The Evolving Role of the Data Engineer

big data. The new data stores look back in computing history to databases with simpler structures that don't try to be all things to all people. This means you have to choose the particular data store that's best for each application, then determine how to structure the data to make access as fast as possible. Modern data environments tie together data sets of many types and sizes, refreshing them from multiple data sources. Because these col‐ lections of data are handled so differently from data warehouses, organizations like to call them data lakes. Often, organizations also link to outside sources and retrieve data from them as needed. For instance, Server SQL's PolyBase lets you join multiple sources in an SQL query: internal and external, or relational and nonrelational. Provenance and catalogs are key to making a data lake work. If you lose track of what you have in this diverse collection, you end up with what data engineers fearfully call a data swamp. Database Options The major categories of data stores are: Relational The traditional databases developed in the 1980s and 1990s: Oracle, IBM DB2, MySQL, PostgreSQL, and so on. Document These store data as labeled fields, like XML or JSON. Each record or row can be of arbitrary length. Sometimes these stores follow schemas, such as Avro, ORC, and Parquet. Other docu‐ ment stores, such as MongoDB and CouchDB, allow each record to have a unique structure, naming each field. Key/value Stores each element as a key—which may or may not have to be unique—and a value. These databases hash the key to make retrieval by key fast, so key/value stores are sometimes called hash tables. The distinction between document stores and key/value stores is fuzzy. For instance, Cassandra started out as a key/value store, but has evolved into more of a document store. Any key/value store can store a document as the value. Structuring Data | 25

