White Papers

The Evolving Role of the Data Engineer

Issue link: https://www.qubole.com/resources/i/1243713

Contents of this Issue


Page 29 of 63

And what if another application retrieves item fields from the prod‐ uct fields? A purchase can probably contain several products, and you might want to know with which product each item is associated. So you might keep the item nested within the product or create a new field that combines the product and item. Best Practice When you duplicate fields, to maintain a single source of truth, use metadata to mark the original copy that becomes the source for other copies. Enforce a data flow that updates the original copy and then lets the update flow out to other copies. In the old way of working, a DBA would give users access to selected fields and rows through views, avoiding the need to copy data to a new database. Views are also supported by some of the newer database engines, such as MongoDB, and can be useful in data engineering. Structured Storage Formats Some data stores have their own internal formats. But many big data projects need input or output in a standard, interchangeable format. The Hadoop Distributed File System (HDFS) also allows data to have any structure of your choice. So data engineers can spend con‐ siderable time researching different storage formats and choosing the right ones for their applications. One major choice is whether to store data row by row or column by column. Traditional databases store data row by row, with all the columns in a row stored together (except for a few particular cases, such as large unstructured fields known as binary large objects [BLOBs]). This storage makes it easier to write data because most operations add or update a set of rows through a WHERE clause (for instance, WHERE COUNTRY = US). These databases are used mostly for transactional applications, which retrieve several col‐ umns from a particular row for a customer, a product, or something similar. Row storage here is natural and convenient because the data wanted by each query would probably be stored together on disk. Many modern applications perform much better with columnar storage because column sizes can vary more widely. Therefore, more 22 | The Evolving Role of the Data Engineer

Articles in this issue

view archives of White Papers - The Evolving Role of the Data Engineer