White Papers

The Evolving Role of the Data Engineer

Issue link: https://www.qubole.com/resources/i/1243713

Contents of this Issue


Page 27 of 63

If data arrives in a key/value structure such as JSON, storage is easy. But unstructured data, such as postings on social media or multimedia files, requires some work to add identifying metadata. For instance, you can add a date-and-time stamp to serve as a key. The MapReduce programming model used by Hadoop expects each record to start with a key. The rest of the record is the value. It's the job of the programmer to create each key. Thus, if a MapReduce program is seeking to collect data on different countries, it might extract the "country" field from its input and use that as the key. The key will also start the record when it is stored. Best Practice If you know that a field will be specified often in quer‐ ies, consider adding an index to it. Most modern data formats also support indexes, which serve quer‐ ies in ways similar to relational databases. In addition to support by databases, indexes appear in several formats that are popular for storing big data. It's optimal to add these indexes after data has been loaded so that data is smaller during initial ingestion. Example: Duplication and Normalization "Architectural evolution" on page 16 explained how the current gen‐ eration of data maintainers has accepted the duplication of data. Let's look now at a small example of structured data to see why you need the flexibility that modern data formats offer instead of stick‐ ing to the relational model. Imagine a simplified data structure for a purchase, as shown in Figure 1-1. 20 | The Evolving Role of the Data Engineer

Articles in this issue

view archives of White Papers - The Evolving Role of the Data Engineer