White Papers

The Evolving Role of the Data Engineer

Issue link: https://www.qubole.com/resources/i/1243713

Contents of this Issue

Navigation

Page 41 of 63

Best Practice Maintaining robust metadata on the objects you store is crucial, so that users can find objects later. Object stores require study because they differ a lot from one another, as well as from block storage. You may have to learn a spe‐ cial API to read and write to and from each object store. Investigate the architecture of the object stores that interest you to understand how to best make use of traits such as scalability. Metrics such as bytes that are added and deleted can help you make better use of the object store. For instance, metrics may let you know that it's not such a good idea to archive data after a year, because people are reading and writing it more often than you anticipated. "Metrics and Evaluation" on page 44 discusses the types of metrics you can capture and how they might be valuable. Partitioning Big data works by partitioning, or sharding, data. This lets dis‐ tributed systems store huge quantities of data on different servers as well as process the data by dividing it up by natural sections. Column selection Choosing the right field on which to define partitions, along with choosing the right keys and indexes, is crucial for efficient data pro‐ cessing. Think of slicing a grapefruit: if you do it properly, you can easily extract the pulp, but if you slice the fruit at an odd angle, all the pulp is stuck in hard-to-access places. A similar concept turns up in Kafka, a popular message broker, as topics. Publishers assign a topic, which is simply a keyword, to each record they submit to Kafka. Thus, a stock reporting tool might use the stock symbol of the company as a topic (MSFT for Microsoft Corporation, for instance). Consumers subscribe to topics in order to get just the records of interest to them. Criteria for choosing keys and partitions As one classic example of key/value pairs for big data, let's look at the paper introducing MapReduce by Jeffrey Dean and Sanjay Ghe‐ mawat of Google. Google needed to create a list of web pages from 34 | The Evolving Role of the Data Engineer

Articles in this issue

Links on this page

view archives of White Papers - The Evolving Role of the Data Engineer