White Papers

The Evolving Role of the Data Engineer

Issue link: https://www.qubole.com/resources/i/1243713

Contents of this Issue


Page 48 of 63

low they are. And yet another may want all messages, which it stores in a database. The proper messages will be delivered to each applica‐ tion. This is called multiplexing. Sources are called spouts in Storm, and data passes from the spouts through one or more bolts to be processed. Like Flume, Storm allows multiple stages in a pipeline. Streaming Data Processing The processing of streaming data is an important task in today's analytics. Storm came along to enable the era of streaming data pro‐ cessing. Other tools with similar purposes include Flink and Spark, which can handle both streaming and batch processing. These tools can also be useful in data engineering because you may be responsible for cleaning and prepping data. Streaming processors can be useful for the task of transformation and enrichment (adding provenance data, creating aggregate data, etc.) mentioned in "Data Engineering Today" on page 4. Analytics can help you reduce your data; for instance, your analysis of malfunctions in automobiles might turn up 6,000 possible causes (also called features or dimen‐ sions), and analytics might reveal that you need to save only 12 of them to accurately predict a malfunction. Streaming tools grab incoming data, whether batch or real time, and transform or run analytics on it. Their programming interfaces exploit method chaining, a common syntax in modern program‐ ming languages, to set up pipelines. Spark can ingest data in batches (Datasets or DataFrames) or as a continuous stream through a more recent feature called Spark Streaming. Flink and Storm were designed to work with continuous streams. These tools are widely used by both data scientists and data engineers. Example Workflow for Streaming Tools Suppose you have a data set regarding restaurant visits that is upda‐ ted with new data several times a day, and that you have agreed to perform a sequence of tasks on each data set to give your analysts better data: 1. Add two fields indicating the date and time at which the data set was received (provenance metadata). Choosing the Right Data Processing Engine | 41

Articles in this issue

Links on this page

view archives of White Papers - The Evolving Role of the Data Engineer