White Papers

The Evolving Role of the Data Engineer

Issue link: https://www.qubole.com/resources/i/1243713

Contents of this Issue

Navigation

Page 45 of 63

transfer it, thus mimicking the functionality of Spark and Flink. Flume can also alter data through interceptors, which can add or remove headers or use mechanisms as regular expressions to do the more subtle alterations associated with streaming processors. Tools from the older database era also evolve to meet modern needs. Data Ingestion and Transfer: Message Brokers Message brokers are the most efficient way to store data, so long as you need little or no transformation. There are several reasons to store the data in its original form and perform any transformations you want later. First, the volume and velocity of incoming data could be so high that you risk dropping data if you take the time to process it. Second, a poorly programmed tool could corrupt your data, and you'll want to be able to return to the raw form to process it correctly. Finally, you might not anticipate some need among your users, and you might need to go back to the raw data and process it differently. So most likely, you will set up a message broker such as Kafka to ingest data into a raw zone in your data center or the cloud. You can then use a streaming processor, as described in the following sec‐ tion, to run transformations and enrichment requested by users on this data in real time. Message brokers all operate similarly, although they offer different interfaces and have different architectures. Each broker accepts streams of data from multiple sources or pro‐ ducers, and sends it to multiple recipients or consumers. The sour‐ ces are often called publishers and the recipients are called subscribers, the entire architecture thus being called a pub/sub protocol. The data is put on a queue until the subscribers are ready to receive it. Certain guarantees are provided by message brokers. Messages are always delivered in the order in which the queue receives them. Note that if multiple publishers send data to a broker, the messages may not arrive in the order in which they were generated (in techni‐ cal terms, delivery is nondeterministic). Usually, therefore, each publisher opens its own set of queues. Each item of data is placed on one queue, and messages on each queue are delivered to subscribers in the order that the messages entered that queue (with various exceptions). The delivery of messages from different queues has a 38 | The Evolving Role of the Data Engineer

Articles in this issue

view archives of White Papers - The Evolving Role of the Data Engineer