White Papers

The Evolving Role of the Data Engineer

Issue link: https://www.qubole.com/resources/i/1243713

Contents of this Issue


Page 57 of 63

One way to handle scheduling is through operating system schedu‐ lers, such as crontab on Linux and other Unix-like systems. More sophisticated schedulers are provided as programming libraries, such as Quartz in the Java language, and as part of analytics plat‐ forms such as Qubole Scheduler. AirFlow is particularly compatible with a tool called Celery, which sets up task queues. Celery in turn depends on other tools: a message broker such as RabbitMQ, and a system called ZooKeeper for fault recovery. Fault Tolerance and Checkpoints It would be tedious to manually check whether every job finishes, especially when the number of jobs runs into the thousands. Many modern systems check automatically for jobs that fail or take too long—it might be impossible to tell the difference because both vir‐ tual and physical machines sometimes fail silently—and restart jobs as necessary. As in database replication, a checkpoint or snapshot in streaming data preserves a coherent view of the data at a particular point in time, so that if a job fails you can pick up from a recent place instead of repeating the whole job. For instance, in Spark or Spark Streaming you issue the checkpoint call to set a checkpoint and can restart a job from the most recent checkpoint. Checkpoints require more state information in streaming than in database replication, because many aggregate operations require some knowledge of previously processed data. Checkpoints are designed to hide the complexity of saved state and give program‐ mers a simple interface. The administrator configures whether old data is saved or discarded, and whether the state information is stored on the local node or to a distributed filesystem (which impo‐ ses more overhead but will preserve the state in case the local node fails). Conclusion The evolving data engineering profession is less than two decades old. While researching this report, I discovered a scarcity of online information about the role. Although many popular tools can be used productively in pursuit of data engineering, most current doc‐ umentation focuses on analytics instead of the storage and data transformation tasks for which a data engineer is responsible. Thus, 50 | The Evolving Role of the Data Engineer

Articles in this issue

Links on this page

view archives of White Papers - The Evolving Role of the Data Engineer