White Papers

The Evolving Role of the Data Engineer

Issue link: https://www.qubole.com/resources/i/1243713

Contents of this Issue


Page 54 of 63

worked for a caterer. This description is not based on reality, but is just a metaphor to explain the concepts in this section.) In computing, the basic elements of orchestration are resource man‐ agement, scheduling, and fault tolerance (also known as high availa‐ bility). Let's use the catering metaphor to look at each. Resource Management Resource management, as discussed under "Development Best Prac‐ tices" on page 42, generally involves choosing a server with suffi‐ cient CPU, memory, and network bandwidth to run each task. Trial runs can help you determine what resources you need for a particu‐ lar tool given a particular volume of data. Virtualization lets you specify resources such as CPU and memory in very precise amounts, whether through virtual machines, containers, or the aforementioned online services known as serverless computing. Running on-premises, you might spin up a container with the speci‐ fied resources. In the cloud, you might start a new instance of some virtual machine. In the catering metaphor, imagine that some events offer only appe‐ tizers, while others have sit-down meals. If the event offers only appetizers, it needs five waiters to circulate among the crowd. For a sit-down meal, it needs eight waiters. So the caterer hires a person‐ nel manager who makes sure that the proper number of waiters are available when needed. If the facility runs out of waiters, some events might have to wait until the waiters are no longer needed by other events. But nobody gets more waiters than they need, so the waiters are always busy. In computing, the CPU and memory are the waiters. Hadoop offers Yarn to do resource scheduling on top of a Hadoop cluster. Scheduling Let's turn now to the scheduling part of orchestration. For data engi‐ neering, the equivalent of chefs could be developers checking new tools for reading data into a version-control system, and the waiters could be test suites. The manager in charge of scheduling would be an orchestration tool that creates a workflow connecting the version control system to the test suites. Each check-in to the version con‐ trol system automatically launches regression tests, which tell you whether you introduced a change that will break the application. Orchestration | 47

Articles in this issue

Links on this page

view archives of White Papers - The Evolving Role of the Data Engineer