An enterprise data analytics and reporting platform typically runs data pipelines with long and complex jobs, spanning across many services, programs, tools, and scripts interacting together. These jobs need to run on an ad-hoc basis, have a set of dependencies on other existing datasets, and have other jobs that depend on them. Quickly this becomes a tangled mesh of computing and memory intensive processes, leading to a maintenance nightmare, instability, and poor performance. This calls for a need to build a scalable and optimized workflow management solution. While there are a plethora of open source solutions available to solve these problems, they may not fit everyone’s needs. So this talk provides an under-the-hood view into the architectural patterns of such solutions, and considerations for those companies that chose to build a more customizable, simple, and elegant solution without having to reinvent the wheel.
Free access to Qubole for 30 days to build data pipelines, bring machine learning to production, and analyze any data type from any data source.
See what our Open Data Lake Platform can do for you in 35 minutes.