Managing Transactions in Data Lakes – Joydeep Sen Sarma, CTO & Co-founder, Qubole

An enterprise data analytics and reporting platform typically runs data pipelines with long and complex jobs, spanning across many services, programs, tools, and scripts interacting together. These jobs need to run on an ad-hoc basis, have a set of dependencies on other existing datasets, and have other jobs that depend on them. Quickly this becomes a tangled mesh of computing and memory intensive processes, leading to a maintenance nightmare, instability, and poor performance. This calls for a need to build a scalable and optimized workflow management solution. While there are a plethora of open source solutions available to solve these problems, they may not fit everyone’s needs. So this talk provides an under-the-hood view into the architectural patterns of such solutions, and considerations for those companies that chose to build a more customizable, simple, and elegant solution without having to reinvent the wheel.