Key considerations for building a scalable transactional data lake
Data-driven companies are driving rapid business transformation with cloud data lakes. Cloud data lakes are enabling new business models and near real-time analytics to support better decision-making. However, as the number of workloads migrating to cloud data lakes increase, companies are compelled to address data management issues.
The combination of data privacy regulations and the need for data freshness with data integrity is creating a need for cloud data lakes to support ACID transactions when updating, deleting, or merging data. There are several architectural considerations for cloud data lakes to address this requirement. These are:
Transactionality on Data Lakes
Data lakes are no longer used as cold data stores, but rather sources for ad-hoc analytics of near real-time data combined with hot data in data warehouses. Data lakes have evolved considerably to enable enterprises to gain real-time insights using business intelligence dashboards or build artificial intelligence capabilities. To build a reliable analytics platform that can support these expanded use cases, data engineers need a mechanism to construct:
- Slowly changing dimensions (Type-I and Type-II): This is a common requirement for any data analytical system and requires capabilities to INSERT, UPDATE, and UPSERT data
- Data Restatement: Organizations are integrating data from a wide variety of sources ranging from transactional Databases, CRM, ERP, IoT, and other SaaS applications and data from social media. This can cause incorrectness or poor data quality which needs to be rectified in a subsequent step. Business rules that rely on these data require clean, complete, accurate, and up-to-date data which further increases the importance of data restatement.
Security & Privacy regulations & compliance
A new requirement for the “Right to Erasure” or “Right To Be Forgotten” (RTBF) has stemmed from a series of new and expanding sets of global data privacy regulations. These regulations govern consumers’ rights to their data and enact stiff financial penalties for non-compliance. Given that financial penalties are significant (as much as 4% of global turnover), they cannot be overlooked. Businesses are faced with the challenge to meet these data privacy & protection requirements while ensuring business continuity. RTBF requires the ability for targeted deletion of specific data—record or row or column—that may reside in the data lake and within a limited amount of time. With extensive data proliferation in the data lake, it is challenging to delete specific subsets of data without disrupting existing data management processes. While some new solutions are emerging from various vendors, not all meet requirements adequately. So organizations are still building custom solutions to meet these new regulations. But as with most in-house built solutions, they present problems around updates, maintenance, and auditability among others.
Fast, interactive analytics on “gold standard” datasets allow users to trust results and lowers time-to-insight. Fast reads require prepared data and the right analytical engine. Data engineers are constantly asking “what is the best data format for my data types?” and “what is the right file and partition size for faster performance?”
Typical distributed systems will experience additional overhead—aside from latency— when it comes to completing writes. The overhead stems from writing to staging locations before writing to cloud storage, or updating entire partitions rather than a record. The impact on overall performance is significant, and quickly becomes a major concern as organizations start operating data lakes on a large scale.
Data consistency & integrity
Concurrency control is important for a data lake as it needs to support multiple users and applications, and conflicts are bound to happen. For instance, ensuring data consistency, integrity, and availability when one user may want to write to a file or partition while another user is looking to read from the same file or partition; or two users wanting to write to the same file or partition. Therefore, a modern data lake architecture needs to address such scenarios. It also needs to ensure that these concurrent operations do not violate the completeness, accuracy, and referential integrity of data leading to erroneous results.
Preserving the choice of the right compute engine and cloud for the job
The rapid growth in demand for insights and information has resulted in an exponential increase in data collected and stored by virtually every business. The strategic imperative to harness the data collected to improve customer experience requires businesses to adopt a data architecture that serves multiple use cases of today while preserving the choice of data processing engine, cloud infrastructure, and vendor portability to serve to use those cases of tomorrow.
At Qubole, we have put these considerations at the forefront of our data platform’s design:
- It supports full transactionality on a data lake, regardless of the cloud—AWS, Azure, or GCP.
- It provides built-in support for delete operations, which enables customers to comply with regulatory and privacy requirements for “Right to Erasure” within established SLAs.
- You can write directly to cloud object stores, thus eliminating extra overhead, while guaranteeing data integrity at the best performance possible.
- Most importantly, we continue to provide freedom of choice of data processing engine—Apache Spark, Presto, Hive, etc.—with a full implementation of ACID capabilities based on Hive transactional tables.
And finally, we are open-sourcing Presto and Spark connectors that work directly with Hive ACID tables for high throughput reads on data lakes. You can find our contributions here: