Data lake architecture stores large volumes of data in its original form – structured, unstructured, and semi-structured. Ideal for machine learning use cases, data lakes provide SQL-based access to data and provide support for programmatic distributed data processing frameworks. A data lake can store the data in the same format as its source systems or transform it before storing it. Data lakes support native streaming, where streams of data are processed and made available for analytics as it arrives.
According to Dr. Kirk Borne, Principal Data Scientist & Data Science Fellow, Booz Allen Hamilton:
“With the data lake, business value maximization from data is within every organization’s reach.”
The primary purpose of a data lake is to make organizational data from different sources accessible to various end-users like business analysts, data engineers, data scientists, product managers, executives, etc., to enable these personas to leverage insights in a cost-effective manner for improved business performance. Today, many forms of advanced analytics are only possible on data lakes.
At the Data Lake Summit, Caleb Jones, Senior Staff Software Architect, The Walt Disney Company, said:
“Domain-driven data platform aligns data and product source or target experts. It aligns with architectural and product evolution. The data lake also decouples domains so they can evolve independently, creates teams that are more focused, specialized, and have expertise around the domains, and also gives product domains greater autonomy in their backlogs.”
View recordings of the Data Lake Summit 2020 sessions here
Let’s take a closer look at the benefits of data lake characteristics:
Data Ingestion and Storage
The data lake ingests data from sources such as applications, databases, data warehouses, and real-time streams. A data lake supports both the pull and push-based ingestion of data. It supports pull-based ingestion through batch data pipelines and push-based ingestion through stream processing. For batch data pipelines, it supports row-level inserts and updates — UPSERT — to datasets in the lake. UPSERT capability with snapshot isolation — and more generally, ACID semantics — greatly simplifies the task instead of rewriting data partitions or entire datasets.
Data Processing and Continuous Data Engineering
A data lake stores the raw data from various data sources in a standardized open format. However, use cases such as data exploration, Interactive Analytics, and Machine Learning require that the raw data be processed to create use-case-driven trusted datasets. For Data Exploration and Machine Learning use cases, users continually refine data sets for their analysis needs. As a result, every data lake implementation must enable users to iterate between data engineering and use cases such as interactive analytics and Machine Learning. We call this “Continuous Data Engineering.”
Continuous data engineering involves the interactive ability to author, monitor, and debug data pipelines. In a data lake, these pipelines are authored using standard interfaces and open source frameworks such as SQL, Python, Apache Spark, and Apache Hive.
Data Access and Consumption
The most visible outcome of a data lake is the types of use cases it enables. Whether the use case is Data Exploration, Interactive Analytics, or Machine Learning, access to data is vital. Access to data can be through SQL or programmatic languages such as Python, R, Scala, etc. While SQL is the norm for interactive analysis, programmatic languages are used for more advanced applications like machine learning and deep learning.
A data lake supports data access through the standards-based implementation of SQL with no proprietary extensions. It enables external tools to access that data through standards such as ODBC and JDBC. Also, a data lake supports programmatic access to data via standard programming languages such as R, Python, and Scala and standard libraries for numerical computation and machine learning such as TensorFlow, Apache Spark, and MLib, MXNet, Tensorflow, Keras, and SciKit Learn.
Preserving the Choice of the Right Compute Engine For the Job
The rapid growth in demand for insights and information has resulted in an exponential increase in data collected and stored virtually by every business. The strategic imperative to harness the data collected to improve customer experience requires enterprises to adopt a data architecture that serves multiple use cases of today while preserving the choice of data processing engine, cloud infrastructure, and vendor portability to serve those cases of tomorrow.
When data ingestion and data access are implemented well, data can be made widely available to users in a democratized fashion. When multiple teams start accessing data, data architects need to exercise oversight for governance, security, and compliance purposes. Data itself is hard to find and comprehend and not always trustworthy. Users need to discover and profile datasets for integrity before they can trust them for their use case. Since the first step is to find the required datasets, it’s essential to surface metadata to end-users for exploration purposes, see where the data resides and what it contains, and determine if it is useful for answering a particular question. A data lake provides an open metadata repository. For example, the Apache Hive metadata repository is an open metadata repository that prevents vendor lock-in for metadata.
Furthermore, increasing accessibility to the data requires data lakes to support strong access control and security features on the data. A data lake does this through non-proprietary security and access control APIs. For example, deep integration with open source frameworks such as Apache Ranger and Apache Sentry can facilitate table, row, and column level granular security.
The ability to delete specific subsets of data without disrupting a data management process is also essential. A data lake supports this ability on open formats and open metadata repositories. In this way, they enable a vendor-agnostic solution to compliance needs.
At Qubole, we have put the following considerations at the forefront of our data platform’s design:
- The QDS platform supports full transactionality on a data lake, regardless of the cloud—AWS, GCP, or Azure.
- It provides built-in support for delete operations, enabling customers to comply with regulatory and privacy requirements for ‘Right to Erasure’ within established SLAs.
- You can write directly to cloud object stores, thus eliminating extra overhead while guaranteeing data integrity at the best performance possible.
- Most importantly, we continue to provide freedom of choice of data processing engine – Apache Spark, Presto, Hive, etc.—with a full implementation of ACID capabilities based on Hive transactional tables.