When done right, data lake architecture on the cloud provides a future-proof data management paradigm, breaks down data silos and facilitates multiple analytics workloads at any scale and at very low cost. Key considerations to get data lake architecture right include:
Data Lake Architecture – Data ingestion and storage
An Open Data Lake ingests data from sources such as applications, databases, real-time streams, and data warehouses. It stores the data in its raw form or an open data format that is platform-independent.
The ingest capability supports real-time stream processing and batch data ingestion; ensures zero data loss and writes exactly-once or at-least-once; handles schema variability; writes in the most optimized data format into the right partitions and provides the ability to re-ingest data when needed.
The data is stored in a central repository that is capable of scaling cost effectively without fixed capacity limits; is highly durable; is available in its raw form and provides independence from fixed schema; and is then transformed into open data formats such as ORC and Parquet that are reusable, provide high compression ratios and are optimized for data consumption.
Big Data Lake Architecture
Data Lake Architecture – Data processing and continuous data engineering
An Open Data Lake supports concurrent, high throughput writes and reads using open standards. Data is transformed to create use-case driven trusted datasets.
Ingested data is frequently reconciled to reflect continuous business operations. Concurrent data writes and reads need to be supported with the desired levels of transaction guarantees that are dictated by the use case and the data.
Data is cleaned, classified, denormalized and prepared for a variety of use cases using continuously running data engineering pipelines. Data lakes that run into petabyte scale footprints need massively scalable data pipelines that also provide sophisticated orchestration capabilities. Under the hood, data processing engines such as Apache Spark, Apache Hive, Presto provide desired price-performance, scalability, reliability for a range of workloads.
Data Lake Architecture – Data access and consumption
An Open Data Lake enables different use cases such as ad hoc analytics, data discovery, business intelligence reports, machine learning. Different types of data access and tooling are supported. Data access can be through SQL or programmatic languages such as Python, Scala, R etc.
It can be hard to find data in the lake. Without a data catalog, users can end up spending the majority of their time just trying to discover and profile datasets for integrity before they can trust them for their use case. A data catalog crawls and classifies datasets, documents them and supports a search interface to aid discovery.
For business intelligence reports, SQL is the lingua franca and runs on aggregated datasets in the data warehouse and also the data lake. Ad hoc analytics uses both SQL and non-SQL and typically runs on raw and aggregated datasets in the lake as the warehouse may not contain all the data or due to limited non-SQL access. Third party SQL clients, BI tools are supported using a high-performance connectivity suite of ODBC, JDBC drivers and connectors. Machine learning users need a variety of tooling and programmatic access through single node local Python kernels for development; Scala and R with standard libraries for numerical computation and model training such as TensorFlow, Scikit-Learn, MXNet; ability to serialize and deploy, monitor containerized models.
Data Lake Architecture – Governance – Cost control, Security, Compliance
Cost of big data projects can spiral out of control. This is exacerbated by the lack of native cost controls and lifecycle policies in the cloud. Fine grained cost attribution and reports at the user, cluster, job, account level are necessary to cost efficiently scale users and usage on the data lake.
Data in the lake should be encrypted at rest and in transit. Cloud providers provide services to do this using keys either managed by the cloud-provider or keys fully created and managed by the customer.
An Open Data Lake integrates with non-proprietary security tools such as Apache Ranger to enforce fine grained data access control to enforce principle of least privilege while democratizing data access.
Perimeter security for the data lake includes network security and access control. Cloud providers support methods to map the corporate identity infrastructure onto the permissions infrastructure of the cloud provider’s resources and services. LDAP and/or Active Directory are typically supported for authentication
Expanded data privacy regulations, such as GDPR and CCPA, have created new requirements around “Right to Erasure” and “Right to Be Forgotten”. An Open Data Lake not only supports the ability to delete specific subsets of data without disrupting data consumption, but offers easy to use non-proprietary ways to do so.
Data Lake Architecture – Infrastructure and operations
An Open Data Lake provides a platform runtime with automation on top of cloud primitives such as programmatic access to instance types, low cost compute (Spot on AWS, Low priority VMs on Azure, Preemptible VMs on GCP). This frees up organizations to focus on building data applications.
Whether the data lake is deployed on the cloud or on-premise, each cloud provider has specific implementation to provision, configure, monitor, and manage the data lake as well as the resources it needs.
An Open Data Lake is cloud-agnostic and is portable across any cloud-natives environment including public and private clouds. This enables administrators to leverage benefits of both public and private cloud from an economics, security, governance and agility perspective.