Cloud Data Lakes Best Practices

Start Free Trial
February 21, 2020 by and Updated March 27th, 2024

This is an abridged version of the article that appears on NewStack

BI tools have been the go-to for data analysts who help businesses track top line, bottom line, and customer experience metrics. BI tools analyze small sets of relational data (a few terabytes) in a data warehouse, which require small data scans (a few gigabytes) to execute.

But, businesses are now looking beyond BI to interactive, streaming, and clickstream analytics, machine learning, and deep learning in order to gain the data-led advantage. For these types of analytics applications, data lakes are the preferred option. Data lakes can ingest the volume, variety, and velocity of data and stage and cataloged them centrally. Data is then made available for a variety of analytics applications, at any scale, in a cost-efficient manner.

Let’s look at best practices in setting up and managing data lakes across three dimensions –

  1. Data ingestion,
  2. Data layout
  3. Data governance

Data Ingestion

Ingestion can be in batch or streaming form. The data lake must ensure zero data loss and write exactly once or at least once. The data lake must also handle variability in the schema and ensure that data is written in the most optimized data format into the right partitions, and provide the ability to re-ingest data when needed.

Batch Data Ingestion

For batch ingestion of transactional data, the data lake must support UPSERT – row-level inserts and updates — to datasets in the lake. UpSert capability with snapshot isolation and ACID semantics simplifies the task, as opposed to rewriting data partitions or entire datasets. ACID semantics ensures concurrent writes and reads are on a data lake without issues with data integrity issues or reduction in reading performance.

Streaming Data Ingestion

For streaming data, the data lake must guarantee that data is written exactly once or at least once. A recommended combination is Spark Structured Streaming in conjunction with streaming data arriving at variable velocity from message queues such as Kafka and Amazon Kinesis. A data lake solution for stream processing should integrate with the schema registry in message queues and must support replay capability to keep up with business evolution on stream processing and re-process/reinstate outdated events.

Apart from batch and stream ingestion modes, data lakes must also provide for

  • Source-to-target schema conversion – intelligently detect source schema and create logical tables on the fly, and flatten semi-structured JSON, XML, or CSV into columnar file formats.
  • Monitoring data movement – connect pipelines and the underlying infrastructure to rich monitoring and alerting tools such as Datadog, Prometheus, and SignalFx, to shorten the time to recovery after a failure.
  • Keeping data fresh – data restatement and row-level data inserts using UPSERT is key to keeping data fresh.

Data Layout

Data generation and data collection across semi-structured and unstructured formats are both bursty and continuous. Inspecting, exploring, and analyzing these datasets in their raw form is tedious because the analytical engines scan the entire data set across multiple files. We recommend five ways to reduce data scanned and reduce query overheads –

  • Columnar data formats for reading analytics – use open-source columnar formats such as ORC and Parquet to reduce data scans and avoid queries that need to parse JSON by using json_parse and json_extract
  • Partition data – use time, geo, lob to reduce data scans, tune partition granularity based on the data set under consideration (by hour vs. by second)
  • Compaction to chunk up small files – chunk up small files into bigger ones asynchronously to reduce network overheads
  • Perform stats-based cost-based optimization – collect dataset stats like file size, rows, and histogram of values to optimize queries with join reordering.
  • Use Z-order indexed Materialized views for cost-based optimization – a z-order index serves queries with multiple columns in any combination and not just data sorted on a single column.

Managed data lakes can deliver autonomous data management capabilities to operationalize the aforementioned data layout strategy.

Data Governance

With data lakes, multiple teams will start accessing data. There needs to be a strong focus on oversight, regulatory compliance, and role-based access control along with delivering meaningful experiences. A single interface for configuration management, auditing, obtaining job reports, and exercising cost control is key. Here are three recommendations for data governance

Data Catalog

Having a data catalog helps users discover and profile datasets for integrity by enriching metadata through different mechanisms, document datasets, and supporting a search interface

  • Use crawlers and classifiers to catalog data. Automatically adding descriptions about the context of how data, especially unstructured data, came in, and keeping the metadata and data in sync, will speed up the end-to-end cycle from discovery to consumption.
  • Data dictionary and lineage. Data dictionaries contain table and column descriptions, the most frequent users and usage statistics, and canonical queries for a specific table. Data lineage allows users to trust data for business use by showcasing a data life cycle map that indicates all its modifications from its origin
  • Metadata management. Answering questions like a customer churn analysis typically requires wrangling new and disparate datasets. It is essential to surface a data dictionary to end-users for exploration, to see where the data resides and what it contains, and to determine if it is useful for answering a particular question.

Regulatory Compliance

New or expanded data privacy regulations, such as GDPR and CCPA, have created new requirements around the Right to Erasure and Right to Be Forgotten. Therefore, the ability to delete specific subsets of data without disrupting a data management process is essential. In addition to the throughput of DELETE itself, you need support for special handling of PCI/PII data and auditability.

Permissions and Financial Governance

Using the Apache Ranger open-source framework that facilitates table, row, and column level granular access, architects can grant permissions against already-defined user roles in the Identity and Access Management (IAM) access solutions of cloud service providers. With wide-ranging usage, monitoring and audit capabilities are essential to detect access violations and flag adversarial queries. To give P&L owners and architects a bird’s eye view of usage, they need cost attribution and exploration capabilities at the cluster-, job- and user level from a single interface.

Conclusion

The data lake best practices can help build a sustainable advantage using the data you collect. A cloud data lake can break down data silos and facilitate multiple analytics workloads at scale and at lower costs.

Start Free Trial
Read Data Lake Essentials, Part 3 – Data Catalog and Data Mining