Data Lake Essentials, Part 3 – Data Catalog and Data Mining

February 28, 2020 Jorge Villamariona

Data Lake Essentials, Part 3 – Data Lake Data Catalog, Metadata and Search

In this multi-part series we will take you through the architecture of a Data Lake. We can explore data lake architecture across three dimensions

In this edition, we look at Data Catalog, Metadata, and Search.

Key Considerations

Any data lake design should incorporate a metadata storage strategy to enable business users to search, locate and learn about the datasets that are available in the lake. While traditional data warehousing stores a fixed and static set of meaningful data definitions and characteristics within the relational storage layer, data lake storage is intended to support the application of schema at read time with flexibility. However, this means that a separate storage layer is required to house cataloging metadata that represents technical and business meaning. While organizations sometimes simply accumulate content in a data lake without a metadata layer, this is a recipe for an unmanageable data swamp instead of a useful data lake. There are a wide range of approaches and solutions to ensure that appropriate metadata is created and maintained. Here are some important principles and patterns to keep in mind. Single data set can have multiple metadata layers dependent on use cases. e.g. Hive Metastore, Apache Glue etc. Same data can be exported to some NoSQL database which would have different schema.

Enforce a metadata requirement

The best way to ensure that appropriate metadata is created, is to enforce its creation. Making sure that all methods through which data arrives in the core data lake layer enforce the metadata creation requirement; and any new data ingestion routines must specify how the meta-data creation requirement will be enforced.

Automate metadata creation

Like nearly everything on the cloud, automation is the key to consistency and accuracy. Wherever possible, one should design for automatic metadata creation extracted from source material.

Prioritize cloud-native solutions

Wherever possible, use cloud-native automation frameworks to capture, store and access metadata within your data lake.

Metadata Searching

A solution like Alation is one of the examples for data catalog that allows searching against the metadata – e.g Which one is the hottest table in the store?

Data Lake – Access and Mining the Lake

Schema on Read

‘Schema on write’ is a tried and tested pattern of cleansing, transforming and adding a logical schema to the data before it is stored in a ‘structured’ relational database. However, as noted previously, data lakes are built on a completely different pattern of ‘schema on read’ that prevents the primary data store from being locked into a predetermined schema. Data is stored in a raw or only mildly processed format, and each analysis tool can impose on the dataset a business meaning that is appropriate to the analysis context. There are many benefits to this approach, including enabling various tools to access the data for various purposes.

Data Processing

Once you have the raw layer of immutable data in the lake, you will need to create multiple layers of processed data to enable various use cases in the organization. These are examples of the structured storage described earlier in this blog series. Typical operations required to create these structured data stores involve:

  • Combining different datasets (i.e. joins)
  • Denormalization
  • Cleansing, deduplication, householding
  • Deriving computed data fields

Apache Spark has become the leading tool of choice for processing the raw data to create various value-added, structured data layers.

Data Warehousing

For some specialized use cases (think high performance data warehouses), you may need to run SQL queries on petabytes of data and return complex analytical results very quickly. In those cases, you may need to ingest a portion of your data from your lake into a column store platform. Examples of tools to accomplish this would be Google BigQuery, Amazon Redshift or Azure SQL Data Warehouse.

Interactive Query and Reporting

There are still a large number of use cases that require support for regular SQL query tools to analyze these massive data stores. Apache Hive, Presto, Amazon Athena, and Impala are all specifically developed to support these use cases by creating or utilizing a SQL-friendly schema on top of the raw data.

Data Exploration and Machine Learning

Finally, a category of users who are among the biggest beneficiaries of the data lake are your data scientists, who now have access to enterprise-wide data, unfettered by various schemas, and who can explore and mine data for high-value business insights. Many data scientists tools are either based on, or can work alongside Hadoop-based platforms that access the data lake.

Data Lake Platform – features to look for

Your Data Lake Platform Should Offer:

  • Multiple Data processing engine options such as Spark, Hadoop/Hive, Presto etc. This is essential to be able to support a wide array of use cases.
  • A Metastore anchored on an open standards, such as Hive which can then be used from Hive, Presto and Spark SQL
  • Catalog integration with AWS Glue.
  • Support for AIR (Alerts, Insights and Recommendations) that can be used for getting useful information from the Metadata
  • Support for Kafka Schema registry (for Streamed Data Sets).
  • Connectors to Data Warehousing solutions such as Snowflake, Redshift, BigQuery, Azure SQL Database, etc.
  • Connectors for popular commercial databases like MySQL, Oracle, MongoDB, Vertica, SQL Server etc.
  • Serverless computing options (e.g. Presto) to cost effectively meet interactive query requirements.
  • A Unified browser based UI for Analysts to run their queries.
  • JDBC/ODBC drivers to query from BI tools like Tableau, Looker, Click View, SuperSet, Redash etc.
  • Jupyter/Zeppelin notebooks for data scientists and analysts.
  • UI-based data science package management for Python and R.

Missed Part 2? Data Lake Essentials, Part 2 – File Formats, Compression And Security

Additional References

Blogs

Developer Resources

Partner Products

Conclusion

In this blog, we’ve shared major components of the data lake architecture along with Qubole’s solutions for each of those. We encourage you to continue your journey with a Qubole test drive!

 

The post Data Lake Essentials, Part 3 – Data Catalog and Data Mining appeared first on Qubole.

Previous Article
A Message to Our Customers & Partners from Qubole CEO Ashish Thusoo

To our valued customers and partners, I hope all of you, your colleagues, families and friends are safe and...

Next Article
Cloud Data Lakes – Best Practices
Cloud Data Lakes – Best Practices

This is an abridged version of the article that appears on NewStack BI tools have been the go-to for data a...