Data Sheets

What is an Open Data Lake?

Qubole Data Sheets

Issue link:

Contents of this Issue


Page 0 of 3

A data lake is a system or repository that stores data in its raw format as well as transformed trusted datasets and provides both programmatic and SQL based access to this data for diverse analytics tasks such as machine learning, data exploration, and interactive analytics. The data stored in a data lake can include structured data from relational databases (rows and columns), semi-structured data (CSV, logs, XML, JSON), unstructured data (emails, documents, PDFs) and binary data (images, audio, video). A data lake, where data is stored in an open format and accessed through open standards-based interfaces, is defined as an Open Data Lake. This adherence to an open philosophy, aimed at preventing vendor lock-in, permeates through every aspect of the system, including data storage, data management, data processing, operations, data access, governance, and security. We define an open format as a format that is based on an underlying open standard, developed and shared through a publicly visible and community-driven process without vendor-specific proprietary extensions. For example, an Open Data Format is a platform-independent, machine-readable data format such as ORC or Parquet, whose specification is published to the community, such that any organization can create tools and applications to read data in the format. A typical data lake has the following capabilities: • Data Ingestion and storage • Data processing and support for continuous data engineering • Data Access and consumption • Data Governance - Discoverability, Security and Compliance • Infrastructure and operations In the following sections, we will describe openness requirements for each capability. WHAT IS AN OPEN DATA LAKE?

Articles in this issue

view archives of Data Sheets - What is an Open Data Lake?