Overview of Data Lake
Data Lakes have become a core component for companies moving to modern data platforms as they scale their data operations and Machine Learning initiatives. Data lake infrastructures provide users and developers with self-service access to what was traditionally disparate or siloed information.
Today, with developments in cloud computing, companies and data teams can measure new projects according to the ROI and cost of an individual workload in order to determine if the project should be scaled out. The production-readiness and security of cloud computing is one of the largest breakthroughs for enterprises today. This model provides near unlimited capabilities for companies’ analytics lifecycles:
Ingest – Data arrives in any raw format, and is stored for future analysis or disaster recovery. Companies typically segment out several data lakes depending on privacy, production access, as well as the team’s that will be leveraging the incoming information.
Store – Data lakes allow business to manage and organize nearly infinite amounts of information. Cloud object stores (AWS S3, Azure Blob, Google Cloud Storage, etc.) offer high availability access for big data computing, at an extremely low-cost.
Process – With cloud computing, infrastructure is now simply an API call-away. This is when data is taken from its raw state in the data lake and formatted to be used with other information. This data is also often aggregated, joined, or analyzed with advanced algorithms. Then the data is pushed back into the data lake for storage and further consumption by business intelligence or other applications.
Consume – When companies talk about having a self-service data lake, Consume is typically the stage in the life-cycle they are referencing. At this point, data is made available to the business and customers for analytics as their needs require. Depending on the type of complex use cases, end-users may also indirectly or directly be using the data in the form of predictions (forecasting weather, financials, sport performance, etc) or perceptive analytics (recommendation engines, fraud detection, genome sequencing, etc).
Cloud Data Lake vs. On-Premise Data Lake
Enterprise data lakes are most commonly split in two general industry trends: on-premise data lakes and cloud-based data lakes.
In an on-prem data lake, companies must manage both the software and the hardware assets that house their data. If they’re data volumes grow beyond the capacity of the hardware they’ve purchased, then companies have no choice but to buy more computing power themselves.
In cloud data lakes, companies are able to pay for only the data storage and compute they need. This means they are able to scale up or down as their data requires. This scalability has been a huge breakthrough in Big Data’s adoption, driving the increased popularity of data lakes.
How does a Cloud Data Lake Work?
By definition, a data lake is an operation for collecting and storing data in its original format, and in a system or repository that can handle various schemas and structures until the data is needed by later downstream processes.
The primary utility of a data lake is to have a single source for all data in a company — ranging from raw data, prepared data, and 3rd party data assets. Each of these are used to fuel various operations including data transformations, reporting, interactive analytics, and machine learning. Managing an effective production data lake also requires organization, governance, and servicing of the data.
Big Data Lake Architecture
The cloud data lake is a centralized data store (typically referred to as object storage) which allows for highly scalable data storing capabilities, and high availability for quick access by cloud computing servers or other applications. Commonly a data lake is setup with layers of object stores: ones that are the source of truth for gathering all ingested files, raw data that can be used for DR (Disaster Recovery); and another that is optimized for analytics and other downstream processes.
In order to fully realize the cost advantages of a cloud data lake, the big data workflow needs to be architected to take advantage of the separation of compute and storage. However, the challenge is having a system that can help different big data workloads autoscale according to the nature of their workloads (whether bursty or steady stream).