Overview of a Data Warehouse
Businesses rely on accurate analytics, reports and monitoring in order to make critical decisions. These insights are powered by data warehouses that are optimized for handling a variety of information that feeds these reports. The information in these data warehouses is most commonly sourced from a combination of disparate data sources (e.g. CRM, product sales, online events, etc.). They provide an organized schema for the information that allows end-users to more easily interpret the underlying data.
What are the Capabilities of the CDW
Data warehouses were built to handle mostly batch workloads that could process large data volumes, and reduce I/O for better performance per query. And with storage being tied directly with compute, data warehouse infrastructures can quickly become outdated and expensive. Today, with the capabilities of cloud data warehousing, companies can now to scale out horizontally to handle either compute or storage requirements as necessary. This has significantly reduced the concern about wasting potentially millions of dollars from over-provisioning servers to handle bursty data requirements or a project that may only be short-term.
Cloud Data Warehouse vs. Cloud Data Lake
There are two fundamental differences between cloud data warehouses and cloud data lakes: data types and processing framework. In a cloud data warehouse model, you have to transform the data into the right structure in order to make it usable. This is often referred to as “schema-on-write”.
In a cloud data lake, you can load raw data, unstructured or structured, from various sources. With a Cloud Data Lake it’s only when you are ready to process the data that it is transformed and structured. This is called “schema-on-read.” When you marry this operational model with the cloud’s unlimited storage and compute availability – businesses can then scale their operations with growing volumes of data, variety of sources and query concurrency, while paying only for the resources utilized.
Modern Cloud Data Warehouse with Big Data
As companies advance in understanding the information they own, so does the need for improved infrastructure to handle the larger compute requirements to run complex analytics and workflows. This has paved the way for cloud infrastructures such as Informatica and Talend, which allow users to leverage compute for different technologies at their fingertips, all on top of the same data. With cloud infrastructure, companies can now grow their advanced analytics and ETL operations separately from their data warehouse workloads.
Using Qubole as the central cloud operations platform for the data lake, companies can seamlessly integrate with their data warehouses so that end-users can easily access data across their data lake and warehouses. This allows data teams to develop predictive analytics applications without disrupting the system that products and business intelligence rely on.