Quark Simplify and Optimize SQL Queries Across Hadoop and RDBMS
Data analysts typically have multiple copies of data to choose from for their analysis. In this talk, we will introduce Quark (https://github.com/qubole/quark) an open source cost-based SQL optimizer built using Apache Calcite – and explain how it can be used to track and efficiently use the derived datasets spread across different data stores. Quark models relationships between datasets using well-known database concepts like materialized views and OLAP cubes. Today Quark can route SQL queries across data warehouses, Big-Data SQL engines and data marts. We will describe use cases in large companies where Quark is being used to grapple with the explosion of derived data sets. For example hot data or an OLAP cube is stored in a fast data warehouse or cold data in mass storage system like HDFS. At scale, the determination of the best dataset to use is a mental overhead and leads to wasted time and effort of both human and computer resources. Once Quark is setup, data analysts simply submit all queries directly or through a BI tool to Quark on the base tables. We will cover the functionality available in open-source Quark project as well as its SaaS implementation available via Qubole Data Service.