The Qubole Data Service (QDS) is a Software-as-a-Service analytics platform running on leading cloud offerings like AWS. Targeted towards data analysts, data scientists and ETL engineers - it can help users to get started analyzing data in a matter of minutes.
“Simplicity is the ultimate sophistication.” - Leonardo da Vinci
Using Qubole, the user can go from raw data to insights and build data driven applications in a matter of minutes:
- Setup is quick and simple - just enter AWS credentials required for accessing data in S3 and provisioning machines in EC2
- Start analyzing data using (Hive) SQL and using any scripting languages of your choice (eg: Python, Perl) through a browser based interface
- Schedule queries periodically and export data to external databases to build applications ranging from simple reporting to sophisticated applications like recommendation engines
Organizations no longer have to setup and manage a score of systems, understand and evaluate multiple open source projects and hire domain experts to just get to analyzing their data. Some of the key features that make it simple to process your data are as follows:
Auto-Scaling Hadoop Clusters
Qubole has built the world's first auto-scaling Hadoop clusters - that provisions, grows, shrinks and terminates Hadoop and Hive clusters on AWS completely on demand. It monitors the workloads to transparently perform these functions - while making maximum utilization of paid-for cloud resources. It transparently allows the clusters to be shared by the users of the same organization - leading to higher responsiveness for users and increased efficiencies.
Data Definition Wizards
Qubole provides intuitive data definition wizards that enable the users to create tables and table partitions on data stored in Amazon S3 in a matter of minutes. It provides intuitive interfaces to interpret data formats and generate the appropriate data properties thereby reducing significantly the time to move from data blobs to data analysis and insights.
Fast Sampled Query and Expression Evaluation
Qubole provides mechanisms to interact with sampled data sets so that the time spent on authoring queries and transformations is significantly reduced. No longer do the users have to wait for their queries to execute on big data sets only to discover semantic errors in the query later. With the ability to support expression and query evaluation on data samples, Qubole provides mechanisms to catch these errors earlier rather than later, saving significant amount of time for the end users.
With the intuitive Query Builders provided by Qubole, users can focus on writing semantically correct queries and transformations quickly and confidently. The query templates provided by Qubole help to cut down the time and error in authoring certain common queries on the users data sets.
Data Pipeline Support
Qubole leverages Apache Oozie to provide an integrated toolset and infrastructure to write simple data pipelines that can be scheduled to execute periodically or on availability of input data sets. With these features users can quickly go from creating their queries to executing them on newly arriving data on a continual basis. The results can be exported back to S3 or to databases like MySql to power data driven applications.
“Big will not beat small anymore. It will be the fast beating the slow” - Rupert Murdoch
Qubole stack is optimized to run in the cloud and give optimal performance with minimal intervention. Users of Qubole do not have to think about correctly configuring obscure options to get good performance. A big part of the win comes from Hadoop auto-scalability - clusters automatically expand to provide the performance required by the query. Some other key facets are as follows:
Fully Tuned Hadoop Clusters
We started with Facebook's Hadoop distribution based on Apache Hadoop 0.20.1 and over the last few months configured and tested all the key options required to make Hadoop perform well. Every cluster spawned by Qubole takes advantage of our accumlated knowledge and ensures that customers have one less thing to worry about. Some key differentiators:
- Our Hadoop deployment includes the most up-to-date version of the Hadoop Fair-Scheduler that can schedule jobs much faster.
- It incorporates the most recent advances in speculative execution and provides the highest level of protection against bad run-away jobs.
- Qubole uses optimal defaults for AWS instance types and has finely tuned the Hadoop configuration to make maximal use of the available disk and cpu resources of each AWS instance.
- We use myriad technologies to make clusters spin up faster - from eliminating unnecessary Linux services - to bringing up software and hardware in parallel.
Columnar Cloud Cache (C3)
One of the issues with data stores like S3 is that they can be relatively slower and erratic in performance as compared to local disk. Furthermore - we see a profusion of data sets in json and other non-relational formats that can be fairly expensive to parse. To solve these issues - Qubole invented C3. Data Sets stored in S3 are automatically converted to more efficient formats where possible and stored in a local disk cache. Such caching and transformation techniques have given us up to 5x performance improvements. These improvements not only benefit developer productivity - but also reduce the cost of running queries in the cloud.