Presto, developed by Facebook, is a real-time SQL (Structured Query Language) engine to query a wide range of data store types including Hadoop. Hadoop as a technology is dominating Big Data projects due to its popularity, attractive ecosystem of tools, and cost effectiveness. SQL on Hadoop has been an important development to democratize Big Data by giving easy data access to business users, analysts, data scientists, and programmers, to previously locked away large data sets.
Related Blog: Real-Time Data Query: The Next Competitive Advantage
The original SQL on Hadoop engine is Apache Hive, which in its earliest form translated SQL queries to map-reduce programs to read, transform, and aggregate data stored on Hadoop’s distributed file system or S3 cloud storage. Naturally, this approach is exceptionally scalable, extensible and fault tolerant since it employs Hadoop’s core data processing engine. Hive and democratizing data access was a huge success at Facebook where it was developed by the now founders of Qubole. It permitted thousands of business users to utilise the full data sets of Facebook to build the most successful social media website in the world.
Today, Hive as a Service is an important offering for businesses to trial, test and develop big data projects. Hive excels in analytics tasks that do not require real-time performance and is a complementary tool for ETL (Extract, Transform, and Load) workflows besides Pig. Current developments on the Hive project focus on enriching it with more SQL features and speeding it up to allow simple queries to complete in interactive fashion. These are being brought to the fore through the Stringer project lead by Hortonworks.
Facebook, potentially the biggest user of Hive today, found that for a certain class of workloads, a faster query engine was needed to offer interactive speeds for data exploration. Many business analytics and visualisation tools that use SQL interfaces expect responses in seconds or less. These queries and tools traditionally required data warehouse solutions costing many millions of dollars. Presto aims to achieve this level of performance at a fraction of a cost and at a scale beyond traditional data warehouse deployments.
At Facebook, over 1,000 users, run over 30,000 queries a day on hundreds of Petabytes of data with Presto today (source). Being battle tested at such a large scale is something that separates Presto from many other fast SQL-on-Hadoop solutions. Facebook was convinced that Presto fills an architectural gap in the Hadoop ecosystem and therefore open sourced it. Consequently, users of Presto are not locking themselves in with a vendor or have to blindly rely on vendor promises. Every user can download and try out Presto, or use a Presto as a Service solution to experience the performance instantly.
A distinguishing feature of Presto is its design to support a wide range of data sources. Presto can access data stored in Hive, i.e. on HDFS (Hadoop Distributed File System) or Amazon Web Services’ S3, HBase, relational database management systems, Scribe, or any other data source. Presto’s pluggable data backend design makes it extendable to accommodate even legacy or custom data stores.
Presto has a pluggable data backend (source)
Importantly, in the future any number of data sources can be accessed and their data easily combined with Presto. For example, various SQL, NoSQL, and data sinks can be accessed from one interface and data from them can be combined and loaded into any of them (the latter is in development at the moment). This makes Presto ideal to combine a variety of data of any volume and solves two important aspects of contemporary big data challenges.
Qubole is providing the first Presto as a Service in the cloud – inexpensive and scalable – to everyone. You can sign up for our development preview and try it out yourself. Qubole also provides Hadoop as a Service, Hive as a Service and Pig as a Service which complement the Presto offering.
Learn More: Best Practices Writing Presto Queries
Bursting workloads is a common problem with explorative analytics, but Qubole's auto-scaling feature eliminates this issue.
E-commerce businesses have a lot to gain from big data. Learn how big data technology can address typical e-commerce challenges.
Qubole's caching system overcomes typical bandwidth problems with Amazon S3.
Learn how to collect and analyze large volumes of web analytics data directly in your data warehouse.
Scaling efficiently is crucial to controlling machine and operations costs of a big data infrastructure
Qubole is a significantly more polished product than EMR. Data scientists can explore their data in S3, create tables and query those tables all via an easy-to-use web UI
Qubole’s fantastic support has been key in our successful deployment. They continue to deliver of new features and revisit the ones that we ask for
Our goal at MediaMath was to take our existing industry leading infrastructure to the next level handling new complex analytics tasks. Qubole has helped us enable this goal with minimal risk.
Instead of worrying about provisioning clusters of machines or job flows or whatever, Qubole lets you focus on your data and your queries … The Qubole guys have been extremely helpful!
The service spins up users’ clusters only when a job is started, then automatically scales or contracts them based on the workload, and spins the servers down once the job is done.
Qubole’s Hadoop and Hive interfaces are vastly superior to the default CLIs, which scare business analysts and hinder meaningful analyses of the gaming logs that we collect. With Qubole, business analysts are self-sufficient in using a Big Data platform to meet their advanced analytic needs.
Online Gaming Company
top-performing technologies in the data industry are definitely taking aim at democratizing data tools and bringing the power of data to smaller businesses. This is a major change in the data industry, and Qubole Data Service is a great example
I’m very happy to be using Qubole in production. Qubole has saved me a lot of time, effort, and trouble in getting my data processing pipelines up and running. My data pipelines process Appnexus data in Amazon S3 which is then stored in Vertica. The engineering team understands the complexities and provided awesome support!
Real-time Ads Retargeting Startup
There’s a whole world of web companies, SMBs and other non-Facebooks or Yahoos that will want to use Hadoop but not want to run it in-house…offering a cloud service makes it easier for these users to get started with the platform and for Qubole to keep improving.
Qubole offers a big data ETL and exploration service through auto-scaling Hadoop clusters with a web user interface for data exploration and integration with various data sources. The service can do (nearly) everything EMR can do, and it goes further
Big Data Republic
Simba knows Big Data access. Qubole knows Big Data. Qubole’s founders authored Apache Hive, built key parts of the Hadoop eco-system and brought Apache HBase to Facebook
“The integration of Tableau and Qubole makes it faster and easier for our customers to operationalize Big Data…lowers the resource barriers to deriving the benefits of Big Data because customers can deploy our joint solution seamlessly and cost effectively.”