Presto, developed by Facebook, is an open-source distributed SQL query engine that was systematically designed and written for interactive analytics. It compares with the speed of commercial data warehouses. A major chunk of Big Data analytics involves interactive and exploratory queries that can be written in ANSI-SQL – and Presto focuses on this very use case.
SQL Ad-Hoc Analytics
Presto is the fastest and most reliable SQL query engine for data engineers who struggle with managing multiple query languages and interfaces to siloed databases and storage by providing a simple ANSI SQL interface for all their data analytics and open lakehouse. It can plug into various storage systems such as HDFS, S3, etc. Furthermore, Presto has an API that allows users to author and use their own storage options as well. It has its own scheduler and resource manager, which makes the execution of queries much faster and is analogous in design to many relational databases. In fact, Presto performs at par with several badged RDBMSs for large query operations.
Is Presto a Data Warehouse?
Presto is referred to as a SQL query execution engine and is different from a traditional data warehouse. Data warehouses control how data is written, where it resides, and how it is read. Once the data is received in the warehouse, it is difficult to get it back out. Presto adopts a different approach by decoupling the storage of data from its processing while providing support for the same ANSI SQL query language as usual.
Presto executes queries over data sets provided by plug-ins, especially Connectors that help Presto to read and write data onto an external data system. The Hive Connector is one of the standard connectors which uses the same metadata used to interact with HDFS or Amazon S3. Due to this connectivity, Presto is a drop-in replacement for organizations that continue to use Hive today. It is enabled to read data from the same schemas and tables using the same data formats — ORC, Avro, Parquet, JSON, etc. Connectors are frequently being added to Presto, granting it the potential to access data anywhere it lives. The benefit of a decoupled storage model is that Presto is able to provide a single integrated view of all data, therefore enhancing the capabilities of ad hoc querying to levels unreached before, while also providing interactive query times over your large data sets.
Ad-Hoc Reporting With Presto
Presto was designed as a means to provide end-users access to large data sets to perform ad hoc analysis. Before Presto, Facebook used to use internally-developed Hive, which was later donated to the Apache Software Foundation, to execute this kind of analysis. Gradually, as Facebook’s data sets grew, Hive was found to be too slow and insufficiently interactive. This was mainly because the foundation of Hive is MapReduce, which required intermediate data sets to be held onto HDFS at the time.
Instead of retaining the intermediate data on HDFS, Presto allows for the data to be pulled into memory and perform operations on this data instead of holding on all of the intermediate data sets to disk. Using Presto, the data is kept where it is stored and in-memory executions are performed on these data sets across various distributed systems by shuffling data between servers as needed, also resulting in speeding up query execution time.
As the data set grows, the Presto cluster can also be grown in order to maintain the same expected runtimes. This level of performance, combined with the flexibility to query virtually any data source, will help empower a business to get more value from their data than ever before — all while retaining the data where it is and doing away with expensive transfers and engineering time to consolidate the data into a single place for analysis.
Integration of queries and query data: Presto can query relational and NoSQL databases, data warehouses, data lakes, and more and has dozens of connectors available today. It also provides for querying data where it is stored and a single Presto query can amalgamate data from multiple sources, allowing for analytics across the entire organization.
Lightning-fast analytics: Presto is an in-memory distributed SQL engine, which is faster than other compute engines in the disaggregated stack.
Standardization of SQL with one engine: Presto can be utilized for interactive and batch workloads, varying sizes of data, and a wide range of a number of users. Presto offers one familiar ANSI SQL language and one engine for data analytics doing away with the requirement to upgrade to another lakehouse engine.
Open source: Presto is a neutrally governed open-source project under the Linux Foundation with several member companies.
Presto Strict Mode
When it comes to Presto, Qubole refers to Hive’s design coupled with the typical query patterns seen among the Presto customer base. In R55, Qubole has added a new feature called Presto Strict Mode, which, once enabled, restricts the user from executing certain queries.
Data Lake Cost-Savings with Presto on Qubole
Handling Spot Interruption with Presto on Qubole helps in Saving 60% on Cost
Qubole’s Spot interruption handling helps Presto on Qubole to achieve up to 15% higher success rates, without any query failures, as compared to the competition.
Presto optimizes cluster utilization by assigning tasks to the cluster nodes uniformly across all operating queries. This prevents nodes from getting overloaded, which would otherwise lead to a slacking up of queries due to overloaded nodes becoming a computing bottleneck.
Presto’s current scheduling algorithm works extremely well if all task items allotted to the nodes are equal in terms of CPU and I/O stipulations.
Local Scheduling In Presto On Qubole
Presto provides a few ways in which locality can be prioritized while assigning data to workers:
- The ‘force-local-scheduling’ property compels Presto to always assign data splits to local worker nodes. It has been discovered that this might be detrimental to performance if the distribution of data across workers is not uniform or if certain nodes are processing splits slower than others. Both these cases could result in certain worker nodes becoming performance bottlenecks.
- Presto has a Topology Aware Scheduler that relies on a hierarchy of network segments such as workers at the rack level, etc., and assigns splits based on the proximity of data to workers. In flat Topology Aware Scheduling, there is a single level of the hierarchy of network segments, and the locality of splits to each machine is given some preference. However, in this case, the locality is prioritized by reserving a certain number of slots for local splits in each worker. This reservation behavior depends on the initial distribution of data and on the order in which splits fall for scheduling. At times, this could result in the under-utilization of resources and cause splits to be stuck in the waiting state for long periods.
Both options for locality-aware scheduling in Presto have certain corresponding drawbacks. Hence, there is a need for a better scheduler that takes the locality of splits into account when Presto is utilized alongside RubiX while sustaining uniform distribution.
Cost reduction through Spot Nodes with Presto on Qubole
Spot Nodes on AWS and Preemptible VMs on GCP are extremely popular among customers who find them very efficacious in the reduction of cloud costs. Qubole helps Presto customers utilize Spot nodes without surrendering reliability through built-in features that elegantly handle Spot interruptions.
Dynamic sizing of Presto clusters through Resource Groups
Resource groups are an important construct in Presto to manipulate resource usage and query scheduling. Broadening them to augment limits on scaling provides a simple solution to place limitations on auto-scaling and supplementing resource groups with cost-isolation mechanisms. After this feature, admins can not only introduce user-level restrictions on auto-scaling but also combine multiple clusters into a single entity for simpler management and setup.
Presto Query Optimization
Presto on Qubole works faster than its competitors
In the past few years, Presto has set the standards for fast analytical processing in modern cloud data lake architectures. Qubole has been providing a Presto service as part of its open data lake platform since 2014. The Qubole team has strived to offer the best performances and user-friendly Presto-based services.
Benchmarking results demonstrate that Presto on Qubole has proved to be faster than ABC Presto in terms of overall Geomean of the 100 TPC-DS queries for the no-stats run. Principally, the queries were faster on Presto on Qubole with significant queries being faster than other queries. Additionally, dynamic filtering, which is not available on ABC, was enabled on Presto on Qubole for both sets of runs. Even on the availability of Hive metastore statistics, they were faster on Presto on Qubole while some queries were at least 1.5x or faster on Presto on Qubole. For the no-stats run, size-based statistics on Presto on Qubole were enabled. Size-based statistics is the latest feature that is distinctive to the Qubole platform.
A major credit of these performance gains can be attributed to Dynamic Filtering and Size-based statistics. Besides, Presto service on Qubole offers this superlative performance at a lower cost than its competition by leveraging distinguished features around the utilization of low-cost computing with better reliability and workload-aware autoscaling.
Data is fast becoming a focal point for large organizations, and it is imperative to query multiple sources accurately and efficiently. With Presto, businesses can operate ad-hoc queries of data to tackle challenges around time to discover and the amount of time it takes to perform ad hoc analysis.
Presto Use Cases
SQL can be used to run ad hoc queries at any instance. Presto permits the user to query data where it is stored so that the requirement to ETL data into a separate system is eliminated. The usage of Presto connectors and their in-place execution helps platform teams quickly provide access to datasets of interest to the analysts. Additionally, Presto is fast and runs queries in seconds instead of hours, enabling analysts to iterate quickly on innovative hypotheses.
Presto can query data across multiple sources to build a single view of reports and dashboards for self-service BI business intelligence. Presto reporting helps data scientists and analysts with the ability to perform a source-wide query on data to eliminate the constraint of platform expertise.
Presto helps in the consolidation of huge chunks of data across multiple sources and runs efficient ETL queries. Presto can be used to run resource-efficient and high throughput queries as against legacy batch processing systems.
Data Lake Analytics
Data can be queried directly on a data lake without any need for transformation. Presto helps the user to query any type of data- both structured and unstructured.
Data can be queried across several different data sources including databases, data lakes, lake houses, on-premises, or in the cloud. Presto also allows the user to consolidate answers back in the Presto in-memory database.
To summarize, Presto is a widely-embraced distributed SQL engine for data lake analytics. Presto can also execute ad-hoc querying of data, which helps to solve challenges regarding time to discover and the amount of time it takes to perform ad hoc analysis. Furthermore, it also provides users with new features like the disaggregated coordinator, Presto-on-Spark, scan optimizations, a reusable native engine, and a Pinot connector enabling added benefits around performance, scale, and ecosystem.
Presto also provides many advantages for organizations of different sizes. Particularly, it boasts of the ability to query data such that it reduces the amount of time required by data engineers to build complex ETLs. This implies that clients or customers can be provided with faster replies to their questions.
Qubole’s Presto-as-a-Service is primarily intended for Data Analysts who need to transform business questions into SQL queries. Since these questions are often ad-hoc, trial and error are involved to some degree; achieving the final results may require a series of SQL queries. By reducing the response time of these queries, the platform can cut down on the time to insight and prove to be extremely beneficial for the business.