IDERA - LIVE

Presto Ruby Client in QDS

Start Free Trial
September 8, 2016 by Updated September 8th, 2021

Co-authored by Somya Kumar, Member of the Technical Staff at Qubole.

Presto is an open-source distributed SQL query engine developed at Facebook. It’s built for interactive analytics queries, and like other Big Data processing engines such as Hive, Hadoop, and Apache Spark offered as a service on Qubole Data Services (QDS), Qubole also offers Presto as a service to query data stored in the cloud or on HDFS.

In this blog post, we will review the integration of the Presto Ruby client in QDS and also look at two Presto query execution paths–one that uses the Presto Java client and the other using the Presto Ruby client, including the performance benefits of one over the other.

Presto Ruby Client in QDS

QDS currently uses Presto Java client packaged in Presto’s open-source distribution in order to interact with the Presto server. However, from our extensive experience working with Presto Java clients, we’ve learned that it can take up to 3 seconds just to invoke the JVM. This goes against the very nature of interactive and ad hoc queries that usually have strict SLA. In order to eliminate this client-side performance overhead, we explored other Presto clients and chose Ruby because it integrates well with our existing Ruby on Rails web application as well as provides other benefits outlined below.

We’ve created a Ruby gem by extending the open-source Presto Ruby client developed by Treasure Data. We chose monkey patching for this integration so we could take advantage of any open source updates as well as have the ability to add functionality specific to our needs.

One of the main benefits of using QDS is auto-scaling–it adds and removes cluster nodes dynamically and it automatically manages to compute and storage resources to keep up with Big Data application’s processing needs. To complement this functionality, Ruby client code was extended to start a Presto cluster, process queries, get logs and results in real-time from the Presto server, and update query statistics and session management information.

Furthermore, our implementation uses SOCKS proxy to connect to the master node and the Ruby client uses Faraday gem to interact with the server. Note: Faraday gem is an HTTP client library that provides a common interface among many adapters such as Net:HTTP, which supports HTTP proxy, but not SOCKS proxy. So we’ve extended the Faraday gem using monkey patching (see code below) to add support for the SOCKS proxy.

module Faraday
 class Adapter
   class NetHttp < Faraday::Adapter
     def net_http_connection(env)
       if proxy = env[:request][:proxy]
         #support for socks proxy added here
         if proxy[:socks]
           Net::HTTP::SOCKSProxy(proxy[:uri].host, proxy[:uri].port)
         else
           Net::HTTP::Proxy(proxy[:uri].host, proxy[:uri].port, proxy[:user],proxy[:password])
         end
       else
         Net::HTTP
       end.new(env[:url].host, env[:url].port || (env[:url].scheme == 'https' ? 443 : 80))
     end
   end
 end
end

Presto Query Execution Paths: Java vs Ruby

To elaborate on the differences between the two clients, let’s review their query execution paths.

The “Old” flow (see diagram below) shows how Presto queries are currently handled in QDS using the Presto Java client. When a query is submitted via the QDS web interface or through APIs, the task is assigned to one of the worker processes–which invokes a Presto client JVM that interacts with the Presto server to carry out the query execution.

Compare that to the “New” flow which uses the Presto Ruby client wherein the worker process interacts with the Ruby client–which then initiates a REST API call to execute the query on the Presto server.

presto-ruby-client

With this “New” architecture and flow that uses the Presto Ruby client, we’ve already seen the query execution lifecycle reduced by at least 3-4seconds just from the fact that the JVM invocation stage has been eliminated.

Another important benefit is greater scalability–with Presto Java client, a cluster can process only as many queries as the number of Java clients it can start in parallel before hitting memory limitations. On the other hand, because the Presto Ruby client runs within QDS (see diagram above) and not on the cluster like the Java client, it can scale out to keep up with the big data processing workloads.

Performance Boost

As a result of the optimizations, when comparing our Presto Ruby client to the packaged Presto Java client, we’ve noticed an increased performance boost for queries that normally have less than 10sec execution time.

These are the scenarios we took into consideration:

  • DDL Query. For example, SHOW TABLES
  • Query with a syntax error

Here are the results:

Presto Java client

(wait time in secs)

Presto Ruby client

(wait time in secs)

Improvement

DDL Query41Ruby client is 4x faster than Java client when processing DDL queries
Query with a syntax error41Ruby client is 4x quicker than Java client in reporting back errors generated by Presto server

Summary

Using Presto Ruby client in QDS provides better overall performance and therefore makes data analysts running interactive analytics queries through the web interface more efficient. It also makes data engineering applications that use QDS APIs more performant because API commands go through the same execution path as the QDS web interface.

We will be rolling out the Presto Ruby client in QDS for all accounts in the coming weeks. If you’d like to get started today, please email us at [email protected] and we’ll enable it for your account.

Start Free Trial
  • Blog Subscription

    Get the latest updates on all things big data.
  • Recent Posts

  • Categories

  • Events

    Data Lakes vs. Data Warehouses

    Sep. 22, 2021 | India

    Big Data LDN: Launch a Data Fuelled Future

    Sep. 22, 2021 | EMEA - London Olympia

    AI & Big Data Expo: Delivering AI & Big Data for a Smarter Future

    Sep. 29, 2021 | North America

    Idera Live 2021: The Data Debate Part 1 – Managing Databases in the Cloud

    Oct. 6, 2021 | Global

    Idera Live 2021: The Data Debate Part 2 – Supporting Changing Strategies & Environments

    Oct. 7, 2021 | Global

    Open Data Science Conference

    Nov. 16, 2021 | North America - West
  • Read Apache Spark On Qubole: Sky Is the Limit