Blog

×

Caching on EMR using RubiX: Performance Benchmark and Benefits

By November 27, 2017

Last year Qubole introduced RubiX, a fast caching framework for big data engines. This framework enables big data engines such as Presto, Hive and Spark to localize the data from a cloud object store to the local disk for faster access.  

Currently 90% of Qubole’s customers using Presto have enabled RubiX by default in their clusters and they are experiencing significant performance gains for their read intensive workloads. Today we are announcing support for RubiX on open source Presto. In this blog post we will walk you through how to configure your EMR Presto cluster to enable RubiX, and show you the performance gain we have experienced.

RubiX Architecture

RubiX works on the following basic principles.

Split Ownership Assignment

The entire file is divided into chunks and, with the help of consistent hashing, a node is assigned to be owner of a particular chunk. RubiX uses this ownership information to provide hints to the respective task schedulers to assign tasks on those nodes.

Block Level Caching

The whole file chunk as mentioned above is divided into logical blocks of a configured size (default value of which is 1MB).The metadata of these blocks (whether cached or not) is kept in the BookKeeper daemon, and also checkpointed to local disk. When a read request comes in, the request is converted into logical blocks, and on the basis of  metadata of those blocks, the system decides whether to read the data from the remote location (object store in this case) or from the cache of the local or non-local nodes.

This architecture gives RubiX the following advantages:

Works well with columnar file format – RubiX reads and caches only the data that is required.

Engine-independent scheduling logic – The task assignment works on the locality of the data which is determined by using consistent hashing of the file chunk

Shared cache across JVMs – Because the metadata of the cache is stored in a single JVM outside of any engine, any big-data framework can use this information to get benefit of the cache.

Performance Analysis

To analyze the performance impact of RubiX, we compared how long a job takes to run when data is on S3, with how long the same job takes when the data is in local disk through RubiX cache.

We selected TPC-DS data of scale 500 in ORC format as our base dataset. We selected 20 queries with a good mix of interaction and reporting, so as to eliminate any kind of bias. The queries can be found here.

For both S3 read and local RubiX cache read, the queries were run three times; the best run is represented here. For the local RubiX cache read, we have pre-populated the cache via earlier runs of the same queries.

The following graph shows the performance improvement provided by Rubix.

As you can see from the above chart, RubiX provides better performance for every query; the improvement ranges from 6 percent to 32 percent. To summarize:

Max Improvement: 32% (Query 42)

Min Improvement: 6% (Query 59)

Average Improvement: 20%

The percentage improvement is calculated as (Time without Rubix – Time with Rubix) / Time without Rubix

Because the RubiX cache is optimized for reading the data from local disk, there are  greater performance gains for read-intensive jobs . For CPU-intensive jobs, the gains made while reading the data are offset by the time spent on CPU cycles.

Setting up RubiX

Cluster Configuration

         Presto Version  – 0.184

         Number of Slave Nodes – 2

         Node Type – r3.2xlarge

  1. Log in to the master node as the Hadoop user once the cluster is up, and set up passwordless ssh among the cluster nodes:
ssh <key> [email protected]
  1. Install Rubix Admin on the master node. This is the admin tool for installing RubiX on the cluster and starting the necessary daemons.
sudo yum install gcc libffi-devel python-devel openssl-devel
sudo pip install rubix_admin rubix_admin -h

This creates a file ~/.radminrc with the following content

hosts:
- localhost
remote_packages_path: /tmp/rubix_rpms
  1. Add the public DNS addresses of the slave nodes in the hosts section.
  2. Use rubix_admin to install Rubix. This installs the latest version of RubiX.
rubix_admin installer install

You can also specify the rpm path if you need to install any other version of RubiX:

rubix_admin installer install --rpm <path to the rpm>
  1. Once RubiX is installed, start the RubiX daemons:
rubix_admin daemon start
  1. Verify that the daemons are up:
sudo jps -m
pid1 RunJar /usr/lib/rubix/lib/rubix-bookkeeper-0.2.12.jar com.qubole.rubix.bookkeeper.BookKeeperServer
pid2 RunJar /usr/lib/rubix/lib/rubix-bookkeeper-0.2.12.jar com.qubole.rubix.bookkeeper.LocalDataTransferServer
  1. If the daemons have not started, check the logs at the following locations:
For BookKeeper -- /var/log/rubix/bks.log
For LocalDataTransferServer -- /var/log/rubix/lds.log

Running a Simple Example

We will run a very simple Presto query that reads data from a remote location in S3 and does some basic aggregation. The first run of the query warms up the cache by downloading the required data, and subsequent runs use the local cache data to do the necessary aggregation.

  1. Run Hive as follows to create the external tables:
hive --hiveconf hive.metastore.uris="" --hiveconf fs.rubix.impl=com.qubole.rubix.hadoop2.CachingNativeS3FileSystem

We are starting Hive pointing to its embedded Metastore server. As the RubiX related jars are pushed in hadoop lib directory during RubiX installation, the thrift Metastore server needs to restarted to be aware of the custom scheme rubix. We also need to set the awsAccessKeyId and awsSecretAccessKey for rubix scheme by setting fs.rubix.awsAccessKeyId and fs.rubix.awsSecretAccessKey

  1. Run the following command in Hive to create an external table:
CREATE EXTERNAL TABLE wikistats_orc_rubix 
(language STRING, page_title STRING,
hits BIGINT, retrived_size BIGINT)
STORED AS ORC
LOCATION 'rubix://emr.presto.airpal/wikistats/orc';
  1. Start the Presto client:
presto-cli --catalog hive --schema default
  1. Run the following query:
SELECT language, page_title, AVG(hits) AS avg_hits
FROM default.wikistats_orc_rubix
WHERE language = 'en'
AND page_title NOT IN ('Main_Page',  '404_error/')
AND page_title NOT LIKE '%index%'
AND page_title NOT LIKE '%Search%'
GROUP BY language, page_title
ORDER BY avg_hits DESC
LIMIT 10;

The cache statistics are pushed to MBean named rubix:name=stats We can run the following query to see the stats:

SELECT Node, CachedReads, 
ROUND(ExtraReadFromRemote,2) AS ExtraReadFromRemote, 
ROUND(HitRate,2) AS HitRate, 
ROUND(MissRate,2) AS MissRate,  
ROUND(NonLocalDataRead,2) AS NonLocalDataRead, 
NonLocalReads, 
ROUND(ReadFromCache,2) AS ReadFromCache, 
ROUND(ReadFromRemote, 2) AS ReadFromRemote, 
RemoteReads 
FROM jmx.current."rubix:name=stats";

After the first run (cache warmup) of the presto job, the numbers look like the following:

NodeCached ReadsExtra ReadFrom RemoteHit RateMiss RateNonLocal DataReadNonLocal ReadsRead FromCacheReadFrom RemoteRemote Reads
node11716.590.130.870.174966.53500.14109
node22335.070.10.90.141359.5306.47207

 

Conclusion

RubiX was started to eliminate I/O latency from object stores in public clouds for fast ad-hoc data analysis. The architecture of RubiX is a result of running shared storage, auto-scaling Presto, Hive and Spark clusters that process petabytes of data stored in columnar formats. We have made RubiX available in open source to work with the big data community to solve similar issues across big data engines, Hadoop services and distributions on premise or in public clouds. If you are interested in using RubiX or developing on it please join the community through the links below.

Resources:

Share our Post

Leave a Reply

Your email address will not be published. Required fields are marked *