Faster analytics on cloud with RubiX – Shubham Tagra, Technical Director, Qubole

Cloud stores are inexpensive and infinitely scalable, which has lead to them being the de facto standard for data lakes. But when it comes to performance, local SSDs easily outperform them, especially with newer advancements like NVMe. Furthermore, due to the access over network cloud stores often struggle to provide consistent performance and in certain cases, like inter-region network access in AWS, it can be expensive. Given that all cloud providers have an option to provision machines with high-speed local disks, many of the shortcomings of the cloud stores can be avoided by using these disks for a cache. RubiX is Qubole’s homegrown, open-source data caching framework that integrates with Big data engines like Hive, Presto, Spark to provide the data cache over any cloud store. In this talk, we will find out how RubiX works, how it integrates with the Big Data engines and the cloud store, what kind of improvements can be expected and what new features are being worked upon.