Blog

 
 

Qubole announces Heterogeneous Clusters on AWS – Reduce costs up to 90% with Spot Fleet

 

Co-authored by Hariharan Iyer, Member of the Technical Staff at Qubole. Introduction Big data engines like Hadoop and Spark are known to work well when running on homogeneous clusters. This allows the underlying resource manager to optimally place tasks on the nodes and also lets users tune their jobs as per the configuration of a […]

 
Read More..

The Importance of being Data Driven

 

Understanding the business side of big data is just as important as the technical side, particularly when it comes to ensuring the real-world success of big data projects. Ashish Thusoo, Co-founder/CEO Qubole, recently shared with attendees of Data Driven NYC some of the key aspects of Qubole’s approach to big data. Including making data insights […]

 
Read More..

Qubole Announces Support for SaaS Subscriptions on AWS Marketplace

 

Customers purchasing Qubole through AWS Marketplace will get the first two weeks of Qubole free. SAN FRANCISCO – Nov. 16, 2016 – Qubole, the big data-as-a-service company, today announced that it is now available on AWS Marketplace with support for the new SaaS Subscriptions functionality, enabling customers to subscribe directly on AWS Marketplace and benefit […]

 
Read More..

QDS on Oracle Bare Metal Cloud Service is Generally Available

 

  We previously announced our partnership with the Oracle Cloud Platform and also shared the results of our preliminary benchmark on the Oracle Bare Metal Cloud Service. Today, the Oracle Bare Metal Cloud Service is generally available for access and usage. We at Qubole are continuing to work closely with the Oracle team to bring Qubole […]

 
Read More..

Advanced security using AWS Identity Access Management (IAM) on QDS

 

  For Big Data analyses and processing, Qubole Data Service (QDS) orchestrates storage and compute resources owned in the customer’s account. To enable this, customers delegate the necessary permissions to QDS. With IAM Roles promoted as security best practice on AWS, customers no longer need to provide access and secret keys to QDS. Thereby, making […]

 
Read More..

Airflow as a service on QDS is Generally Available

 

Co-authored by Yogesh Garg and Sumit Maheshwari, Members of the Technical Staff at Qubole. Sumit Maheshwari is also part of Apache Airflow PPMC. We are excited to announce that Airflow as a service on Qubole Data Service (QDS) is GA and joins the family of Hadoop 1, Hadoop 2, Spark, Presto, and HBase offered as […]

 
Read More..

IBM and Qubole take data science and Apache Spark to the public cloud

 

This morning IBM and Qubole made an exciting announcement that will provide the growing number of data scientists a comprehensive environment based on public cloud infrastructure. IBM is well known for its long standing leadership in data science and its Watson Data Platform. It’s also a major committer to Apache Spark. The recently announced IBM […]

 
Read More..

Qubole selected to enter into Joint Venture Partnership with the National Technical Information Service (NTIS)

 

The U.S. Commerce Department’s National Technical Information Service (NTIS) announced Oct 19 that, following a rigorous merit review process, it has selected Qubole as an eligible joint venture partner (JVP) of the NTIS. Once the JV agreement is finalized, Qubole will have the opportunity to compete to work with NTIS on groundbreaking data projects conducted […]

 
Read More..

Cost Analysis of Building Hadoop Clusters Using Cloud Technologies

 

This is a guest post written by Shailesh Garg, Director of Engineering at RevX. A programmatic ad-tech platform like RevX generates terabytes of data on a daily basis. To effectively process and leverage this data, we use big data tools like Hadoop for reporting and analytics. Our infrastructure is hosted in Amazon AWS across multiple […]

 
Read More..

Intelligence in QDS

  • By Xing Quan
  • September 27, 2016
 

The concept of intelligent automation has always played a key role in Qubole Data Service (QDS). It’s one of the main reasons why we can help our customers bring self-service Big Data access across the enterprise. Intelligent automation also plays a big role in the small ops footprint that QDS requires, helping our customers achieve […]

 
Read More..

Creating Customized Plots in Qubole Notebooks

  • By Mohan Krishnamurthy
  • September 22, 2016
 

Important stories live in our data, and data visualization is a powerful means to discover and understand these stories, and then to present them to others. Within Qubole notebooks users can leverage the built-in charting tools to create visualizations. In addition to the built-in charting capabilities, users sometimes find the need to create custom charts. […]

 
Read More..

Apache Spark On Qubole: Sky Is the Limit

 

At Qubole, we’ve made significant progress on our adoption of Spark on QDS with new features and scalability. Here are some recent stats pertaining to Apache Spark on Qubole Data Service (QDS):   Apache Spark on QDS–New Features and Highlights To accommodate growing demands and leverage technological advancements made by the Apache Spark community, we […]

 
Read More..

Presto Ruby Client in QDS

 

Co-authored by Somya Kumar, Member of the Technical Staff at Qubole. Presto is an open source distributed SQL query engine developed at Facebook. It’s built for interactive analytics queries, and like other Big Data processing engines such as Apache Spark, Hadoop, and Hive offered as a service on Qubole Data Service (QDS), Qubole also offers […]

 
Read More..

SparkSQL in the Cloud: Optimized Split Computation

 

When it comes to Big Data processing in the cloud compared to on-premise, one of the fundamental differences between the two is how the data is stored and accessed. Not having a clear understanding of this underlying difference between, for example, AWS S3 in the cloud and HDFS on-prem leads to a suboptimal service to […]

 
Read More..

The Value of Auto-scaling

 

Intro In a recent blog post, we benchmarked auto-scaling and demonstrated that an auto-scaling cluster was a lot less expensive and only a little bit slower than a static, max-sized cluster. In this post, we decided to quantify this benefit in terms of dollars and cents. Based on our results, we estimate that auto-scaling is […]

 
Read More..

Qubole’s Notebook Integration with Github is Generally Available

  • By Mohan Krishnamurthy
  • August 10, 2016
 

We are excited to announce the general availability of GitHub integration for QDS Notebooks. GitHub is an effective way to collaborate on development projects. GitHub is version control software that allows users to track the changes they make to their code, as well as being able to easily revert these changes, share development efforts and […]

 
Read More..

Benchmarking Auto-scaling Spark Clusters

Cmds per Hour vs. Nodes per Hour
 

Intro Have you ever had trouble deciding how large to make a cluster? Do you sometimes feel like you’re wasting money when a cluster isn’t being fully utilized? Or do you feel like your analysts’ time is being wasted, waiting for a query to return? At Qubole, we developed auto-scaling in order to help combat […]

 
Read More..

Qubole Continues Strong Momentum, Reports a Strong First Half of 2016

  • By Jo McDougald
  • August 4, 2016
 

Lyft, Box, Amgen and Scripps Join Growing Roster of Top-Tier Customers MOUNTAIN VIEW, CA–(Marketwired – Aug 4, 2016) – Qubole, the big data-as-a-service company, reported exponential growth over the past six months and strong momentum heading into the second half of 2016. Following its $30 million Series C funding round in January, Qubole has continued […]

 
Read More..

Optimize Queries with Materialized Views and Quark

  • By Rajat Venkatesh
  • July 14, 2016
 

This blog post explores how queries can be sped up by keeping optimized copies of the data. First we will explore the techniques and benchmark some sample results. Later, we talk about how one can use Quark (which we detailed in a previous post) to easily implement these performance optimizations in a Big Data analytics […]

 
Read More..

Quark: Control and Optimize SQL Across Hadoop and RDBMS

  • By Rajat Venkatesh
  • June 27, 2016
 

One of the important functions of a database administrator is to manage storage structures to optimize performance in a relational database. Admins use tables, views, index, and cubes to tune the database as well as control the behavior of users (e.g., discourage full table scans and cross joins). There are similar well-known techniques in the […]

 
Read More..

RubiX: Fast Cache Access for Big Data Analytics on Cloud Storage

  • By Shubam Tagra
  • June 21, 2016
 

Qubole introduced first-generation Caching for S3 files in Presto in 2014 and documented the observed performance gains. In a nutshell: for CPU-efficient engines like Presto and Spark, caching remote files on local disk storage improves performance by removing bottlenecks in network IO. Our users also benefited from these performance gains, as this blog post from […]

 
Read More..

Qubole’s HBase-as-a-Service is Generally Available on AWS

  • By Rajat Venkatesh
  • June 9, 2016
624x154-apache-hbase
 

The HBase team at Qubole is happy to announce the general availability of QDS HBase-as-a-Service on AWS. Through the Beta program, QDS has helped administrators run HBase at scale in production with higher uptime and reliability while exploiting cloud elasticity for more agile deployments. In building our HBase offering, we worked closely with early customers […]

 
Read More..

Qubole and Looker Join Forces to Empower Business Users to Make Data-Driven Decisions

  • By Ari Amster
  • April 27, 2016
 

Qubole, the big data-as-a-service company, and Looker, the company that is powering data-driven businesses, today announced that they are integrating Looker’s business analytics with Qubole’s cloud-based big data platform, giving line of business users across organizations access to powerful, yet easy-to-use big data analytics. Business units face an uphill battle when it comes to gleaning […]

 
Read More..

Qubole Extends Big Data-as-a-Service Platform with StreamX

  • By Ari Amster
  • April 26, 2016
 

Qubole, the big data-as-a-service company, today announced it has open sourced StreamX, an ingestion service to help data teams efficiently and reliably capture large scale, real-time data. Qubole will be adding support for StreamX as a managed service on the Qubole Data Service (QDS) platform to simplify and automate the ingestion of data for big […]

 
Read More..

Qubole Open Sources Quark for SQL Virtualization

  • By Ari Amster
  • April 5, 2016
 

Qubole, the big data-as-a-service company, today announced that it has open sourced Quark, a cost-based SQL optimizer that helps to simplify and optimize access to data for data analysts. Traditionally, the data sets generated by data teams are aggregated and copied to multiple analytics systems to balance performance and cost, making it near impossible to […]

 
Read More..

Moving past infrastructure limitations

  • By Ari Amster
  • March 24, 2016
 

This is a guest post written by Rory Sawyer, Software Engineer at MediaMath Here at MediaMath, we’re quite fond of data. It would be surprising to hear someone say they’re not fond of data, of course, but we’ve spent the last 18 months proving to ourselves and our clients that we really mean it. Our […]

 
Read More..

Qubole Appoints its Head of Web Services Division

  • By athusoo
  • March 18, 2016
 

The appointment of Suresh Ramaswamy will help Qubole scale its multi-tenant SaaS platform and develop highly responsive big data platforms to cater to industry demands. Qubole, the big data-as-a-service company, today announced that it has appointed Suresh Ramaswamy as Qubole’s Head of Web Services. In this role, Suresh will help Qubole scale the web services […]

 
Read More..

Qubole Appoints its First Chief Information Security Officer

  • By athusoo
  • March 10, 2016
 

Andrew Daniels brings more than 20 years of experience in enterprise security to address industry-specific needs Qubole the big data as-a-service company, today announced that it has appointed Andrew Daniels as Qubole’s first chief information security officer (CISO) and vice president of security, compliance and privacy. As CISO, Daniels will focus on developing industry-leading security […]

 
Read More..

Qubole Extends Customer Support with New Education Program

  • By Ari Amster
  • March 7, 2016
 

Qubole, the big data as-a-service company, announced today it will be extending its customer support services with the launch of Qubole Education, an extensive resource to empower data users throughout an organization with the skills needed to successfully implement a cloud-based data project. Qubole’s cloud-agnostic big data platform allows users to implement the right data […]

 
Read More..

Qubole Donates Access to Big Data Cloud Platform for University Research

  • By Ari Amster
  • February 18, 2016
 

Students Will Be Able to Conduct Data Analysis on Any Size Data Sets Using the Latest Technologies Such as Apache Spark, Presto, Hive and Hadoop on Qubole’s Self-Service, Infinitely Scalable Cloud Platform Qubole, the big data as-a-service company, announced today it will be donating time on the Qubole Data Service (QDS) to university classes, giving […]

 
Read More..

Open Source Integration of Airflow and Qubole

  • By Xing Quan
  • February 17, 2016
 

This post was written by Yogesh Garg and Sumit Maheshwari, who are Members of the Technical Staff at Qubole. We are pleased to announce that Qubole has open sourced an Airflow extension to connect with Qubole Data Service (QDS). Using this extension, our customers will be able to use Airflow for creation and management of […]

 
Read More..

Our own Swati Singhi at the Grace Hopper Celebration

  • By Xing Quan
  • February 8, 2016
 

Swati Singhi, a Member of the Technical Staff at Qubole, was recently featured as a speaker at the Grace Hopper Celebration of Women in Computing, held in Bangalore, India. The Grace Hopper Celebration is the world’s largest technical conference for women in computing, and it is designed to bring the research and career interests of […]

 
Read More..

Optimizing S3 Bulk Listings for Performant Hive Queries

  • By Amogh Margoor
 

Introduction We previously wrote about the optimizations we made to optimize Hadoop and Hive on S3. Since then, we’ve applied those same changes across the rest of our Big Data analytics offerings, including Spark and Presto. Today, we’ll discuss some new recent optimizations we’ve made to further make querying of data performant and efficient for […]

 
Read More..

Infographic: Big Data Belongs in the Cloud

  • By Xing Quan
  • February 4, 2016
qubole-infographic-blog-2
 

Big Data infrastructure is complex, difficult to build and operate, and often requires highly specialized talent to maintain. To alleviate these challenges, businesses are turning to the cloud to provide simplicity, flexibility and agility. The graphic below highlights Qubole customers’ leadership due to the ease of administration, scaling, lifecycles, flexibility, and costs.     Qubole […]

 
Read More..

Qubole Closes $30 Million Investment to Extend Leadership in Big Data in the Cloud

  • By Jonathan Buckley
  • January 20, 2016
 

IVP leads Series C financing along with existing investors CRV, Lightspeed Venture Partners and Norwest Venture Partners Qubole, the big data-as-a-service company, today announced that it has closed a $30 million Series C financing, bringing its total funding to $50 million. IVP led the financing and General Partner Somesh Dash will join the Qubole board […]

 
Read More..

Building Qubole: Metrics and Alerts

  • By Rajat Venkatesh
  • January 11, 2016
 

In this blog post, we’ll show you how we collect metrics and set up alerts to ensure the availability of Qubole Data Service (QDS).   QDS Architecture Before getting into the details about monitoring, we’ll give a quick introduction to the QDS architecture.   QDS runs and manages Hadoop/Spark/Presto clusters in our customers’ AWS, GCP, […]

 
Read More..

Qubole Appoints Jonathan Trail as Vice President of Customer Success

  • By Jonathan Buckley
  • December 22, 2015
 

Qubole, the big data as-a-service company, today announced that it has appointed Jonathan Trail as Qubole’s first Vice President of customer success. As VP of customer success, Trail will work closely with Jonathan Buckley, SVP of marketing, and Marcy Campbell, SVP of worldwide sales and business development. Together, they will work to continue the company’s […]

 
Read More..

Qubole Ignites Apache Spark on Google Cloud Platform

  • By Jonathan Buckley
  • December 17, 2015
 

Qubole, the big data-as-a-service company, today announced the availability of Apache Spark on Qubole Data Service (QDS) for Google Cloud Platform. The integration will enable Google Cloud Platform customers to use QDS’s 1-click persistent Spark Notebooks for fast data analysis, and auto-scale Spark clusters that deliver the right compute power for specific workloads. Qubole Data […]

 
Read More..

Getting started with Spark on QDS for Google Cloud Platform

 

Starting today, Qubole Data Service (QDS) users can launch Auto-scaling Spark Clusters and 1-click Persistent Notebooks to analyze data persisting in Google Cloud Storage. To set up a trial account, follow the instructions in our Google Cloud Platform Quick Start Guide. With auto-scaling, you no longer need to manually set the cluster size to achieve […]

 
Read More..

Share RDDs Across Jobs with Qubole’s Spark Job Server

 

When we launched our Spark as a Service offering in February, we designed it to run production workloads. Users would write standalone Spark applications and run them via our UI or API. We then enhanced the offering by adding support for running these standalone Spark applications on a schedule using our scheduler or as part […]

 
Read More..

Riding the Spotted Elephant

Riding-the-Spotted-Elephant
 

Introduction: One of the benefits of moving Hadoop workloads to the cloud is reducing cost and risk. No up front capital expense on hardware is required and on-going expenditure scales only in response to actual usage. This greatly lowers risk. Services like Qubole eliminate administration overhead as well. Amazon EC2 offers multiple instance purchasing options. […]

 
Read More..

Share Data Across Accounts with Data Exchange

  • By Xing Quan
  • November 4, 2015
 

This post was written by Vikram Agrawal and Aswin Anand, who are both lead engineers at Qubole. Qubole has the concept of users and accounts. While customers sign in as a single user, they can also belong to one or more accounts. This account segregation provides some nice logical separation for compute clusters and metadata. […]

 
Read More..

Introducing Hadoop, Spark, and Presto Clusters With Zero Local Disk Storage

 

We’re excited to announce that Qubole can now run Hadoop, Spark, and Presto clusters with zero local disk storage. We now support AWS M4 and C4 instance types, which do not include local disk storage and instead utilize either S3 (for long-lived data) or EBS (network attached disk-storage for holding intermediate and temporary data) for […]

 
Read More..

Interning at Qubole: What I Learned From Working on Hive, Spark, and Sqoop

  • By Xing Quan
  • October 5, 2015
 

This is a guest post from Akhilesh Anandh, who was an engineering intern with us. My journey with Qubole began in January 2015, when I joined as an intern for 6 months (my final semester of college) under the PS-2 programme of my alma mater BITS Pilani. I spent another 2 months at Qubole from […]

 
Read More..

Announcing Support for AWS IAM Roles

  • By Xing Quan
  • September 3, 2015
 

We’re excited to announce support for Identity and Access Management (IAM) Roles for delegating permissions and access to Qubole. IAM Roles are a security best practice on AWS. Customers no longer need to provide access and secret keys to Qubole, making access control more secure. Here’s some background on why Qubole requires access to our […]

 
Read More..

Multi-tenant Job History Server for Ephemeral Hadoop and Spark Clusters

 

Introduction Qubole Data Service (QDS) allows users to configure logical Hadoop and Spark clusters that are instantiated when required. These clusters auto-scale according to the workload and shut down automatically when there is a period of inactivity, resulting in substantial cost savings. This feature, however, presents an additional challenge for supporting and debugging logs. For […]

 
Read More..

SQL-On-Hadoop Evaluation by Pearson

  • By Nate Philip
  • August 13, 2015
 

This is a guest post written by Sumit Arora, Lead Big Data Architect at Pearson, and Asgar Ali, Senior Architect at Happiest Minds Technologies Pvt., ltd. About Pearson Pearson is the world’s leading learning company, with 40,000 employees in more than 80 countries working to help people of all ages to make measurable progress in […]

 
Read More..

Qubole’s Big Data as a Service Platform Gains Rapid Traction in Mobile Data Applications

  • By Nate Philip
  • July 30, 2015
 

MOUNTAIN VIEW, Calif.—July 30, 2015—Qubole, the big data-as-a-service company founded by the team that developed Facebook’s data infrastructure, today reported rapid adoption of its self-service big data analytics platform for mobile applications in the first half of 2015. The Qubole big data as a service platform processes data stored on the three major public clouds: […]

 
Read More..

Presto-Amazon Kinesis Connector for Interactively Querying Streaming Data

  • By Sivaramakrishnan Narayanan
  • July 16, 2015
 

This content was authored by Qubole and originally published on the AWS Big Data Blog. Amazon Kinesis is a scalable and fully managed service for streaming large, distributed data sets. As applications (particularly on mobile and wearable devices) start to collect more and more data, Amazon Kinesis is becoming the starting point for data ingestion […]

 
Read More..

Drag-n-Drop upgrades of Hadoop, Spark and Presto Clusters

  • By Mayank Ahuja
  • July 15, 2015
 

Introduction As the Big Data stack has matured, many companies have started using large clusters for running business critical applications. Workloads in such clusters are often long running (for hours or even days) and restarting a cluster poses a big problem: What happens to jobs that are already running? Restarting all these jobs wastes a […]

 
Read More..

Hive JDBC Storage Handler

  • By Divyanshu Goyal
  • July 14, 2015
 

Untitled Document As a part of my summer internship project at Qubole, I worked on an open-source Hive JDBC storage handler (github). This project helped me improve my knowledge on distributed systems and gave me exposure of working on a team on large projects. In many big data projects, integrating data from multiple sources is […]

 
Read More..

Announcing Saved Queries for Qubole Data Service

  • By Raghunandan Balachandran
  • July 2, 2015
 

We are always striving to add features to simplify the experience of our customers using Qubole Data Service (QDS). One of the major feature asks which has come up time and again is the ability to design queries and save them in a design time repository. This concept would allow separation of design time artifacts […]

 
Read More..

Qubole Recognized as Advanced Technology Partner by Amazon Web Services

  • By Nate Philip
  • July 1, 2015
 

With Qubole on AWS, any size organization can become data-driven with self-service access to the latest big data technologies MOUNTAIN VIEW, Calif., July 1, 2015—Qubole, the big data-as-a-service company founded by the team that developed Facebook’s data infrastructure, today announced it is now an Amazon Web Services (AWS) Advanced Technology Partner. Qubole’s self-service platform for […]

 
Read More..

CUBE Keyword in Apache Hive

  • By Rajat Venkatesh
  • June 19, 2015
 

Introduction As part of a recent project – I had to experiment with CUBE functionality in Hive. This functionality was added somewhat recently to Hive (version 0.10) and is an advanced use case in Hive. Perhaps for these reasons – it is difficult to find examples other than the one in the Hive Wiki. In […]

 
Read More..

Rebalancing Hadoop Clusters for Higher Spot Utilization

  • By Hariharan Iyer
  • June 9, 2015
 

Running Hadoop clusters efficiently is an important customer use case at Qubole. When running in AWS, this often means using Spot instances efficiently. In this post we introduce the notion of Rebalancing Hadoop clusters to achieve a higher mix of Spot instances – while still maintaining reliability and meeting SLAs. Spot Instances At Qubole, many […]

 
Read More..

Apache Hadoop 2.6.0 Now Generally Available on Qubole

  • By Xing Quan
  • June 4, 2015
 

We’re excited to announce that Apache Hadoop 2.6.0, the latest stable release* of Apache Hadoop, is now generally available on Qubole. Hadoop 2.6.0 is compatible with all of the usual services that Qubole offers, including Spark, Hive, Pig, and MapReduce. In addition, the optimizations that we’ve made for operating in the cloud, such as auto-scaling […]

 
Read More..

Announcing Qubole’s HBase-as-a-Service for AWS

  • By Jonathan Buckley
  • May 6, 2015
 

Today we are pleased to announce the Beta offering of Qubole’s HBase-as-a-Service. QDS can now provide fully managed HBase 1.0.0 running on Hadoop 2.6.0 as a managed service on the AWS Cloud. Introduction to HBase Apache HBase is an integral part of the Apache Hadoop ecosystem. When fast reads and writes with high concurrency and […]

 
Read More..

Bridging HDFS2 with HDFS1

  • By Rajat Jain
  • March 14, 2015
 

Industry is rapidly moving to adopt Hadoop 2.x. With every upgrade process — especially one that is so big in nature — there is a level of complexity involved. Qubole has already started offering a beta service to our customers. Our customers have started to try out Hadoop 2 as well, and as with any […]

 
Read More..

Hadoop with Enhanced Networking on AWS

  • By Hariharan Iyer
  • March 13, 2015
 

Introduction At Qubole, many of our customers run their Hadoop clusters on AWS EC2 instances. Each of these instances is a Linux guest on a Xen hypervisor. Traditionally each guest’s network traffic goes through the hypervisor, which adds a little bit of overhead to the bandwidth. EC2 now supports Single Root I/O Virtualization (called Enhanced […]

 
Read More..

Plugging in Presto UDFs

  • By Sivaramakrishnan Narayanan
  • March 4, 2015
 

Presto is a great query engine for a variety of SQL workloads. We’ve been offering  Presto-as-a-Service for many months now and a frequent question that comes up is: “How can I plug-in custom user-defined functions in Presto?” In this blogpost, we will answer this very question. We’ve created a Presto UDF Project in github that […]

 
Read More..

Qubole Adds Apache Spark to Hadoop-based Cloud Offering

 

One of the things customers love about Qubole is that they’re able to use the latest and greatest technologies—without having to fiddle with deploying it on their own. Continuing this tradition, I’m pleased to announce that we’ve expanded our portfolio of services on the Qubole Data Services (QDS) platform to include Apache Spark. Data scientists […]

 
Read More..

Qubole on Azure

  • By Swati Singhi
  • February 9, 2015
 

Qubole is the leading provider of Hadoop as a service. Our mission is to provide a simple, integrated, high-performance big data stack that businesses can use to derive actionable insights from their data sources quickly. Qubole Data Service (QDS) offers self-service and auto-scaling Hadoop in the cloud (patent pending) along with an integrated suite of data […]

 
Read More..

Re-using JVMs across Hadoop jobs

  • By Sivaramakrishnan Narayanan
  • December 22, 2014
 

One of the oft-discussed problems with Hadoop is that it launches new JVMs for each map or reduce task. Launching a new JVM and loading all the classes is pretty expensive and can take anywhere from 4-8 seconds. If the job is a small one, this startup overhead can be a substantial part of overall […]

 
Read More..

High Performance Hadoop with New Generation AWS Instances

 

Welcome New Generation Instance Types Amazon Web Services (AWS) offers a range of instance types for supporting compute-intensive workloads. The compute optimized instance family has a higher ratio of compute power to memory. The older generation C1 and CC2 instance types have been very useful in batch data processing  frameworks such as Hadoop. Late last […]

 
Read More..

Securely sharing data across Organizations with Qubole

 

Customers love that Qubole enables collaboration via a shared workbench across multiple analysts in an organization. Increasingly though, we have started finding use cases where organizations want to share data across Qubole accounts. Departments in different geographies want to share selected data sets with each other. Also, organizations want to share data with their partner […]

 
Read More..

Caching in Presto

 

Qubole’s Presto-as-a-Service is primarily targeted at Data Analysts who are tasked with translating ad-hoc business questions into SQL queries and getting results. Since the questions are often ad-hoc, there is some trial and error involved. Therefore, arriving at the final results may involve a series of SQL queries. By reducing the response time of these […]

 
Read More..

Qubole Releases Industry’s First Auto-Scaling Presto Clusters

Auto-Scaling-Presto-Clusters_small
 

Qubole was the first big data platform to offer a true auto-scaling Hadoop-as-a-Service solution. Now, Qubole is pleased to announce the industry’s first auto-scaling Presto-as-a-Service solution. Why Auto-Scaling Presto-as-a-Service Explorative analytics is one area that can get quite bursty. A single business question can easily require multiple short queries. For example, let’s say a data […]

 
Read More..

June 2014 Product Update

 

At Qubole, we’re continually improving our platform and bringing the features and functionality that matter most to our users a reality. This month, we’re proud to announce the launch of a several vital new features to our platform. Multi-Cluster – We are pleased to announce a new feature in Qubole Data Services (QDS) – support for […]

 
Read More..

Canonicalizing hive queries to find top workloads

 

Motivation One of Qubole Data Services’ most popular offering is Hive-as-a-Service in the cloud. Users run a large number of ad-hoc, analytical Hive queries against their data in S3 or HDFS. It wasn’t apparent to us how many of these queries were truly unique and how many were simple variants. The hypothesis was that if […]

 
Read More..

Presto Performance

  • By Sivaramakrishnan Narayanan
  • April 14, 2014
 

Presto is an open source distributed SQL query engine, developed by Facebook. Presto was designed and written from the ground up for interactive analytics and approaches the speed of commercial data warehouses. Qubole started its Presto-as-a-Service program a few weeks ago to make it easily accessible with a single click for its users. A good […]

 
Read More..

Save Time Executing Hive Queries Using Command Templates

executing_hive_queries_small
 

A common characteristic of many analytics queries is that they are mostly invariant in form and function. Over multiple invocations of the query or command, one would find that only the range of inputs varies in the form of a couple of inputs, while the major part of the query remains the same. Command templates […]

 
Read More..

Easy reusable commands with templates

  • By Hariharan Iyer
  • January 7, 2014
 

A common characteristic of many analytics queries is that they are mostly invariant in form and function. Over multiple invocations of the query or command, one would find that only the range of inputs varies in the form of a couple of inputs, while the major part of the query remains the same. Command templates […]

 
Read More..

Waiting for Mr. Ntpd

  • By Swati Singhi
 

In one of our earlier blog posts, we announced the availability of the Qubole Hadoop Platform on Google Compute Engine. This was also featured on the Google Cloud Platform Blog. In this post we talk about a critical issue that we faced (and eventually managed to circumvent) a few days before the Qubole-on-GCE beta release. […]

 
Read More..

Qubole Available on Google Compute Engine

  • By Joydeep Sen Sarma
  • December 7, 2013
 

Qubole is a leading provider of Hadoop as a service with the mission of providing a simple, integrated, high-performance big data stack that businesses can use to derive actionable insights from their data sources quickly. Qubole Data Service offers self-service and auto-scaled Hadoop in the cloud along with an integrated library of data connectors and […]

 
Read More..

Deploy Demotrends using the Scheduler

Deploy-Demotrends-using-the-Scheduler
 

Introduction In previous blog posts, we explained how to create a data pipeline to process the raw data, generate a list of trending topics and export it to the web app. In this blog, we will explain to you how to deploy the data pipeline using the Qubole scheduler. The data pipeline shown in the […]

 
Read More..

Incremental Hive for Workflows

Incremental-Hive-for-Workflows
 

Introduction At Qubole, we are working on various products to make analysis on top of big data easier and simple. One of the core products in this lineup is the Scheduler which offers our users a way to schedule periodic workflows that run at specific intervals. Scheduler is an important part of the analytic pipeline […]

 
Read More..

Faster Hive with a single click

Faster-Hive-with-a-single-click
 

Background One of our goals at Qubole is to make analysis of large data sets simple and optimal – especially in cloud environments. In our experience working with data analysts and engineers we observed a common pattern. During the query authoring phase analysts end up writing a number of ad-hoc queries as part of an […]

 
Read More..

Qubole Hive Server

Qubole-Hive-Server
 

Qubole offers Hive as a service. When a user logs in to Qubole, he/she sees the tables and functions associated with their account and can submit a HiveQL command via the composer pane. Qubole takes care of executing the HiveQL command, spawning a Hadoop cluster if necessary and saving results and logs. Now, multiple users […]

 
Read More..

A Performance Comparison of Qubole and Amazon Elastic MapReduce (EMR)

performance_comparison_small
 

Summary: Here at Qubole, our core focus is on providing the best platform to analyze data in the Cloud. We are simplifying the complicated infrastructure necessary for analytics and making the tools accessible for users with various skill sets and experience levels. We have also continued to optimize Hadoop and Hive for use in the […]

 
Read More..

Sqoop as a Service

Sqoop-as-a-Service
 

Background: As Qubole Data Service has gained adoption – many of our customers asked for import and export facility from their relational data sources into the Cloud (S3). Dimension data from such data sources are an important part of data analysis. Log files (aka. Fact tables) in S3 are often desired to be joined with […]

 
Read More..

Top-K Optimization

Top-K-Optimization
 

It started with an innocent tweet in response to a blog post on how to optimize top-k queries. My colleague, Shrikanth, pointed out that Hive does not, in fact, have this optimization. After a couple of months, I finally got a chance to implement the optimization and that is topic of this blog. What’s Top-K? […]

 
Read More..

Optimizing Hadoop for S3 – Part 1

Optimizing-Hadoop-for-S3-Part-1
 

Introduction: Users of Qubole Data Service use Hive queries or Hadoop jobs to process data that resides in Amazon’s Simple Storage Service (S3). S3 has many advantages including data security mechanisms and high reliability. However, S3 is much slower than HDFS and direct attached storage. In this first of a series of posts, we dive […]

 
Read More..

Industry’s First Auto-Scaling Hadoop Clusters

Industry-First-Auto-Scaling-Hadoop-Clusters
 

Background In 2009 I first started playing around with Hive and EC2/S3. I was blown away by the potential of the cloud. But it bothered me that the burden of sizing the cluster was put on the user. How would an analyst know how many machines were required for a given query or a job? […]

 
Read More..
clear