Qubole has been a huge win for us. Qubole has proven to be stable at petabyte scale and has given use 30%-60% higher throughput than Amazon EMR. It has also made it extremely easy to onboard non-technical users.
Pinterest Data Engineer
Big data plays a big role at Pinterest. With more than 30 billion Pins in the system, Pinterest is building the most comprehensive collection of interests online. One of the challenges associated with building a personalized discovery engine is scaling its data infrastructure to traverse the interest graph to extract context and intent for each Pin.
Pinterest currently logs 20 terabytes of new data each day, and has around 10 petabytes of data in S3. The company use Hadoop to process this data, which enables it to put the most relevant and recent content in front of Pinners through features such as Related Pins, Guided Search, and image processing. It also powers thousands of daily metrics and allows Pinterest to put every user-facing change through rigorous experimentation and analysis.
In order to build big data applications quickly, Pinterest evolved its single cluster Hadoop infrastructure into a ubiquitous self-serving platform. Though Hadoop is a powerful processing and storage system, it’s not a plug and play technology. Because it doesn’t have cloud or elastic computing, or non-technical users in mind, its original design falls short as a self-serve platform.
Early on, Pinterest used Amazon’s Elastic MapReduce to run all of its Hadoop jobs. EMR played well with S3 and Spot Instances, and was generally reliable. As the company scaled to a few hundred nodes, EMR became less stable. Pinterest chose Hive for most of its Hadoop jobs primarily because the SQL interface is simple and familiar to people across the industry. However, Pinterest started running into limitations of EMR’s proprietary versions of Hive.
For large dependencies that take a while to install, Pinterest preinstalled them on an Amazon Machine Image (AMI), including Hadoop Libraries and a Natural Language Processing Library Package for internationalization. However, support for this approach by Hadoop service providers was difficult to find.
The company had already built so many applications on top of EMR that it was hard for it to migrate to a new system. Pinterest also didn’t know what it wanted to switch to because some of the nuances of EMR had crept into the actual job logic. In order to experiment with other flavors of Hadoop, Pinterest implemented an executor abstraction and moved all the EMR specific logic into the EMRExecutor. This gave Pinterest the flexibility to experiment with a few flavors of Hadoop and Hadoop service providers, while enabling us to do a gradual migration with minimal downtime.
Pinterest ultimately migrated its Hadoop jobs to Qubole, a rising player in the Hadoop as a Service space. Given that EMR had become unstable at Pinterest’ large scale, it had to quickly move to a provider that played well with AWS (specifically, spot instances) and S3. Qubole Data Service (QDS) supported AWS/S3 and was relatively easy to get started on.
After vetting Qubole and comparing QDS performance against alternatives (including managed clusters), Pinterest decided to go with Qubole for a few reasons:
“Overall, Qubole has been a huge win for us, and we’ve been very impressed by the Qubole team’s expertise and implementation,” comments Mohammad Shahangian, Pinterest Data Engineer. “Over the last year, Qubole has proven to be stable at petabyte scale and has given us 30%-60% higher throughput than EMR. It’s also made it extremely easy to onboard non-technical users.”
With Pinterest’s current setup, Hadoop is a flexible service that’s adopted across the organization with minimal operational overhead. Pinterest has over 100 regular MapReduce users running over 2,000 jobs each day through QDS’ web interface, ad-hoc jobs and scheduled workflows
Pinterest has six standing Hadoop clusters comprised of over 3,000 nodes, and developers can choose to spawn their own Hadoop cluster within minutes. The company generates over 20 billion log messages and process nearly a petabyte of data with Hadoop each day.
Pinterest is currently experimenting with managed Hadoop clusters, including Hadoop 2, but for now, using cloud services such as S3 and QDS is the right choice for the company because it frees Pinterest up from the operational overhead of Hadoop and allows the company to focus our engineering efforts on big data applications.
Qubole is a significantly more polished product than EMR. Data scientists can explore their data in S3, create tables and query those tables all via an easy-to-use web UI
Qubole’s fantastic support has been key in our successful deployment. They continue to deliver of new features and revisit the ones that we ask for
Our goal at MediaMath was to take our existing industry leading infrastructure to the next level handling new complex analytics tasks. Qubole has helped us enable this goal with minimal risk.
Instead of worrying about provisioning clusters of machines or job flows or whatever, Qubole lets you focus on your data and your queries … The Qubole guys have been extremely helpful!
The service spins up users’ clusters only when a job is started, then automatically scales or contracts them based on the workload, and spins the servers down once the job is done.
Qubole’s Hadoop and Hive interfaces are vastly superior to the default CLIs, which scare business analysts and hinder meaningful analyses of the gaming logs that we collect. With Qubole, business analysts are self-sufficient in using a Big Data platform to meet their advanced analytic needs.
Online Gaming Company
top-performing technologies in the data industry are definitely taking aim at democratizing data tools and bringing the power of data to smaller businesses. This is a major change in the data industry, and Qubole Data Service is a great example
I’m very happy to be using Qubole in production. Qubole has saved me a lot of time, effort, and trouble in getting my data processing pipelines up and running. My data pipelines process Appnexus data in Amazon S3 which is then stored in Vertica. The engineering team understands the complexities and provided awesome support!
Real-time Ads Retargeting Startup
There’s a whole world of web companies, SMBs and other non-Facebooks or Yahoos that will want to use Hadoop but not want to run it in-house…offering a cloud service makes it easier for these users to get started with the platform and for Qubole to keep improving.
Qubole offers a big data ETL and exploration service through auto-scaling Hadoop clusters with a web user interface for data exploration and integration with various data sources. The service can do (nearly) everything EMR can do, and it goes further
Big Data Republic
Simba knows Big Data access. Qubole knows Big Data. Qubole’s founders authored Apache Hive, built key parts of the Hadoop eco-system and brought Apache HBase to Facebook
“The integration of Tableau and Qubole makes it faster and easier for our customers to operationalize Big Data…lowers the resource barriers to deriving the benefits of Big Data because customers can deploy our joint solution seamlessly and cost effectively.”