Case Study Pinterest

Pinterest Builds a Self-Service Platform for Hadoop Using Qubole Data Service

OpenQubole has been a huge win for us. Qubole has proven to be stable at petabyte scale and has given use 30%-60% higher throughput than Amazon EMR. It has also made it extremely easy to onboard non-technical users.Close

Mohammad Shahangian

Pinterest Data Engineer

Challenges

Big data plays a big role at Pinterest. With more than 30 billion Pins in the system, Pinterest is building the most comprehensive collection of interests online. One of the challenges associated with building a personalized discovery engine is scaling its data infrastructure to traverse the interest graph to extract context and intent for each Pin.

Pinterest currently logs 20 terabytes of new data each day, and has around 10 petabytes of data in S3. The company use Hadoop to process this data, which enables it to put the most relevant and recent content in front of Pinners through features such as Related Pins, Guided Search, and image processing. It also powers thousands of daily metrics and allows Pinterest to put every user-facing change through rigorous experimentation and analysis.

Objectives
  • Social pinboard optimization
  • 100 MapReduce users running over 2,000 jobs each day through Qubole’s web interface, ad-hoc
    jobs and scheduled workflows
  • Data: 10 petabytes, six Hadoop clusters
    comprised of over 3,000 nodes
  • Developers spawn Hadoop clusters in minutes
  • Users process a petabyte of data with Hadoop daily
  • Support for 100% spot instance clusters
  • Horizontally scalable to 1000s of node on a single cluster
  • Superior integration with Hive

Building a Self-Serve Platform for Hadoop

In order to build big data applications quickly, Pinterest evolved its single cluster Hadoop infrastructure into a ubiquitous self-serving platform. Though Hadoop is a powerful processing and storage system, it’s not a plug and play technology. Because it doesn’t have cloud or elastic computing, or non-technical users in mind, its original design falls short as a self-serve platform.

Early on, Pinterest used Amazon’s Elastic MapReduce to run all of its Hadoop jobs. EMR played well with S3 and Spot Instances, and was generally reliable. As the company scaled to a few hundred nodes, EMR became less stable. Pinterest chose Hive for most of its Hadoop jobs primarily because the SQL interface is simple and familiar to people across the industry. However, Pinterest started running into limitations of EMR’s proprietary versions of Hive.

For large dependencies that take a while to install, Pinterest preinstalled them on an Amazon Machine Image (AMI), including Hadoop Libraries and a Natural Language Processing Library Package for internationalization. However, support for this approach by Hadoop service providers was difficult to find.

The company had already built so many applications on top of EMR that it was hard for it to migrate to a new system. Pinterest also didn’t know what it wanted to switch to because some of the nuances of EMR had crept into the actual job logic. In order to experiment with other flavors of Hadoop, Pinterest implemented an executor abstraction and moved all the EMR specific logic into the EMRExecutor. This gave Pinterest the flexibility to experiment with a few flavors of Hadoop and Hadoop service providers, while enabling us to do a gradual migration with minimal downtime.

Why Qubole Data Service?

Pinterest ultimately migrated its Hadoop jobs to Qubole, a rising player in the Hadoop as a Service space. Given that EMR had become unstable at Pinterest’ large scale, it had to quickly move to a provider that played well with AWS (specifically, spot instances) and S3. Qubole Data Service (QDS) supported AWS/S3 and was relatively easy to get started on.
After vetting Qubole and comparing QDS performance against alternatives (including managed clusters), Pinterest decided to go with Qubole for a few reasons:

  1. Horizontally scalable to 1000s of nodes on a single cluster
  2. Responsive 24/7 data infrastructure engineering support
  3. Tight integration with Hive
  4. Google OAUTH ACL and a Hive Web UI for non-technical users
  5. API for simplified executor abstraction layer + multi-cluster support
  6. Baked Amazon Machine Image customization (available with premium support)
  7. Advanced support for spot instances – with support for 100% spot instance clusters
  8. S3 eventual consistency protection
  9. Graceful cluster scaling and auto-scaling

Results

“Overall, Qubole has been a huge win for us, and we’ve been very impressed by the Qubole team’s expertise and implementation,” comments Mohammad Shahangian, Pinterest Data Engineer. “Over the last year, Qubole has proven to be stable at petabyte scale and has given us 30%-60% higher throughput than EMR. It’s also made it extremely easy to onboard non-technical users.”

With Pinterest’s current setup, Hadoop is a flexible service that’s adopted across the organization with minimal operational overhead. Pinterest has over 100 regular MapReduce users running over 2,000 jobs each day through QDS’ web interface, ad-hoc jobs and scheduled workflows

Number of scheduled daily workflow jobs run over time. Does not include ad-hoc jobs.

Pinterest has six standing Hadoop clusters comprised of over 3,000 nodes, and developers can choose to spawn their own Hadoop cluster within minutes. The company generates over 20 billion log messages and process nearly a petabyte of data with Hadoop each day.

The Future

Pinterest is currently experimenting with managed Hadoop clusters, including Hadoop 2, but for now, using cloud services such as S3 and QDS is the right choice for the company because it frees Pinterest up from the operational overhead of Hadoop and allows the company to focus our engineering efforts on big data applications.

clear