Case Study: Pinterest’s Journey to Qubole

July 29, 2014 by Updated December 5th, 2017

Pinterest-Journey-QuboleWith 20 terabytes of new data logged each day, managing big data is not an option at Pinterest. In order to provide an optimal user experience with the most relevant and recent content, Pinterest turned to Hadoop to help process the data.

Unfortunately, Hadoop in its raw form doesn’t act as a self-serve platform because it is only built for a technical user and lacks elasticity.

In order to overcome these limitations, Pinterest at first turned to Amazon Elastic MapReduce to run its Hadoop jobs. However, as the workload scaled to a few hundred nodes, Amazon EMR became less stable, and the engineering team started to run into limitations with EMR’s versions of Hive.

Due to these limitations, Pinterest decided to migrate its Hadoop jobs to Qubole. In a blog post on Pinterest’s engineering blog, Mohammad Shahangian, outlined why the company chose Qubole over other big data platforms.

  • Ability to scale horizontally to 1,000s of nodes on a single cluster
  • 24/7 engineering support for data infrastructure
  • A user interface for non-technical users
  • Hive integration
  • Multi-cluster support and a simplified executor abstraction layer
  • Baked AMI customization
  • Support for spot instances
  • S3 eventual consistency protection
  • Graceful autoscaling clusters

Currently Pinterest has more than 100 regular users of MapReduce who run more than 2,000 jobs each day. To learn more about how Pinterest built its self-serve platform, check out their blog post.