5 Big Data Infrastructure Implementations
- By Dharmesh Desai
- July 28, 2016
One of the great things about the big data industry is how willing practitioners are to share their knowledge, thinking process and experience. We love it when our customers talk about their implementations and it’s amazing to see what they’ve accomplished. Here’s a collection of some our favorite blog posts:
1.Powering Big Data at Pinterest
Mohammad Shahangian, a data engineer at Pinterest, describes how the engineering team built a self-serve platform for Hadoop.
“We currently log 20 terabytes of new data each day, and have around 10 petabytes of data in S3. We use Hadoop to process this data, which enables us to put the most relevant and recent content in front of Pinners through features such as Related Pins, Guided Search, and image processing. It also powers thousands of daily metrics and allows us to put every user-facing change through rigorous experimentation and analysis.
In order to build big data applications quickly, we’ve evolved our single cluster Hadoop infrastructure into a ubiquitous self-serving platform.” Read More
2. The Big Data Lifecycle at TubeMogul
Chris Chanyi, Senior Data Architect at TubeMogul, goes in-depth on the history of the company’s data architecture and how it has transformed to better manage big data.
“One of our recent blog posts detailed how we handle over A Trillion HTTP Requests a Month – a dizzying number. All of these HTTP requests means that TubeMogul receives and stores some kind(s) of data for each request that needs to move through our data pipeline. While getting that data isn’t easy; storing, retrieving, querying, and aggregating all of this data is even more difficult. Thanks to a number of cloud services and improvements to our services, handling this data has gotten easier over the years. To understand how we handle this volume of data, it’s important to understand how we started…” Read More
3.Big Data: New Options for Implementation
MediaMath, a digital marketing technology company, has seen increasing demand from customers for direct access to data. To address this demand, the company built a data platform on Amazon Web Services and uses Qubole give clients direct access to data.
“In recent years, MediaMath has experienced increasing client demand for direct access to the data. To address that demand, the company created a scalable data platform built on Amazon Web Services (aws.amazon.com); the data platform consumes terabytes of data and nearly 10 billion log records each day. With the scale and access challenges under control, MediaMath began testing Qubole (qubole.com), a self-service portal for big data analytics, to support the growing analytic needs both for clients and for internal use. Qubole’s analytics suite test was successful, and MediaMath rolled out the offering, first internally, then directly to clients.” Read More
4.Moving Past Infrastructure Limitations
Rory Sawyer, a Software Engineer at MediaMath, describes the company’s journey of developing a data platform and why the company abandoned its on-premises hardware.
“Here at MediaMath, we’re quite fond of data. It would be surprising to hear someone say they’re not fond of data, of course, but we’ve spent the last 18 months proving to ourselves and our clients that we really mean it. Our company is built around driving concrete, measurable results, and our clients – both internal and external – have sophisticated analytics teams that want access to the data we generate for their own analysis, owned marketing, budgeting, and more. In this post we will describe the journey from data warehouse to data platform and the success of ditching our on-premises hardware. It has enabled our business to grow and for different personas in our organization to innovate, be more productive and expand their roles and how they impact the overall business.” Read More
Sumit Arora, Lead Big Data Architect at Pearson and Asgar Ali, Senior Architect at Happiest Minds Technologies PVT., provide their evaluation of various interactive query solutions to meet the company’s research requirements.
“There are many interactive query solutions available in the big data ecosystem. Some of these are proprietary solutions (like Amazon Redshift) while others are open source (Spark SQL, Presto, Impala, etc). Similarly, there are a number of file formats to choose from – Parquet, Avro, ORC, etc. We “theoretically” evaluated five of these products (Redshift, Spark SQL, Impala, Presto and H20) based on the documentation/feedback available on the web and decided to short list two of them (Presto and Spark SQL) for further evaluation.” Read More