Ibotta Builds a Cost-Efficient, Self-Service Data Lake Using Qubole

April 2, 2019 by

This blog is a customer guest post written by Ibotta. Ibotta is a mobile technology company that is transforming the traditional rebates industry by providing in-app cashback rewards on receipts and online purchases for groceries, electronics, clothing, gifts, supplies, restaurant dining, and more for anyone with a smartphone.

Today, Ibotta is one of the most-used shopping apps in the United States, driving more than $7 billion in purchases per year to companies like Target, Costco, and Walmart. Ibotta has more than 27 million total downloads and has paid out more than $500 million to users since its founding in 2012.

Maintaining a competitive edge in the ecommerce and retail industry is extremely difficult, because it requires building an engaging and unique shopping experience for consumers.

Ibotta’s Previous Data Infrastructure

Prior to moving to a big data platform with Qubole, Ibotta’s data and analytics infrastructure was based on a cloud data warehouse that was static and rigid. This worked as long as the data sets were well-structured and in tabular format. However, as the business grew, newer and more complex data formats were being developed and ingested.

At the same time, Ibotta was heavily investing in new data analytics teams such as data engineering, decision science, and machine learning. The teams needed access to the same data, but each team sought to interact with the data in a different way.

Data Engineering needed a set of tools that allowed it to perform extract, transform, and load (ETL) processes in many different ways using MapReduce, Apache Hive, Spark, and/or Presto. The Machine Learning team wanted to use Spark for feature engineering and to train and deploy its models. Decision Science wanted to use SQL, R, and Python to extract insights and business recommendations from the data.

Moving Beyond Descriptive Analytics

Ibotta needed to grow beyond descriptive analytics — which were complementary to its products — into a pure data-driven company. The organization needed to be segmented so that Ibotta could adequately staff the appropriate teams and people in order to help accomplish the following aspirations:

  • For the Data Engineering team: Design the data lake, manage technologies, provide data services, and create automated pipelines that feed into various data marts
  • For the Machine Learning team: Create new product features and move to predictive and prescriptive analytics with use cases ranging from personalization to optimization
  • For the Decision Science team: Develop and deliver a self-service insights platform for internal stakeholders and external client partners

Ibotta needed a way for every user to have self-service access to data and be able to use the right tools for their use cases with big data engines like Spark, Hive, and Presto. At the same time, the Data Engineering team needed to be able to prepare data for easy consumption. To address the various goals of its data teams, Ibotta built a cost-efficient, self-service data lake using a cloud-native platform.

Building a Self-Service Data Lake

Ibotta realized the first step to building a self-service platform was to define what data was critical to enable the analytics teams to meet critical business milestones. At the time, users were employing a combination of data (from the transactional system and the data warehouse) to run their models.

After the value of each dataset was defined, the data engineering team could begin building pipelines that extracted data from the data warehouse and Amazon Aurora and converted it to JSON format, which was then stored in the raw storage area.

From there, additional pipelines converted the JSON format into Optimized Row Columnar (ORC) and Parquet columnar format and stored the resulting data in the optimized storage area. Using Airflow and its ability to monitor new partitions in the metastore, downstream pipelines could then start running as soon as the new data locations were exposed to the Hive metastore.

To mitigate the legacy data warehouse constraints, Ibotta now has ETL jobs loading data from Hive into Snowflake for consumption by its business intelligence (BI) tool, Looker. Ibotta utilizes Hive and Spark jobs for processing raw data into production-ready tables used by the Decision Science team. This is all orchestrated using Airflow’s hooks into Qubole to ease automating jobs via the API. Airflow gives more control over orchestration than Cron and AWS Data Pipeline. It also provides performance benefits, including parallelization and the flexibility of scheduling jobs as a directed acyclic graph (DAG) instead of assuming linear dependency.

Leveraging Big Data for ML, ETL, and Ad Hoc Querying

Ibotta uses Qubole to provision and automate its big data clusters. Specifically, it uses Spark for machine learning and other complicated data processing tasks; Hive and Spark are used for ETL processes; and Presto is used for ad hoc queries like exploratory analytics.

Utilizing this platform, Ibotta has empowered the Decision Science team to use BI tools to produce real-time dashboards for hundreds of users. Since instituting their new data platform, Ibotta has increased the volume of processed data by more than three times within four months, and it is passing more than 30,000 queries per week through Qubole.

Ibotta’s Decision Science team was immediately empowered after Qubole was in place. It achieved the goal of self-service access to the data and efficient scale of compute resources in AWS Elastic Compute Cloud (Amazon EC2) for big data workloads. Within a month, the Machine Learning team was launching new prescriptive analytics features in the product that included a recommendation engine, A/B testing framework, and an item-text classification process.

Conclusion

By using Qubole on AWS, teams at Ibotta are able to provision resources themselves without having to engage a central administration group. Big data clusters are using a 60 to 90 percent mix of Spot instances with on-demand nodes, which, combined with the use of Qubole’s heterogeneous cluster capability, makes it really easy and reliable to achieve the lowest-running cost for big data workloads.

Additionally, autoscaling and cluster lifecycle management provide significant savings to Ibotta’s cloud infrastructure costs. This means that managing budget and ROI is much easier, and Ibotta can forecast how to scale different features and projects accordingly.

Ibotta is focusing on delivering next-generation ecommerce features and products that help drive both a better user experience and partner monetization. Qubole allows Ibotta to spend time developing and productionizing scalable data products. More important, it can concentrate on bringing value back to users and customers.

Want more information? Read Ibotta’s full story about building a self-service data lake with Qubole.

  • Blog Subscription

    Get the latest updates on all things big data.
  • Recent Posts

  • Categories

  • Events

    Presto Summit in India

    Sep. 5, 2019 | Banglor, India

    Google Summit Jakarta

    Sep. 5, 2019 | Jakarta, Indonesia

    Google Cloud Summit Seattle

    Sep. 17, 2019 | Seattle, WA

    Strata NY

    Sep. 23, 2019 | New York, NY

    Big Data World Asia

    Oct. 9, 2019 | Singapore

    Spark Summit Amsterdam

    Oct. 15, 2019 | Amsterdam, NL