Build a Data Pipeline with Qubole

Start Free Trial
September 18, 2013 by Updated September 16th, 2021


Build a Data Pipeline with Qubole

We get many questions about how Qubole makes data science easy. The one-line answer is – we have developed a number of features to help with all stages in a data pipeline’s lifecycle – import data, analyze, visualize and deploy to production. However, the best way to answer these questions is by letting the product talk. We have created a demo to showcase how Qubole lowers the barrier to leverage the power of Apache Hive with Hadoop.

Wikimedia, the organization behind Wikipedia publishes the page views of each topic, every hour of the day. From the page views, trending topics i.e. those topics with sudden jumps in page views can be found. We are not the first ones to attempt such an exercise. Data Wrangling has documented a previous attempt here.

We have also deployed a rails app to visualize the results at The code for the rails app is available in Github.


Requirements to run the demo

  1. Qubole Trial Plan


  1. An Amazon Web Services account
  2. Qubole Free Plan

PageCount Data

On the Wikimedia website, the data is partitioned by year followed by the month. There is one directory per year starting from 2007. Each year has one directory per month. For e.g. in the year 2013, there are directories named

  • 2013-01
  • 2013-02
  • 2013-03
  • 2013-04
  • 2013-05
  • 2013-06
  • 2013-07


Each month directory has one file for every hour in a day. For e.g.



Let’s take a peek at the page count data feed. Each row ends with a new line ‘\n’. There are 4 columns in each row and the columns are separated by a space.

GroupA category is chosen by Wikipedia.
TitleName of the topic
Page ViewsNo. of views of each topic in an hour
Bytes SentNo. of bytes served by the server in an hour


A sample of the rows are:

GroupTitlePage ViewsBytes Sent

Resolve Synonyms

The page titles in a row can be synonyms. For e.g. “Bhart”, “Al_Hind” and “India” redirect to the same Article of India on Wikipedia, ie these 3 are synonyms and Wikipedia keeps track of synonyms. Wikipedia calls the synonym page titles “Redirect Titles”.


The page visit count data that we discussed above gives info about page visits for “Bhart”, “Al_Hind” and “India” separately. To get the actual page visit count of “India” we need to aggregate the data for all synonyms/Redirect titles for “India”. Wikipedia publishes the data about page synonyms in form of 2 dimension tables.

Page IdPage Title
Redirect fromPage Title

Page Table has an entry for each page that exists in the wiki. The redirect table has an entry for only redirect page ids. We need to join these two tables to get a lookup table.

Redirect IdRedirect TitleTrue TitleTrue Id

We can then use this lookup table to find the True Title for any redirect table.

Final Output

The final output exported to the web app looks like

DateTitleMonthly TrendDaily Trend



The data pipeline is visualized in the flowchart shown above. There are four distinct parts:

  1. Import and clean up Pagecount
  2. Import and join Pages & Redirect tables to create a lookup table.
  3. Resolve Synonyms and calculate trend ranks for each topic.
  4. Export the results to the webapp database.

In the second part of the blog series, we’ll describe the ETL steps required to process the raw data.

In the third part, we’ll describe the analysis and export steps.

Start Free Trial
  • Blog Subscription

    Get the latest updates on all things big data.
  • Recent Posts

  • Categories

  • Events

    QUBOLE LIVE DEMO: Google Cloud Platform (GCP) Enables You To Simplify Today and Future Proof for Tomorrow

    Jan. 27, 2022 | Global

    Data Lake and Data Warehouse – A modern data strategy discussion

    Feb. 2, 2022 | Online

    QUBOLE LIVE DEMO: Stop The Cloud Cost Madness With Graviton and AWS. Switch And Save to Reduce Your Data Lake Costs Today

    Feb. 3, 2022 | Global

    CONTINUOUS INTELLIGENCE DAY – Continuous Intelligence in Finance 2022 and beyond

    Feb. 24, 2022 | Global

    Data Innovation Summit MEA 2022

    Mar. 7, 2022 | Global

    Data2030 Summit 2022 – APAC Edition – Data Strategies For Data And AI-Driven Organisations

    May. 24, 2022 | Global
  • Read Download and Prepare raw data