The Importance of Data Due Diligence

Start Free Trial
March 26, 2018 by Updated October 31st, 2018

Earlier this month, a study was published indicating that the widely used “Reddit dataset” (released in 2015 by Jason Baumgartner) had significant, previously unidentified gaps. This study was surprising and incredibly relevant to me personally, since I had personally used the Reddit dataset for an analysis blog post without ever validating that the data was complete or even representative. Clearly, this oversight on my part was not uncommon, as this study cites dozens of published, academic papers which mistakenly relied on the dataset. The authors of these papers failed to perform their “data due diligence” and therefore their results have been called into question.

The reliance on this faulty dataset is a clear example of a widespread problem: data scientists and analysts failing to thoroughly validate the integrity of the data that they rely on. As the authors of the study write in the corresponding Medium post, “…researchers have a duty to check the datasets rather than assume their quality on faith.” This is true of both academic researchers and the data scientists and analysts who have chosen to work in industry.

Too often, a data scientist might train and even deploy a machine learning model without first checking the validity of the training data. Similarly, a data analyst might write a report without vetting the underlying data to make sure that it is accurate and comprehensive. In the best case, this means re-training the model or re-running the report costing the data scientist or analyst hours or days of work. In the worst case, releasing a report with false data or a badly trained model can cost companies millions of dollars.

One tool that Qubole provides to enable our customers to quickly check the validity of their dataset is the lightning fast Presto querying engine. With Presto, data scientists and analysts can quickly identify any gaps or discrepancies in their datasets and work to fix the issue. The real time querying power of the engine means that users can do sanity checks on their data (like make sure that there are as many rows as expected) without sacrificing their valuable time. At Qubole, we consider data integrity in every aspect of product design and are actively working on packaging these features into a product offering.

Start Free Trial
  • Blog Subscription

    Get the latest updates on all things big data.
  • Recent Posts

  • Categories

  • Events

    Data Lakes vs. Data Warehouses – a Modern Data Strategy Debate

    Oct. 27, 2020 | SEA

    Data Lakes vs. Data Warehouses – a Modern Data Strategy Debate

    Oct. 28, 2020 | India

    Data Lakes vs. Data Warehouses – a Modern Data Strategy Debate

    Oct. 28, 2020 | SEA

    Data Lakes vs. Data Warehouses – a Modern Data Strategy Debate

    Oct. 29, 2020 | SEA

    Building A Modern Data Lake on AWS

    Nov. 5, 2020 | Indonesia
  • Read Machine Learning: Using the H2O Framework with Apache Spark Clusters on Qubole