The Importance of Data Due Diligence

Start Free Trial
March 26, 2018 by Updated September 4th, 2021

Earlier this month, a study was published indicating that the widely used “Reddit dataset” (released in 2015 by Jason Baumgartner) had significant, previously unidentified gaps. This study was surprising and incredibly relevant to me personally since I had personally used the Reddit dataset for an analysis blog post without ever validating that the data was complete or even representative. Clearly, this oversight on my part was not uncommon, as this study cites dozens of published, academic papers which mistakenly relied on the dataset. The authors of these papers failed to perform their “data due diligence” and therefore their results have been called into question.

The reliance on this faulty dataset is a clear example of a widespread problem: data scientists and analysts failing to thoroughly validate the integrity of the data that they rely on. As the authors of the study write in the corresponding Medium post, “…researchers have a duty to check the datasets rather than assume their quality on faith.” This is true of both academic researchers and data scientists and analysts who have chosen to work in the industry.

Too often, a data scientist might train and even deploy a machine learning model without first checking the validity of the training data. Similarly, a data analyst might write a report without vetting the underlying data to make sure that it is accurate and comprehensive. In the best case, this means re-training the model or re-running the report costing the data scientist or analyst hours or days of work. In the worst case, releasing a report with false data or a badly trained model can cost companies millions of dollars.

One tool that Qubole provides to enable our customers to quickly check the validity of their dataset is the lightning-fast Presto querying engine. With Presto, data scientists and analysts can quickly identify any gaps or discrepancies in their datasets and work to fix the issue. The real-time querying power of the engine means that users can do sanity checks on their data (like make sure that there are as many rows as expected) without sacrificing their valuable time. At Qubole, we consider data integrity in every aspect of product design and are actively working on packaging these features into a product offering.

Start Free Trial
  • Blog Subscription

    Get the latest updates on all things big data.
  • Recent Posts

  • Categories

  • Events

    QUBOLE LIVE DEMO: Google Cloud Platform (GCP) Enables You To Simplify Today and Future Proof for Tomorrow

    Jan. 27, 2022 | Global

    Data Lake and Data Warehouse – A modern data strategy discussion

    Feb. 2, 2022 | Online

    QUBOLE LIVE DEMO: Stop The Cloud Cost Madness With Graviton and AWS. Switch And Save to Reduce Your Data Lake Costs Today

    Feb. 3, 2022 | Global

    CONTINUOUS INTELLIGENCE DAY – Continuous Intelligence in Finance 2022 and beyond

    Feb. 24, 2022 | Global

    Data Innovation Summit MEA 2022

    Mar. 7, 2022 | Global

    Data2030 Summit 2022 – APAC Edition – Data Strategies For Data And AI-Driven Organisations

    May. 24, 2022 | Global
  • Read Machine Learning: Using the H2O Framework with Apache Spark Clusters on Qubole