A version of this post first appeared on Infoworld.
Why You Should Move Enterprise Data Science to the Cloud
We live in a world that is inundated with data. Data science and Machine Learning (ML) techniques have come to the rescue in helping enterprises analyze and make sense of these large volumes of data. Enterprises have hired data scientists — people who apply scientific methods to data to build mathematical software models — to generate insights or predictions that enable data-driven business decisions. Typically, data scientists are experts in statistical analysis and mathematical modeling who are proficient in programming languages such as R or Python.
Barring a few large enterprises, most data science is still being carried out on laptops, leading to a very inefficient process that is prone to errors and delays. In this blog, we will explore the top 5 reasons why we think ‘laptop data science’ within enterprises is dead in the age of cloud computing.
Enterprise Data Science Is a Team Sport
The algorithms and machine learning models form one piece of the advanced analytics and machine learning puzzle for enterprises. Data scientists, data engineers, ML engineers, data analysts, and citizen data scientists need to collaborate to deliver machine learning-based insights for business decisions.
In a scenario where data scientists are building models on their laptops, they are downloading datasets created by data engineers onto their laptops or on-premises servers to build and train machine learning models. Due to the computing and memory limitations of laptops or on-premises servers, data scientists would have to sample the dataset to create smaller datasets. While these smaller sample sets help the data science project get off the ground, they create a lot of issues further into the data science life cycle.
There are also concerns about the staleness of data. With local copies of the data, there is a risk that data scientists could be building predictions based on an inaccurate snapshot of the real world. The use of larger, more representative sample sets from a centralized source location would alleviate this concern.
Artificial Intelligence and Machine Learning in Data Lakes
The recent surge of interest in artificial intelligence and machine learning is driven by the ability to quickly process and iterate (train and tune the ML model) over large volumes of structured, unstructured, and semi-structured data. In almost all cases machine learning benefits from being trained on larger, more representative sample sets.
Enterprises can unlock really powerful use cases by combining semi-structured interaction data (website interaction logs, event data) and unstructured data (email text, online review text) with structured transaction data (Enterprise Resource Planning, Customer Relationship Management, Order Management Systems, etc.). The key to unlocking business value from machine learning is the availability of large data sets that combine transactional and interaction data. With the increasing scale of data, these data are often processed on the cloud or in large on-premises clusters. Adding a laptop to this mix creates a bottleneck in the entire flow and leads to delays.
Avoid Labour-Intensive Data Infrastructure Management
Today, data scientists can leverage a lot of open-source machine learning frameworks such as R, Sci-kit Learn, Spark ML, TensorFlow, MXNet, and CNTK. However, managing the infrastructure, configuration, and environments for these frameworks is very cumbersome when done on a laptop or on-premises server. This additional overhead of managing infrastructure takes time away from core data science activities.
However, much of the infrastructure management overhead goes away in the software-as-a-service model of the cloud. The usage-based pricing model in the cloud works well for machine learning workloads that are bursty in nature. The cloud also makes it easier to experiment among different ML frameworks with cloud vendors offering model hosting and deployment options. In addition, cloud service providers such as Amazon Web Services, Microsoft Azure, and Google Cloud offer intelligent capabilities as services. Thereby lowering barriers to integrating these capabilities into new products or applications.
Data Accuracy and Model Auditability
The predictions from a machine learning model are only as accurate and representative as the data used to train them. Every modern manifestation of AI/ML is made possible by the availability of high-quality data. For instance, apps that provide turn-by-turn directions that have been around for decades are now much better than they were in the past thanks to the larger volume of data.
It is no surprise then that a significant part of AI/ML operations revolves around data logistics, which is the collecting, labeling, categorizing, and managing of data sets that reflect the real world we are trying to model with machine learning. For an enterprise with several data users, this problem gets further complicated when multiple local copies of the data set exist among the various data users.
The concerns around security and privacy are increasingly taking center stage, and enterprise data processes need to be in compliance with data privacy and security regulations. A centralized repository for all data sets not only simplifies the management and governance of data but also ensures data consistency and model auditability.
Faster Time to Value for Data Science
All of the above reasons contribute to a delayed time to value laptop data science. In a typical workflow for a data scientist working off their laptop, they would first need to sample the data and download datasets manually onto their laptops or connect via ODBC driver to a database. Secondly, they would need to install and maintain all of the required software tools and packages such as RStudio, Jupyter, Conda distributions, machine learning libraries, and language versions such as R, Python, and Java.
When the model is ready to be deployed to production, they would hand it off to an ML engineer. The ML engineer needs to either convert the code to a production language such as Java/Scala/C++ or at least optimize the code and integrate it with the rest of the application. Code optimization would consist of:
- Rewriting any data query into an ETL job.
- Profiling the code to find any bottlenecks.
- Adding logging, fault tolerance, and other production-level capabilities.
Each of these steps presents bottlenecks that can result in delays. For instance, inconsistencies in software or package versions between development and production environments can result in delays. Code built in a Windows or Mac environment will certainly break when deployed into Linux.
Enterprise Data Science Platforms Run in the Cloud
All of the above issues with running data science on laptops result in a loss of business value. Data science involves resource-intensive tasks in data preparation, model building, and model validation. Data scientists will typically iterate several hundreds of times between features, algorithms, and model specifications before they find the right model for the business problem they are trying to address. These iterations can take a significant amount of time. Adding bottlenecks around infrastructure and environment management, deployment, and collaboration can further delay time-to-value for enterprises.
Data scientists who rely on laptops or local servers are making a trade-off between the ease of getting started with the ease of scaling and productionizing ML models.
While working on a laptop or a local server gets the data science team up and running faster, cloud platforms provide greater long-term advantages such as unlimited compute and storage, easier collaboration, faster time to ML in production, and many more.
The fastest and most cost-effective way to get started with data science and machine learning on the cloud is to use a cloud-native data science and machine learning platform such as Qubole. Sign up to test drive Qubole for free today.