How We Learned to Stop Data Wrangling and Love Machine Learning

Start Free Trial
September 1, 2017 by , and Updated April 15th, 2024

Nexla is a data operations platform that focuses on enabling data movement between companies with security and scale. The platform is simple enough for the tech-aware business person to use, and powerful enough for engineers to customize.  As machine learning grows, companies need to access more and more data from many different sources. We’ve found they can spend a significant amount of time simply wrangling with the data to get it in the right shape, form, and system before they can actually use it. These complexities make data not particularly accessible to much of the organization—and widespread data access is critical to the sustainability of data systems. As we approach a future of increased data volumes and velocities, enabling individuals throughout the organization to access data  themselves will be essential. The importance of DataOps is the message we wanted to bring to our colleagues at the Data Platforms 2017 conference.

Our talk at Data Platforms focused on how to automate data operations. It’s not realistic to expect to continue wrangling data in the same ways we’re used to as data sizes continue to increase. We wanted to drive home the reality that this is a challenge for companies across sectors, not just in the tech industries. Whether you’re an eCommerce company taking data from product suppliers, an insurance company trying to get connected car data, or a healthcare company trying to do better predictive analytics on the data you have internally or that’s imported from pharma or other healthcare companies, data operations, including moving data between companies, is something you’re going to have to deal with.

The second reality we wanted to drive home is that this is not a problem that can be handled with a dose of ad-hoc engineering work. This is something that must be an operationally repeatable and scalable system. To make this work, you must have effective tools and they must be usable by non-engineers. Data is used by many different groups within organizations, and they must have ready access to it. With this level of data access, scaling becomes far easier.

Companies must do a good job of providing access and usable tools to employees throughout the organization so they can do their own data analysis and turn more data into more business value. In fact, data operations professionals can be found in marketing departments, analytics departments, and data science departments. We know that scaling data operations by increasing the headcount of technical staff is not sustainable. The only viable solution is to create tools that can be used by non-technical users. The first step to learning to love machine learning is to free yourself and the entire organization from the often mundane and not intrinsically valuable tasks of data operations and get back to creating queries, asking questions, and delivering value.

The First Data Operations Benchmark Survey

At Data Platforms, we also shared details of our recent data operations survey that examined the real world challenges of hundreds of data operations professionals. We believe this is the first benchmark survey for data operations. In addition to the technology industry, the survey captured responses from over 40 industries including healthcare, education, government, and military. Some of the results were quite surprising, including responses to questions about the rate at which data is growing at companies outside the technology sector.

We all know the Googles, Facebooks, and Amazons of the world are producing terabytes upon terabytes of data per day. However, we also found that across all industries, 14% of companies are producing one or more terabytes of data per day. In addition, this massive amount of data being generated is not only being used internally by these companies. In fact, our survey found that 73% of companies are currently sharing data with third party partners or have plans to do so in the future—and 91% are already ingesting data from third party partners.

This finding quite clearly debunks the idea that inter-company data is an edge case—in fact, it’s very central to companies across industries. To support all these growing activities, 70% of the companies we surveyed reported they were planning to expand their data operations staff in the next twelve months. That’s a lot of headcount, but it’ll be necessary to handle growing data volumes and increasing data velocity. However, as we discussed, headcount expansion can only take us so far down the big data road. Turning data operations into a scalable, repeatable process is not a luxury, but a necessity.

Nexla will be presenting at Data Platforms Online, September 19 – 22, 2017.

Start Free Trial
Read Spark Streaming: IoT with Amazon Kinesis and Visualizing with Qubole Notebooks