DataOps and the Modern Big Data Platform – Data Platforms 2017


Ashish Thusoo: Hi folks. Thank you, David, for that introduction. Before I start, let me tell you a little bit about my background, who I am, where I came from, then a little bit about the talk that I’m going to be talking about today. Before I started Qubole, I started Qubole along with Joydeep in 2011, but before that, I was a practitioner and a builder of data platforms at Facebook. Before Qubole from 2007 to 2011, we were both at Facebook and their vision was really how to transform that company into a data-driven company and what are the key ingredients both from the perspective of technology, but also from the perspective of usage, also from the perspective of organization and how to make data part of conversation every day.

I’m going to talk a little bit about that in today’s talk, but before that, I will also talk about why are we in a place where we are in terms of what is causing the explosion of data and what are some of the problems that a lot of companies grapple with when they are thinking of getting to a state where they can truly call themselves as a data-driven enterprise. We are living in a time where data has become a centerpiece. We are all data people here, so let’s look at some of the numbers which back that. These are very, very interesting times, probably the best times to be in the data world. As you saw in the opening video, IDC predicts that by 2020, we will have 44 zettabytes of data.

To put that in perspective, today we are at 4.4 zettabytes so in the next three years, there is going to be a 10x growth in data. Not just that, today, we are producing about 2.5 exabytes of data every day. If you put that in perspective, that is equivalent to 530 million songs produced every day. More interestingly, that is equivalent to 90 years of HD video created every day. If you think that you’re grappling with a lot of data today, just look at the future and see what is coming down the horse. Not only that, the big question is, why so much data is being created today?

There are multiple drivers of this but if I was to put in the hand and the pulse of a couple of things that are causing this transformation, it is really two things. One, everything is becoming connected, and two, everything is becoming data driven. It is not just the traditional things that we talk about in terms of measurement. We have been in the internet age for a long, long, long time. By now, it’s very, very common to have our online activities be measured by various different entities, but it is also offline activities. It’s like things like exercising, things like watching television. All of these things are now being measured by devices, by sensors, and more and more things are getting connected.

We talk about Internet of Things, that’s really becoming mainstream. If you look at cars, a lot of those are connected today. Ten years back, my car would not tell me when to get its oil changed, now I get an email which tells me, “Hey, I need to get serviced”. All of those things are essentially driving data, all of those things are essentially driving an explosion in data, and all of those things are essentially fueling the skyrocketing demand for big data. I think all of us in this room, we’re all data practitioners and we made a brilliant career choice to be in the stage where data is exploding. We have a very, very bright future in front of us.

Apart from that, it is not just the explosion on data, but we are also living in times where there is tremendous amount of innovation that is happening on the systems that are used for processing data. There is innovation not just in terms of open source projects and new open source projects which come up every few months claiming to solve one piece of the puzzle. Flink is probably the latest one that I’ve heard of in the last six months that is trying to solve the real time analytics problem. Apache Heron is another one, you’ll hear from Karthik later who is one of the creators of Heron in the keynote, about that.

Spark continues to move towards becoming more and more dominant in the big data communities. There’s a lot of changes happening in the core data platforms and the data infrastructure space. Even more importantly than that, there’s also changes happening in the computing fabric that is used for analyzing these data sets. One is the platforms that are used for processing data but also what computing fabric those platforms run on or operate on, those are also changing. GPU computing is becoming more and more mainstream. There is lot more power packed in GPU’s, a lot of panel processing is moving towards GPU computing.

There is a whole trend around serverless computing in the cloud with all the cloud practitioners having some sort of an offering in the serverless world. We are living in very interesting times not just from the perspective of exploring data but also from the perspective of extreme amount of innovation being driven in the platforms and the infrastructure which is needed for processing these data sets. In addition to that, things are moving to the cloud. Back when we started Qubole, this was in 2011, we saw very early on in the cloud, a potential of a huge disruption in terms of how people think about infrastructure, how people think about IT.

It was not always that case. A lot of people at that time believed that cloud would be a place where startups go, cloud would be a place where test their workloads go. Really, it was a leap of imagination that was needed to think about cloud being a place where the whole of IT industry is going to go. We saw that very early and true enough, that transformation is happening today. A lot of the driver there is that more and more businesses and enterprises are realizing that to compete in today’s fast moving world where things change a lot, things are always in the flux, you need to rely on an infrastructure that’s able to adapt with that change that gives you the tools, the flexibility and the agility to compete in this environment.

Cloud is one such thing, for the first time in computing, we have had a platform which essentially turns physical hardware into software, and that’s been a great innovation. In fact, cloud has become so hot– Actually, before I go there, some of the reports around big data and the cloud essentially indicate that the number one priority for big data for many companies is to move to the public cloud. There’s a convergence that is happening there. Big data needs a lot of computing power.

Big data needs a lot of agility because we are moving we are in this environment where things are changing all the time in terms of innovation and stuff like that. If you marry that with the feel of the cloud which is around giving your computing power, enormous amount of computing power at your fingertips in a very, very agile environment, it becomes a very, very good fit for those two technologies come together. Cloud has actually become so hot that last weekend, it won a part of the triple camera as well.


Ashish: Jokes aside, it is really something which is transformative.