The Rise and Rise of Apache Spark
Matt Aslett: Okay. Great. Thank you. Good morning, good afternoon and good evening depending on where you are in the world. Welcome to today’s presentation, The Rise and Rise of Apache Spark brought to you by 451 Research, Qubole and featuring Autodesk, during which we will try to answer the question, “Is Spark the answer to all questions posed for big data?” During this stimulating webinar we will hear from Steve Gotlieb, [Principal Architect] Big Data Guru at Autodesk, who will dive into how developers and data scientists are using Spark notebooks to prototype data transformations that can be deployed to an automated ETL pipeline and delivered to data analysts to enable faster time to insights.
Also, joining in the fun will be Dharmesh “Dash” Desai, technology evangelist for Qubole, who will take a look at the real value of a self-service analytics platform and how this value is realized when both business users and data team members have access to raw and aggregated data from a range of sources. Finally, myself Matt Aslett, research director at 451 Research will kick off the discussion around the impact of the rise of Apache Spark on the big data ecosystem.
Before I begin, just wanted to remind you to log any questions along the way using the console on the left-hand side, and we will take those at the end of all three presentations. Without further ado, I’ll kick off and I’ll just give you a very brief introduction to 451 Research before we get into presentation itself. We’re an industry research and advisory company founded in 2000 with 250 plus employees, including over hundred analysts, over a thousand clients, which includes technology and service providers, corporate advisory, finance, professional services, and IT decision makers.
The last thing I’ll mention on this slide, the next number and the most important number, I think, on this slide is 50,000 plus. That’s the number of IT professionals, business users and consumers that we have within our research community. Those are people out there in the field, practitioners, working with technology on a day-to-day basis. They increasingly shape our research, they shape our view on the world by working with us in terms of doing surveys and interviews. An increasingly important part of 451 Research.
In terms of our research, we deliver both written research, and obviously data, across a range of channels, 15 channels, from the data center technologies right through to the mobile age with enterprise mobility and mobile telecom. The area that I cover, I’m research director for data platforms and analytics, which is roughly right in the middle of the of this slide here.
The key question we’re talking about today is about Apache Spark. In a few short years the Apache Spark and memory data processing engine has risen from pretty much nowhere to become one of the most important projects in the Hadoop ecosystem. For some, the anointed successor to MapReduce is Hadoop’s primarily data processing engine. Certainly, when we were recently putting together our trends perspectives for the next year, this was definitely one of the the key trends that we were highlighting, that Spark will begin to emerge as not just one of the most important technologies behind the use-cases for Hadoop, but actually potentially the dominant technology behind use-cases for Hadoop.
In terms of just setting the scene, I’m sure a lot of people, obviously, on this webinar are aware of Spark and understand it, otherwise they wouldn’t be interested in the webinar in the first place, but just to level set and be absolutely clear for perhaps some people who have not come across it before. What is Spark? Well, Spark is an in-memory data processing engine. It’s based on the directed acyclic– I knew I was going to have problems wit that. This is called DAG concept, and it supports applications written in Java, Scala or Python.
As well as the core in-memory data processing engine, it also has a number of key sub-projects, specifically Spark Streaming, Spark SQL, MLlib and GraphX. What is it for? It’s designed to complement Apache Hadoop. You can use Spark to read and write data to and from the Hadoop distributed file system as well as other components of the Hadoop ecosystem, including HBase and other data platforms as well, such as Cassandra and AWS S3 cloud storage.
However, one of the things about Spark is not just something that can be deployed alongside HDFS, it can also run standalone. As you see here, it can run on top of Apache YARN, it could also run with things like Apache Mesos and Alluxio in-memory file system. It’s important even though we talked about Spark, obviously, being an in-memory data processing engine, it’s important to note that spark is not limited to data stored in-memory, so it can spill the disk or recompute partitions that don’t fit in RAM.
Just to give a sense of the evolution of spark over time with this timeline. You can see the project was actually conceived at the University of California Berkeley’s AMPLab in 2009. It was first released as an open-source project initially using the Berkeley license in 2010, switched to the Apache license with version 0.8 in 2013 shortly after it’d become an Apache Software Foundation incubation project in early 2013. Actually, pretty quickly went through that incubation process, so about a year later, it became a top-level Apache Software Foundation project and version 1.0 followed quickly after that in in mid 2014. Version 2 which has been the most recent major release was released in mid 2016.
The other thing that we’ve highlighted on this slide references Cloudera One Platform Initiative towards the end of 2015. This is one initiative from one vendor, but it was quite an important one because for a while, certainly in private conversations we’ve been having with vendors in the Hadoop space, there had been an increasing acknowledgment that Spark could be the long-term successor to MapReduce as the core processing engine in the Hadoop ecosystem.
This announcement from Cloudera was the first time we’d seen one of the prominent vendors in this space actually voicing that publicly and saying not just it could be, but as far as they were concerned, it was going to be and they were going to put some effort into make sure that it was. A significant endorsement but one of many for Spark. Obviously, with any successful open-source project you see a whole ecosystem of vendors supporting that.
In addition to Cloudera, we can see MapR, Hortonworks, IBM, Google, Microsoft Azure, and AWS EMR. Also, the founders of Spark at AMPLab actually founded Databricks in mid 2013. There’s a good ecosystem of projects out there. The other one to mention, obviously, in terms of the participants on today’s webinar, Qubole fairly early to this, and especially in terms of as a service support, in early 2015 and active support for Spark on the Qubole data service.
I think one of the key things that we’ve seen around Spark that’s driven us to conclude that it’s more significant even than the level of adoption we’ve seen from the Hadoop ecosystem is the level of adoption we’ve seen in the wider ecosystem. Throughout the recent years, we’ve seen a number of vendors in the data integration space, in the analytic space, in advanced analytics, in machine learning, using Spark as the engine for them delivering additional performance and high-performance analytics or integration. What we’ve seen is that Spark has almost become the default in a very short space of time, the default way in which those ISVs deliver high-performance integration and analytics, either standalone or as part of the Hadoop deployment.
Just another indicator in terms of momentum we’ve seen. This is a chart from from Google Trends, and as you can see, we’re comparing with MapReduce here. If you look at the interest over time in terms of Google searches, a very significant growth for Apache Spark in recent years. Perhaps more significantly this chart, which actually comes from a Stack Overflow. This isn’t just about general interest, this is interest from developers, people who are out there actually choosing technologies to use in their projects. You can see that actually if anything the chart shows an even accelerated rate of growth in terms of the level of interests and questions being asked around Apache Spark compared to MapReduce.
Why Spark? We talked about what it is, we talked about what it does, but why Spark in particular? There’s a number of reasons that we’ve seen in our research that organizations are particularly drawn to Spark. One of those is the fact that it’s open source, which means it’s freely available. We talked about the Apache software license, which is a very permissive license. It’s very flexible. That’s one of the reasons we see ISVs in particularly drawn to it, but also end users as well. It’s a true community development project.
We talked about all those vendors that are supporting Spark, either in the Hadoop distributions or in their big data service offerings. A lot of those organizations, you can see that the list at the bottom here lists the current committers to the Spark project. You see a lot of those vendors we mentioned there, but you also see a lot of end-user organizations. You also see some vendors from that wider ecosystem as well, in addition to research organizations academia. It’s a true community and it’s a varied community as well. It’s also one of the most active open-source projects.
There’s significantly more activity around Spark in recent years than there has been– especially if we compare it to something like MapReduce. As we mentioned, because of that ecosystem, there’s widespread support for Spark out there. A lot of times, people assume because it’s an open-source project, that means you can go to different vendors and get support for that. I think in the case of Spark, that’s absolutely true that there are multiple vendors out there that will assist you with Spark in terms of services, in terms of software, in terms of support, in terms of as a service deployment.
Of course, because Spark is an in-memory data processing engine, one of the other prime reasons that we see organizations looking at Spark is performance, its native support for running in-memory. I’m not going to get into too much detail on performance. I think Dash is going to provide some examples of this later on, but the standards of baseline claims are up to 10% faster than MapReduce on disk, up to 100% than MapReduce running in memory. Also, it’s beyond batch, whilst Spark is at heart a batch-processing framework, it does enable interactive query.
Finally, flexibility. I think when we first looked at Spark, obviously, we saw that there was going to be interest in it because it was open source, because enables high-performance analytics. I think the thing that most struck us looking at this was actually the flexibility that you have in terms of the ways in which you can use it and the ways in which you can deploy it. We already talked about Spark supporting multiple sub-projects, Spark SQL, Spark Streaming, MLlib, GraphX. You can use Java. You can use Scala. You can use Python. You can use R to write applications to run on Spark. You can connect to it using your standard BI tools, using ODBC and JDBC.
As we mentioned earlier, you can deploy it on HDFS, YARN. You can deploy it standalone. You can apply on Mesos. You can explore in other things including Alluxio, S3, Cassandra, and HBase. There’s a great amount of flexibility and choice for organizations in terms of the way in which they can use that platform, the applications, and the workloads they can deploy on Spark. They can use their existing, in a lot of cases, their existing skills and their existing tools and bring those to Spark, not necessarily having to learn a new platform, new languages.
Just briefly, in terms of some of the use-cases, obviously, we’re going have more details on this with some of the other presenters, but a high-level interactive query, high-performance analysis using SQL, using R, using Python, using visualization tools in concert with existing visual BI software. Stream processing, including things like event detection, streaming ETL, integration with static data. Some common examples we see, ad targeting, personalization, fraud detection, recommendation engine– [coughs] Excuse me.
Machine learning is another common use-case, obviously, with the MLlib project. Enabling predictive intelligence, things like customer segmentation, behavioral, in other words, to stumble over, analysis and sentiment analysis. Also, anything these days, there’s always an IoT angle and absolutely in terms of IoT edge computing. The low latency analytics that can be enabled using Spark makes it potential for deployment at the edge of the network to enable streaming and machine learning and graph analysis, for example, at the edge of the network.
It’s important, though, I think, to be aware here although, obviously, we’re talking about the rising waves of Spark, and we see Spark as a very significant part of the big data ecosystem, if you like. It isn’t a panacea. It doesn’t solve all use-cases. This chart, really, it is designed to illustrate that, even if you’re deploying Spark standalone, which we see an increasing number of organizations are doing, very often it’s standalone but alongside Hadoop and alongside other projects within the Hadoop ecosystem. Certainly, we see in terms of the main core user base, Spark is very much being used by data engineers and data scientists.
As we mentioned, you can connect your BI tools to that environment to enable access to that data in Spark by data analysts and business users, but there are other ways of doing that. There’s other projects within the wider Hadoop Big Data ecosystem that arguably might be better suited for that. I don’t want to give the impression we’re saying Spark is good for everything. We can go into more detail on this in the questions later, but just to illustrate that it is part of a bigger picture.
It isn’t a panacea. We don’t want to give that impression, but absolutely, we see as I talked about right at the start in terms of our predictions for 2017, we see it actually will become not just one of the most significant technologies behind use-cases for Hadoop and big data and analytics, but actually arguably the dominant technology for those use-cases. With that, I want to thank you for your time. Just to remind you, I see some of you are already sending across questions. Thank you for those. We’ll get to those later on. You’ve got my contact details here if you’ve got any questions for me directly about 451. For now, I’ll hand you over to Dash at Qubole, and from there we’ll hear from Steve at Autodesk. For now, thanks very much for your time.
Dharmesh Desai: Great. Thank you so much, Matt, for laying down a great foundation for our webinar today. Hello, everyone. This is Dharmesh or Dash Desai from Qubole. I’m a tech evangelist. For those listeners that are not familiar with Qubole, it’s the fastest and only cloud-native big data platform that’s designed from the ground up to scale. It was founded by the creators of Apache Hive Project, and the same people that led data service team at Facebook. Their names being Joydeep Sarma and Ashish Thusoo. It was founded in 2011.
Just to give you an idea of enterprise scale and readiness of the platform, let’s look at some numbers that are on the slide. Qubole today processes about 500 petabytes of data per month compared to all the other leading internet brands. This should give you an idea of how much efforts have gone into creating this platform that’s enabling these companies and a lot of the customers across the board become more data driven, which brings up the next slide.
The vision of the company being to help companies become data driven, like I mentioned, and also to democratize data and lower the burden on IT teams as well as the data teams across the enterprises. This has been made possible by providing a self-service big data platform with these three elements being at the core. The first one being the architecture that’s built to be agile, flexible, and scales. We already saw some of the numbers in one of our earlier slides. That gives you an idea of the enterprise scale of the platform.
The second one’s automation and transparency of infrastructure, we’ll get into that in a little bit. Also, how the platform provides these simple tools and interfaces for data team members across the enterprises. Let’s get into the first one, architecture built to be agile, flexible and scalable. In a traditional big data Hadoop deployment, compute and storage, they live together. This forces them to scale up and down together as well, which is not really ideal because then, the elasticity is much harder to achieve and manage. This also means that you have to provision for peak. Data science and big data workloads, as we all know, are often bursty and unpredictable.
Provisioning for peak can lead to underutilized resources, and also, it could drive up the total cost of ownership because of that. The other side effect of having compute and storage together is the clusters must stay on, otherwise, data becomes unavailable. One of the important benefits of moving to the cloud is the possibility of the coupling storage and compute. Storage is getting cheaper and compute can be very expensive. By separating the two, data teams can achieve greater elasticity and more agility within the organization. The idea is to use persistent storage service such as Amazon S3 as you see on the slide, and computing power selectively, and more importantly, on-demand.
By doing this, what happens is, you have the storage that’s centralized and computation is distributed, which means the resources scale based on the workload. For example, compute heavy versus storage heavy. Moreover, it’s also easier and faster to test new technologies such as Spark. You don’t get locked into these infrastructures where trying out a new engine or data processing technology, you would have to go through these overhead of creating new clusters, and maintaining and managing those. Which brings up the next attribute which is flexibility, where every cluster can be brought up and down independently of each other. They run at the cloud scale with individual clusters scaling to thousands of nodes.
Moving on to the automation and transparency aspect of the self-service platform. Qubole has taken a very different approach, in that it lets data team members to focus on the analysis part, the data science part, and then automates everything around it. For example, as seen on this slide, once the query is formulated and executed, it automatically deploys. The platform automatically deploys the cluster, and when finished, the cluster terminates automatically. As you can see at the drop-down on the right-hand side, the dots, those are all the clusters that you have available for you to run against this query in this example.
The other big aspect of Qubole as this self-service big data platform is auto-scaling. On the right-hand side, you will see that the cluster is pre-configured to have a minimum of four nodes, and then several auto-scaling nodes that are brought up or down depending on the workloads. You could go from, say, for example, two nodes to thousand nodes, thousands, in fact. The way auto-scaling is structured or architected within the platform is, it’s policy-driven. You pre-configure the clusters based on your workloads or your big data processing engines, what-have-you.
Then, when the platform needs to either bring up a cluster or create a cluster from scratch, it uses that as a template or blueprint in real time, just in time. Like mentioned before, the cluster scales up and down. All of that has been taken out of your day-to-day operations, if you will. All the clusters types that are supported are listed on the slide. As we can see Spark, Hadoop 2/YARN, Hadoop and Presto. The other big attribute or feature as part of the self-service big data platform is also support for SPOT instances. SPOT instances are resources or machines that Amazon puts out in a bidding-style market, where you can bid for machines for your workloads.
All of that has been automated within the platform, so you don’t have to worry about the pricing and when to bid for one or not. Also, it has ability to fall back on on-demand instances depending on the workload and the use-case. These are all instantiated in customers accounts. They’re secured within your VPC or subnets or what-have-you. Moving on to the next section where Qubole as a big data service platform provides REST APIs as well as a web-based interface for every action that you can think of, basically. Right from SQL editor to Command Debugging. We also have an extension of Zeppelin integrated into the platform notebooks.
We can have various reports created out of the platform for a cluster usage per user. For instance, there’s integration with Ganglia as well as Tableau, and so forth. A lot of different things that have been put into place for team collaboration. For example, integration with GitHub within the notebook and also being able to comment, share resources right from the same interface. Now, we’ll look at some of the screenshots of these different aspects of the platform. The first one that I wanted to touch base on is notebooks. It’s an extension of Zeppelin. Data analysts, for example, can view, run, and visualize results of SQL, Scala, Python in a single collaborative environment.
They can also iterate quickly with saved queries. Queries and results are always persisted, so you can view them even if the cluster’s not running. That’s cool. There’s also ability to create multiple notebooks targeting different engines and clusters. Like mentioned before, there’s GitHub integration for version control and tracking changes as well as collaborating with data team members. Here’s another example of out-of-the-box feature where if you had latitude, longitude dataset or columns within your query or dataset, you can plot those on a map. This is all out-of-the-box available for data team members to use. The next interface I wanted to touch base on a little bit was Explorer.
This is where data team members, data engineers, admins go to either Explorer, your current Hive metastore, unified Hive metastore, that uses an access across clusters. You can also browse through all your data stored in the cloud right from this interface. That’s pretty handy. The next interface I wanted to touch base a little bit on was the Scheduler. Data engineers can use this to schedule queries, commands at specific intervals. They can build workflows and schedule jobs, targeting multiple engines including Sparks, Sparks SQL, Presto, and they can also actually import and export data from external data sources.
One of the cool features of this scheduler is you can add dependencies. You could say, run a particular job only if a Hive partition exists or an S3 file is there in a given bucket. Speaking a little bit about Apache Spark momentum on Qubole, to accommodate growing demands and leverage technological advancements made by the Apache Spark community, Qubole continues to release enhancements and optimizations pertaining to Spark offered as a service.
Here are some of the highlights that I wanted to touch base on a little bit. The first one being support for Spark 2.0, and the second one being the GitHub Integration. It allows data team members to share and restore entire status for notebook without having to rewrite queries or commands, even when the associated cluster is not running. That’s great. The next one is, we’ve optimized split computations for SparkSQL, which means AWS S3 listings enable split computations to run significantly faster of these SparkSQL queries.
As Matt referred to earlier, there’s a lot of different things the community is doing to make the Spark platform more performance. Qubole’s done a few things of their own as well. In this particular case, speaking about split computation, if you have large number of partitions in higher tables, we recorded up to six to almost 50% improvement– X improvement on query execution on SC listings. That’s big.
Last one, being heterogeneous Spark clusters. Now, you can not only have dedicated on-demand or dedicated spot instance clusters, but you can also have slave nodes in Spark clusters that can be of any instance types. That speaks to lowering your TCO depending on use-cases. We’re also proud to announce that we scaled up the largest Spark cluster in the cloud. It was launched on Qubole data service by one of our customers. It was auto-scaled up to 500 nodes equaling to 16,000 compute cores and 120 terabytes of memory.
In conclusion for this, Spark’s definitely on the rise. Like Matt alluded to, there’s definitely other technologies that play a significant role in the bigger picture of the bigger ecosystem.
Then, I’m going to have Steve come on, and then talk a little bit about their use-cases and how they’re using Spark as well as other technologies within the organization. [pause 00:33:01] I’m sorry about that. Feel free to contact me. Here’s my email: [email protected], twitter handle. With that, I’d like to hand it over to Steve Gotlieb. Thank you.
Matt: Thanks, Dash. Thanks a lot, Dash.
Steve Gotlieb: Hello, my name is Steve Gotlieb. I work for Autodesk, I’m part of our big data engineering team. Working on engineering and infrastructure for our product development group. We’re building an enterprise-wide analytics system based on Spark running on Qubole. Take some of the ideas that Matt presented about Spark and some of the things that Dash discussed regarding how Qubole can provision and provide access to Spark and how Autodesk is using that.
A little bit, first, about Autodesk. We’re a design company, and we’ve been around since 1982 and have really gone through the evolution of a traditional desktop product that ships floppy disks to– moved into the CD and DVD era. Now, we’re really moving into a new wave of SAS model and subscription and lots of web services that are interconnected with our desktop products.
We’re developing what we’re calling a 360 brand products. These are really lots of products and services that work together to accomplish things that enable our customers to build. We’re really about making things. We’re software for people who want to build things. Companies building cars or buildings, designing a smartphone, special effects for lots of films that have been nominated for special effects Oscars.
Most people have only heard of AutoCAD, but we’re a lot more than that. We are also becoming a data-driven company. We have goals where we want to use data to allow our culture to make decisions based on the data we have, to drive machine learning, ad-hoc data discovery and decision making based on usage events. We want to have a user experience that is really a good one for the users that has adoption of lots of different features in our products, can be measured.
We use this data to improve our UI for customers and figure out how we can help them optimize their workflows. Another goal of our big data platform is just to support ETL development, which I’m going to get into how we have implemented a self-service model. The Facebook philosophy, which was what the Qubole founders came from, is really built on providing a data infrastructure that can be used across the company.
We’re turning our data assets into a utility, making data collaborative so that we can have a large number of employees throughout different parts of the organization that can collaborate and access that data. Of course, we need it to be scalable because the more products we have, the more customers and the more usage of those products, the more events we’re capturing.
We see an ever-increasing amount of data that’s being generated through our platform, and then accessed by all different groups like product managers, customer support, marketing product teams that are developing new features. Our requirements were spelled out for our big data platform. We first wanted to enable a data exploration layer. We’re good at collecting data, but we haven’t been very good at making that data widely available for users to explore.
It’s always been a little difficult in the past due to the complexity of tools, the lack of scalability and other things. One of the things we wanted to do was have a data exploration layer where users could access different levels of data aggregated in datasets where most queries would complete in 30 seconds or less. We also needed to have a multi-tenant environment because we know we have lots of users running queries concurrently around the clock, and we want to try and meet that demand without users bumping up on each other for resource limitations and so forth.
Cost, of course, was a consideration as well. We want to be able to maximize our investment when running things on AWS. We want to be able to use the optimal instance types. We want to be able to utilize the Spot market. Another thing out here is the auto-scaling feature that Dash mentioned, which really helps us to have our demand drive our system resource utilization.
Then, we wanted to be able to have some support. We needed to be able to onboard users quickly. Spark is fairly new, a lot of people don’t have experience with Spark, so we want to get people up and running. Establishing relations with Qubole supporters has helped us to do that. Training also is key, and there’s lots of education that we’re trying to provide users.
I’m going to walk through a journey of what we’ve built in terms of an abstraction of our pipeline and where we found issues and how we’ve solved them. Reading from top to bottom, left to right, we have our instrumentation layer in which teams engage and integrate to send their data. We have over 200 product teams and they each have different types of instrumentation, different detailed level of instrumentation and different things that are important to them to collect. We have a basic set of data, a base schema of required fields that we get from everyone and then we allow flexibility for products to do some custom instrumentation as well.
They forward through our transport layer to get data into our data lake where it can be accessed by those various teams that are interested in doing that and consumed through a self-service layer. This is the idea that we had our initial implementation which was built on Amazon EMR, had some issues with.
Ingestion was difficult. There was always a slow down there. Hooking up with our transformation layer and configuring any kind of data transformations and then the entire self-service process defining and building and getting teachers to engage in that was a slowdown. We solved the problem with the slowdown for instrumenting and ingestion by creating a set of SDKs that are in a variety of languages.
Right now we’re using Kafka and Flume in our ingestion pattern. If we change that as a service that behind-the-scenes users won’t need to do anything they just have to integrate our SDK. That’s saved us a lot because in the past and we’ve had to change things we have to go to two hundred teams and have to make configuration changes.
It was always painful and difficult to do that. I’m not going to focus much on this half of it, This is getting data into our system but I wanted to show that these are the problems we hadl. The other problem around data access is where I’m going to focus the rest of my presentation. What we really want to do is build a true self-service model.
Which will allow teams to engage, define their business requirements and build data sets that can meet their requirements and validate those data sets with minimal interaction and sort of a self-service type of environment. How do we build self-service analytics. This has been a great challenge for us.
First we have to highlight what’s the value of a self-service pipeline. The goal of this is to serve the data users and meet our corporate goals so we have a lot of needs from the data users and everybody’s working to meet our corporate goals. We don’t want to have any kind of barrier for adoption of our pipeline.
Teams in the past have gone out and tried to build their own solution and we’re trying to build and we have built a compelling case for users to adopt our pipeline. They don’t have to build our pay separately. For our pipeline we have one centralized one and that can free up resources to instead of building a pipeline to create customer value we also want to be able to support analysis across the multiple products.
We have one common platform with a schema that has some commonality between it. We can look at cross-product analysis and figure out how many products are used per users, which products are used most frequently together. How can we improve the customer experience by understanding how our our products are actually being used.
Also, with a single self-service layer that everyone’s using we can address security and data governance in a centralized manner, provide standard product metrics and KPIs. We can control the data quality of our events and our master data and get closer to the single version of the truth. We can support and leverage some sophisticated analytics tools.
Our taggers we want to create insights not infrastructure, at least not a complicated infrastructure. Insights are really designed to allow interactive query which lets pretty much anyone who is comfortable working with data become an analyst and we want to make those analysts more productive.
Our solution is interactive query through Sparks SQL, our BI reporting tools are connecting through JDBC which is running on HiveServer2 on our cluster. The exploration is through SQL, we also support data frames iterative exploration through Spark notebooks. Our users are using SQL, scala, R, and Python; but mostly SparkSQL and PySpark.
The notebooks are really for that advanced exploration and resource and we want to scale across our tools. A little bit of architecture here and I’ll show how this fits in in a few slides down. Shows that and we’re running Qubole with Spark or Spark on Qubole with Zeppelin notebooks users are accessing for exploration.
We’re connecting BI tools currently through a virtual layer through Denodo, which allows us to connect data sources that are not in our Spark ecosystem and also provides a level of security for some of our BI tools to connect to. Skip over this. The self-service analytics pipeline in detail again has all these different components that we’ve built.
I have an overlay here of the different areas that we’ve dealt with from instrumentation of data transport and our data lake and then we have a compute layer. You can see here the data lake and compute are separate. That was touched upon earlier in the presentations today where we decoupled storage and compute.
We’re using S3 for storage and then our data also consists of a Hive metastore and for compute we’re running Spark on Qubole. We can scale up and down on compute and our storage is infinite. We’ve also abstracted out our compute versus fast access and I’ll get into that in a couple of slides. We have a continuous integration process which enables our self-service.
Underneath the hood of that abstraction layer and this might be a little bit of an eye chart but this gets into a little bit more of the technology that we’re using behind the scenes. We are running BI tools like Qubole, tableau, and Qmetrix to consume data and then for our building integration we’re using Jenkins and Github.
We really are running three different types of Spark clusters. We have an ETL cluster, a notebook cluster and fast access clusters and they each serve a separate purpose. ETL clusters running workflows in batch that are submitted through an Oozie scheduler and we have autoscaling enabled and we have separate Spark applications for each workflow so that we can run things concurrently.
Our notebook clusters are specifically designed just for users who are accessing through Zeppelin notebooks. These are mostly ETL developers and data scientists. We also have autoscaling enabled here and again separate spark applications per users which allow for concurrent use. Fast access is a separate cluster that we run which is used for BI tools to access and SQL clients.
This is where Tableau or maybe SQL Workbench or TOAD will connect through our JDBC connector. Right now we don’t have autoscaling enabled because we’re trying to deal with some issues around Spark caching and how that works with autoscaling. Our self-service model really fits a variety of users throughout the company.
This inverted pyramid diagram shows at the top most of the majority of users are non-technical but as you get deeper into the data there’s a smaller number of users but they need access to find a grain of data. The majority of users are non-technical but as you get further into, deeper into the data there’s a smaller number of users but they need access to find a grain of data so we have different tool sets that we use for users as they go down the stack. Qubole is really positioned for our data engineers and data scientists to develop ETL and test in logic and then deploy through our pipeline.
I earlier referred to our managed Hive MetaStore which is a central place that where everything is defined as a table. All users are accessing tables that are defined with different schemas and they can easily navigate through the Explorer browser or any browser that has that can show all of the tables in the MetaStore. We discover and share through our wiki and we use the Hive MetaStore to provide access controls hooked into our LDAP system so we can have fine-grained access controls. We’ve optimized our storage on S3 using reduce redundancy storage and performance is optimized using our Json and parquet file types. This allows us, I think this slide was a duplicate.
This slide here is about how we deploy the pipeline as a service, so Qubole notebooks are used for data exploration and once users have come up with a working prototype it can be deployed through our ETL pipeline so code is submitted to Github right now it’s manual but we’re working on getting Github integration with our Qubole notebooks that goes through code review. We have a bunch of folks who are kind of our Spark experts that are reviewing code that’s developed by other people within the organization who have standards that are defined in our wiki and they’re the logic. They tune resources make sure that security, any security concerns are addressed and so forth. Then we have a build and automation through Jenkins which deploys it through our Oozie workflow and runs the job through a direct Spark submit through the Qubole API on our ETL cluster.
That is a real quick overview of how we’re doing self-service data at Autodesk. I want to thank you for listening. My email address is here and I think earlier slide had my LinkedIn as well. With that, I’ll turn this back to Matt for questions.
Matt: Great. Thank you, Steve. That’s fantastic presentation and we’ll let you get your breath back for a second. Just to remind everyone, if you’ve got questions you can submit them using the chat on the left hand side and we’ve already got quite a few good ones. We’ve got only a small amount of time left so we’ll try and get through as many as we can. The first one, well, I’ll let the Steve to get his breath back. The first question is, “so if storage is decoupled as you talked about, how does that, what are the implications therefore in terms of I/O bottleneck? The question makes the point that the reason for being compute to the storage was to avoid that issue, how is that addressed in Qubole in particular?”
Steve: Yes, I think that’s a great question. The key here to keep in mind is that even though the storage is separated or decoupled from compute everything including the cluster is still running within the same VPC or the subnet so you would really see that big of a performance or latency issue from that point of view. I don’t know is that answered the question.
Matt: Yes, in relation to that there’s another question asking about the performance of the system in terms of latency and throughput. How that changes with increased number of nodes obviously related or if just that one quickly gets one more topic.
Steve: The increased number of nodes would definitely depend on the use case. The nodes are brought up only if, for example, the Spark jobs would require, in a spark cluster for example if they require there were containers or more executors, so reading from a tree for connection there’s a max amount of throughput so more tasks on more nodes it actually means better read throughput.
Matt: Okay, great thanks. Next question Steve for you. In general, and I was just interested in and what were the challenges you faced in terms of moving to the self-service approach that you described but we did have one question in particular which asked about, how you dealt with master data governance in relation to that architecture with self-service analytics? I don’t know if that fits specifically into the challenges you had but probably if you could talk about?
Steve: Sure, yes I’ll just explain in more detail. In terms of master data governance, we have our SDKs which enforce a schema. We also have a validation tool that we run on the ingested data to make sure that the data coming in is meet certain types of rules that we don’t have any bad data. That the required fields are included in our incoming data that we ingest. Having a common schema really helps because the problem we had before was we had you know 200 different teams all instrumenting calling a user ID something different. Now we’ve standardized that, have an evolving schema that can meet the growing needs and also help us to govern our master data. Some of the challenges that we’ve had have been more around, I would say three challenges, self-service and I saw another question about this which was, what kind of team do we need to manage our analytics pipeline. This is one of the challenges we’ve had is that this requires much more support than we expected.
We knew that there would be a lot of support required when you’re bringing on users who are brand new to Spark team, new to ETL even. There is some hand-holding and we consider it an investment upfront to get build a lot of strong data analysts. There will be some things that are suboptimal that are implemented and that’s a challenge to deal with. Then, concurrent usage is another challenge that we’ve had. Qubole’s really helped work with us on this one which is, the solution was to have separate Spark applications per user, which on a shared cluster, which allow us to leverage yarn resource allocation. We can provide isolated resources to users if we need to and we can have a shared resource pool using the fair scheduler in yarn. Then the third challenge is just some tool incompatibility there’s Spark still being a fairly young technology. There are some tools out there that BI tools, for example, the people are using that don’t support that. There are also challenges around how different tools generate queries when you’re running large queries against big datasets and trying to set the expectations around what the latency should be.
Matt: Great, thanks. You mentioned one of the challenge there and skills and actually it relates to one of the questions we got was asking me about another analyst company pointing out that there are challenges to Hadoop adoption, in general, on one of the things that relate to that scales. It was a good question I wanted to point that out. I mean we certainly see those same challenges that Steve discussed that also apply clearly to Spark. In relation to skills and you talked about getting up to speed there, I was wondering if you actually sent any of your team to training with Qubole in terms of helping get up to speak real quickly.
Dharmesh: We have. I’ve encouraged our users to take advantage of the training, the public courses offered and also leverage some of the other classes we’re working on trying to schedule additional training and make that even a requirement for users to get on our platform.
Speaker: Great, okay. I think we’re pretty much out of time. I know there are other questions that have come in but I’m not sure we’re going to get a chance to address them. We can obviously follow up. We’ll share those questions out between the team and make sure those get addressed. I think for now I would like to thank everyone obviously for joining us on this webinar. Thank you all very much for your time.