Big Data and Oracle BMC – Qubole On Demand Webinar


Maddie: Good morning, good afternoon, and good evening, depending on where you are in the world. Welcome to today’s webinar entitled Data Warehouse Modernization, Big data and a Cloud Success with Qubole and Bare Metal Cloud. In today’s session, we’re going to be talking about how to leverage Oracle Bare Metal Cloud in Qubole to optimize the cost, performance, and scale of your big data initiative.

This time I like to introduce our two speakers for today, Craig Carl and Xing Quan. Craig joins us from Oracle Bare Metal Cloud team where he is the Director of Solutions Architecture. Joining Craig is Xing, the Senior Director of Product Management here at Qubole. With that, I’ll go ahead and hand it over to you, Xing.

Xing Quan: Thanks. My name is Xing. I am the Head of Product Management for Qubole. We’re here today to talk about big data success in the cloud. First, let’s settle on lie, this is important. It’s no secret that data big data is disrupting markets. If you look at all of these companies and the use cases, the thing that they have in common is that data is becoming a differentiator in the product and customer experience.

If you look at something like Netflix, where the recommendations that they show, in terms of what TV shows and what movies you would like, has become a core part of the product experience. Same thing with something like Airbnb, where they are actually suggesting prices for hosts that is base on empirical data of half-trends, as well as demand from recent searches from customers. All of these companies have discovered that data is quickly becoming their number one asset. They have built out data platforms to take advantage of this fact.

There are a bunch of challenges with implementing and executing on a big data platform. First of all, there is a large variety in data sources. You can have data that sits in relational data basis. You can have data in logs. You can also have data that you purchase or obtained from third parties. All of which pertain to your essential customer. How can you get to a 360 degree completely centralized view of your customer? Well, you need to be able to integrate all of these data within one central data lake.

Volume also becomes an issue. With volume, you need to have the scale-out architecture and the technology to be able to process massive petabyte level volumes of data, in order to get through a large historical analysis, as well as joining all of your various data sources. Wanting these big data platforms on-premises can be both complex, as well as expensive. It requires a very specific expertise to help with the operations and with the engineering.

Let me also quickly touch on why Spark? Which has become the hottest big data technology in the space. Spark has been adopted very quickly for a number of reasons. First of all, it does a majority of it’s processing in memory which is faster than traditional hard drive disk. There are increasingly more compute options available which have large memory footprints. It has a fully featured ecosystem of use cases, such as a SQL for interactive ad-hoc queries, stream processing for real-time data analysis. It has built-in machine learning libraries that help you develop and train learning models. Then finally, it has graph processing for very complex correlations in your data sets.

Finally, it has a very simple API which has been adopted by both data engineers, as well as data scientist. Of course, Spark is open source. It helps you avoid vendor lock-in and technology lock-in. It has also become something that there’s a very large community built around in terms of building out extensions and also, building out tutorials and educational materials.

The problem with Hadoop and Spark which have similar architectures is that the traditional model co-locates the compute and storage within a single compute node. While Hadoop and Spark both scale very well horizontally, meaning that you can keep adding additional nodes as your workload increases. The problem is that you are forced to scale compute and storage at the same time, which is not ideal. If you have data volumes that keep increasing, what your compute needs are not necessarily increasing. You’re actually scaling inefficiently because what you really like to do is you really like to just be able to scale the storage independently.

The other issue that the traditional on-prem model for Hadoop and Spark present is that the cluster must be persistently on, or else the data is inaccessible. What that means is that all of the consumers of your data lake, whether it’d be programmatic API actions or interactive ad-hoc access through by data analyst or data scientist, they always to be occurring a cluster that is on. Then when there is no activity, that cluster still must stay on because of the on-prem model.

Ideally, what you’d want to have in a modern data platform is a cloud platform that can give you on-demand and elastic compute capabilities. Meaning that when you are ready to do some analysis or access some data, then the infrastructure is provisioned automatically and very quickly for you. You want to have a scale out object storage with a dedicated storage service which is consistently accessible and low cost that you can use to access as your data lake. This data platform should also be able to expand and contract because your workload changes not only over time but also throughout the day and because you are not always able to predict what that you should look like.

Finally, you ideally want to have a turnkey service which provides a high degree of automation and acceleration so that you can get very quickly to a self-service data platform storage for your data consumers, which can be engineers, data scientist, or even business analyst. With that, I’m going to pass the mic over to Craig who is the Director of Solutions Architecture for Oracle Bare Metal Cloud. He’s going to walk you guys through the Oracle Bare Metal Cloud service.

Craig Carl: Thank you very much, Xing. Oracle has entered the market recently with a new IaaS, a new compute cloud offering. It has some significant differentiators in the market. Some of those make it a tremendous place to run big data workloads. The combination of our compute infrastructure with Qubole’s management, usability, and scaleability tooling makes the combination just one of the very best places on-premise or in the cloud to run big data workloads.

The Bare Metal Cloud team has existed for about three years inside of Oracle. We are based in Seattle, that’s where Cloud Talent is. Oracle came here to get Cloud Talent. A vast majority of the people who worked here have worked in Oracle for less than those three years. We come from AWS, we come from Azure, we come from other large-scale cloud providers. I call it a late mover advantage. We’ve built genuine clouds. We’ve made all of those mistakes. We’re now doing it again.

With everything we’ve learned about large-scale distributed compute challenges and delivering large-scale cloud infrastructure, we’re starting again from scratch or we have started again from scratch. We’re building something new and exciting. Everybody here loves these problems and love solving these problems. These are just incredibly interesting problems to solve and the result of that is we’re building amazing, amazing product.

Take that and pick up a bunch of people with lots of experience in distributed compute, lots of experience in delivering amazing cloud products. Combine that with Oracle’s laser focus on enterprise workloads and their incredible success in that market for decades and you get a powerful IaaS platform that is ideal for things like big data. We offer, of course, virtual machines like every other cloud provider, but also bare metal instances. This is important, we hand you an entire bare metal host. There are no Oracle agents. There are no hypervisors. This is an entire bare metal host dedicated to you. No noisy neighbors, no shared resources, nothing.

To do that, we’ve pushed all of our secret sauce into the network. Other cloud providers, they fork hypervisors, they make lots of change in the hypervisors, they put their secret sauce in the hypervisor. We’ve moved all of ours into just a tremendously impressive software-defined network, that lets us offer both bare metal hosts in multiple sizes and virtual machines.

We took a performance first approach to the market. The three instance shapes that we lead with, the largest, 36 cores, 512 gigs of memory, a 10-gigabit network and 28.8 terabytes of NVMe SSDs, the very fastest durable storage you can buy today as the local ephemeral storage. When we went to market, the smallest machine we had was exactly the same without the NVMe storage. Definitely, a performance first approach.

At the same time, virtual machines are vital. You don’t always need 36 cores to run your applications, to run the tooling that interfaces with your big data, to run your web front ends, your visualization tools, you don’t need a bare metal for that. Of course, we offer virtual machines. The experience for customers between virtual machines and bare metal host is identical. You come to the console, you use the API, you use Terraform, you use an SDK to launch virtual machines and bare metal with the exact same experience. You just change the shape, the instance name or the instant shape.

Which means, it is trivial to combine bare metal and virtual instances, virtual machines in a solution. They’re on the same network. They’re connected. They have the same incredibly fast network. We provision at the same speed as other providers so you can get a bare metal instance. The 36 cores, 512 gigs of memory, the 10-gigabit network, the massive amount of NVMe storage in less than five minutes, a little bit longer for Windows. You can get a virtual machine instances Linux in about 90 seconds.

We offer multiple payment methods. Oracle offers lots of contract vehicles, including hour by hour, pay-as-you-go pricing that is powerful with big data, especially with Qubole. Because Qubole will scale your big data cluster up, it will scale it down and your bill will adapt as that scales up and down.

The Bare Metal Cloud is built in a regional model. The regional model, again, critical for big data workloads. We can spread a Hadoop cluster over multiple availability domains. Each availability domain is separated from every other availability domain by some number of miles. At the same time, we took guidance from the Oracle database team and the team that runs active Data Guard, to make sure that our availability domains are close enough together to support synchronous replication for Oracle workloads.

We’ve had a real sweet spot between separating them far enough away that any physical damage, storms, car accidents, natural disasters of any sort, won’t affect multiple availability domains, but close enough together so the interconnects are incredibly low latency. There’s a 10-gigabit network between every instance including across availability domains. You get 10 gigabits of traffic between every bare metal instance in your environment. A little bit less for virtual machines, depending on the size of the virtual machine. We use a very flat clone network inside of an availability domain, you are never more than two hops away from another instance.

Very fast, very low latency, no oversubscription network and we should talk about that really quick. Oversubscription in a network has existed primarily to drive down costs for the cloud provider. At a technical level, it’s not difficult to implement no oversubscription, you just have to pour money into that infrastructure, and that’s exactly what Oracle has done. From the host to the top of the rack, no oversubscription, up into that flat clone network, no oversubscription. We can commit to you absolutely rock solid predictable performance at 10 gigabits between your instances.

Our compute product is strong, especially supporting big data and then the applications that surround big data. Most relevant for Qubole and for big data customers are the high I/O and the dense I/O instances. These have the brand new Intel’s Xeon E5-2600, 36 cores each, 2.3 GHz Intel processors, the 10-gigabit network we’ve talked about and then either 28.8 or 12.8 TB NVMe SSDs and half a terabyte of RAM, which again, tremendous for Spark. We support lots of OSs, Oracle Linux, CentOS, Ubuntu, Windows. We support custom images which Qubole actually uses extensively to deliver the service. You can bring your own custom OSs and images to us as well.

Typically, in a big data environment, big data is a great place to answer certain questions and to analyze certain types of data, but it’s typical to also have data that is best analyzed with relational databases. Obviously, this is where Oracle absolutely shines. We can offer a single node database as a service, 2 node redundant, highly available Oracle RAC as a service, and quarter, half, and full rack Exadata as a service.

Exadata is the best combination of a database engine and hardware that you can get on the market today. You can now consume it in a subscription model, a millisecond away from your big data, Hadoop and Spark clusters and the same amount of time sub-milliseconds away from your application stack. We can connect these to your own premises using a no charge VPN interconnect, direct connections through our fast connect product and MPLS connections through our fast connected partner edition.

We can host these great big data workloads. We can attach relational infrastructure to these big data workloads. Of course, there’s a high performance, three replica, designed for 11 nines of durability object store, less than a millisecond away from all of this that we can offer as well.

We take this incredibly impressive IaaS product and we deliver it at a really compelling price. Compared to AWS, and these comparisons are hard, we’re not running the exact same infrastructure, we are at a minimum 20% less price to performance. When we move up into the virtual machine infrastructure, we are nearly 40% lower. All of the cloud providers have always charged high retail prices for data transfer, we’re just not doing that.

Data transfer across and in and out of the internet is one cent per gig outbound, free inbound, no inter-ad traffic charges, and you get your first 10 terabytes of egress for free. A big chunk of our customers actually never pay for data transfer and those that do, driven their costs way, way down.

It’s the combination of a whole bunch of different things that result in a tremendous big data platform. It’s no noisy neighbors. Even our virtual machine fleet, we closely commit resources to virtual machines. Even if every other virtual machine in your infrastructure that’s co-located with you on host is using 100% of their available resources, it does not impact you. There is no over-subscription.

At the same time, you can’t burst into additional capacity and that’s important because it’s a proof that we’re not oversubscribing or sharing resources. An entirely 10-gigabit network, going, believe it or not, to 50 gigabits around Oracle OpenWorld. 10 gigabits is low, we’re going up to 50 gigabits, even more impressive product. The Bare Metal compute, the NVMe SSDs, which are incredibly low latency, incredibly high performance.

We have the object store. We have the Oracle relational database systems. Of course, you can spin up MySQL, PostgreSQL server if that’s your relational database of choice and the low latency network, combine that with Qubole’s experience in managing, scaling, making big data far easier to consume, far easier to use. I’ve used the product. I love the product. It’s impressive in its management. It’s impressive that I can walk away from the product, it’ll shut itself down, it’ll reduce its size, or it’ll stop entirely depending on a set of rules that I apply. Then I go back and I run another query and the thing just starts up and it just takes a little bit longer to run the first query because the cluster is starting.

We take the combination of all of those things and we are the very best place on-premise in the cloud anywhere to run your Hadoop and Spark workloads. Xing, I’m going to turn it back to you now.

Xing: Great. Thanks, Craig. Let me talk a little bit more about Qubole which is the Turnkey Big Data Service on Oracle Bare Metal Cloud. The Qubole data service, why would you want to use it? I think that the big things are that we are a very simple service. It’s a complete data platform that you don’t need to manage the infrastructure. Qubole helps provision on your behalf.

As Craig mentioned, we help size the clusters, we do this automatically with software. The end result is that you end up having Spark and Hadoop clusters in just minutes. It’s all built on top of the Oracle Bare Metal Cloud and all of the cost and performance advantages that Craig had talked about. Finally, all of this is done with pay-as-you-go pricing. You can get to something like self-service of your data with your Spark and Hadoop clusters, and you can do it with very, very little administrative or management costs.

Let me just dive a little bit deeper into the actual product and the Qubole data service. We have designed Qubole to scale across the enterprise for all the data consumers that would need access to data. This includes analysts who are perhaps writing ad hoc SQL queries. It includes data scientists who are doing a variety of data exploration as well as data modeling and in some cases, advanced modeling with machine learning or deep learning type algorithms.

Finally, it also includes data engineers who are setting up data pipelines for ETL & Reporting and are also making integrations with vertical applications. Having applications that programmatically read data from the data lake, in order to actually inform some product experience. We run completely on open source engines. We support Apache Spark Hadoop for large scale data transformation. We support Hive for ETL and for some ad hoc query. Finally, we support Presto, which is a very fast interactive ad hoc query engine. As I mentioned, all of this is built with native integration with the Oracle Bare Metal Cloud Service which leverages their speed, their performance, and their network architecture.

The results are astounding. We were super excited to work with the Oracle Bare Metal Cloud team because we found that their product really worked and was effective. When we benchmarked Sparks SQL with a very typical TPC-DS data set, we found that the combination of Qubole and Oracle Bare Metal Cloud platform performed more than twice as fast. It performed 115% faster than a comparable on-premises setup. You really experience the impact. Your analysts will get their data quicker and more efficiently and you’re doing this all in a pay-as-you-go environment, which has great performance and low cost.

Really quickly, what makes us different? I think the big message is that we help you become productive. There’s a lot of automation built into the product, in terms of automatically making use of the Oracle BMC APIs but also in managing the cluster lifecycle. For instance, you don’t have to think about provisioning hardware. You don’t have to think about sizing it and shutting it off. All of this is done automatically by Qubole and is based on context. It is based on the workload and the demand that is on that cluster.

You can imagine if you have, let’s say, a Spark SQL use case where you have many, many analysts running ad hoc data queries. That cluster is going to be very, very busy in the middle of the day when everybody is at work and they’re all looking to grab some data from the warehouse. After work hours, that cluster will be relatively quiet. During these times of differences in workload, Qubole is smart to actually just scale down the cluster and scale down it to a size that is appropriate for the workload that is occurring on the cluster at that time.

The other thing that makes us different is the user productivity. As a consumer of data, you no longer have to think about infrastructure. You don’t have to think about software version. All of this is made available in a simple interface where you can just concentrate on writing your query, looking at the results, and investing that back into a business decision.

I think for the administrators and for the ops personas, Qubole really takes advantage of the elasticity of the cloud. On average, we find that our customer clusters scale 34x, so what that means is if they have some minimum amount of nodes as being like 10 nodes, then on average, they are scaling up to 340 nodes at peak. With this elasticity and because you’re not provisioning to the max all the time, you’re saving the cost for when the clusters are not running, but you’re also driving increased productivity when the demand is there and you need to burst.

Then finally, all of this is on top of the core benefits of a cloud IaaS infrastructure with an object store as a data lake, which is infinitely scalable and low cost and durable, as well as network performance, where a lot of data is going to be shared across nodes in the cluster, and having a really good network pipes, as Craig mentioned, really helps the performance.

We like to say that Qubole operates at cloud scale.

If you look at the scale that we’re operating at, we’re processing 500 petabytes of data monthly. We’re basically processing half of an exabyte on behalf of our customers. That’s a useful analysis and that’s directly impacting the business for our customers. We’re running huge 500 and above level nodes for Spark clusters in the cloud, and this is across

thousands of clusters that get started every month.

Let me dive just a little bit deeper into data engineers and data admins, who are typically doing setting up data pipelines and ETL jobs as part of the front end of the data pipeline. The main differentiator that Qubole offers here, I think there are two. The first one is that you have complete control over your costs. As I mentioned before, you can take advantage of things like auto-scale, you can take advantage of even aggressive downscaling when you know that your job is not as time-sensitive, which is very typical. If I have some job that needs to complete in eight hours in the middle of the night, maybe I have some flexibility in how I want it.

Then the other big thing is that all of this program is available via programmatic access. For everything that Qubole offers in its UI, there is a corresponding API action and we provide SDKs for you to help directly integrate with your data workflow.

For data analysts and data scientists, it is all about simplicity, and having the toolsets and having ready access to it. For data analysts, we have a SQL workbench that helps you author your query, gives you autocomplete suggestions, and then, could even integrate directly with a tool like Tableau for data exploration and visualization. Then for data scientists, we have a notebook interface which is multilanguage, which allows you to do things like develop and train a model and then even push that into production when you’re ready and make it part of a data pipeline workflow.

This is just a visual representation of what auto-scaling means to a practical workload. The way to read this chart is that the x-axis is the time of the day, so going from 7:00 AM to 5:00 PM. The green line shows the number of commands that were issued by a customer in that given hour. The blue line shows what Qubole actually provisioned to meet and match the demand.

What you could see is that it is effectively– basically shadowing exactly the demand. Then it caps at 10 nodes during the peak hours because that’s the cost control piece, where an administrator can say, even in the worst case when there’s an extreme burst, “I don’t want to go beyond a certain amount because I want to make sure that I’m operating within my budget.” Again, this advantage means that as an administrator, you don’t have to worry about becoming a bottleneck for your data consumers. Most of the time, when there is not as great of a demand, you’re saving money because you’re not over provisioned on your cluster.

To put this all together, this is showing the data flow and the architecture of the product and how it interacts with the Oracle Bare Metal Cloud. Going from left to right, we have a bunch of interfaces for access. We have a UI, which is a SaaS service, which is offered through the browser. We have SDKs and APIs for custom programmatic access. Then we have ODBC and JDBC drivers for connectivity to BI clients such as Tableau and other software programs. There’s a very lightweight Qubole Saas here which has some control logic.

Really, all of the processing and all of the heavy lifting happens within the Oracle Bare Metal Cloud. We make direct use of the data that you already have stored in the Oracle Bare Metal Cloud object store and we are provisioning and managing in real time, the ephemeral compute instances that you spin up and spin down in the Oracle Bare Metal Cloud compute. All of this is within your own controlled environment and then within your Oracle Bare Metal Cloud account.

Just to close this out and tie it back to the first slide, data is becoming the central assets. Data-driven companies are finding that partnering with Qubole and getting this acceleration and automation and getting to a story of self-service access for analysts, data scientists, and data engineers is really, really powerful. You can learn more about Qubole on the Oracle Cloud Marketplace. You can also directly contact us via our email. With that, I’m going to throw it back to Maddie.

Maddie: Thanks, Xing. That concludes our session today. Thank you to our speakers and everyone who joined today. If you like what you heard, we encourage you to check out our new ebook on Building the Modern Data Platform or join us in person at the Data Platforms conference on May 24th through 26th. We also have a ton of upcoming events and webinars which you can check out at the link on your screen. Thanks, everyone and I hope you have a great week.