Google Cloud Big Data Analytics

As companies scale their data infrastructure on Google Cloud, they need a self-service data platform with integrated tools that enables easier, more collaborative processing of big data workloads. Join Qubole and Google experts to learn:

Watch this on-demand session to learn: 

  • Why a unified experience with native notebooks, a command workbench, and integrated Apache Airflow is a must for enabling data engineers and data scientists to collaborate using the tools, languages, and engines they are familiar with.
  • The importance of enhanced versions of Apache Spark, Hadoop, Hive, and Airflow, along with dedicated support and specialized engineering teams by the engine, for your big data analytics projects.
  • How workload-aware autoscaling, aggressive downscaling, intelligent Preemptible VM support, and other administration capabilities are critical for proper scalability and reduced TCO.
  • How you can deliver day-1 self-service access to process the data in your GCP data lake or BigQuery data warehouse, with enterprise-grade security.

00:05

Big Data Analytics

Hello and welcome everyone to today’s webcast about enterprise-scale big data analytics on Google Cloud. My name is Jose Videsis from Qubole and I am joined today by Anita Thomas from Qubole and Naveen Punjabi from Google. Welcome guys. 

00:24

Thanks, Jonah, nice to be here. 


00:27

Today’s webcast is going to be slightly different in format so that we can make it worth your time. And let me explain how we do that and take care of some housekeeping before we begin. First of all, if you have any questions, please enter your questions on the browser console and we’ll take care of those at the end of the webcast. Second, about the format, this webcast is not going to be a traditional webcast where we all just do some presentations and take care of some slides. Quite the contrary, we want to make this worth your time by focusing on those points that are top of mind for our customers. And as such, I am going to ask some questions of our presenters. I prepared some questions focusing on those that are typically the ones our customers ask us when they are considering and building their strategies and working through this journey of taking big data analytics and the corresponding machine learning projects to the enterprise-to-enterprise scale big level. 


01:29

Enterprise Data Analytics

And that’s what we’re going to focus on today. So why is it going to be worth your time? Again, it’s not just focused on data engineers and machine learning engineers. Those of you that are heads of data, as we call, in other words, those who have responsibility for the data assets in your company, in your organization, are going to benefit from this. Because at the end of this webcast, you’re going to have a list of the primary topics you should be focusing on. Not just because we said so, but those are the things that you ask of us. Second, for those of you in technical jobs, this webcast is a checklist, perhaps not comprehensive, but quite thorough about the main things, the main technical capabilities you should be focusing on when you’re either building for the first time or taking your implementations of big data analytics and ML on Google to the next level. 


02:31

Data Infrastructure


So that’s what we’re going to do today. Without further ado, let’s continue that. I always ask myself and my team, which is how real is this? Is it a lot of vendor hype, is it really that people are asking and taking projects to the next level? And we’re always checking on that, taking a pulse? And what I can tell you for sure is that with the advent of in the last few years of infrastructure as a service, every organization has increased dramatically the speed by which they take these types of projects, not only to the cloud, obviously, but into an enterprise level. And that’s very important. According to Gartner, as you can see on the screen, for the next five years, most of the decisions that have to do with infrastructure are going to be driven by the needs of artificial intelligence, machine learning, and big data analytics. 


03:36

Google Cloud Platform

And they’re going to double every year. That’s real data. That’s not just one or two companies. This is from a recent company that is constantly monitoring the market. But most importantly, these organizations that are taking this to the next level and incorporating data into a comprehensive strategy will be ahead of the pack. They will be ahead of their competitors. And while we all know this, it’s always good and important, especially for us, to ensure that’s top of mind and how we adapt and adjust to these needs from organizations out there. And that’s why back in April, at the Google Next 2019 conference, Hubble and Google announced their expanded partnership, focusing specifically on these needs that we see that are slightly different, that are emerging, that are turning from basic types of small work, group style project into the enterprise level. And at that point was when Google also announced it’s a short list of analytics partners of which Qubole is part. 


05:01

Now this is what brings me to the very first question. Having said that, we are looking at this from a different angle, focusing specifically on what organizations need today. Let me ask you Anita and Naveen, how does this partnership help all these enterprises with their big data strategy? 


05:27

Data Science and Big Data Analytics

Thanks, Jose. As you mentioned, we at Google Cloud continue to see a lot of interest from customers in big data and data science. In particular, 80% of the Fortune 500 companies use big data as a part of their data strategy. But running these workloads is hard. Setting it up, operating, and scaling big data clusters is difficult, time-consuming, and resource intensive. Setting it up for an enterprise big data workload in production is even harder. It takes long lead times and requires difficult capacity planning exercises over a long period. So when we talk to customers and we talk to them quite often, they tell us four main things about what the Pinpoints are and what they would like from their big data platforms. The first thing is they want a faster, more scalable way to get insights from the data that they already have, particularly unstructured data and structured data together. 


06:37

Big Data Engineering

And what they’re essentially saying is they want their platform to get up and running without waiting for hardware or software to be installed and configured. And that’s important. The second thing is they want the resources they have within their organization to get out of owning and monitoring these technologies and get back into implementing net new use cases and innovating. They like design workflows that create clusters very quickly, complete the jobs end to end, and automatically delete them when it’s all done. They also care about having different business needs met, with urgency, with emphasis. And the other thing that they have highlighted quite highly to us is the fact that Big Data is no longer a part of one line of business or one team within their organization. It’s now part of their broader data strategy. So they have developers who are working on Big data tools, and they have data engineers implementing workloads to clean the data in certain groups. 


07:47

AI and ML


And they have data science stakeholders who are now building AI and ML models on top of that Big Data. And they are now thinking of how we can have a platform that comprehensively, services all these needs together and enhances the collaboration that is possible within the team. Right now the tooling that exists is quite fragmented and holds them back from moving faster. So when we talk about Qubole on Google Cloud and the partnership, it’s meant to address these needs that customers tell us about. 

08:27

Google Cloud Platform Data Lake


Thanks, I mean, for that background. So definitely those are the principles on which we have architected our platform, the Qubole platform on Google Cloud. And some of the key benefits that this platform would provide to a customer. First is being able to have a unified experience in which tools and interfaces that your users need are all bundled as part of this platform. So be it a notebook, your dashboards, or having a common workbench for your commands, all of this is bundled right into the platform to provide that unified experience that Naveen just mentioned. Second is that we have invested in building sophisticated automation into the platform so that A, your enterprises can operate this platform at scale in the cloud without being constrained by having to manage the platform daily. So users can now be freed up or the data teams can now be freed up from managing and provisioning the platform to getting insights out of your data. 


09:42

Kubernetes infrastructure


When you have sophisticated automation, you can run this platform at scale, but at the same time, you can control your cost. So we do it in a cost-efficient way. Third is enterprise-grade security we have made sure from the get-go that we have built-in granular access controls within the platform so that enterprises, while your users can collaborate at the same time the admin can control what resources a user has access to. So whether a user has access to a particular cluster, a command, or a notebook, all of that can be controlled within the platform using fine-grain access controls. And finally, because of this platform, as we see more and more enterprises operating multi-cloud, we offer this platform across clouds and we can help users seamlessly migrate their workloads into Google and we are helping customers do that already. And finally, to share a high level of this, what our architecture looks like, we have bolter hosted on a modern control plane that is hosted in Google on a Kubernetes-based infrastructure. 


10:58

Open Source Data Lake

It has all the built interfaces as well as tools like notebooks, dashboards, unified workbench for commands your end users would need so that they don’t need to go out of the band and use other tools outside the platform. All of that is built into the platform to enable the self-service experience that your users need. Second, we target this platform at multiple user personas. So we have data engineers who use the platform, data scientists who have ML needs, and then finally the data analysts who want to query and draw insights from the data. All of these users and customers, tell us that they need different engines to be able to different open source engines that they use, to be able to serve these different use cases. And that’s what we have built into the platform. We have support for multiple open-source engines, so we believe in giving our users a choice, right? 

11:52

Data Processing Engine

From Spark, Hadoop, Hive, Airflow, and Presto. So all of these engines are built into the platform to be able to serve the needs of the end users. And finally, in keeping with our principle of having secure enterprise-grade security, we have made sure that the data as well as the computer always remains in your Google project. So in your Google project, your VPC is where we bring up the big data cluster. So we use the customer’s compute engine instances. And for storage, your data can be either in cloud storage, which is your data lake or like Naveen mentioned, there are a lot of customers who are also using BigQuery storage, and both in tandem are supported in this product. And finally, on the interface side, while we have all of these interfaces built-in, we make sure that our end users can access it via API, via UI, SDK, and finally, ODBC JDBC drivers, so that you can access your data through bi tools like Looker. 


13:01

Data Collaboration Platform

Thank you, guys. That was pretty comprehensive. I appreciate it. And what I like is the fact that both of you have focused on and touched on not only what enterprise-scale big, but also what we call, in general, end users, right? Most of the time, we’re focusing on having people access the data, and process data as soon as possible, in the best way possible, right, so that they can actually do their job. And one of the most common questions that we get, either implicit or explicit all the time from every customer is how can we make it easier for our data engineers, our data scientists, and our ML engineers to collaborate amongst themselves and within their groups. How can we make this possible? So I want to address that as at the top of these questions today. So tell me a little bit more about this. How do we improve collaboration? 

14:00

Big Data Platform

Yeah, so, Jose, as I mentioned earlier, we have multiple personas that we target for our big data platform and for data scientists specifically, we have built in all the tools and interfaces that they need into the platform. So to be able to have a self-service nature. These tools and what we hear from data scientists is that they need notebooks that are not living out of band as a service, but actually having basically being an integral part of the platform. These notebooks can talk to both cloud storage as well as BigQuery. So your data can be in either of these sources. You have the flexibility to be able to use multiple languages like Python, Scala, SQL and R in these notebooks. Data scientists can then use multiple of these frameworks to be able to build and train their ML models. And finally, when they are done and while they’re building and training these models, what is important is collaboration because usually they work in teams and being able to not only share these models and notebooks with other users becomes important. 

15:15

Data Engineering

But at the same time, we have built-in security capabilities where you can control what the other users are doing within these notebooks. Once you have trained and built your models, you can deploy your entire ML pipeline to production using either Airflow, which is a native service within our platform, or we have even a built-in scheduler in the service. We have ways in which you have certain users who just want to see the end result of your notebooks. You can publish those to Dashboards and share those with their end consumers who can just come and understand the insights that they can draw from your notebooks. And finally, for the data engineers, same thing, the same design principle of having all the tools that they need to be built within the platform so that they don’t again need to go and build all these things out of the band in an IDE and bring it into the platform. 


16:12

Data Science Platform Scalability

Increases the scalability of the platform, the end-user experience, and that unified experience that we are looking to target with this platform. So we have a command workbench that is native to the platform, where you can come in, you can write commands against all the big data engines that we support. We have the Hive tables and also BigQuery because we made sure we built a first class integration with BigQuery. All of these tables are surfaced right within the platform. So you can easily look up these tables and then write your commands. Logs or application UIs like the Spark UI are all available right there. Scheduling becomes easy because you can build your dags and pipelines in Airflow. The data engineers can build their pipelines in Airflow and execute it easily onto Qubole. And finally, we have all of this available via UI, API, SDK. We have data import export capabilities where you can easily bring in your data from multiple sources, bring it into a data lake, and then do your analysis using the engine of your choice. 

17:15

So in addition to that, I think we also have enterprise support for the engine. So Naveen, do you want to share some light on how you have seen enterprise-scale big data space? Talk about the support capabilities. 


17:33

Big Data Processing Engine

Yeah, let me chime in here. I think one of the most important aspects that we have heard from end users, be data scientists, and data engineers across enterprises, has been the enterprise support for big data engines. We often see customers face issues in open source that they either need to fix themselves or wait for these fixes to be patched into open source. Customers often prefer a model where our partners provide premium support for open source rather than needing to figure out these fixes and issues on their own. In fact, at Next a couple of months ago, we announced partnerships with the broader open-source ecosystem, including Redis, Confluent, MongoDB, Elastic, and many others on the same principle that customers will be able to use those services on Google Cloud with enterprise support coming from our partners. So we’re providing customers both the choice and the premium support on open source technologies. 


18:39

And our partnership with Qubole and Qubole running on GCP is in the same context. With Qubole on GCP, customers will get enterprise support of multiple OSS engines that Qubole supports like Spark Hadoop, Hive, Airflow, all of the things that you mentioned offered by the experts coming from Qubole. 


19:02

Yeah, this was an important consideration in the partnership where in addition to providing a unified platform that is self-service and that has sophisticated automation, google and our customers definitely told us that we having enterprise support for these open-source engines was critical for them. And what we have is that we have engineering teams that are specialized for each of these engines that we so it’s not just a platform where we support a way where we provide best effort support. We have engineering teams that add specific capabilities, Qubole enhanced optimizations like join optimizations, dynamic filtering, data caching, all of that is added on top of these open source engines. Some of these, for example, like data caching or the ability to diagnose your Spark jobs with Spark lens, all of these we have contributed back to open source. In some other cases, there are specific optimizations that we keep in source within our, within platform. 


20:16

So custom within our platform and the ability for a user to not have to wait. So if you run into a spark bug or any other bug in an open source engine, the ability to have to not wait and the ability that Qubole can provide to immediately fix that for you and if necessary, contributed back to open source was important to our base and to the Google customers. And we have made sure we provide that in this platform. 


20:43

Good. Thank you guys. This is really good. And as I said in the beginning, for those of you, especially in technical roles, keep in mind our purpose here is to give you sort of a checklist of the things that you need to consider of the things that you need to look for in your technology, whatever your platform is, when you embark on this journey on Google cloud. Right. So another very important question that always comes up, and that I refer to the immediacy of data is related to essentially the patients. How much time do people want to wait or have in order to get their data in the right form so that they can take action? And this is critical in an economy or in an environment where near real-time micro batches of data or even streaming data is becoming not just widespread, but also an important part of what every data department, every data engineer has to consider in their role. 


21:49

So I think it’s worth for us to cover this a little bit. And since we’ve been talking about what end users need, those data engineers, analysts, those ML engineers, now let’s talk about the amount of time and what are the real aspects behind not just having access to the data. Let’s give them some. I would ask of you, Anita and Naveen, what if we can give it some more realistic perspective to get the data in the hands of those that need it? Naveen? 


22:19

Self-Service Analytics

Yeah, I think, like you mentioned in the beginning, you gave a lot of good data points about how enterprises are becoming data-driven. So having a self-service platform that allows them to deploy operationalize and get insights quickly from the data is critical to them. I think that’s stable stakes now, but it goes beyond how quickly customers can deploy the platform. It requires few more things. How quickly can enterprises scale big platform that they need? How quickly can they get their billing contracts done, how they can set up billing configurations in the most optimal way, and how they can once they have the platform deployed? How quickly can they give access to everybody within the organization? And what we have done in this partnership is that we have built integrations, in fact tight integrations between Qubole and Google Cloud marketplace so that customers can easily subscribe to Qubole via the marketplace, receive a single bill, easy contracting mechanism. 


23:31

And the bill comes from Google. Today in the marketplace, we have a pay-as-you-go, usage-based plan. In fact, we are actively working to cater a better experience for enterprises and add more options doing custom quotes and annual commit plans in the near future. So this is top of mind for us to not only look at how Qubole addresses the need of self-service to the end data engineers and data scientists from an access standpoint, from a deployment standpoint, but also from the end-to-end experience that they would care about. 


24:08

That’s very good. Thank you. I like that because it just brings that sense of realism to what does it take to do this at an enterprise scale? Any additional comments? 


24:18

Anita yeah. So in addition to what Naveen covered in terms of being able to easily procure and subscribe to the service from. 


24:27

Marketplace. 


24:31

Google Cloud Platform Cost Optimization

Having integrated billing where you just receive one single bill from Google for both your Qubole usage as well as your Google usage. We have also made sure that we have really simplified the onboarding experience here so your users can easily authenticate into the platform using their Google account credentials. And the deployment essentially is a one-click deployment. On the Qubole side, we have made sure that once your users have authenticated into the platform, everything is set up, then access to data becomes the next hurdle. So making sure that we have easy access connectors to multiple data sources. So specifically for Google, it’s cloud storage and BigQuery. And then generically, within our platform, we have connectors to multiple databases like  MySQL, Postgres, Mongo, DB, so that you can bring the data from these into the data lake and then do easily start drawing insights from your data and start processing your data quickly. 


25:34

Good, thank you. Now that we’ve covered all these aspects of immediacy and end users and also the self-service, now let’s talk about the hardest part of it all. Naveen mentioned this earlier as well. I’d like touch on this. I think this is a very important question because sometimes for some organizations, this can be the elephant in the room. And that means, yes, I can do all of this. But as I grow, my environment, as the world itself generates so much more data, and we want to process that and use it in all our projects, all our analytics, then the costs of administering a platform keep growing as well. And this is where I would say we’ve focused tremendously as a core foundation of this partnership. And that is the synergies between Google and Qubole. Focusing on not just simplifying, but making administration easier with the lowest possible cost is what has brought us together. 


26:43

Nobody else can bring this together in the way that we can as Qubole and Google together. So I definitely want you guys to cover on this a little bit more. Let’s expand on this because as I said, this is top of mind, but many organizations don’t consider this and they don’t start to feel these pains until after they embarked on these projects. So what can we talk about this, Anita? 


27:08

Let me chime on that a little bit. Jose, I think when we talk about administration, most systems start with easy administration, but they start facing challenges as they scale. Google is known for solving customer problems and running services at scale, whether it’s Search, YouTube, and we at Google Cloud are focused on bringing those scalable services to end customers along with our partners. So the intent there is, as we scale our solutions with partners, the administration component of them doesn’t become harder. That’s a critical way of how we think about it. The other aspect which I think is important to keep in mind is security and compliance. It’s top of mind for businesses, companies have to get it right. Cloud can help them a lot. In fact, Gartner talks about how cloud infrastructure will be less prone to security incidents in the next few years than traditional data centers. 


28:13

But not many people see it or believe it yet. And it adds to the overall overhead of administration because you have to do a lot of things to ensure security and compliance and making that a bit harder than what it should be. So when we think about what we are doing with Qubole from an enterprise gate security administration standpoint, just bear in mind the fact that we at Google take security as a key pillar. And when we think about partner integrations and partners running on Google, we want to make sure that the partners are using the best of those services so that they can deliver the capabilities to the end customer. 


28:57

Data Lake Security

In keeping with those principles, we have made sure that we have built an enterprise-grade security into the platform both in terms of fine-grain access controls for users and groups. So while users can collaborate easily, so while we have an easy collaborative platform, at the same time we have fine-grained controls where the admin can control which resources the users have access to and what level of access that is. So be it our commands, command history that you have, or whether it is the clusters, which users have access to, what clusters or how we share notebooks, all of these are controlled by ACLs. Second is that there’s also data access controls so which users have access to, what data becomes important to define and those we have controls like high authorization and tools like Ranger built into supported by the platform so that you can control the user access to those data sources. 


30:03

In addition to that, we have a tight integration with Google Cloud IAM in order to make sure that we can support really custom and granular im permissions that a customer would grant to Qubole to allow the service to spin up clusters in your Google project. 


30:23

So all of those abilities have been. 


30:27

Data Lake Scalability

To ensure we have the enterprise address, the enterprise-grade security needs of our customers. On the second aspect is scalability. So like Jose mentioned, this is usually a day two problem where initially the customers would onboard a platform, but then as they scale it, they start feeling the pain of being unable to scale the platform, being unable to control the cost that platform is driving and hence that’s where the automation comes in. So having sophisticated automation in the platform will help you scale to those levels of being able to run huge big data workloads on this platform without having incurring high cost or without losing the self-service capability of the platform. So we have built in things like workload-aware auto-scaling where you can scale your clusters. So when a user submits a command, the clusters start by themselves. They can scale based on the SLA of the workload. 


31:32

Cluster Autoscaling

So it’s scaling based on awareness of the workload. And finally, when the workload dies down, the clusters themselves downscale and then ultimately terminate. So this autoscaling, both storage disk autoscaling where our persistent disk within the clusters can also scale. So your jobs would never run out of storage space and hence have issues there. And finally, we’ve built support for intelligent support for preemptible VMs within the platform, so that while you scale out, you also have the choice of using either. 


32:09

Regular VM instances within your platform. So all a user has to do is come and specify the percentage of preemptible VMs that you want, and we will take care of intelligently managing, acquiring and handling the loss of a preemptible VM. And that gets preempted so that your jobs don’t see any failures. So that automation, the resiliency, the cost efficiency that is built into the platform ultimately allows us to operate at a level where we have some of our customers running with one is to 200 Admin ratios and some really large ones that run with one Admin per for every thousand users. So both enterprise gate security, as well as automation, have been addressed within the platform. So that when you acquire this platform, and deploy this, your scalability of the platform and your day two problems have also been tackled.


33:11

I think that was very important because as both of you mentioned, this is something that comes up later, but it is very important to plan ahead of time. If we don’t consider all these elements, then we can get caught very quickly with some surprises within our implementations. Now, at this point, I want to ask you another very important question. Typically, typically when we’re talking to any of our joint customers or prospects, there comes this question which is, okay, great, we’ve gone down into the weeds. You’ve given me a lot of important information details as we’ve covered also in this webcast. Can you give me a little bit more about the product integration itself? So that’s the next question for you guys, perhaps. 


34:06

Google Cloud Storage

Naveen, can you start definitely before we go deeper into specific integrations and how they work, I want to spread some light on key GCP services, a couple of them, and why their integration with Qubole matters. Anita referred to them earlier when she was talking about architecture. The first one that I want touch upon is Google Cloud Storage. It is our unified object store for developers and enterprises. It allows you to store, process and analyze data efficiently in an agile way. With Google Cloud Storage, what you can do is you can land your data in the platform in raw state, whether it’s structured data, unstructured data sets, or file formats like Avro Parquet, all of the. JSON structures and store it separated from compute resources so they’re not bound by it and you can easily scale them. Google Cloud Storage also offers multiple storage classes, so you have different levels of latency access that you can configure and optimize your use case for those scenarios as needed. 


35:29

BigQuery 

The second service that I want to talk about is BigQuery. It’s our petabyte-scale data warehouse and it is really good for understood types of undefined, such as orders, and order details. Inventories are things that can be organized in rows, columns and tables. We typically when customers deploy their Big data architectures on Google, we see both of these services being used together based on the use case. One of the most common scenarios that we see is customers storing the data in a data lake in Google cloud storage initially having it pre-processed with Qubole, for example, making the data ready for analytics. And once it’s ready, once it’s in a well-defined schema, having it stored in BigQuery for easier access by their bi platforms. Those are the patterns that we see and why. Hence, when we thought about Qubole integrations, we prioritized these a lot more and focused on optimizing them a lot more than others as well. So I’ll let Amiga talk about specific integrations. 


36:41

Yeah, in terms of specific integrations, in addition to making sure that we have a platform that is scalable and self service, we paid special attention to making sure we are well integrated with the main services within the Google ecosystem. So the main ones among those like Naveen just mentioned were one is BigQuery and the second is Cloud Storage. For cloud storage and for both of these we have built hadoop and Spark connectors into either Google Cloud Storage as well as BigQuery storage. So BigQuery has direct read APIs now where you can integrate directly with these to read data that is in BigQuery storage. So we see customers having data in both of these either because cloud storage is where their unstructured data lives, and BigQuery, there are a lot of applications that have data that coming directly into BigQuery. So we have built integrations to both these storage sources. 


37:44

Big Data Clusters

Second is compute engine. Since we have our big data clusters that we bring up within a customer’s Google project, we have integrated with both the compute engine in terms of the regular instances as well as preemptible VM instances and also persistent disks. Because the capacity on those clusters is defined by the persistent disk and the ability to scale that, along with the compute engine resources, becomes important. And I mentioned earlier, the marketplace very important because it is an easy way to procure and an easy way to have a single bill from Google for your Qubole as well as Google usage. And finally, IAM, because having granular IAM permissions to be able to define the level of access that Qubole has into your Google project, to be able to spin up these clusters and orchestrate the auto-scaling of them is something that a customer can control. 


38:49

So those are the first-level integrations that we have built in. And of course we will be building this out further. So we have other elements like BigTable, Et cetera, that we are planning to integrate with as well going down the road. And one specific integration that I wanted to cover here, is basically the integration with BigQuery. So BigQuery storage becomes important. So cloud storage, obviously we support that because that happens to be the data lake where most unstructured data falls. But then BigQuery in Google has BigQuery storage where a lot of the data that comes directly into BigQuery, like we have double click publishers, Google Analytics, all of these data sources come directly into BigQuery storage. So having a platform that can take for example, Spark, we have built connectors to both BigQuery storage and cloud storage. So that having Spark as a common engine that can read from both these sources and analyze your data in Spark and draw insights from that data becomes important for our end customers. 


40:06

Google Cloud Platform Machine Learning

So what we’ve done here is we have worked with the BigQuery team to build connectors to both Spark and Hadoop and we see customers adopting this for two use cases. First is machine learning where you want to draw, build ML models using Spark on data that is sitting in BigQuery storage and probably also in cloud storage. So you can combine these two data sources, build your ML models using Spark ML and then finally surface these in notebooks. So train your models in notebooks. Or there is also the ETL use case where you have data probably sitting in BigQuery, some data sitting in cloud storage. You want to combine these or prep this data before you load it into BigQuery again. Or continue analyzing in cloud storage, depending on what your use case is. So for both of these use cases, we have built connectors to Spark and Hadoop for direct reads. 


41:02

And the second thing that we have done is we have also taken the BigQuery tables and surfaced it within our UI, so that just like hive tables, you can easily read. Look up what is in these BigQuery tables as users are writing their queries, whether it is for Spark, SQL and ETL workloads, or whether it is for your ML and notebook kind of use cases. 


41:31

Great. Thank you guys. So now let’s bring back up and before we open it up for questions and to take any other questions online, folks, remember, if you have any questions, also have access to all of these slides at the end of the webcast. So let’s talk about some customers. I would love to wrap this up with some stories and see if we can dive a little deeper into what have companies actually done. So, how about you start naveen, tell us a little bit about agile one. 


42:08

Yes, agile one is a great customer of Google Cloud. For folks who are not aware, AgileOne offers a customer data platform. It’s a fairly quickly growing solution area within the market space. And what it does is it provides enterprises, especially marketers within those companies, the power to integrate consumer data across their digital channels, physical channels, mobile ads, all of that come together into one platform and deliver analytics. So they can do predictive insights, they can run campaigns, they can do 360 degree profiling of their end customer. And when they were looking at Google, one of the key considerations they had is how they could integrate really well, not just run on Google Cloud, but also take advantage of the data that we have within the broader Google ecosystem. Around Google AdWords, the marketing platform, double click YouTube and they really wanted to optimize their platform from an integration standpoint, leverage those assets and run on Google effectively. 


43:28

And as were having discussions with them, one of the key considerations we learned was the fact that how critical is Qubole to their overall data strategy in terms of enabling ML for their organization. So they’re a fantastic customer. They have deployed Qubole on Google Cloud and are running fully on Google Cloud now. The CDP is up and running and in fact helping a lot of the retailers out there in enabling analytics and helping their customers serve better. 


44:07

And specifically in terms of what are the benefits that we deliver to AgileOne? First is that overall, we helped Agile One deploy their ML pipelines to production in a scalable way. So being able to basically before Qubole, they were building these ML models locally on their laptops and collaboration, et cetera, was not easy. So having a platform where they can build these models on a self-service platform, where their data scientists could come in, build them, to build and train these models and collaborate with each other was very important to them. Second was after this was done, having a good way to deploy these ML pipelines. The ML pipelines to production with Airflow was very beneficial to them and this really helped them scale out their process in terms of how they build these pipelines and deploy them to production. And then in addition to that, the second aspect was the scalability of the platform. 


45:14

Data Science Bottlenecks

So before Qubole they were always bottlenecked by their ops team. The data scientists often had to wait for ops to provision these clusters and a lot of dependency on their ops and delays. As a result of that, with the automation that is built into the platform, the Admin or the DevOps team got a lot of free bandwidth and the dependency on that team was greatly reduced because there was no provisioning of infrastructure required. The platform took care of making sure the clusters are up and running, they scale to the needs of the workload and then finally they shut down. When not required things like doing software updates became zero downtime for them because we basically take care of that organically. Within the platform, we have these concepts of labels that you attach to clusters so that you can easily move those labels around and hence upgrade one cluster while the other is and move the label to another one that is already upgraded. 


46:20

So it just makes it a lot easier to have zero downtime updates. On our side, affordability across clouds was very important. They were sort of locked into the AWS ecosystem, and having a platform like Qubole allowed them to migrate to Google Cloud. And then finally, the support from our teams was very critical to them in terms of because whenever they got stuck with issues, there were delays from both their data. Scientists were delayed, the DevOps would get stuck, troubleshooting some of these issues. They had no form of support for open source before we showed up. So with Qubole now they have support from specialized engineering teams. So whenever they hit a bargain open source, we can handle that for them and take care of it and fix it within their platform. So all of this end to end automation capabilities the ability for their data scientists to collaborate at scale the ability to deploy pipelines to production easily with airflow the ability to have sophisticated automation so that you’re not blocked by a DevOps team was critical for them and that’s how we got them into production and they’re very successfully running now. 


47:35

Well, that’s great. Thank you very much, appreciate it. Sounds very interesting and unfortunately we’re a bit close to the end of our slot here and we always want to leave some time for questions. So what I’m going to do is just put up this slide to encourage all of you to try Qubole on GCP, on Google Cloud platform. I think that’s the best way for you to get started. Take a free test drive and start looking at it and understanding how can it enhance or improve your own projects, your journey towards ML, big data analytics. And I think that’s very important for us to ensure that all of you have that option available as we move forward. Now in looking at some of the questions guys, I’m going to first take an attempt to answer some of these and feel free to chime in. Anita and Naveen first of course, a reminder to all of you to continue submitting your questions online. 


48:51

Here’s one that talks about services in particular. It says, do you offer services to help companies move to GCP from other technologies or other clouds? Of course, the answer is yes. Right? But let me go deeper into this, right? I wish I would have some of our professional services guys here to respond in more detail, but what we offer are different levels of professional services, not just from the Qubole perspective, but also with our partners. And first of all, we have a menu of what we call Activators. And Activators are sort of a comprehensive, very detailed menu of on demand sessions or small projects. They can be bigger as well, but they’re very specific in terms of what you can do. So you can go in there and depending on where you are in your journey, if they are related to administration tasks, security integration, or they’re specific to one or more open source engine or frameworks, there’s a wealth of detail there on our website, on the Qubole website, on the professional services side, where you can look up individual, very specific services, where you can get help depending on where you are. 


50:17

That way you can adapt those services based, as I said, on administration tasks, on end-user tasks, and on engines, so that you can move projects forward. Now, if I remember correctly, we also had two other levels. I think one was called strategic, and then the other one is tailored and those are specific, right? The strategy is going to focus a little bit more also on advisory services and help throughout the whole project and journey and your strategy. And then, of course, the tailor one is very specific to your situation. That’s also when we work with our partners, with our system integrators partners like Cognizant and others, where we can come in and of course, jointly with Google, we can help you orchestrate your projects at your own pace and based on your own needs. I think that’s as far as I want to take this answer, I’m sure that I haven’t done complete justice to the professional services side, but please go on and take a look at that. 


51:32

Right, thanks, guys. Let me take another question here that I see somebody asked specifically. Oh, this is a question about AgileOne. And the question is, can you explain a little bit more about how is it that you help them? Let me see if I can rephrase this a little bit. How did you help them improve their prototyping? That’s the gist of the question about their machine-learning model prototyping. And I could take a step at that as well. In some of the conversations that we’ve had, especially after they started their deployments on GCP, one of the very important things that the team of data scientists had faced was how quickly could they have the data ready for training new models or just using it in models that had been customized depending on the needs of, say, a net new customer? So Agile One has a library, if you would, of ML models that they use with their customers, right? 


52:46

And not all of them are applied to every situation. It depends on what their particular client’s needs are, and at the same time, they need to modify those models, as is obvious, right? They may need to tweak some things in these specific algorithms. And when you do that, of course, there’s a lot of prototyping and testing that has to come with it. But what’s the most important thing after you’ve tweaked your algorithms? You have to check your, in other words, to check your logic. You have to train the model and then put it to work on sample sets of data that are real sets of data so that you can do not only UAT user acceptance testing from the perspective of their clients but most importantly about all the logic and the process of the model itself. That took a long time. Why? 


53:42

Because first, to prepare the data with some of the other tools that they had, or in some cases they had older versions of Spark that they didn’t have the time or could have new capabilities. So that was one, then another one was okay, once you are extra capacity, extra compute resources, to process all of this data quickly enough so that we can serve our client’s needs. So you can see how the problem compounds. Yes, what a data scientist could do very quickly was modify and alter the algorithms themselves. But how you test them was a big deal for them. So that level of prototyping was essential with Qubole on GCP. Now that’s very efficient because they can process the data or subsets of their data or new sets of combined data to train and deploy some of these models. 


54:44

What this has resulted in is an agile one having the ability to serve clients much faster than before and acquire clients much faster than before. So I thought that was a pretty interesting question because it rounds off a little bit of the whole case study and the impact that this has had on their business. I don’t see any other questions on the console online, so let’s give it a couple more minutes and see if anybody else has any additional questions. In the meantime, Naveen or Anita, I would love for you to add any closing comments until we take a look at the console and any additional questions, Is anything else from your side, guys before we wrap up? 


55:36

Thanks, Jose, for covering this. I think this framework provides a good reference to everybody listening in terms of how they should be thinking about their big data strategy, what the key considerations they should be keeping in mind, and in general, map to how Qubole on GCP can help them through their journey. 


56:00

Thank you, Naveen. Much appreciated. And again, thank you for your presence today. It’s great to work with you regularly. So with that, I don’t see any other questions online. I think we are ready to wrap it up. So once again, Anita, thank you very much for joining us today. And Naveen, thank you for joining. I hope all of you got what you needed out of this webcast and now, remember, you’ll have access to these slides so you can take with you these lists of top priorities that enterprises are looking at, as well as for your technical guys, the checklist of key product and technology capabilities for your product project. Thank you again, guys. Until next time. 

56:46

Thank you for having us. Jose. Thanks.