Introducing Qubole Support
Qubole processes over 250 Petabytes of data in a month, and the diversity of data we process, clouds platforms we run on, and technologies we provide is often staggering. Here are a few points of interest regarding the Qubole platform:
- We have customers in four public clouds – AWS, Azure, Google and Oracle.
- The largest of these environments can be supporting thousands of customer accounts, processing 2000+ concurrent commands, and running hundreds of Notebooks at any given time.
- These workloads are a mix of Hive, Spark and Presto jobs – each with multiple versions – as well as workflows from Airflow.
- Serving our customer workloads involves a software stack with dozens of system services.
A powerful platform that provides the choice and flexibility of Qubole needs a powerful support team filled with superheroes to keep customer workloads running at peak performance and quickly get them back on track quickly, when necessary.
Our global Support team of 15 amazing people are able to handle the existing backlog of issues and take up to 50 or more new complex tickets every day in addition to fielding additional questions through numerous Slack conversations. The quality of expert support we provide is often a major factor in why customers choose Qubole. In the words of a customer from a recent case study:
Qubole’s dedicated, expert support made a huge difference in stabilizing Zillow’s analytics infrastructure based on Presto clusters. Improved stability plus better autoscaling also greatly improved the performance of those clusters.
“The reason that we’re saving this much in terms of productivity is because Qubole helped us identify key configurations that needed to be adjusted – stuff you can’t just look up in the Presto documentation,” says Rhodes. “It took somebody with quite a bit of expertise digging into it with us to find what the issue was.
Crazy Days of Summer
To better understand the types of issues and challenges our support team handles, let’s follow one of our support superheroes, Pratham Vasa, over the course of a few days. Pratham has been a member of the US-based Qubole Support team for just over 3 years.
(Staff Big Data Support Engineer, Qubole HQ, Santa Clara)
The second half of June promised to be like any other summer, except that things were a little hot at one of our large healthcare customers that we will call H_Co. Pratham received a request from the customer via our web submission system that the customer had tagged as critical. According to the ticket, the user’s Notebook UI had a severe lag which slowed down their team’s data science project delivery. The customer described their issue as:
When we try to write a few SQL’s on the Qubole UI, immediately the words are not appearing, it will reflect after a few seconds. Frequently, we are getting the following alert “Page is unresponsive.” We can talk quickly, if needs be. My number is 000-000-0000.
Pratham recognized the urgency of the issue and immediately called the customer to discuss the issue and record the details of the issues faced by the customer. He collected all the logs, tested the issue, validated that this was a bug, and escalated this to the Qubole development team responsible for Notebook functionality. He then worked collaboratively with the Notebook development team and the customer, conveying all the required information and analysis, to keep the ball rolling.
While the Notebook team was working to identify and create a fix, there came another critical ticket from another healthcare customer, I_Co:
SHOW STOPPER:: Caused by: java.lang.ClassCastExceptio…….
“Show Stopper” – the ticket description read. The user’s command had failed due to an exception. In such scenarios, asking the right questions can help isolate the issue and accelerate the resolution process. Pratham instantly looked into the user’s command and asked the user questions like the exact time of failure, any known change in the underlying data, examples of previously successful commands, and for permission to re-run the command if necessary.
While waiting for I_Co to respond, Pratham continued to work with the Notebook team and scheduled the new hot-fix for H_Co that the team had provided for deployment. He informed H_Co about the pending deployment, ensuring them that the fix would only be applied to the test accounts they had discussed and providing a deployment time that evening..
By now, I_Co had responded to Pratham’s questions. With the new information, Pratham looked into the failed command, re-ran the command, isolated the table that was causing the issue, investigated the Application Master and the Resource Manager logs associated with the command execution, re-ran some test commands, and identified a mismatch between the user’s input data and the target schema. Pratham provided the user with an email outlining the exact column where the mismatch was occurring. I_Co’s critical issue was resolved!
In debugging I_Co’s problem, Pratham had used one of the biggest advantages of the Qubole platform – that it keeps a record of all commands run in the system. This level of detail is invaluable for debugging repeated workloads and for expediting support’s ability to respond to and resolve customer issues quickly.
Unfortunately, the fix for H_Co had not worked. The customer had responded saying they continued to see the problem after the hot-fix had been applied. This was curious as Pratham had already tested the hot-fix on his test cluster and it worked fine, but somehow, it did not take effect on the user’s cluster. Why?
Once again, Pratham offered to get on a call and, after further investigation, Pratham identified that this was not a Qubole specific error but a missing IAM role configuration on the user’s end.
The hotfix was released via a jar file in a Qubole S3 bucket. As depicted in the diagram above, the users cluster was trying to access the Qubole S3 bucket, but since they did not have the right permission to access the packages, the Notebook teams hot-fix did not take effect on the user’s cluster.
Pratham figured this out by running diagnostic commands on the cluster and finding the S3 access errors that identified the culprit. He asked the customer to reconfigure their IAM roles, to allow access to that S3 bucket. Finally, after much diligence from the Notebook team together with Pratham’s efforts, this issue was resolved! Once the hotfix was verified by H_Co, Pratham scheduled the deployment of this fix across various accounts and clusters, like a true support hero!
These few days were business as usual for Pratham as he exhibited excellent problem solving skills through active listening, having empathy towards his customers, and demonstrating patience, the ability to remain calm under pressure and resourcefulness. This all led to having his customers back up and running in short order and with constant progress updates. Experiences like these which are delivered by all of Qubole’s Support Superheroes is why customers rate us at near 100% CSAT rating on Support tickets.
Never a dull moment
A peek from a dance video, choreographed by him
Like our other support team members, Pratham is full of life both at work and after hours. Outside of work, Pratham is passionate about dancing, hiking and drinking Boba tea. He started dancing at the mere age of 8 and has nearly 5 years of experience teaching Bollywood, HipHop and BollyHop dance forms.
He has performed in various award shows in India including the Screen Awards and Filmfare Awards with famous Bollywood celebrities like Shah Rukh Khan and Hrithik Roshan to name a few. He was also a part of two Bollywood movies Bhootnath and God Tussi Great Ho. So next time he works on your tickets, strike up a discussion on dancing. 😉