Interning at Qubole: What I Learned From Working on Hive, Spark, and Sqoop
This is a guest post from Akhilesh Anandh, who was an engineering intern with us.
My journey with Qubole began in January 2015, when I joined as an intern for 6 months (my final semester of college) under the PS-2 programme of my alma mater BITS Pilani. I spent another 2 months at Qubole from July to September. I chose Qubole in order to get an opportunity to work on some interesting projects related to big data infrastructure, a field that I am highly passionate about; and also gain a first-hand insight into the culture of a small but rapidly growing startup.
Working at Qubole has been a wonderful and fun experience. Over the last 8 months, I got to work on some amazing stuff, and had a superb learning curve. I was guided by some awesome mentors who brought me up to speed real quick, patiently helped me with all my little doubts and questions, and pushed me in the right direction. (Shout-out to Rajat Gupta and Pavan Srinivas!)
Some of the stuff I worked on:
- Hive Storage Handler for Kinesis
A storage handler that allows users to create external tables in hive that reference data in kinesis streams. Users can read from / write to kinesis streams using hive queries. Checkpointing allows queries to be performed incrementally, by only reading the data that came in after the last query. This project has been open sourced, and can be found here.
- First Class Notebooks
Qubole offers Zeppelin-based notebooks for Spark and HBase clusters. First class notebooks allows the user to view a list of all notebooks across all clusters in one place, transfer a notebook from one cluster to another, clone a notebook, search for a notebook, etc. This was taken up by me initially and later launched by the team.
- Data Import/Export Improvements
I worked on a few minor features and fixes in Qubole’s data import/export tool, which is based on Sqoop. This includes periodic HDFS cleanup, handling a large number of tables, and allowing the use of pre-configured tunnels when creating a connection from the UI.
- Several miscellaneous improvements and features in Qubole’s Spark and Zeppelin offerings: automatically restarting Zeppelin interpreters after idle timeout, setting the default interpreter for a Zeppelin group, killing spark applications whose spark contexts have been stopped, allowing the user to edit the spark configuration of a cluster from the UI, supporting external packages in a spark program, and automated python tests for a few Spark features.
Dabbling with such a diverse set of projects was a tremendous learning experience, which boosted the breadth and depth of my knowledge and skills to a great extent. Here are some of my takeaways from this:
- I gained a deeper understanding of several tools commonly used for big data processing such as Hadoop, Hive, Spark, Zeppelin, etc. Working with them helped me get to know what tools are suitable for what use cases.
- I got a lot of exposure to AWS and learned about its myriad of services – S3, EC2, Kinesis, DynamoDB, etc. This made me appreciate the power and versatility of the cloud.
Working on the web tier made me (somewhat) proficient in Ruby and Rails, and gave me a better idea of web development in general. I learned how to use the model-view-controller pattern, how to write REST APIs, etc.
- I became familiar with quite a few programming languages – I learned and used Scala while working on Spark; used Python for writing automated tests and scripts; Java for building the Hive storage handler, and some Zeppelin stuff.
- My shell scripting skills have improved a great deal.
From my very first day here, I found Qubolers (if you will) to be a bunch of outstandingly talented and passionate people, focused towards building cool products and getting things done. Everyone is highly collaborative (the Slack and Skype chat rooms teeming with activity sometimes even into the wee hours of the night are testimony to this), and always ready to help out (Joydeep, our co-founder, could be seen helping interns unscrew their laptops to add RAM on their first day). Qubolers enjoy vast amounts of flexibility in choosing working hours that suit them, working from home when they want, etc. There is a flat hierarchy, and minimal policies and processes. It is employees’ dedication and responsibility that makes all of this work. The founders have been working exceptionally hard to realize their vision.
At the same time, Qubolers don’t hold back on having fun. Team lunches, birthday celebrations, all-hands meetings over pizza, etc. happen all the time. Festivals like Holi are celebrated in the office in full swing. And to cool off from all the slogging, there’s the occasional off-site. Food expenses are reimbursed, which meant I and other interns had a fun time together trying out the several restaurants in the area for lunch.
Overall, I enjoyed every moment I spent at Qubole, and am sure the experience and learnings I gained here would be quite useful in my career as a software engineer. I would definitely recommend Qubole to anyone interested in the big data space.
Thanks for the post, Akhilesh! PS. Check out our career page if you’re interested in exploring employment and internship opportunities with Qubole.