As Qubole Data Service has gained adoption – many of our customers asked for import and export facility from their relational data sources into the Cloud (S3). Dimension data from such data sources are an important part of data analysis. Log files (aka. Fact tables) in S3 are often desired to be joined with such dimension data as part of data analysis and/or pipelines. In many cases – end results of data analysis – whether of an ad-hoc query or a data pipeline – are often required to be exported back into relational databases. Stored in this form – the results are often used to drive reporting tools or power online applications. Fortunately, we found an excellent tool in open source – Apache Sqoop – for moving data to and from relational databases. This post describes our experiences using Sqoop and how we made it easy to integrate a lot of different data sources in the Cloud.
Apache Sqoop can import and export data from relational databases over JDBC to HDFS. Sqoop allows importing full table, selected columns – and even allows the flexibility of specifying free-form queries to extract data and write into HDFS. Moreover it can do this using multiple parallel connections where required. In doing so it takes advantage of the inherent parallelism of Hadoop. Hadoop is already one of the core offerings of QDS – and our self-service and auto-scaling Hadoop clusters make adopting new Hadoop based applications a breeze. However – we had to solve a number of limitations before we could offer Sqoop as a service to our users:
- S3 support: Sqoop can be used to dump data to and from HDFS – but the target for data import and export in QDS is frequently S3. Hadoop clusters and HDFS instances bought up by QDS on behalf of customers are ephemeral.
- Exporting Hive Tables and Partitions: Although Sqoop supports importing to a Hive table/partition, it does not allow exporting from a table or a partition.
- Upsert Support: Upserts are a common mode for exporting data to a database – where existing rows are updated and new rows are inserted. Unfortunately, Sqoop does not support upserts for many databases, including Mysql.
Loading data to S3
To solve the first problem, we first use Sqoop to copy data into HDFS – and then generate Hive queries to copy/load that data into tables in S3. Both of these activities, today, occur on a shared Hadoop cluster run by Qubole. Fortunately, auto-scaling means we have little administration overhead of running a continuously variable customer workload. In addition the various optimizations we have made to fix Hive performance on S3 have been critical in making this arrangement work.
We have added upsert support for Mysql over JDBC. Sqoop exports data using SQL queries. We alter the generation of these queries to achieve the desired effect. Note that the notion of upsert in MySql is slightly different from the generic notion of upserts in databases.
Exporting Hive Tables/Partitions
While Hive Tables/Partitions can always be dumped to HDFS and exported by sqoop from there – an obvious optimization is to read data directly from the files backing Hive Tables in cases where it’s feasible to do so. We have made these enhancements where possible.
Integration into QDS
One of the best things about building a service is that we can integrate useful primitives like Sqoop throughout our service:
- Users can define database end points using the DbTap abstraction (see Documentation). These database endpoints can be used for various activities throughout our system – including for exporting and importing data via Sqoop.
- One-time Export and Import commands available via our REST API are implemented using Sqoop.
- Import and Export commands can be embedded inside workflows and invoked in our batch processing infrastructure called the Scheduler.
Further integration points seem attractive. Users find it useful to run a query and then to later export it’s results to a relational database. An export option can be displayed alongside historical queries visible via Qpal.
A persistent focus at Qubole has been in making Big Data components accessible and relevant to analysts throughout an organization. Interfaces that make it easy to author and test uses of powerful constructs like Sqoop are essential. With respect to Sqoop – these priorities resulted in many back-end and front-end improvements:
It is easy to erroneously configure a data import/export process. Simple errors (for example – mismatched schemas, or bad table/column names) can take a significant while to discover. To that end – we have added a test-mode to our Sqoop executions where we short-circuit expensive steps (like running Hive queries or Map-Reduce jobs) and just go through the rest of the steps. This has allowed us to quickly validate commands and give feedback on a lot of errors to users early.
Often a user may try to move too much data by mistake (an incorrect where clause on the source data for example!). In order to catch these – we impose reasonable limits on total data transfer by a single import/export command (that can always be overridden in genuine use cases).
We allow users to kill a running import/export command which might have been mistakenly launched. This is a generic functionality that is supported for all QDS commands. For Sqoop – we borrowed code from Hive to keep track of any map-reduce jobs spawned by the Sqoop command and requesting the Job-Tracker to kill them on receiving any signal.
Sqoop’s parallelism is a boon when dealing with large data volumes – but can also make it harder to debug any execution failures. Users must go through the failed tasks belonging to any launched Hadoop Jobs and pull up their logs to find out the reason for failure. Apache Hive short-circuits this process by showing logs from the tasks with most failures automatically. We borrowed this idea/code from Hive and added it to Sqoop to help make it easier to debug failures in Sqoop runs.
Making sophisticated tools like Sqoop as widely accessible as possible is always a challenge. A case study in point is Sqoop import configuration. After looking at the configuration options carefully – we decided to categorize the configurations into one of two types:
- Simple Mode: where the user specified columns and where conditions for data extractions
- Advanced Mode: where the user can specify a free form query to extract data
Sophisticated configuration around parallel extracts were only made available in the Advanced mode – and that too only if the user actually wanted a parallel extract. Some of these flows are shown below. We hope that this makes Sqoop imports extremely accessible for the vanilla use case – while at the same time allowing the power users full flexibility and helping them avoid errors.
Conclusion and Roadmap
Apache Sqoop has been an excellent starting point for our data integration efforts. For the most part – it just works and the quality of the codebase is excellent (our sincere thanks to its developer community). The functionality described in this post is now available generally – users can signup for QDS and create a free account. We are actively working on additional integration points of Sqoop in our browser application. Going forward we will be integrating support for importing/exporting data from other types of databases – like MongoDB and CouchDB – into QDS. We are also working on integrating Sqoop’s incremental extraction features into our product. We are also investigating closer integration between Sqoop and Hive – to let users have direct access to external data sources using Hive. Democratizing data, it turns out, creates interesting engineering problems. All the way from creating intuitive User Interfaces, to coming up with future-proof abstractions and apis, writing operational software to run complex distributed systems without downtime and finally to having a rocking back-end data processing stack. If these sorts of problems interest you, ping us at [email protected]. We’re hiring!