Apache Sqoop 1.4.7 – 9 reasons why you need it

January 10, 2020 Saiyam Agarwal

The sixth release of Apache Sqoop i.e. 1.4.7 is out! This is one of the most significant updates to the Sqoop platform. We give you 9 reasons why you need Apache Sqoop 1.4.7, including the enhanced Sqoop on the Qubole Data Platform, which has additional features that help you run Extract-Transform-Load (ETL) pipelines more efficiently and connect securely with data warehouses like Google Big Query, Amazon Redshift or Snowflake.

What is Apache Sqoop?

Apache Sqoop is designed to efficiently transfer enormous volumes of data between Apache Hadoop and structured datastores such as relational databases. It helps to offload certain tasks, such as ETL processing, from an enterprise data warehouse to Hadoop, for efficient execution at a much lower cost. Sqoop also makes it easy to extract data from Hadoop and export it to external structured datastores.

How does an Apache Sqoop import work?


Sqoop-import-process

Apache Sqoop 1.4.7 on Qubole – 3 enhancements

Apache Sqoop 1.4.7 on Qubole supports Snowflake

1. Supports secure data transfers with Snowflake: Apache Sqoop 1.4.7 on Qubole supports Snowflake as a data source to import from or export data to Hive. This functionality has also been extended to support AWS PrivateLink. PrivateLink enables the creation of a highly secure network between Snowflake and other AWS VPCs (virtual private clouds) in the same AWS region, fully protected from unauthorized external access.

Apache Sqoop 1.4.7 on Qubole uses HiveServer2 Clusters

2. Increases efficiency using Sqoop with HiveServer2 Cluster: Once data reaches the HDFS storage, Apache Sqoop 1.4.7 on Qubole uses a scalable HiveServer2 cluster (runs on parent Hadoop2 cluster) to push data from HDFS to Hive (runs Sqoop jobs more efficiently). The advantages of using HiveServer2 cluster are:

1. Horizontal Scalability – scales HiveServer2 cluster horizontally by adding worker nodes.
2. High Availability – no single point of failure for a query. Queries can run on any worker node.
3. Limited Isolation – Queries can run on any worker node. In the event of a HiveServer2 JVM crash, impact would be restricted to queries on a particular worker node
4. Ease of Configuration – easily configure any number of worker nodes at runtime.

For more information about HiveServer2 clusters, click here.

Apache Sqoop 1.4.7 on Qubole preserves the data location of a hive table

3. Increases speed and efficiency by preserving the data location of a hive table: While importing data using the Qubole Data Import command, into a non-partitioned Hive table, the table gets deleted and created again. In the case of an external table, this changes the table’s location. Each subsequent import command needs to be preceded by re-creation of the table before the import is run. Qubole Sqoop now preserves the location of external tables and overwrites data to the same location.

Apache Sqoop 1.4.7 open-source updates

Apache Sqoop 1.4.7 allows definition of decimal (n,p) data type

4. You can define a decimal(n,p) data type in the map-column-hive option while importing data from an external data store to Hive. For example, Decimal(1,1) Hive column can be defined now.

Apache Sqoop 1.4.7 allows direct import of data into AWS S3 using Kite

5. You can directly import data into AWS S3 using the Kite SDK 1.1.0 version. The latest version of Kite supports data import of a HDFS file into AWS S3 directly.

Apache Sqoop 1.4.7 now handles unsigned Bigint data types during data imports

6. For MySQL, Sqoop data import handles unsigned Bigint data type columns. This will be beneficial where the column value can reach up to unsigned Bigint.

Apache Sqoop 1.4.7 supports data imports from tables with column names containing special characters

7. Sqoop supports data imports from a table with column names containing special characters. With this, data pipelines won’t fail if the column name contains special characters.

Apache Sqoop 1.4.7 is more secure with support discontinued to JDK6

8. Security enhancements have been made, such as, the dropping of support for JDK6. This will make Apache Sqoop more secure, robust and up to date.

Apache Sqoop 1.4.7 has improved error messaging

9. Error messaging has been improved for troubleshooting errors and debugging purposes.

Reap the benefits of a much-improved Apache Sqoop 1.4.7 on the Qubole Data Platform

For more information on the Qubole Data Platform, click here

The post Apache Sqoop 1.4.7 – 9 reasons why you need it appeared first on Qubole.

Previous Article
Streamlining Operations of Machine Learning Models
Streamlining Operations of Machine Learning Models

Guest authors: Jerry Xu, Co-founder and CEO Datatron; Lekhni Randive, Product Manager, Datatron Qubole auth...

Next Case Study
Predicting, Detecting, and Eliminating Online Threats: Malwarebytes
Predicting, Detecting, and Eliminating Online Threats: Malwarebytes

The cybersecurity company yields greater data-processing at lower costs, and realizes more powerful insight...