What is Sqoop in Hadoop | Simple Commands for beginners




What is Sqoop in Hadoop eco-system?

Sqoop is one of the component built on top of the Hadoop Distributed File System(HDFS) and is meant for communication with RDBMS.

i.e.    Either importing  RDBMS tabular data to Hadoop or exporting process data in Hadoop to RDBMS table.

  • Sqoop is only meant for Data Ingestion, it is used for processing data with some business logic like MapReduce, PIG, and HIVE, etc.
  • As Sqoop component is not bundled in the default installation of Hadoop and hence we must have to install Sqoop exclusively on top of Hadoop boxes.
  • Either import or export Sqoop internally makes use of MapReduce Mapper phase only.

Some of the key observations with respect to Sqoop:

1. Either import or export data only happens through Hadoop HDFS. Doesn’t communicate with LFS (Local File System).

2. If you are communicating in Hadoop to any Relational Database using Sqoop, the target RDBMS must be of a java compatible Data Base.

3. If you are communicating in RDBMS from Hadoop using Sqoop, the RDBMS specific connector jar file must be part of below directory:

$SQOOP_HOME/lib

By default, Sqoop is making use of the Mapper process alone to import the data on Hadoop (HDFS). There is no concept of Reducer phases.

Sqoop simple command for import entire table from source to destination, using below command:

Sqoop import \
--connect jdbc:mysql://localhost:3306/testdata \
--username sqoopuser \
--password sqooppaswd \
--table tablename

Sqoop by default uses 4 Mappers, however, we can change the o.of mappers by using below command:

Sqoop import \
--connect jdbc:mysql://localhost:3306/testdata \
--username sqoopuser \
--password sqooppaswd \
--table tablename \
--num-mappers 10

Note: When we are using import all tables option:

1. Either we have to use default path pf HDFS
2.HIVE Warehouse path (user HIVE warehouse path to import the data)
  • Sqoop using Parquet, ORC, CSV, etc file for tabular data.
  • We are also using Sqoop integration with HIVE and HBASE.




Error while running Sqoop jobs in Hadoop Cluster





Apache Sqoop: Sqoop is one of the import/export large data from HDFS to RDBMS and RDBMS to HDFS vice versa.
Talend: Talend is an open-source ETL Tool to provide Data Integration and Big Data environment and Cloud storage based Tool. Nowadays most popular for Data Integration and Big Data.
While running the Talend jobs in the Hadoop cluster environment getting Sqoop connection error like below.

Caused by:  java.sql.SQLRecoverableException: IO Error: Connection reset
at oracle.jdbc.driver.T4Cconnection.login(T4CConnection.java.498)
at.oracle.jdbc.driver.PhyscicalConnection.<init>(PhysicalConnection.java:553)
at.oracle.jdbc.driver.OracleDriver.connect(OracleDriver.java:528)
at.java.sql.DriverManager.getConnection(DriverMAnager.java:571)
at org.apache.sqoop.mapreduce.db.DBConfiguration.getConnection(DBConfiguration.java:302)
... 46 more
Caused by: java.net.SocketException:Connection reset
at java.net.SocketInputStream.read(SocketInputStream.java.196)

The above issue belongs to the connection (network) issue between Integration tools and Database Management System.

Resolution 1:

Step 1: We use Java Database Connectivity JDBC driver.
Step 2: Then configure DNS on both systems 
Step 3: Restart the DB and Tools in the Hadoop cluster.

Resolution: 2

Step 1: $JAVA_HOME/jre/lib/security/java.security
Step 2: securerandom.source=file:/dev/urandom

After completion of all steps then restart the Apache Sqoop.