What is Sqoop in Hadoop eco-system?
Sqoop is one of the component built on top of the Hadoop Distributed File System(HDFS) and is meant for communication with RDBMS.
i.e. Either importing RDBMS tabular data to Hadoop or exporting process data in Hadoop to RDBMS table.
- Sqoop is only meant for Data Ingestion, it is used for processing data with some business logic like MapReduce, PIG, and HIVE, etc.
- As Sqoop component is not bundled in the default installation of Hadoop and hence we must have to install Sqoop exclusively on top of Hadoop boxes.
- Either import or export Sqoop internally makes use of MapReduce Mapper phase only.
Some of the key observations with respect to Sqoop:
1. Either import or export data only happens through Hadoop HDFS. Doesn’t communicate with LFS (Local File System).
2. If you are communicating in Hadoop to any Relational Database using Sqoop, the target RDBMS must be of a java compatible Data Base.
3. If you are communicating in RDBMS from Hadoop using Sqoop, the RDBMS specific connector jar file must be part of below directory:
By default, Sqoop is making use of the Mapper process alone to import the data on Hadoop (HDFS). There is no concept of Reducer phases.
Sqoop simple command for import entire table from source to destination, using below command:
Sqoop import \ --connect jdbc:mysql://localhost:3306/testdata \ --username sqoopuser \ --password sqooppaswd \ --table tablename
Sqoop by default uses 4 Mappers, however, we can change the o.of mappers by using below command:
Sqoop import \ --connect jdbc:mysql://localhost:3306/testdata \ --username sqoopuser \ --password sqooppaswd \ --table tablename \ --num-mappers 10
Note: When we are using import all tables option:
1. Either we have to use default path pf HDFS 2.HIVE Warehouse path (user HIVE warehouse path to import the data)
- Sqoop using Parquet, ORC, CSV, etc file for tabular data.
- We are also using Sqoop integration with HIVE and HBASE.