How to Copy Data from Hadoop Cluster to Cloud S3| BigData | Hadoop | AWS

In this article, we will explain how to copy data from Hadoop cluster to Cloud migration with large data sets.



Copy from HDFS to Cloud using Distcp command:

Here we are using Hadoop commands for data migration from Hadoop cluster to Cloud or Hadoop cluster to other cluster level.

HDFS to AWS S3:

Here source is Hadoop Distributed File System and Destination is Amazon S3 bucket. Cluster level Data migration from Hadoop FS to Cloud (AWS S3)

hadoop distcp hdfs://source_directory/Files s3a://destination_directory

AWS S3 (Cloud) to HDFS

hadoop distcp s3a://source_filesĀ  /destination[/tmp/datasets]

In case the above command is not working properly then try to fix it.




AWS S3 block filesystem was added to the ${HADOOP_HOME}/bin/hadoop. After that run the above distcp command in the Cloudera or Hortonworks or MapR distributions. Distcp command sets up a MapReduce job to run the copy in the background. Here the map tasks is calculated by counting the number of files in the source. Once completed 100% map task, it means copied total data from source to destination.

The above steps are getting failed then try to use simple copy command for HDFS local to Cloud (Amazon Web Service S3).

Copy from HDFS to Cloud using CP command:

hadoop fs -cp /usr/hadoop/files/data.csv s3n://S3-Bucket - Name/

 

What is incremental copy in Hadoop?

To avoid duplicate data In Hadoop eco-system is called incremental copy in hadoop. Basically, incremental backup is mandatory for large datasets in production environment.

How to do incremental copy from HDFS to S3:

hadoop distcp -update -delete hdfs://source_directory/Files s3a://destination_directory

The above commands are using for incremental backup data migration from source Local HDFS(Hadoop Distributed Files System) to Cloud either AWS S3 or Azure.

Summary: In Hadoop Production environment distcp command is very useful for Hadoop users because large data sets are transferring from local to cloud (either aws or azure) clusters.