In this article, we will explain how to resolve the Hadoop and Big Data distributions Distcp copy command error in the Big Data cluster.

Error: java.io.IOException: org.apache.hadoop.tools.mapred.RetriableFileCopycommand java. io. FileNotfoundException: Requested file maprfs://mapr/user/File/Copy_File doesn’t exist.

Error:

TASK ID failed : attempt 158934xxx_m_0003_1000, Status : FAILED

Error: java.io.IOException: org.apache.hadoop.tools.mapred.RetriableFileCopycommand$CopyReadException: java. io. FileNotfoundException: Requested file maprfs://mapr/user//Copy_File doesn't exist.
at org.apache.hadoop.tools.mapred.CopyMapper.map(CopyMapper.java:250)

at org.apache.hadoop.tools.mapred.CopyMapper.map(CopyMapper.java:52)

at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:146)

at org.apache.hadoop.mapred.MapTask.runNewMapper (MapTask.java:796)

at org.apache.hadoop.mapred.Yarnchild$2.run(YarnChild.java:163)

at java.security.AccessController.doPriviliged(Native Method)

at javax.security.auth.Subject.doAs(Subject.java:422)

at org.apache.hadoop.security.UserGroupInformation. doAs(UserGroupInformation. java :1669)

Caused by: org.apache.hadoop.tools.mapred.RetriableFileCopyCommand$CopyReadException: java .io. FileNotFound: Requested file maprfs://mapr/user/File/File_Copy does not exist.

Solution:

In this issue is common for Hadoop Admins and we tried with copying large data sets from HP-MapR FS to Amazon S3(Simple Storage Services) cloud but getting the above error and here we provided simple resolution for this error.

Step 1: Go to source cluster and check whether files are present or not.

Step2: Incase, If the large datasets/files exists are in the source cluster then try with below command.

nohup hadoop distcp -update -delete maprfs://mapr/user/production/datasets s3a://user/datasets/File 2>&1> copy.log&

Step3: Once it is done, check with log file.

Summary: The above error is very simple to resolve for the Hadoop or Big Data Admins. This error is not only for Hadoop admins even Cloud admins (AWS, Azure). Why distcp getting error in the cluster? because some of the files like AVRO, Parquet files are not copied properly due to large data sets in those files. Actually, we tried with Spark large log data files from Spark_Source to destination cluster at the time getting these type of error in the Big Data cluster for all users.

Incase distcp is not working properly then try with SCP command or falcon tool otherwise will take snapshots or mirroring mechanism for Big data distributions.