In this article, we will explain how to resolve the disk I/O error while running the Impala queries in the Cloudera Hadoop cluster on Big Data environment.
While run the Spark streaming query in the format of Avro/Parquet file in the Big Data clusters. At the time Metadata updates from the previous session like driver program in the Spark context. It’s not fetching data from the source to HDFS(Hadoop Distributed File System) location and getting below error:
ERROR: Disk I/O error on :22000: Failed to open HDFS file hdfs://nameservice1/prd//cascsa/coresq/xxxxxxxxxx-xxxxxxxxxxxxxxxxxxxxx.0.avro Error(2): No such file or directory Root cause: RemoteException: File does not exist: /prd/cascsa/coresq/xxxxxxxxxx-xxxxxxxxxxxxxxxxxxxxx.0.avro at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:85)
The above error most likely have a race condition in you’re code. If we are trying to read an Spark streaming query immediately after loading or changing. We need to change the below code configuration:
sync_ddl = 1
We need to update the above configuration on you’re Spark session and then refresh the code. And try to execute from Spark shell session.
How sync_ddl it works?
Here it’s a force query to update all Spark/Impala sessions from backend. If you’re running a sequence of CREATE DB, CREATE Table, Alter table, Insert and similar statements withing a setup script. to minimize the overall delay so we can enable from query option.