Hadoop job (YARN Staging) error while executing simple job

In a Hadoop eco-system, no.of jobs are executing in a fraction of time in that time. I am trying to execute the Hive job for Data validation in Hive server in Production server. While executing a Hive job in the hive command line I got this type of error.



at org.apache.hadoop.ipc.Client.call(Client.java:1468)
at org.apache.hadoop.ipc.Client.call(Client.java:1399)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
at com.sun.proxy.$Proxy9.addBlock(Unknown Source)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.addBlock(ClientNamenodeProtocolTranslatorPB.java:399)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
at com.sun.proxy.$Proxy10.addBlock(Unknown Source)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.locateFollowingBlock(DFSOutputStream.java:1532)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1349)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:588)
22:33:33 INFO mapreduce.JobSubmitter: Cleaning up the staging area /tmp/hadoop-yarn/staging//.staging/job_1562044010976_0003
Exception in thread "main" org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /tmp/hadoop-yarn/staging//.staging/job_1562044010976_0003/job.jar could only be replicated to 0 nodes instead of minReplication (=1). There are 1 datanode(s) running and no node(s) are excluded in this operation.
at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:1549)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:3200)
at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:641)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:482)
at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2039)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2035)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2033)

The above error belongs to a connection error in Datanode while executing the code. At the time Datanode not running properly. so find below resolution for this issue:

Stop all services:

stop-all.sh
start-all.sh

Here restart all services including Namenode, Secondary Namenode, DataNodes and remaining services like Hive, Spark,
etc.

If still showing this type of error then start the distributed file system.

start-dfs.sh

Check all the Hadoop Daemons like Name node, Secondary Name node, Datanode, Resource Manager and Node Manager, etc. By using below command

jps

And then check All node information by using “hadoop dfsadmin -report ” for the status of the Datanode whether it is running fine or not.

Above steps for Local, Pseudo distributed,  and standalone mode only in Hadoop eco-system.

For Cloudera, Hortonworks, MapR distributions are simply “Restart” DataNodes and Services like Hive, Spark, etc.




Summary: In Big Data environment we executing so many jobs like Hadoop/Spark/Hive for the result but some times showing above error. At the time we stuck but here the simple solution for the above error

Hadoop Admin Vs Hadoop Developer

Basically in Hadoop environment Hadoop Admin and Hadoop Developer major roles according to present IT market survey Admin has more responsibilities and salaries compared to Hadoop developers. But we can differentiate below-mentioned points:



Hadoop Developer:

  1. In Big Data environment Hadoop is a major role, especially in Hadoop developers. A developer primarily responsible for Coding in Hadoop developer also the same kind of thing here developing like:

A)Apache Spark – Scala, Python, Java, etc.

B) Map Reduce – Java

C)Apache Hive  – HiveQL (Query Language & SQL)

D) Apache Pig  – Pig Scripting language etc.

2. Familiarity with ETL backgrounds for data loading and ingestion tools like:

A)Flume

B)Sqoop

3. Bit of knowledge on Hadoop admin part also like Linux environment and some of the basic commands while developing and executing.

4. Nowadays most preferably Spark & Hive developers with high-level experience and huge packages.

2.Hadoop Administration:

1. Coming to Hadoop Administration is a good and respectable job in the IT industry. Whereas, admin is responsible for performing the operational tasks to keep the infrastructure and running jobs.

2. Strong knowledge of the Linux environment. Setting up Cluster and Security authentication like Kerberos and testing the HDFS environment.

3. To provide new user access to Hive, Spark, etc. And cluster maintenance like adding (commissioning) node and removing (decommissioning) nodes. Resolve errors like memory issues, user access issues, etc.

4.Must and should knowledge on BigData platforms like:




A) Cloudera Manager

B) Horontworks Data Platform

C) MapR

D) Pseudo-distributed and Single node cluster setup etc.

5. Review and Managing log files and setting up of XML files.

6. As of now trending and career growth job.

7. Compared to Hadoop developers, Hadoop Admins are getting high salary packages in present marketing.

Summary: In the Bigdata environment Hadoop has valuable and trending jobs. And provide huge packages for both Hadoop developers and Hadoop administration. Depends upon skill set will prefer what we need for future growth.

Big Data Spark Multiple Choice Questions

Spark Multiple Choice Questions and Answers:

1)Point out the incorrect  statement in the context of Cassandra:

A) Cassandra is a centralized key -value store

B) Cassandra is originally designed at Facebook

C) Cassandra is designed to handle a large amount of data across many commodity servers, providing high availability with no single point if failure.

D) Cassandra uses a right based DHT*Distribution Hash Table) but without finger tables or routing

Ans : D

2. Which of the following are the simplest NoSQL databases in BigData environment?

A) Document                                    B) Key-Value Pair

C) Wide – Column                        D) All of the above mentioned 

Ans : ) All of the above mentioned

3) Which of the following is not a NoSQL database?

A) Cassandra                          B) MongoDB

C) SQL Server                           D) HBase

Ans: SQL Server

4) Which of the following is a distributed graph processing framework on top of Spark?

A) Spark Streaming                   B)MLlib

C)GraphX                                          D) All of the above

Ans: GraphX

5) Which of the following is leverage of Spark core fast scheduling capability to perform streaming analytics?

A) Spark Streaming                     B) MLlib

C)GraphX                                       D) RDDs

Ans: Spark Streaming

6) Which of the following Machine Learning API for Spark based on Which one:

A) RDD                                 B) Dataset

C)DataFrame          D) All of the above

Ans: DataFrame

7) Based on which functional programming language construct for Spark optimizer

A) Python                         B) R

C) Java                                   D)Scala

Ans: Scala is a functional programming language

8) Which of the following is a basic abstraction of Spark Streaming?

A)Shared variable                 B)RDD

C)Dstream                                  D)All of the above

Ans: Dstream

9) In a which cluster manager to do support of Spark?

A) MESOS                                B)YARN

C) Standalone Cluster manager   D) Pseudo Cluster manager

E) All of the above

Ans: All of the above

10) Which of the following is the reason for Spark being faster than MapReduce while execution time?

A) It supports different programming languages like Scala, Python, R, and Java.

B)RDDs

C)DAG execution engine and in-memory computation (RAM based)

D) All of the above

Ans: DAG execution engine and in-memory computation (RAM based)

BigData and Spark Multiple Choice Questions – I

1. In Spark, a —————– is a read-only collection of objects partitioned across a set of machines that can be rebuilt if a partition is lost.

A) Resilient Distributed Dataset (RDD)                  C)Driver

B)Spark Streaming                                                          D) Flat Map

Ans: Resilient Distributed Dataset (RDD)

2. Consider the following statement is the correct context of Apache Spark   :

Statement 1: Spark allows you to choose whether you want to persist Resilient Distributed Dataset (RDD) onto the disk or not.

Statement 2: Spark also gives you control over how you can partition your Resilient Distributed Datasets (RDDs).

A)Only statement 1 is true                 C)Both statements are true

B)Only statement 2 is true                  D)Both statements are false

Ans: Both statements are true

3) Given the following definition about the join transformation in Apache Spark:

def : join [W] (other: RDD[(K, W)]) : RDD [(K, (V, W))]

Where join operation is used for joining two datasets. When it is called on datasets of type (K, V) and (K, W), it returns a dataset of (K, (V, W)) pairs with all pairs of elements for each key.

Output the result of joinrdd, when the following code is run.

val rdd1 = sc.parallelize (Seq ((“m”,55), (“m”,56), (“e”,57), (“e”,58), (“s”,59),(“s”,54)))
val rdd2 = sc.parallelize (Seq ((“m”,60),(“m”,65),(“s”,61),(“s”,62),(“h”,63),(“h”,64)))
val joinrdd = rdd1.join(rdd2)
joinrdd.collect
A) Array[(String, (Int, Int))] = Array((m,(55,60)), (m,(55,65)), (m,(56,60)), (m,(56,65)), (s,(59,61)), (s,(59,62)), (h,(63,64)), (s,(54,61)), (s,(54,62)))
B) Array[(String, (Int, Int))] = Array((m,(55,60)), (m,(55,65)), (m,(56,60)), (m,(56,65)), (s,(59,61)), (s,(59,62)), (e,(57,58)), (s,(54,61)), (s,(54,62)))
C) Array[(String, (Int, Int))] = Array((m,(55,60)), (m,(55,65)), (m,(56,60)), (m,(56,65)), (s,(59,61)), (s,(59,62)), (s,(54,61)), (s,(54,62)))
D)None of the mentioned.

Ans: Array[(String, (Int, Int))] = Array((m,(55,60)), (m,(55,65)), (m,(56,60)), (m,(56,65)), (s,(59,61)), (s,(59,62)), (s,(54,61)), (s,(54,62)))

4)Consider the following statements are correct:

Statement 1: Scale up means incrementally grow your cluster capacity by adding more COTS machines (Components Off the Shelf)

Statement 2: Scale out means grow your cluster capacity by replacing with more powerful machines

A) Only statement 1 is true               C) Both statements are true

B) Only statement 2 is true              D) Both statements are false

Ans: Both statements are true

Replication Factor in Hadoop





How to Replication Factor comes into the picture:

The Backup mechanism in the traditional distribution system:

In Hadoop, Backup mechanism didn’t provide high availability. This system is followed by shaded architecture.

The first request from File to Master node then divided into blocksize. It is a continuous process but node 1(slave1) is failed to another node(Slave 2).

Replication Factor:

Replication factor is the process of duplicating the data on the different slave machines to achieve high availability processing.

Replication is a Backup mechanism or Failover mechanism or Fault tolerant mechanism.

In Hadoop, Replication factor default is 3 times. No need to configure.

Hadoop 1.x :
Replication Factor is 3
Hadoop 2.x:
Replication Factor is also 3.

In Hadoop, Minimum Replication factor is 1 time. It is possible for a single node Hadoop cluster.

In Hadoop, Maximum Replication factor is 512 times.

If 3 minimum replication factor then minimum 3 slave nodes are required.

If the replication factor is 10 then we need 10 slave nodes are required.

Here is simple for the replication factor:

'N' Replication Factor = 'N' Slave Nodes

Note: If the configured replication factor is 3 times but using 2 slave machines than actual replication factor is also 2 times.

How to configure Replication in Hadoop?

It is configured in the  hdfs-site.xml file.

/usr/local/hadoop/conf/hdfs-site.xml
<configuration>
<property>
<name> dfs.replication</name>
<value> 5 </value>
</property>
</value>

Design Rules Of Replication In Hadoop:

1. In Hadoop Replication is only applicable to Hadoop Distributed File System (HDFS) but not for Metadata.

2. Keep One Replication per slave node as per design.

3. Replication will only happen on Hadoop slave nodes alone but not on Hadoop Master node (because the master node is only for metadata management on its own. It will not maintain the data).

Storage only duplicates in Hadoop but not processing because processing us always unique.

Summary: In Hadoop, Replication factor is a major role for data backup mechanism in earlier days. Default replication factor always 3 except single node cluster environment.

Basic Terminology in Hadoop

Bigdata Solutions:




1.NoSQL – database(Non relational database) – Only for structured and semi-structured

2. Hadoop – Implementation – structured,semi-structured and unstructured data

3.Hadoop eco-systems and its components for everything.

Hadoop:

Hadoop is a parallel system for large data storage and processing. It is a solution for Bigdata.

For Storage purpose HDFS -Hadoop Distributed File System

For Processing purpose MapReduce using simply.

In Hadoop, some keywords are very important for learning scope.

Hadoop Basic Terminology:

1.Cluster

2.Clustered Node

3.Hadoop Clustered Node

4.Hadoop cluster

5. Hadoop Cluster Size

1.Cluster:

A cluster is a group of all nodes belongs to one common network is called a cluster.

2.Clustered Node:

A Clustered Node is a grouping of all individual machines is called a clustered node in Hadoop

3.Hadoop Cluster Node:

A Hadoop Cluster Node is basic storage and processing purpose of a cluster is called as Hadoop Cluster Node.

For storage purpose, we are using the Hadoop Distributed File System.

For processing purpose, we are using MapReduce

4.Hadoop Cluster:

A Hadoop Cluster is a collection of “Hadoop Cluster Node” in a common network is called Hadoop Cluster

5.Hadoop Cluster Size:

A Hadoop cluster size is a total no.of node in a Hadoop cluster.

Hadoop Ecosystem:

1. Apache Pig              –  Processing           – Pig Scripting

2. Hive                             – Processing           – HiveQL (Query language like SQL)

3.SQOOP                       – Integration tool  – Import and Export data

4.Zookeeper               – Coordination      – Distribution coordinator

5.Apache Flume      – Streaming              – log data for streaming purpose

6.Oozie                        – Scheduling             – Open source scheduling jobs

7.HBase                     – Random Access   – Hadoop+dataBASE

8.NoSQL                  – NotOnlySql              – MongoDB, Cassandra

9.Apache Kafka    – Messaging               – Distributed messaging

10.YARN                  – Resource Manager – Yet Another Resource Negotiator

Note: Apache Spark is not a part of Hadoop but including nowadays. It is used for Data Processing purpose. Spark 100 times faster than Hadoop MapReduce.

Compatible Operating System for Hadoop Installation:

1. Linux

2.Mac OS

3.Sun Solaris

4.Windows.

Hadoop Versions:

Hadoop 1.x

Hadoop 2.x




Hadoop 3.x

Different Distributions of Hadoop

1. Cloudera Distribution for Hadoop (CDH)

2.Hortonworks

3.MapR