Here will explain about how balancing HDFS block data in Hadoop.
Balancing HDFS BlockData
The name node attempts to distribute blocks evenly between data node when it’s written, it is still possible for HDFS to become unbalanced.
Why poor distribution of data?
- Addition of new data node
- Mass deletion of data
- Unevenly colocated clients
- Reduce data locality in MapReduce
- Increase Network Utilisation
- Reduce job performance
- Wearing on disks
- Running the balancer regularly might be sufficient, or might at least stave off the problem until other accommodations can be made
- The balancer works by first calculating the average block count per data node and then examining each data node’s deviation from the average
- If a node is below some percentage, it is said to be underutilized.
- A node above some percentage is over-utilized
- This percentage is called the threshold rate at which data is transferred over the network using the below command
Balancing HDFS block data also requires administrator privileges, as with other administrative commands
1. Become the HDFS superuser or a user with equivalent privileges (or use sudo -u username when executing commands).
2. Execute Hadoop balancer -threshold N to run the balancer in the foreground, where N is the percentage of blocks within which data nodes should be with one another. To stop the process prematurely press control+c or kill process id from another terminal. Alternatively, Apache Hadoop users can run the process in the background using the start – balancer.sh script; CDH users should use the hadoop -.20 balancer init script;
3. Monitor the output (or log file, if you choose to run the balancer in the background) to track progress.