Here will explain about how balancing HDFS block data in Hadoop.

Balancing HDFS BlockData

The name node attempts to distribute blocks evenly between data node when it’s written, it is still possible for HDFS to become unbalanced.

Why poor distribution of data?

Addition of new data node
Mass deletion of data
Unevenly colocated clients

Consequences:

Reduce data locality in MapReduce
Increase Network Utilisation
Reduce job performance
Wearing on disks

Possible solution:

Running the balancer regularly might be sufficient, or might at least stave off the problem until other accommodations can be made
The balancer works by first calculating the average block count per data node and then examining each data node’s deviation from the average
If a node is below some percentage, it is said to be underutilized.
A node above some percentage is over-utilized
This percentage is called the threshold rate at which data is transferred over the network using the below command

dfs.balance.badnwidthPerSec

Balancing HDFS block data also requires administrator privileges, as with other administrative commands

1. Become the HDFS superuser or a user with equivalent privileges (or use sudo -u username when executing commands).

2. Execute Hadoop balancer -threshold N to run the balancer in the foreground, where N is the percentage of blocks within which data nodes should be with one another. To stop the process prematurely press control+c or kill process id from another terminal. Alternatively, Apache Hadoop users can run the process in the background using the start – balancer.sh script; CDH users should use the hadoop -.20 balancer init script;

3. Monitor the output (or log file, if you choose to run the balancer in the background) to track progress.

Tag: Big data

How to Balancing HDFS Block Data in Hadoop

Balancing HDFS BlockData

Why poor distribution of data?

Consequences:

Possible solution: