How the data storage on HDFS:

BLOCK:

Individual storage unit on the Hadoop Distributed File System.

In Hadoop 1.X default block size is 64MB

In Hadoop 2.X default block size is 128MB

If any file request is coming to Hadoop cluster what are the steps:

Step 1: Hadoop Master node only receives the file request.

Step2: Based on the Blocksize configuration at that time, data will be divided into no.of blocks.

How to configure “Blocksize” in Hadoop?

/usr/local/hadoop/conf/hdfs-site.xml
<configuration>
<property>
<name>dfs.block.size</name>
<value>14323883></value>
</property>
</configuration>

How to store data in HDFS:

Assume that we have A.log, B.log, and C.log files:

Scenario1:

A.log -> 200mb -> 200/64 -> 64mb 64mb 64mb 8mb+remaining

Senaario2:

B.log->192mb->192/64-> 64mb 64mb 64mb

Design Rules of Blocksize:

1.Irrespective of the file size: In Blocksize for each and every file dedicated to no.of blocks will be there in Hadoop.

2.Except for the last block: Remaining all the blocks of a file will hold the equal volume of data.

Hadoop master node only looks at the block size at the time of blocking the data(dividing data). Not at the time of reading the data because at the time of reading the data only metadata matters.