Replication Factor in Hadoop





How to Replication Factor comes into the picture:

The Backup mechanism in the traditional distribution system:

In Hadoop, Backup mechanism didn’t provide high availability. This system is followed by shaded architecture.

The first request from File to Master node then divided into blocksize. It is a continuous process but node 1(slave1) is failed to another node(Slave 2).

Replication Factor:

Replication factor is the process of duplicating the data on the different slave machines to achieve high availability processing.

Replication is a Backup mechanism or Failover mechanism or Fault tolerant mechanism.

In Hadoop, Replication factor default is 3 times. No need to configure.

Hadoop 1.x :
Replication Factor is 3
Hadoop 2.x:
Replication Factor is also 3.

In Hadoop, Minimum Replication factor is 1 time. It is possible for a single node Hadoop cluster.

In Hadoop, Maximum Replication factor is 512 times.

If 3 minimum replication factor then minimum 3 slave nodes are required.

If the replication factor is 10 then we need 10 slave nodes are required.

Here is simple for the replication factor:

'N' Replication Factor = 'N' Slave Nodes

Note: If the configured replication factor is 3 times but using 2 slave machines than actual replication factor is also 2 times.

How to configure Replication in Hadoop?

It is configured in the  hdfs-site.xml file.

/usr/local/hadoop/conf/hdfs-site.xml
<configuration>
<property>
<name> dfs.replication</name>
<value> 5 </value>
</property>
</value>

Design Rules Of Replication In Hadoop:

1. In Hadoop Replication is only applicable to Hadoop Distributed File System (HDFS) but not for Metadata.

2. Keep One Replication per slave node as per design.

3. Replication will only happen on Hadoop slave nodes alone but not on Hadoop Master node (because the master node is only for metadata management on its own. It will not maintain the data).

Storage only duplicates in Hadoop but not processing because processing us always unique.

Summary: In Hadoop, Replication factor is a major role for data backup mechanism in earlier days. Default replication factor always 3 except single node cluster environment.

Blocksize in Hadoop

How the data storage on HDFS:




BLOCK:

Individual storage unit on the Hadoop Distributed File System.

In Hadoop 1.X default block size is 64MB

In Hadoop 2.X default block size is 128MB

If any file request is coming to Hadoop cluster what are the steps:

Step 1: Hadoop Master node only receives the file request.

Step2: Based on the Blocksize configuration at that time, data will be divided into no.of blocks.

How to configure “Blocksize” in Hadoop?

/usr/local/hadoop/conf/hdfs-site.xml
<configuration>
<property>
<name>dfs.block.size</name>
<value>14323883></value>
</property>
</configuration>

How to store data in HDFS:

Assume that we have A.log, B.log, and C.log files:

Scenario1:

A.log -> 200mb -> 200/64 -> 64mb 64mb 64mb 8mb+remaining

Senaario2:

B.log->192mb->192/64-> 64mb 64mb 64mb

Design Rules of Blocksize:

1.Irrespective of the file size: In Blocksize for each and every file dedicated to no.of blocks will be there in Hadoop.




2.Except for the last block: Remaining all the blocks of a file will hold the equal volume of data.

Hadoop master node only looks at the block size at the time of blocking the data(dividing data). Not at the time of reading the data because at the time of reading the data only metadata matters.

MapR Architecture

MapR Architecture:

Before Hadoop was introduced in 2007, there was not a single data platform that can provide the scalable architecture to handle fast-growing data with a unified security model.




There are four important pillars of a data platform

1.Distributed Metadata

2.Variety of Protocols and API support

3.Variety of Data persistence like objects, files, tables and event queues.

4.Security

Distributed Metadata:

In Distributed metadata is a centralized metadata service leads to a number of restrictions as below:

1.Creates a single point of failure

2.Creates a hotspot that limits the scalability of the cluster

3.Limits sharing of data artifacts

4. Limits the number of data artifacts that can be stored in the cluster.

MapR has built a distributed metadata service from the top that removes all these restrictions.

CLDB (Container Location Data Base) serves as MapR’s level – I metadata service and maintains metadata about volumes, containers, nodes in the entire cluster.

The metadata about data artifacts such as objects, files, tables, topics, directories are maintained in the level-Il metadata is stored in the name container.

Variety of APIs and Protocol Support:

MapR Data Platform provides data ability among the different APIs. In different applications using different APIs:

1.HDFS API

2.S3 API

3.NFS

4.POSIX

5.OJAI API

6.CDC API

Variety of Data persistence:

MapR data container is the unit of storage allocation and management. Each container stores a variety of data elements such as objects, files, tables, and directories.

It supports two types of data elements:

1.File chunks

2.Key – Value stores

These two are data elements in MapR for thread file chunks across containers. Directories are built over Key-Value stores. The tables are built on top of files and key-value stores in an index.

MapR Data Platform war architected in such a way to solve most data problems for enterprise and eliminate data tools.

The heart of the MapR data platform is the Data Container.




And Data Container provides:

1.Different data persistence models, such as files, tables, objects etc.

2.Distributed scale-out storage

3.Data loss prevention

4.Failure resilience and disaster recovery

What is Big Data?




Big Data means:

Big Data is a really Big Data it is a term for large data sets or complex that traditional data processing applications are insufficient to deal with them.  Here some challenges include analysis, analytics, data streaming, capture, search, storage, visualization, querying, updating and information privacy. So the term of Big Data offers simply to the use of predictive analytics, user behavior analytics and advanced data analytics methods that extract value from data.

Big Data and Analytics requires different types of techniques and technologies with new forms of integration to reveal insights from data sets that are diverse, the complexity of a program.

Facts of Big Data:

A)Nowadays Data is growing faster than ever before and by the year 2020, it will go around 2.0 megabytes of new information will be created every second.

B)Large Data volumes are exploding, more data has been created in the past two years in entire big data and analytics.

C)We are seeing massive growth in video and photo data, and bulk amount of data will uploads, downloads in social media data.

D)Social media users send on average around 50 million messages and view around 5 million videos every minute.

E)Distributed computing is a very real case example Google uses it every day to involve about 1,000 computers in answering a single search query.

Uses of Big Data:

A)Nowadays organizations are increasingly turning to big data to discover new ways to improve decision-making, opportunities, and performance.

B)Coming to Operational insights it might depend upon machine data, which can include anything from computers to sensors or meters to GPS devices.

C)Cyber Security or identification and fraud detection is another use of big data. With access to real-time data, a business can enhance security and intelligence analysis platforms.




Finally, Big Data is a problem for Large data sets so will find out a solution for storage and processing purpose using so simple solutions is Hadoop for Big Data.