Replication Factor in Hadoop




How to Replication Factor comes into the picture:

The Backup mechanism in the traditional distribution system:

In Hadoop, Backup mechanism didn’t provide high availability. This system is followed by shaded architecture.

The first request from File to Master node then divided into blocksize. It is a continuous process but node 1(slave1) is failed to another node(Slave 2).

Replication Factor:

Replication factor is the process of duplicating the data on the different slave machines to achieve high availability processing.

Replication is a Backup mechanism or Failover mechanism or Fault tolerant mechanism.

In Hadoop, Replication factor default is 3 times. No need to configure.

Hadoop 1.x :
Replication Factor is 3
Hadoop 2.x:
Replication Factor is also 3.

In Hadoop, Minimum Replication factor is 1 time. It is possible for a single node Hadoop cluster.

In Hadoop, Maximum Replication factor is 512 times.

If 3 minimum replication factor then minimum 3 slave nodes are required.

If the replication factor is 10 then we need 10 slave nodes are required.



Here is simple for the replication factor:

'N' Replication Factor = 'N' Slave Nodes

Note: If the configured replication factor is 3 times but using 2 slave machines than actual replication factor is also 2 times.

How to configure Replication in Hadoop?

It is configured in the  hdfs-site.xml file.

/usr/local/hadoop/conf/hdfs-site.xml
<configuration>
<property>
<name> dfs.replication</name>
<value> 5 </value>
</property>
</value>

Design Rules Of Replication In Hadoop:

1. In Hadoop Replication is only applicable to Hadoop Distributed File System (HDFS) but not for Metadata.

2. Keep One Replication per slave node as per design.

3. Replication will only happen on Hadoop slave nodes alone but not on Hadoop Master node (because the master node is only for metadata management on its own. It will not maintain the data).

Storage only duplicates in Hadoop but not processing because processing us always unique.

Summary: In Hadoop, Replication factor is a major role for data backup mechanism in earlier days. Default replication factor always 3 except single node cluster environment.


Blocksize in Hadoop



How the data storage on HDFS:

BLOCK:

Individual storage unit on the Hadoop Distributed File System.

In Hadoop 1.X default block size is 64MB

In Hadoop 2.X default block size is 128MB

If any file request is coming to Hadoop cluster what are the steps:

Step 1: Hadoop Master node only receives the file request.

Step2: Based on the Blocksize configuration at that time, data will be divided into no.of blocks.

How to configure “Blocksize” in Hadoop?

/usr/local/hadoop/conf/hdfs-site.xml
<configuration>
<property>
<name>dfs.block.size</name>
<value>14323883></value>
</property>
</configuration>

How to store data in HDFS:

Assume that we have A.log, B.log, and C.log files:

Scenario1:

A.log -> 200mb -> 200/64 -> 64mb 64mb 64mb 8mb+remaining

Senaario2:

B.log->192mb->192/64-> 64mb 64mb 64mb

Design Rules of Blocksize:

1.Irrespective of the file size: In Blocksize for each and every file dedicated to no.of blocks will be there in Hadoop.

2.Except for the last block: Remaining all the blocks of a file will hold the equal volume of data.

Hadoop master node only looks at the block size at the time of blocking the data(dividing data). Not at the time of reading the data because at the time of reading the data only metadata matters.


Apache SQOOP in Hadoop



Apache Sqoop:

Apache Sqoop is a tool designed to transfer data between Hadoop and relational databases. Mostly used for import/export data from RDBMS to HDFS vice versa. Sqoop works with relational databases such as Teradata, Oracle, MySQL etc.

Apache Sqoop is a tool designed for efficiently transferring bulk data between Hadoop.

Where Sqoop is used?

Developers feel the transferring of data between relational database systems and HDFS is not interesting, the interesting work starts after data is loaded into HDFS. They always write custom scripts to transfer data in and out of Hadoop.

In case of Map-Reduce programs needs to do similar jobs, the database server would experience very high load, for a large number of concurrent connections, while Map Reduce programs were running for performance issues.

Apache Sqoop makes this possible with a single command line mostly Sqoop uses MapReduce to import and export the data, which provides parallel operations as well as fault tolerance purpose.

What Sqoop Does?

1. Sqoop import sequential data sets from mainframe – the growing need to move data from the mainframe to HDFS.

2. Data import – moves certain data from external stores into Hadoop to optimize the cost-effectiveness of combined data storage and processing.

3. Fast Data copies –  from external systems into Hadoop

4. Parallel data transfer – faster performance and optimal system utilization

5. Load balancing – excessive storage and processing loads to other systems.

Apache Sqoop latest version:

Latest stable release 1.4.7


Sqoop Architecture:

Apache Sqoop command submitted by the end user is parsed by Sqoop and launches Hadoop Map Reduce the only job to import or export data and aggregations are needed. Sqoop just imports and exports the data. It does not do any aggregations. Map job launch multiple mappers depends on the number defined by the user in the command line. Each mapper creates a connection with the database using JDBC and fetches the part of data assigned.

SQOOP Architecture diagram:

 

Sqoop only for imports and exports the data.


Hive: SortBy Vs OrderBy Vs DistributeBy Vs ClusterBy



SortBy:

Hive uses the column in SortBy to sort the rows before sustaining the rows to a reducer in Hive environment. The sort order will be dependent on the column types especially for the column is of numeric type, then the sort order is also in numeric order. If the column is of string type, then the sort order will be lexicographical order in Hive Query. It orders data at each of ‘N’ reducers, but each reducer can have overlapping ranges of data in Hive.

Output: N or more sorted files with overlapping ranges.

Example Query for SortBy

SELECT key, value FROM source SORTBY key ASC, value DESC

Order By:

This is similar to ORDER BY in SQL language. In Hive, ORDER BY guarantees total ordering of data, but for that, it has to be passed on to a single reducer which is normally intolerable and therefore in inflexible mode, in hive makes it compulsory to use LIMIt with ORDER BY so that reducer doesn’t get exhausted.

Ordering: Total Order DATA.

Output: Single output i.e fully ordered.

Example Query for OrderBy

SELECT key, value FROM source ORDER BY key ASC, value DESC

Distribute By:

Apache Hive uses the columns in Distribute By to distribute the rows between reducers in a query language. All rows with the same Distribute By columns will go to the same reducer.



Distribute By protecting each of N reducers gets non-overlapping ranges of the column but doesn’t sort the output of each reducer.

Example:

In  Distribute By x on the following 5 rows to 2 reducers:

x1
x2
x4
x1

Reducer 1 got

x1
x2
x1

Reducer 2 got

x4
x3

Cluster By:

Cluster By is a combination of both Distribute By and Sort By. CLUSTER BY x protecting each of N reducers gets non-overlapping ranges, then sorts by those ranges at the reducers.

Ordering: Global ordering between multiple reducers.

Output: N or more sorted files with non-overlapping ranges.

Example:

Refer to same example as above, if we use Cluster By x, the two reducers will further sort rows on x:

Reducer 1 got

x1
x1
x2

Reducer 2 got

x3
x4



Basic Terminology in Hadoop




Bigdata Solutions:

1.NoSQL – database(Non relational database) – Only for structured and semi-structured

2. Hadoop – Implementation – structured,semi-structured and unstructured data

3.Hadoop eco-systems and its components for everything.

Hadoop:

Hadoop is a parallel system for large data storage and processing. It is a solution for Bigdata.

For Storage purpose HDFS -Hadoop Distributed File System

For Processing purpose MapReduce using simply.

In Hadoop, some keywords are very important for learning scope.

Hadoop Basic Terminology:

1.Cluster

2.Clustered Node

3.Hadoop Clustered Node

4.Hadoop cluster

5. Hadoop Cluster Size

1.Cluster:

A cluster is a group of all nodes belongs to one common network is called a cluster.

2.Clustered Node:

A Clustered Node is a grouping of all individual machines is called a clustered node in Hadoop

3.Hadoop Cluster Node:

A Hadoop Cluster Node is basic storage and processing purpose of a cluster is called as Hadoop Cluster Node.

For storage purpose, we are using the Hadoop Distributed File System.

For processing purpose, we are using MapReduce

4.Hadoop Cluster:

A Hadoop Cluster is a collection of “Hadoop Cluster Node” in a common network is called Hadoop Cluster

5.Hadoop Cluster Size:

A Hadoop cluster size is a total no.of node in a Hadoop cluster.


Hadoop Ecosystem:

1. Apache Pig              –  Processing           – Pig Scripting

2. Hive                             – Processing           – HiveQL (Query language like SQL)

3.SQOOP                       – Integration tool  – Import and Export data

4.Zookeeper               – Coordination      – Distribution coordinator

5.Apache Flume      – Streaming              – log data for streaming purpose

6.Oozie                        – Scheduling             – Open source scheduling jobs

7.HBase                     – Random Access   – Hadoop+dataBASE

8.NoSQL                  – NotOnlySql              – MongoDB, Cassandra

9.Apache Kafka    – Messaging               – Distributed messaging

10.YARN                  – Resource Manager – Yet Another Resource Negotiator

Note: Apache Spark is not a part of Hadoop but including nowadays. It is used for Data Processing purpose. Spark 100 times faster than Hadoop MapReduce.

Compatible Operating System for Hadoop Installation:

1. Linux

2.Mac OS

3.Sun Solaris

4.Windows.

Hadoop Versions:

Hadoop 1.x

Hadoop 2.x

Hadoop 3.x

Different Distributions of Hadoop

1. Cloudera Distribution for Hadoop (CDH)

2.Hortonworks

3.MapR



Latest Hadoop Admin Interview Questions with Answers



LatestHadoop admin interview questions and answers:

1. What is Edge Node? Why choose two edge nodes in a cluster?

Basically, Edge Nodes are end-user connectivity purposes like an interface between cluster and client.

One Edge node is a single point if the edge node goes down another edge node will connect that’s why we use two edge nodes.

2. If you have four master nodes what are services are installed?

In master node 1: installed, Name node, Secondary node Hive server, Resource manager one zookeeper

In master node 2: HBase master, Oozie server

In master node 3: Hue, spark, three zookeeper

In master node 4: High availability

3. Tell me about default block size of Hadoop and  Unix?

The default block size of HDFS is 128MB

The default block size of Unix is 4kb



4. What are security measures that are implemented in the Hadoop cluster?

LDAP is the first level authentication

Kerberos for the second level authentication

Sentry for role-based authorization to data and metadata stored on Hadoop cluster

Knox, who access the cluster to provide security like a  gateways

Ranger is to provide security across Hadoop eco-system folder access and data authorization

5. What about data transmitted over the network data in transit how do you secure the data?

By using encrypted data transmitted over the networks and also using SSL certifications and HTTPS and some other protocols also.

6. What are the types of accounts used in the Hadoop cluster?

Service account: This account belongs to create in the active directory,  within the Hadoop cluster access the jobs and applications.

Technical account: This account related to access from outside clients for application related for example Java client to Hive access.

Business user account: This account belongs to some business users want to access the Hadoop cluster.

Admin account: highly privileged account for giving credentials for users from active directory

Local account: This account belongs to Unix based for active directory principals.


MapR



What is MapR?

MapR is one of the Big Data Distribution. It is a complete enterprise distribution for Apache Hadoop which is designed to improve the Hadoop’s reliability, performance, and ease of use.

Why MapR?

1. High Availability:

MapR provides High Availability features such as Self – Healing it means that no Namenode architecture.

It has job tracker High Availability and NFS. MapR achieves only distributing its file system metadata.

2. Disaster Recovery:

MapR provides mirroring facility which allows users to enable policies and mirror data. It automatically within the multinode cluster or single node cluster between on-premise and cloud infrastructure

3.Record Performance:

MapR is a world record performance cost only $9 to the earlier cost of $5M at a  speed of 54 sec. And it handles the large size of clusters like 2,200 nodes.

4.Consistent Snapshots:

MapR is the only big data distribution which provides a consistent, point in time recovery because of its unique read and writes storage architecture.

5. Complete Data Protection:

MapR has own security system for data protection in cluster level.



6.Compression:

MapR provides automatic behind the scenes compression to data. It applies compression automatically to files in the cluster.

7.Unbiased Open Source:

MapR completely unbiased opensource distribution

8. Real Multitenancy Including YARN also

9.Enterprise-grade NoSQL

10. Read and Write file system:

MapR has Read and Write file system.

MapR Ecosystem Packs (MEP):

The “MapR Ecosystem” is the set of open source that is included in the MapR Platform, and the “pack” means a bundled set of MapR Ecosystem projects with specific versions.

Mostly MapR Ecosystem Packs are released in every quarter and yearly also

A single version of MapR may support multiple MEPs, but only one at a time.

In a familiar case, Hadoop Ecosystem components and Opensource components are like Spark, Hive etc components are included in MapR Ecosystem Packs are like below tools:

Collectd
Elasticsearch
Grafana
Fluentd
Kibana
Open TSDB



MapR Vs Cloudera Vs Hortonworks



In Bigdata distributions are mostly three familiar in the present market.

1.Cloudera

2.Hortonworks

3.MapR

 

Cloudera, HDP (Hadoop Data Platform) are open source and enterprise editions are also available but MapR is a complete enterprise distribution for Apache Hadoop which is designed to improve the Hadoop’s reliability, performance, and ease of use.

                                                      Hortonworks              Cloudera                         MapR

Manageability:

Management Tools                 Ambari                  Cloudera Manager       MapR CS

Volume Support                             No                              No                                  Yes

Heat map, Alarms                         Yes                              Yes                                  Yes

Alerts                                                  Yes                               Yes                                 Yes

REST API                                           Yes                               Yes                                 Yes

High Availability:

Hortonworks  - Single failure recovery
Cloudera     - Single failure recovery
MapR         - Self healing across multiple failures

Replication:

Hortonworks - Data
Cloudera    - Data
MapR        - Data + Metadata

Disaster Recovery:

Hortonworks - No
Cloudera    - File Copy Scheduling
MapR        -  Monitoring

Upgrading:

Hortonworks - Planned downtime
Cloudera    - Rolling Upgrades
MapR        - Rolling Upgrades



Summary:  Nowadays Big data and Analytics are the most emerging technology. Especially Big data distributions are Cloudera, HDP, and MapR. These are some special features and open source and enterprise editions. MapR is used in the Banking and Finance sectors are used mostly. Cloudera is used anywhere with enterprise and open source. Hortonworks is also same like as Cloudera.

How to Install Hadoop on Windows



Hadoop Installation on Windows 10:

Prerequisite: Java 1.7 or more version is mandatory for Hadoop installation on Windows.

Using javac -version, and java command for Java versions and complete installation.

Step 1: Goto apache mirrors for Hadoop tarball then download it.

Step 2: After downloading tarball extract in your path or will create a different file like Hadoop_Installation then processed it.

Step 3: After extracting Hadoop file put into Program files path.

Step 5: After that goto, Hadoop bin file path then copy the path like below

C:\Program Files\hadoop-2.8.0\hadoop-2.8.0\bin

Step 6: Open Environment Variable path then create New User Variable :

Variable Name: HADOOP_HOME

Variable Path : C:\Program Files\hadoop-2.8.0\hadoop-2.8.0\bin

Step 7: Check JAVA_HOME is there or not. If Java home is not there then create in the system variable.

Step 8: If JAVA_HOME set in system variable then put into top position using Move Up buttons.

Step 9: After completion of Environmental variable setup then goto Hadoop file system etc for configuration of all XML files like core – site.xml, yarn – site.xml etc. Simply check below the path and configure it.

C:\Program Files\hadoop-2.8.0\hadoop-2.8.0\etc\hadoop

Then copy the below xml code into core – site.xml file.

<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://loclahost:8020</value>
</property>
</configuration>




Step 10: Then go with the hdfs-site .xml file for storage configurations like replication etc.

<configuration>
<property>
<name>dfs.replication</name>
<value>3</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>C:\hadoop-2.8.0\data\namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>C:\hadoop-2.8.0\data\datanode</value>
</property>
</configuration>

Step 11: Configured the yarn-site.xml file using below XML code.

<configuration>

<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.auxservices.mapreduce.shuffle.class</name> 
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>

</configuration>

Step 12: After completion of All XML file configurations then create data and name node files in Hadoop file.

Step 13: Go to Hadoop path open command then name node format using below command:

hadoop namenode -format

Step 14: Open name node web UI like http: localhost:50070 for name node information.

Step 15: Finally after successful installation of Hadoop single node cluster setup on Windows machine then use in simply for Hadoop developers and Administrators also.

 


Connection refused error while running Hive in Hadoop




When Hive Installation in a single node cluster setup on Hadoop ecosystem sometimes showing below like this:

Connection refused error in Hive

Exception in thread  “main”  java.lang.RuntimeException: call From your domain/127.0.1.1 to localhost:8020 failed on connection exception: Java.net.ConnectionException:Connection refused:

For more details see:

http://wiki.apache.org/hadoop/Conncetionrefused

at org.apache.hadoop.hive.ql.session.SesseionState.start(SessionStart.java:522)
at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:677)
at org.apache.hadoop.hive.cli.CliDriver.main.(CliDriver.java:621)
at sum.reflect.NativeMethodAccessorImpl.invoke(Native Method)
...more

Caused by: java.net.ConncetException : Call From  slthupili/127.0.1.1 to localhost:8020 failed on connection exception: java.net.ConnectionException: Connection refused;

at sun.reflect.NativeConstructorAccessorImpl.newInstance0
(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance
(NativeConstructorAccessorImple.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance
(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
...more

Caused by : java.netConnectException: Connection refused

at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)

at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannel Impl.java:717)

at org.apache.hadoop.net.NetUtilis.connect(NetUtils.java:530)

...more

Solution:


First, stop all services in Hadoop using below command:

$ stop-all.sh

This command used for all services like NameNode, DataNode, SecondaryNode, YARN etc.

Second step back up the data then will use below command

$ hadoop namenode -format

Above command removes unnecessary data then enter hive command

$ hive

Check below video for more details: