HBase Table(Single&Multiple) data migration from one cluster to another cluster

HBase single table migration from one cluster to another cluster:

Here will be shown about Hbase single data table migration existing cluster to a new cluster simple steps:

Step 1: First export the hbase table data into the local hdfs path (Hadoop Distributed File System)

Step 2: After that copy the HBase table data from the source cluster to destination cluster by using the distcp command. (mostly distcp is a copy command for one cluster data to another cluster)

Step 3: Then create an Hbase table in the destination cluster (target cluster)

Step 4: After that import the Hbase table data from local to HBase table in the destination cluster.

Source Cluster:

1.  hbase.org.apache.hadoop.hbase.mapreduce.Driver export <hbase _table _name >  < source _hdfs _path >

2. hbase distcp hdfs :// <source_cluster_ipaddress:8020> to </source _hdfs _path>

3.hdfs: // < destination_cluster_ipaddress: 8020 > to <destination _hdfs _path>

Destination Cluster:

1.hbase org.hadoop.hbase.mapreduce.import < hbase _ table_ name > to < hbase _table _hdfs _path >

HBase multiple table migration from one cluster to another cluster:

We know how to Hbase single table migration then coming to multiple table migration from one cluster to another cluster in simple manner by below steps.

We have script files then simply multiple Hbase data migrations happening to go through below steps:

Step 1: First step place the hbase-export.sh and hbase-table.txt in the source cluster

Step 2: After that place the hbase -import.sh and hbase-table.txt in the destination cluster.

Step 3: Mention all the table list in the hbase-table.txt file

Step 4: Create all the HBase table on the destination cluster

Step 5: Execute the hbase-export-generic.sh in the source cluster

Step 6: Execute the hbase-import.sh in the destination cluster.

Summary: I tried in Cloudera Distribute Hadoop environment for Hbase data migration from one cluster to another cluster. For Hbase single table data and multiple table data migration in very simple for Hadoop administrator as well as Hadoop developers. It is the same as Hortonword Distribution also.

Replication Factor in Hadoop

How to Replication Factor comes into the picture:

The Backup mechanism in the traditional distribution system:

In Hadoop, Backup mechanism didn’t provide high availability. This system is followed by shaded architecture.

The first request from File to Master node then divided into blocksize. It is a continuous process but node 1(slave1) is failed to another node(Slave 2).

Replication Factor:

Replication factor is the process of duplicating the data on the different slave machines to achieve high availability processing.

Replication is a Backup mechanism or Failover mechanism or Fault tolerant mechanism.

In Hadoop, Replication factor default is 3 times. No need to configure.

Hadoop 1.x :
Replication Factor is 3
Hadoop 2.x:
Replication Factor is also 3.

In Hadoop, Minimum Replication factor is 1 time. It is possible for a single node Hadoop cluster.

In Hadoop, Maximum Replication factor is 512 times.

If 3 minimum replication factor then minimum 3 slave nodes are required.

If the replication factor is 10 then we need 10 slave nodes are required.

Here is simple for the replication factor:

'N' Replication Factor = 'N' Slave Nodes

Note: If the configured replication factor is 3 times but using 2 slave machines than actual replication factor is also 2 times.

How to configure Replication in Hadoop?

It is configured in the  hdfs-site.xml file.

<name> dfs.replication</name>
<value> 5 </value>

Design Rules Of Replication In Hadoop:

1. In Hadoop Replication is only applicable to Hadoop Distributed File System (HDFS) but not for Metadata.

2. Keep One Replication per slave node as per design.

3. Replication will only happen on Hadoop slave nodes alone but not on Hadoop Master node (because the master node is only for metadata management on its own. It will not maintain the data).

Storage only duplicates in Hadoop but not processing because processing us always unique.

Summary: In Hadoop, Replication factor is a major role for data backup mechanism in earlier days. Default replication factor always 3 except single node cluster environment.

Blocksize in Hadoop

How the data storage on HDFS:


Individual storage unit on the Hadoop Distributed File System.

In Hadoop 1.X default block size is 64MB

In Hadoop 2.X default block size is 128MB

If any file request is coming to Hadoop cluster what are the steps:

Step 1: Hadoop Master node only receives the file request.

Step2: Based on the Blocksize configuration at that time, data will be divided into no.of blocks.

How to configure “Blocksize” in Hadoop?


How to store data in HDFS:

Assume that we have A.log, B.log, and C.log files:


A.log -> 200mb -> 200/64 -> 64mb 64mb 64mb 8mb+remaining


B.log->192mb->192/64-> 64mb 64mb 64mb

Design Rules of Blocksize:

1.Irrespective of the file size: In Blocksize for each and every file dedicated to no.of blocks will be there in Hadoop.

2.Except for the last block: Remaining all the blocks of a file will hold the equal volume of data.

Hadoop master node only looks at the block size at the time of blocking the data(dividing data). Not at the time of reading the data because at the time of reading the data only metadata matters.

Deep Learning Overview

Deep Learning:

Deep learning has defined as hierarchical learning or deep structured learning. Deep learning is a part of the machine learning methods. And it is based on learning methods, data representations, as against task-specific algorithms. Learning can be in different types of machine learning concepts like supervised, semi-supervised and unsupervised.

In Artificial Intelligence Deep learning methods aim at learning at feature hierarchies with features from higher level features to lower level features. Automatically learning features at multiple levels of abstraction allow a system to learn complex functions mapping the input to the output directory from data, without depending completely on human-crafted features.

In Deep learning architectures mostly neural networks like trusted networks and recurring neutral networks have been applied to fields including computer vision speech identification of natural language processing audio recognition social network filtering machine translation bioinformatics drug design and board game programs where they have produced.

In Machine Learning Deep learning models are approximately inspired by information processing and communication patterns in organic nervous systems. Coming to structural and functional properties of organic brains which make them opposed with the nervous system.

Apache SQOOP in Hadoop

Apache Sqoop:

Apache Sqoop is a tool designed to transfer data between Hadoop and relational databases. Mostly used for import/export data from RDBMS to HDFS vice versa. Sqoop works with relational databases such as Teradata, Oracle, MySQL etc.

Apache Sqoop is a tool designed for efficiently transferring bulk data between Hadoop.

Where Sqoop is used?

Developers feel the transferring of data between relational database systems and HDFS is not interesting, the interesting work starts after data is loaded into HDFS. They always write custom scripts to transfer data in and out of Hadoop.

In case of Map-Reduce programs needs to do similar jobs, the database server would experience very high load, for a large number of concurrent connections, while Map Reduce programs were running for performance issues.

Apache Sqoop makes this possible with a single command line mostly Sqoop uses MapReduce to import and export the data, which provides parallel operations as well as fault tolerance purpose.

What Sqoop Does?

1. Sqoop import sequential data sets from mainframe – the growing need to move data from the mainframe to HDFS.

2. Data import – moves certain data from external stores into Hadoop to optimize the cost-effectiveness of combined data storage and processing.

3. Fast Data copies –  from external systems into Hadoop

4. Parallel data transfer – faster performance and optimal system utilization

5. Load balancing – excessive storage and processing loads to other systems.

Apache Sqoop latest version:

Latest stable release 1.4.7

Sqoop Architecture:

Apache Sqoop command submitted by the end user is parsed by Sqoop and launches Hadoop Map Reduce the only job to import or export data and aggregations are needed. Sqoop just imports and exports the data. It does not do any aggregations. Map job launch multiple mappers depends on the number defined by the user in the command line. Each mapper creates a connection with the database using JDBC and fetches the part of data assigned.

SQOOP Architecture diagram:


Sqoop only for imports and exports the data.

From Human Neurons to Artificial Neurons continuation

After Simple Neuron and Firing rule remaining rules are below :

3. Pattern Recognition:

After simple neuron, firing rules and important application of neural networks is pattern recognition. Pattern recognition can be enforced by using a pro-act neural network that has been trained accordingly during training the network is trained to associate outputs with input patterns. When the network is used it identifies the input pattern and tries to output the associated output pattern. The power of neural networks comes to life when a pattern that has no output associated with it, is given as an input. In this case, the network gives the output that corresponds to a trained input pattern that is least different from the given pattern.

Above example is trained to recognize the patterns T and H. The associated patterns are all black and all white respectively.

Input                          Output                   Input                  Output

Above white squares represent with 1  and black squares represent with 0 then the truth tables for the 3 neurons after generalizations are below truth table.

Top Neuron:

Middle Neuron :

Bottom Neuron:

From the above tables, it can be seen the following associations can be extracted

Input                               Output

In this case, it is obvious that the output should be all blacks since the input pattern is almost the same as the “T” pattern.

Input                     Output

In this case, it is obvious that the output should be all whites since the input pattern is almost the same as the ‘H’ Pattern.

Input                                        Output

Above case, the top row is 2 errors away from the T and 3 from an H. So the top output is black. The middle row is 1 error away from both T and H so the output is random.

The bottom row is 1 error away from T and 2 away from H. Therefore the output is black. The total output of the network is still in favor of the T shape.

4. A more complicated Neuron:

The most sophisticated neuron is the McCulloch and Pitts model. It is a variety from the remaining model is that the inputs are ‘weighted’ the effect that each input has at decision making is dependent on the weight of the particular input.

From Human Neurons to Artificial Neurons

To easy to understand Human neurons to Artificial Neurons is a little bit tough but we conduct these neural networks by first trying to conclude the essential features of neurons and their internal connections in Artificial Intelligence. Then typically program to the computer to replicate these characteristics. However, because our knowledge of neurons is insufficient and our computing power is limited, our models are necessarily gross idealizations of real networks of neurons.

The Neuron model:

An Engineering Approach

1.A Simple Neuron:

In a simple neuron, Artificial Neuron is a device with many inputs but only one output. The neuron has different modes of operations in a simple neuron. One is training mode and another one is user mode. Basically the training mode, the neuron can be trained to fire, for a particular input pattern. And the user mode when a taught input pattern is detected at the input, its related to output becomes the current output in Artificial neuron. If the input pattern does not belong in the taught list of input patterns the firing rule is used to determine whether to fire or not in a simple neuron.

2.Firing rules:

In an Artificial Intelligence, the firing rules is a most important concept in neural networks and account for their high adaptability. A firing rule verifies a neuron should fire for any input pattern. Firing rules understand to all the input patterns not only the ones on which the node was trained in Artificial neurons.

A simple firing rule can be performed by using the Hamming distance technique.

In simple firing rule can take a collection of training patterns for a node, some of which generate it to fire and others which intercept it from doing so then the patterns not in the collection cause the node to fire if, on the comparison, they have more input elements in common with the nearest pattern in the 1 – taught set than with the nearest pattern in the 0 – taught set. If there is a tie then the pattern remains in the undefined state.

Example: In firing, rule take 3 – input neuron is trained to output 1 when the input (X1, X2, and X3)  is 101 or 111 and to output 1 when the input is 000 or 001 and to output is 0 the final output  truth table below is:

In the above example of the way the after applying Firing, a rule is to take the pattern 010. Firing rule differs from 000 in 1 element, from 001 in 2 elements, from 101 in 3 elements and from 111 in 2 elements. Therefore, the close the pattern is 000 which belongs in the 0 – taught set. It necessary that the neurons do not fire when the input is 001, on the other hand, is equal distance from two trained patterns that have different outputs and consequently the output stays undefined 0/1.

For more the difference between the two truth tables is called the generalization of the neuron. The firing rule gives the neuron a sense of similarity and authorizes it to respond sensibly to patterns not seen during training.

Hive: SortBy Vs OrderBy Vs DistributeBy Vs ClusterBy


Hive uses the column in SortBy to sort the rows before sustaining the rows to a reducer in Hive environment. The sort order will be dependent on the column types especially for the column is of numeric type, then the sort order is also in numeric order. If the column is of string type, then the sort order will be lexicographical order in Hive Query. It orders data at each of ‘N’ reducers, but each reducer can have overlapping ranges of data in Hive.

Output: N or more sorted files with overlapping ranges.

Example Query for SortBy

SELECT key, value FROM source SORTBY key ASC, value DESC

Order By:

This is similar to ORDER BY in SQL language. In Hive, ORDER BY guarantees total ordering of data, but for that, it has to be passed on to a single reducer which is normally intolerable and therefore in inflexible mode, in hive makes it compulsory to use LIMIt with ORDER BY so that reducer doesn’t get exhausted.

Ordering: Total Order DATA.

Output: Single output i.e fully ordered.

Example Query for OrderBy

SELECT key, value FROM source ORDER BY key ASC, value DESC

Distribute By:

Apache Hive uses the columns in Distribute By to distribute the rows between reducers in a query language. All rows with the same Distribute By columns will go to the same reducer.

Distribute By protecting each of N reducers gets non-overlapping ranges of the column but doesn’t sort the output of each reducer.


In  Distribute By x on the following 5 rows to 2 reducers:


Reducer 1 got


Reducer 2 got


Cluster By:

Cluster By is a combination of both Distribute By and Sort By. CLUSTER BY x protecting each of N reducers gets non-overlapping ranges, then sorts by those ranges at the reducers.

Ordering: Global ordering between multiple reducers.

Output: N or more sorted files with non-overlapping ranges.


Refer to same example as above, if we use Cluster By x, the two reducers will further sort rows on x:

Reducer 1 got


Reducer 2 got


MapR Architecture

MapR Architecture:

Before Hadoop was introduced in 2007, there was not a single data platform that can provide the scalable architecture to handle fast-growing data with a unified security model.

There are four important pillars of a data platform

1.Distributed Metadata

2.Variety of Protocols and API support

3.Variety of Data persistence like objects, files, tables and event queues.


Distributed Metadata:

In Distributed metadata is a centralized metadata service leads to a number of restrictions as below:

1.Creates a single point of failure

2.Creates a hotspot that limits the scalability of the cluster

3.Limits sharing of data artifacts

4. Limits the number of data artifacts that can be stored in the cluster.

MapR has built a distributed metadata service from the top that removes all these restrictions.

CLDB (Container Location Data Base) serves as MapR’s level – I metadata service and maintains metadata about volumes, containers, nodes in the entire cluster.

The metadata about data artifacts such as objects, files, tables, topics, directories are maintained in the level-Il metadata is stored in the name container.

Variety of APIs and Protocol Support:

MapR Data Platform provides data ability among the different APIs. In different applications using different APIs:


2.S3 API





Variety of Data persistence:

MapR data container is the unit of storage allocation and management. Each container stores a variety of data elements such as objects, files, tables, and directories.

It supports two types of data elements:

1.File chunks

2.Key – Value stores

These two are data elements in MapR for thread file chunks across containers. Directories are built over Key-Value stores. The tables are built on top of files and key-value stores in an index.

MapR Data Platform war architected in such a way to solve most data problems for enterprise and eliminate data tools.

The heart of the MapR data platform is the Data Container.

And Data Container provides:

1.Different data persistence models, such as files, tables, objects etc.

2.Distributed scale-out storage

3.Data loss prevention

4.Failure resilience and disaster recovery

Apache Pig In Hadoop


Pig is founded by Apache Software Foundation is one of a component of Hadoop built on top of HDFS.

Apache Pig is using Hadoop to focus more on analyzing large data sets with less time complexity having to write mapper and reducer programs. The Apache Pig programming language is designed to handle any kind of data.

Pig is made of two components are Pig Latin and another one is run time environment  Pig Latin programs are executed.

It is analyzing large data sets that consist of a high-level language for expressing data analysis programs.

Pig Latin:

Pig Latin is a high-level programming language provided by Pig. It can be used in any framework including Hadoop and Java is not required but it contains all Data processing features like group by, joins, order by

Pig Execution modes:

1.Local mode

Input: LFS Path

Output: LFS Path

2.HDFS mode

Input: HDFS Path

Output: HDFS Path

Data Types in Pig

Simple Types:

int, long, float, double, Boolean, char array, byte array etc.

Complex Types:

bag, tuple, field, map etc.

When to use MapReduce and Pig in Real-time projects:

In the below use case scenario MapReduce only more recommended Pig

1.Unstructured data processing

2.When we are aiming at high and performance.

3. For some hierarchical and job processing involves.

Running Pig Programs

1.Grunt SHELL:

Is an interactive shell which is the default mode of Pig execution. That is whether the output is success or failure we will come to know the result the there itself.

2.Script Mode:

Instead of writing each command with grunt shell, we can write a bunch of Pig commands in a single file and only executing that script alone.


If we are not achieving desired functionality by using the predefined transformation of Pig, we can generally go head with Pig UDF’s.