HBase Table(Single&Multiple) data migration from one cluster to another cluster



HBase single table migration from one cluster to another cluster:

Here will be shown about Hbase single data table migration existing cluster to a new cluster simple steps:

Step 1: First export the hbase table data into the local hdfs path (Hadoop Distributed File System)

Step 2: After that copy the HBase table data from the source cluster to destination cluster by using the distcp command. (mostly distcp is a copy command for one cluster data to another cluster)

Step 3: Then create an Hbase table in the destination cluster (target cluster)

Step 4: After that import the Hbase table data from local to HBase table in the destination cluster.

Source Cluster:

1.  hbase.org.apache.hadoop.hbase.mapreduce.Driver export <hbase _table _name >  < source _hdfs _path >

2. hbase distcp hdfs :// <source_cluster_ipaddress:8020> to </source _hdfs _path>

3.hdfs: // < destination_cluster_ipaddress: 8020 > to <destination _hdfs _path>

Destination Cluster:

1.hbase org.hadoop.hbase.mapreduce.import < hbase _ table_ name > to < hbase _table _hdfs _path >

HBase multiple table migration from one cluster to another cluster:

We know how to Hbase single table migration then coming to multiple table migration from one cluster to another cluster in simple manner by below steps.

We have script files then simply multiple Hbase data migrations happening to go through below steps:


Step 1: First step place the hbase-export.sh and hbase-table.txt in the source cluster

Step 2: After that place the hbase -import.sh and hbase-table.txt in the destination cluster.

Step 3: Mention all the table list in the hbase-table.txt file

Step 4: Create all the HBase table on the destination cluster

Step 5: Execute the hbase-export-generic.sh in the source cluster

Step 6: Execute the hbase-import.sh in the destination cluster.



Summary: I tried in Cloudera Distribute Hadoop environment for Hbase data migration from one cluster to another cluster. For Hbase single table data and multiple table data migration in very simple for Hadoop administrator as well as Hadoop developers. It is the same as Hortonword Distribution also.

Replication Factor in Hadoop




How to Replication Factor comes into the picture:

The Backup mechanism in the traditional distribution system:

In Hadoop, Backup mechanism didn’t provide high availability. This system is followed by shaded architecture.

The first request from File to Master node then divided into blocksize. It is a continuous process but node 1(slave1) is failed to another node(Slave 2).

Replication Factor:

Replication factor is the process of duplicating the data on the different slave machines to achieve high availability processing.

Replication is a Backup mechanism or Failover mechanism or Fault tolerant mechanism.

In Hadoop, Replication factor default is 3 times. No need to configure.

Hadoop 1.x :
Replication Factor is 3
Hadoop 2.x:
Replication Factor is also 3.

In Hadoop, Minimum Replication factor is 1 time. It is possible for a single node Hadoop cluster.

In Hadoop, Maximum Replication factor is 512 times.

If 3 minimum replication factor then minimum 3 slave nodes are required.

If the replication factor is 10 then we need 10 slave nodes are required.



Here is simple for the replication factor:

'N' Replication Factor = 'N' Slave Nodes

Note: If the configured replication factor is 3 times but using 2 slave machines than actual replication factor is also 2 times.

How to configure Replication in Hadoop?

It is configured in the  hdfs-site.xml file.

/usr/local/hadoop/conf/hdfs-site.xml
<configuration>
<property>
<name> dfs.replication</name>
<value> 5 </value>
</property>
</value>

Design Rules Of Replication In Hadoop:

1. In Hadoop Replication is only applicable to Hadoop Distributed File System (HDFS) but not for Metadata.

2. Keep One Replication per slave node as per design.

3. Replication will only happen on Hadoop slave nodes alone but not on Hadoop Master node (because the master node is only for metadata management on its own. It will not maintain the data).

Storage only duplicates in Hadoop but not processing because processing us always unique.

Summary: In Hadoop, Replication factor is a major role for data backup mechanism in earlier days. Default replication factor always 3 except single node cluster environment.


Blocksize in Hadoop



How the data storage on HDFS:

BLOCK:

Individual storage unit on the Hadoop Distributed File System.

In Hadoop 1.X default block size is 64MB

In Hadoop 2.X default block size is 128MB

If any file request is coming to Hadoop cluster what are the steps:

Step 1: Hadoop Master node only receives the file request.

Step2: Based on the Blocksize configuration at that time, data will be divided into no.of blocks.

How to configure “Blocksize” in Hadoop?

/usr/local/hadoop/conf/hdfs-site.xml
<configuration>
<property>
<name>dfs.block.size</name>
<value>14323883></value>
</property>
</configuration>

How to store data in HDFS:

Assume that we have A.log, B.log, and C.log files:

Scenario1:

A.log -> 200mb -> 200/64 -> 64mb 64mb 64mb 8mb+remaining

Senaario2:

B.log->192mb->192/64-> 64mb 64mb 64mb

Design Rules of Blocksize:

1.Irrespective of the file size: In Blocksize for each and every file dedicated to no.of blocks will be there in Hadoop.

2.Except for the last block: Remaining all the blocks of a file will hold the equal volume of data.

Hadoop master node only looks at the block size at the time of blocking the data(dividing data). Not at the time of reading the data because at the time of reading the data only metadata matters.


Apache SQOOP in Hadoop



Apache Sqoop:

Apache Sqoop is a tool designed to transfer data between Hadoop and relational databases. Mostly used for import/export data from RDBMS to HDFS vice versa. Sqoop works with relational databases such as Teradata, Oracle, MySQL etc.

Apache Sqoop is a tool designed for efficiently transferring bulk data between Hadoop.

Where Sqoop is used?

Developers feel the transferring of data between relational database systems and HDFS is not interesting, the interesting work starts after data is loaded into HDFS. They always write custom scripts to transfer data in and out of Hadoop.

In case of Map-Reduce programs needs to do similar jobs, the database server would experience very high load, for a large number of concurrent connections, while Map Reduce programs were running for performance issues.

Apache Sqoop makes this possible with a single command line mostly Sqoop uses MapReduce to import and export the data, which provides parallel operations as well as fault tolerance purpose.

What Sqoop Does?

1. Sqoop import sequential data sets from mainframe – the growing need to move data from the mainframe to HDFS.

2. Data import – moves certain data from external stores into Hadoop to optimize the cost-effectiveness of combined data storage and processing.

3. Fast Data copies –  from external systems into Hadoop

4. Parallel data transfer – faster performance and optimal system utilization

5. Load balancing – excessive storage and processing loads to other systems.

Apache Sqoop latest version:

Latest stable release 1.4.7


Sqoop Architecture:

Apache Sqoop command submitted by the end user is parsed by Sqoop and launches Hadoop Map Reduce the only job to import or export data and aggregations are needed. Sqoop just imports and exports the data. It does not do any aggregations. Map job launch multiple mappers depends on the number defined by the user in the command line. Each mapper creates a connection with the database using JDBC and fetches the part of data assigned.

SQOOP Architecture diagram:

 

Sqoop only for imports and exports the data.


Hive: SortBy Vs OrderBy Vs DistributeBy Vs ClusterBy



SortBy:

Hive uses the column in SortBy to sort the rows before sustaining the rows to a reducer in Hive environment. The sort order will be dependent on the column types especially for the column is of numeric type, then the sort order is also in numeric order. If the column is of string type, then the sort order will be lexicographical order in Hive Query. It orders data at each of ‘N’ reducers, but each reducer can have overlapping ranges of data in Hive.

Output: N or more sorted files with overlapping ranges.

Example Query for SortBy

SELECT key, value FROM source SORTBY key ASC, value DESC

Order By:

This is similar to ORDER BY in SQL language. In Hive, ORDER BY guarantees total ordering of data, but for that, it has to be passed on to a single reducer which is normally intolerable and therefore in inflexible mode, in hive makes it compulsory to use LIMIt with ORDER BY so that reducer doesn’t get exhausted.

Ordering: Total Order DATA.

Output: Single output i.e fully ordered.

Example Query for OrderBy

SELECT key, value FROM source ORDER BY key ASC, value DESC

Distribute By:

Apache Hive uses the columns in Distribute By to distribute the rows between reducers in a query language. All rows with the same Distribute By columns will go to the same reducer.



Distribute By protecting each of N reducers gets non-overlapping ranges of the column but doesn’t sort the output of each reducer.

Example:

In  Distribute By x on the following 5 rows to 2 reducers:

x1
x2
x4
x1

Reducer 1 got

x1
x2
x1

Reducer 2 got

x4
x3

Cluster By:

Cluster By is a combination of both Distribute By and Sort By. CLUSTER BY x protecting each of N reducers gets non-overlapping ranges, then sorts by those ranges at the reducers.

Ordering: Global ordering between multiple reducers.

Output: N or more sorted files with non-overlapping ranges.

Example:

Refer to same example as above, if we use Cluster By x, the two reducers will further sort rows on x:

Reducer 1 got

x1
x1
x2

Reducer 2 got

x3
x4



Apache Pig In Hadoop



PIG:

Pig is founded by Apache Software Foundation is one of a component of Hadoop built on top of HDFS.

Apache Pig is using Hadoop to focus more on analyzing large data sets with less time complexity having to write mapper and reducer programs. The Apache Pig programming language is designed to handle any kind of data.

Pig is made of two components are Pig Latin and another one is run time environment  Pig Latin programs are executed.

It is analyzing large data sets that consist of a high-level language for expressing data analysis programs.

Pig Latin:

Pig Latin is a high-level programming language provided by Pig. It can be used in any framework including Hadoop and Java is not required but it contains all Data processing features like group by, joins, order by


Pig Execution modes:

1.Local mode

Input: LFS Path

Output: LFS Path

2.HDFS mode

Input: HDFS Path

Output: HDFS Path

Data Types in Pig

Simple Types:

int, long, float, double, Boolean, char array, byte array etc.

Complex Types:

bag, tuple, field, map etc.

When to use MapReduce and Pig in Real-time projects:

In the below use case scenario MapReduce only more recommended Pig

1.Unstructured data processing

2.When we are aiming at high and performance.

3. For some hierarchical and job processing involves.

Running Pig Programs

1.Grunt SHELL:

Is an interactive shell which is the default mode of Pig execution. That is whether the output is success or failure we will come to know the result the there itself.

2.Script Mode:

Instead of writing each command with grunt shell, we can write a bunch of Pig commands in a single file and only executing that script alone.

3.Embedded:

If we are not achieving desired functionality by using the predefined transformation of Pig, we can generally go head with Pig UDF’s.


Apache Flume in Hadoop



Flume:

Apache Flume is a data ingestion mechanism for collecting aggregating and transporting large amounts of streaming data such a log files from various sources to the centralized data store. It is a distributes system that gets logs from their source and aggregates them to where you want to process them. Flume is the highly reliable, distributed and configurable tool.

Advantages of Flume:

  1. Using Apache Flume we can store the data into any of the centralized stores in HDFS or HBase.
  2. Flume provides the features of contextual routing
  3. Flume acts as a mediator between data producers and the centralized stores and provides a basic flow of data between them.

Features of Flume :

  • Using flume we get the data from multiple servers immediately into Hadoop
  • Apache flume supports a large set of sources and destinations types.
  • Flume supports multi-hop flows, contextual routing etc
  • Flume can be scaled horizontally.

Core Concepts in Flume :

Event

An Event is the Fundamental unit of data transported by flume from its point of origination to its final destination.

Here Headers are specified as an unordered collection of string key, value pairs. Headers are used for contextual routing.

Client

Here the client is an entity that generates events and sends them to one or more Agents

Agent:

An Agent is a container for hosting sources, channels, sinks and other components that enable the transportation of events from one place to another.



 

Flume Streaming :

In general, a large amount of data that is to be analyzed will be produced by various data sources like applications servers, social networking websites and cloud-related servers. This data will be in the form of log files an events.

Log File:

A log file is a file that lists actions that occur in an operating system.

  • The application performance and locate various software and hardware failures.
  • The user behavior and derive better business.



Basic Terminology in Hadoop




Bigdata Solutions:

1.NoSQL – database(Non relational database) – Only for structured and semi-structured

2. Hadoop – Implementation – structured,semi-structured and unstructured data

3.Hadoop eco-systems and its components for everything.

Hadoop:

Hadoop is a parallel system for large data storage and processing. It is a solution for Bigdata.

For Storage purpose HDFS -Hadoop Distributed File System

For Processing purpose MapReduce using simply.

In Hadoop, some keywords are very important for learning scope.

Hadoop Basic Terminology:

1.Cluster

2.Clustered Node

3.Hadoop Clustered Node

4.Hadoop cluster

5. Hadoop Cluster Size

1.Cluster:

A cluster is a group of all nodes belongs to one common network is called a cluster.

2.Clustered Node:

A Clustered Node is a grouping of all individual machines is called a clustered node in Hadoop

3.Hadoop Cluster Node:

A Hadoop Cluster Node is basic storage and processing purpose of a cluster is called as Hadoop Cluster Node.

For storage purpose, we are using the Hadoop Distributed File System.

For processing purpose, we are using MapReduce

4.Hadoop Cluster:

A Hadoop Cluster is a collection of “Hadoop Cluster Node” in a common network is called Hadoop Cluster

5.Hadoop Cluster Size:

A Hadoop cluster size is a total no.of node in a Hadoop cluster.


Hadoop Ecosystem:

1. Apache Pig              –  Processing           – Pig Scripting

2. Hive                             – Processing           – HiveQL (Query language like SQL)

3.SQOOP                       – Integration tool  – Import and Export data

4.Zookeeper               – Coordination      – Distribution coordinator

5.Apache Flume      – Streaming              – log data for streaming purpose

6.Oozie                        – Scheduling             – Open source scheduling jobs

7.HBase                     – Random Access   – Hadoop+dataBASE

8.NoSQL                  – NotOnlySql              – MongoDB, Cassandra

9.Apache Kafka    – Messaging               – Distributed messaging

10.YARN                  – Resource Manager – Yet Another Resource Negotiator

Note: Apache Spark is not a part of Hadoop but including nowadays. It is used for Data Processing purpose. Spark 100 times faster than Hadoop MapReduce.

Compatible Operating System for Hadoop Installation:

1. Linux

2.Mac OS

3.Sun Solaris

4.Windows.

Hadoop Versions:

Hadoop 1.x

Hadoop 2.x

Hadoop 3.x

Different Distributions of Hadoop

1. Cloudera Distribution for Hadoop (CDH)

2.Hortonworks

3.MapR



Latest Hadoop Admin Interview Questions with Answers



LatestHadoop admin interview questions and answers:

1. What is Edge Node? Why choose two edge nodes in a cluster?

Basically, Edge Nodes are end-user connectivity purposes like an interface between cluster and client.

One Edge node is a single point if the edge node goes down another edge node will connect that’s why we use two edge nodes.

2. If you have four master nodes what are services are installed?

In master node 1: installed, Name node, Secondary node Hive server, Resource manager one zookeeper

In master node 2: HBase master, Oozie server

In master node 3: Hue, spark, three zookeeper

In master node 4: High availability

3. Tell me about default block size of Hadoop and  Unix?

The default block size of HDFS is 128MB

The default block size of Unix is 4kb



4. What are security measures that are implemented in the Hadoop cluster?

LDAP is the first level authentication

Kerberos for the second level authentication

Sentry for role-based authorization to data and metadata stored on Hadoop cluster

Knox, who access the cluster to provide security like a  gateways

Ranger is to provide security across Hadoop eco-system folder access and data authorization

5. What about data transmitted over the network data in transit how do you secure the data?

By using encrypted data transmitted over the networks and also using SSL certifications and HTTPS and some other protocols also.

6. What are the types of accounts used in the Hadoop cluster?

Service account: This account belongs to create in the active directory,  within the Hadoop cluster access the jobs and applications.

Technical account: This account related to access from outside clients for application related for example Java client to Hive access.

Business user account: This account belongs to some business users want to access the Hadoop cluster.

Admin account: highly privileged account for giving credentials for users from active directory

Local account: This account belongs to Unix based for active directory principals.


MapR



What is MapR?

MapR is one of the Big Data Distribution. It is a complete enterprise distribution for Apache Hadoop which is designed to improve the Hadoop’s reliability, performance, and ease of use.

Why MapR?

1. High Availability:

MapR provides High Availability features such as Self – Healing it means that no Namenode architecture.

It has job tracker High Availability and NFS. MapR achieves only distributing its file system metadata.

2. Disaster Recovery:

MapR provides mirroring facility which allows users to enable policies and mirror data. It automatically within the multinode cluster or single node cluster between on-premise and cloud infrastructure

3.Record Performance:

MapR is a world record performance cost only $9 to the earlier cost of $5M at a  speed of 54 sec. And it handles the large size of clusters like 2,200 nodes.

4.Consistent Snapshots:

MapR is the only big data distribution which provides a consistent, point in time recovery because of its unique read and writes storage architecture.

5. Complete Data Protection:

MapR has own security system for data protection in cluster level.



6.Compression:

MapR provides automatic behind the scenes compression to data. It applies compression automatically to files in the cluster.

7.Unbiased Open Source:

MapR completely unbiased opensource distribution

8. Real Multitenancy Including YARN also

9.Enterprise-grade NoSQL

10. Read and Write file system:

MapR has Read and Write file system.

MapR Ecosystem Packs (MEP):

The “MapR Ecosystem” is the set of open source that is included in the MapR Platform, and the “pack” means a bundled set of MapR Ecosystem projects with specific versions.

Mostly MapR Ecosystem Packs are released in every quarter and yearly also

A single version of MapR may support multiple MEPs, but only one at a time.

In a familiar case, Hadoop Ecosystem components and Opensource components are like Spark, Hive etc components are included in MapR Ecosystem Packs are like below tools:

Collectd
Elasticsearch
Grafana
Fluentd
Kibana
Open TSDB