HBase error: KeeprErrorCode = ConnectionLoss for /hbase in Cluster

On top of Hadoop Cluster Installed HBase (one kind of NoSQL database within Hadoop) service for real-time random reads/random writes in aginst to sequential file accessing of Hadoop Distributed File System (HDFS).




HBase used for better storage but we can’t use HBase to process data with some business logic for some other services like HIVE, Map-Reduce, PIG, andSQOOP, etc.

After Installed Spark server getting below error with HBase Snapshot from Hadoop cluster CLI

Below is the error in the HBase node:

at org.jruby.Ruby.runScript(Ruby.java:697)
at org.jruby.Ruby.runNormally(Ruby.java:597)
at org.jruby.Ruby.runFromMain(Ruby.java:446)
at org.jruby.Ruby.internalRun(Main.Ruby.java:258)
ERROR [ main] client.ConnectManager$HConnectionImplementation: Can't get connection to Zookeeeper: KEeperErrorCode = ConnectionLoss for /hbase
Error: KeeperErrorCode = ConnectionLoss for /hbase
Here is some help for this command:
List all tables in hbase. Optional regualr expression paramete could be used to filter the output. Examples:

How to resolve the below error in HBase Master node?

Resolutions for KeeprErrorCode = ConnectionLoss for /HBase in Cluster:

Above error code means HBase Master is not running on Hadoop cluster:

Resolution 1:

Step 1: First will check the HBase Master node is running or not by using "jps" commands.
Step 2: using "stop-all.sh" command to stop the all running services on Hadoop cluster
Step 3: using "start-all.sh" command to start all running services.
Step 4: using "jps" command to check the services if it showing HBase master working then fine otherwise will do below steps:
Step 5: Goto root user using "sudo su"
Step 6:  Goto hbase shell file path: "cd /usr/lib/habse-1.2.6-hadoop/bin/start-hbase.sh"
Step 7: Open the hbase shell using "hbase shell" command
Step 8: use "list" command.

Resolution 2:





It may cause Zookeeper issue while HBase Master node tries to get the list from Zookeeper then it fails.

Step 1: First check zookeeper service is running or not using "ps -ef | grep zookeeper"
Step 2: Using "sudo service zookeeper stop" command to stop the Zookeeper service in Haodop cluster and stop the HBase service as well.
Step 3: Then HBase xml file to increase the number of connection to Zookeeper services using"hbase.zookeeper,property.maxClientCnxns"
Step 4: start the zookeeper service first then start the HBase service.

Connection issues in Cassandra and HBase

What is Apache Cassandra?




Cassandra is an open-source, distributed, Not Only SQL database management system designed to handle a large amount of across data.

How to install Cassandra?Cassandra simple to install on Ubuntu/Linux with step by step processing and why should be using Apache Cassandra in Data handled:

Install Cassandra on ubuntu linux

What is Apache HBase?

Hadoop + DataBase runs on top of Hadoop eco-system. It is a Database which is an open-source, distributed, NoSQL database related. It provides random access and data stores in HDFS files that are indexed by key, values

How to install Apache HBase on Linux/Ubuntu system?

It is simple to the installation of HBase on the Linux operating system with step by step processing.
Installation of HBase on Ubuntu

Cassandra Connection error:

Error: Exception encountered during startup

java.lang.Illegal exceptionArgumentException is already in reerseMap to (Username)

at org.apache.cassandra.utils.concurrentBiMap.put(concurrentBiMap.java:97)

at org.apache.cassandra.config.schema.load(schema.java:406)

at org.apache.cassandra.config.schema.load(schema.java:117)

HBase Connection Error:

Client.ConnectionManager$HConnection Implementation: Can't get  connection to Zookeeper service connection loss for /hbase

After installation of Cassandra and HBase services on top of Hadoop Eco-system I got this type of error.  Anyone have found resolution please post it here

Spark Lazy Evaluation and Advantages with Example

Apache Spark is an in-memory cluster computing framework for processing and analyzing large amounts of data (Bigdata). Spark provides a simple programming model than that provided by Map Reduce. Developing a distributed data processing application with Apache Spark is a lot easier than developing the same application with Map Reduce. In Hadoop MapReduce provides only two operations for processing the data like “Map” & “Reduce”, whereas Spark comes with 80 plus data processing operations to work with big data application.




While data processing from source to destination. Spark is 100 times faster than Hadoop Map Reduce because it allows in-memory clustering computing, it implements an advanced execution engine.

What is meant by Apache Spark Lazy Evaluation?

In Apache Spark, two types of RDD operations are

I)Transformations

II) Actions.

We can define new RDDs any time, Apache Spark computes them only in a lazy evaluation. That is, the first time they are used in an action. The Lazy evaluation seems unusual at first but makes a lot of sense when you are working with large data(BigData).

Simply Lazy Evaluation in Spark means that the execution will not start until an action is triggered. In Apache Spark, the picture of lazy evaluation comes when Spark transformation occurs”.

Consider where we defined a text file and then filtered the lines that include “CISCO” client name if Apache Spark were to load and store all the lines in the file as soon as we wrote like lines = sc.text( file path ). Here Spark Context would waste a lot of o storage space, given that we then immediately filter out many lines. Instead, once Spark seems that whole chain transformation. It can compute the data needed for its result. Hence first() action, Apache Spark scans the file only until it finds the first matching line; it doesn’t even read the whole file.

Advantages Of Lazy Evaluation in Spark Transformations:

Some advantages of Lazy evaluation in Spark in below:

  • Increase Manageability: The Spark Lazy evaluation, users can divide into smaller operations. It reduces the number of passes on data by transformation grouping operation.
  • Increases Speed: By lazy evaluation in Spark to saves the trip between driver and cluster, speed up the process.
  • Reduces Complexities: There are two types of complexities of any operations are Time and Space complexity using Spark lazy evaluation we can overcome both complexities. The action is triggered only when the data is required.

Simple Example:

In Spark, Lazy evaluation below code writes in  Scala, who evaluates the expression as it’s declared.

With Lazy:

Scala> Val sparkList = List(1,2,3,4)

Scala> lazy val output = sparkList .map( 1 => 1*10)

Scala> println( output )

Output:

List( 10, 20, 30, 40 )



Hadoop Architecture vs MapR Architecture





Basically, In BigData environment Hadoop is a major role for storage and processing. Coming to MapR is distribution to provide services to Eco-System. Hadoop architecture and MapR architecture have some of the difference in Storage level and Naming convention wise.

For example in In Hadoop single storage unit is called Block. But in MapR it is called Container.

Hadoop VS MapR

Coming to Architecture wise somehow the differences in both:
In Hadoop Architecture based on the Master Node (Name node) and Slave (Data Node) Concept. For Storage purpose using HDFS and Processing for MapReduce.




In MapR Architecture is Native approach it means that SAN, NAS or HDFS approaches to store the metadata. It will directly approach to SAN  no need to JVM. Sometimes Hypervisor, Virtual machines are crashed then data directly pushed into HardDisk it means that if a server goes down the entire cluster re-syncs the data node’s data. MapR has its own filesystem called MapR File System for storage purpose. For processing using MapReduce in background.

There is no Name node concept in MapR Architecture. It completely on CLDB ( Container Location Data Base). CLDB contains a lot of information about the cluster. CLDB  installed one or more nodes for high availability.

It is very useful for failover mechanism to recovery time in just a few seconds.

In Hadoop Architecture Cluster Size will mention for Master and Slave machine nodes but in MapR CLDB default size is 32GB in a cluster.




 

In Hadoop Architecture:

NameNode
Blocksize
Replication

 

In MapR Architecture:

Container Location DataBase
Containers
Mirrors

Summary: The MapR Architecture is entirely on the same architecture of Apache Hadoop including all the core components distribution. In BigData environment have different types of distributions like Cloudera, Hortonworks. But coming to MapR is Enterprise edition. MapR is a stable distribution compare to remaining all. And provide default security for all services.

Talend Installation on Windows with Pictures

First, we need Prerequisites for Installation of Talend in the Windows operating system.

Prerequisites for Talend Installation:

While Talend installation memory usage heavily depends on the size and nature of your Talend projects. If your jobs include so much of transformations components you should be upgrading the total amount of memory allocated to your servers.




Based on the following recommendations for Talend installation:

Product - Studio
Client/Server - Client
Memory - 4GB recommended
Disk space installation - 6 GB

Before installing of Talend we need to configure the JAVA_HOME environment variable so that it points to the JDK directory.
Example:C:\Program Files\Java\jdk1.x.x_x\bin

Simple Steps to Talend Installation:

After completion of the above steps then download Talend studio files like Data Integration, Big Data, and cloud trail version:

Talend Studio Download

Step 1: Goto Talend official website for Talend studio zip file contains binaries for All platforms (Windows/Mac).

Once the download is complete, extract the zip file on your Windows operating system.

Step 2: We get the Talend folder then double click on the TOS_DI_win_x86_64.exe. And executable file to launch your Talend studio-like below image.

You need to recommend more memory and space to avoid spaces in the target installation directory.

Step 3: After clicking on Talend icon then ‘accept’ the user license agreement.

Step4: After Accepted the agreement then starts your journey with Talend studio. For the first time user, need to set up a new project or you can also import a demo project and click the Finish button.

 

After successful completion of Talend Studio Data Integration installation on windows. To open a welcome window and launch the studio simply.




Talend studio requires different types of libraries like Java (.jar) files and database drivers to be installed to connect a source to target.
These libraries are also known as external modules in Talend studio. If you want a cloud trial version or Big Data version for more features then simply installed it.

What is Data Mart? Types and Implementation Steps:




What is Data Mart?

In the Data Warehouse system Data Mart is a major role. Data Mart contains a subset of an organization. In other words, a data mart contains only those data that are specific to a particular group. For example, the marketing data mart may contain only data related to items, customers, and sales.

  • Data Marts improve end-user response time by allowing users to have access the specific type of data they need to view most often by providing the data in a way that supports the collective view of a group of users.
  • Data Marts are confined to subjects.

Three Types of Data Mart:

Based on data source dividing the Data Mart into three types
1.Dependent: In contrast, are standalone systems built by drawing data directly from internal or external sources of data or both. This allows you to unite the organization’s data in one data warehouse system with centralization.

2.Independent: Data Mart is created without the use of a central data warehouse for smaller groups within an organization.

  • 3.Hybrid: Data Marts can draw data from operational systems or data warehouses. It is a combination of dependent and independent data warehouse.

Implementing Datamart with simple steps:

Data Mart implementing is a bit of a complex procedure. Here are the detailed steps to implementing the Data Mart:

Designing:

This is the first phase of Data Mart implementation while tasks assigned. At the time gathering the information about the requirements. Then create the physical and logical design of the data mart.

Constructing:

Constructing is the second phase of implementation. It involves creating the physical database and logical structures in data warehouse system. Here are storage management, fast data access, data protection, and security for constructing the database structure.

Populating: 

Populating is the third phase of implementation. It involves Mapping data from source to destination, extraction of source data and loading data into the Data Mart.

Accessing: 

Accessing is the fourth step phase of implementation. It involves querying data, creating reports and charts, etc.

Managing:

Managing is the final step of Data Mart implementation. It involves access to management, tuning the data for the required database and managing fresh data into the data mart.

What is Data Warehouse? Types with Examples

First, we need to basic knowledge on Database then will go with Data Warehouse and different types of Data Warehouse system.



What is Data Base?

Database is a collection of related data and data is a collection of characteristics and figures that can be processed to produce information. Mostly data represents recordable facts. The data aids in producing information, which is based on facts. If we have data about salary obtained by all students, we can then conclude about the highest salary, etc.

What is a Data Warehouse?

Data Warehouse is also an enterprise data warehouse, it is a subject – oriented, integrated, time -variant and non – violent collection of Data management’s decision making.

Data is populated into the Data Warehouse through the processes of extraction, transformation, and loading (ETL). It contributes to future decision making. For example, data will take from different sources is extracted into a single area and transformed according to the data then loaded into storage systems.

Subject Oriented: A data warehouse used to analyze a particular subject area.

Integrated Oriented: A data warehouse integrates data from multiple data sources.

Time-Variant: Historical data is kept in a data warehouse.

Non – Violate: Once data is in the data warehouse, it will not change.

Types of Data Warehouse?

The data warehouse system is majorly three types:

1.Information Processing: A data warehouse allows to process the data stored in it. The data can be processed by means of querying, basic statistical analysis and reporting using charts or graphs

2.Analytical Processing: A data warehouse supports the analytical processing of the information stored in it. Basically, it is OLAP operations including drill – up and drill -down and pivoting.

3.Data Mining: In Data warehouse system Data Mining supports knowledge discovery by finding hidden patterns and associations, constructing analytical models. These results using the visualization tools.

Sample Talend Questions – MCQs





1. Which of the following in the design workspace indicates an error with a component in Talend

A) A red ‘X’                                                                      B) A red exclamation point

C)A green ‘ I’                                                                    D)A yellow  exclamation point

2. Which of the following components can be used to implement lookup in Talend

A)tJoin                                                                         B) tLookUp

C)tMap                                                                        D)tUnite

3.tMap offers following match modes for a lookup in Talend studio

A)Unique match                                             B)Only unique match

C)First match                                                   D)All matches

4.tMap offers following join model in Talend tool

A)Left Outer Join                                     B)Right Outer Join

C)Inner Join                                                D)Full Outer Join

5. Which of the following the components is used to execute a job infinite times in Talend

A)tInfiniteLoop                                          B) tFileWatcher

C) tForEach                                                  D)tRunJon

6.How to access parameters in the Global Map in Talend ETL tool

A)globalMap.put(“Key”, Object)

B)globalMap.get(“Key”, Object)

C)globalMap.put(“key”)

D)globalMap.get(“key”)

7. How do you reference the value of Context Variable FileName in configuration while Talend programming

A)Context.FileName                                B)context.FileName

C)FileName.value                                     D)$context.FileName

8. While Installing your Talend solutions, you have to set the following variable is mandatory?

A)JAVA_HOME                           B)TALEND_HOME

C)TIS_HOME                                 D)JRE_HOME

9. What is the use of a tReplicate component? Choose one best answer?

A)To duplicate the configuration of an existing component

B)To copy the input row to an output row without processing it

C)To duplicate a sub job

D)To send duplicates of an output row to multiple target components

10. How do you see the configuration of an error message for a component in Talend studio?

A)Right-click the component and then click show problem

B)From the errors view

C)Place the mouse pointer over error symbol in the design workspace

D)From the problems view

11. How do you create a row between two components in Talend

A) Drag the target component to source component

B)Right-click the source component click Row followed by the row type and then click the target component

C)Drag the source component onto target components

D)Right-click the source component and then click the target component

12. How do you ensure that a subjob completes before a second subjob runs in Talend?

A)Using RunIf trigger

B)Using the main connection

C)Using onComponentOk or OnComponentError trigger

D)Using onSubJobOk or onSubJobError trigger

13. Which of the following the components will be used to load JSON file to MySQL database in Talend?

A)tMySQLInput

B)tFileInputJSON

C)tMySQLOutput

D)tMap

14. How do you run a job in Talend Studio?

A)Click the Run button in the Run view

B)Click the Run button in the Job view

C)Click the Run button in the File Menu

D)Click the Start button in the Run view
15. What is the best practice for arranging components on the design workspace in Talend studio?

A)Bottom to Top

B)Right to Left

C)Top to Bottom

D)Matching the flow of data

16. From which tab in component view would you specify the component label in Talend

A)View

B)Advanced settings

C)Basic settings

D)Documentation

17. How to place your component in a job in Talend Studio?

A) Click it on Edit Menu

B) Click it in the Repository and then click in the design workspace

C) Click it from Repository to the design workspace

D)Click it in the Palette and then click in the design workspace

HBase Table(Single&Multiple) data migration from one cluster to another cluster

HBase single table migration from one cluster to another cluster:

Here will be shown about Hbase single data table migration existing cluster to a new cluster simple steps:




Step 1: First export the hbase table data into the local hdfs path (Hadoop Distributed File System)

Step 2: After that copy the HBase table data from the source cluster to destination cluster by using the distcp command. (mostly distcp is a copy command for one cluster data to another cluster)

Step 3: Then create an Hbase table in the destination cluster (target cluster)

Step 4: After that import the Hbase table data from local to HBase table in the destination cluster.

Source Cluster:

1.  hbase.org.apache.hadoop.hbase.mapreduce.Driver export <hbase _table _name >  < source _hdfs _path >

2. hbase distcp hdfs :// <source_cluster_ipaddress:8020> to </source _hdfs _path>

3.hdfs: // < destination_cluster_ipaddress: 8020 > to <destination _hdfs _path>

Destination Cluster:

1.hbase org.hadoop.hbase.mapreduce.import < hbase _ table_ name > to < hbase _table _hdfs _path >

HBase multiple table migration from one cluster to another cluster:

We know how to Hbase single table migration then coming to multiple table migration from one cluster to another cluster in simple manner by below steps.

We have script files then simply multiple Hbase data migrations happening to go through below steps:

Step 1: First step place the hbase-export.sh and hbase-table.txt in the source cluster

Step 2: After that place the hbase -import.sh and hbase-table.txt in the destination cluster.

Step 3: Mention all the table list in the hbase-table.txt file

Step 4: Create all the HBase table on the destination cluster

Step 5: Execute the hbase-export-generic.sh in the source cluster




Step 6: Execute the hbase-import.sh in the destination cluster.
Summary: I tried in Cloudera Distribute Hadoop environment for Hbase data migration from one cluster to another cluster. For Hbase single table data and multiple table data migration in very simple for Hadoop administrator as well as Hadoop developers. It is the same as Hortonword Distribution also.

MapR

What is MapR?

MapR is one of the Big Data Distribution. It is a complete enterprise distribution for Apache Hadoop which is designed to improve the Hadoop’s reliability, performance, and ease of use.

Why MapR?

1. High Availability:




MapR provides High Availability features such as Self – Healing it means that no Namenode architecture.

It has job tracker High Availability and NFS. MapR achieves only distributing its file system metadata.

2. Disaster Recovery:

MapR provides mirroring facility which allows users to enable policies and mirror data. It automatically within the multinode cluster or single node cluster between on-premise and cloud infrastructure

3.Record Performance:

MapR is a world record performance cost only $9 to the earlier cost of $5M at a  speed of 54 sec. And it handles the large size of clusters like 2,200 nodes.

4.Consistent Snapshots:

MapR is the only big data distribution which provides a consistent, point in time recovery because of its unique read and writes storage architecture.

5. Complete Data Protection:

MapR has own security system for data protection in cluster level.

6.Compression:

MapR provides automatic behind the scenes compression to data. It applies compression automatically to files in the cluster.

7.Unbiased Open Source:

MapR completely unbiased opensource distribution

8. Real Multitenancy Including YARN also

9.Enterprise-grade NoSQL

10. Read and Write file system:

MapR has Read and Write file system.

MapR Ecosystem Packs (MEP):

The “MapR Ecosystem” is the set of open source that is included in the MapR Platform, and the “pack” means a bundled set of MapR Ecosystem projects with specific versions.

Mostly MapR Ecosystem Packs are released in every quarter and yearly also

A single version of MapR may support multiple MEPs, but only one at a time.




In a familiar case, Hadoop Ecosystem components and Opensource components are like Spark, Hive etc components are included in MapR Ecosystem Packs are like below tools:

Collectd
Elasticsearch
Grafana
Fluentd
Kibana
Open TSDB