Connection issues in Cassandra and HBase

What is Apache Cassandra?




Cassandra is an open-source, distributed, Not Only SQL database management system designed to handle a large amount of across data.

How to install Cassandra?Cassandra simple to install on Ubuntu/Linux with step by step processing and why should be using Apache Cassandra in Data handled:

Install Cassandra on ubuntu linux

What is Apache HBase?

Hadoop + DataBase runs on top of Hadoop eco-system. It is a Database which is an open-source, distributed, NoSQL database related. It provides random access and data stores in HDFS files that are indexed by key, values

How to install Apache HBase on Linux/Ubuntu system?

It is simple to the installation of HBase on the Linux operating system with step by step processing.
Installation of HBase on Ubuntu

Cassandra Connection error:

Error: Exception encountered during startup

java.lang.Illegal exceptionArgumentException is already in reerseMap to (Username)

at org.apache.cassandra.utils.concurrentBiMap.put(concurrentBiMap.java:97)

at org.apache.cassandra.config.schema.load(schema.java:406)

at org.apache.cassandra.config.schema.load(schema.java:117)

HBase Connection Error:

Client.ConnectionManager$HConnection Implementation: Can't get  connection to Zookeeper service connection loss for /hbase

After installation of Cassandra and HBase services on top of Hadoop Eco-system I got this type of error.  Anyone have found resolution please post it here

Permission Denied error in Hive while creating Database in Hadoop eco-system

I have installed Hive service on top Hadoop eco-system then trying to create a database but I got below error and find out a solution as well.



Permission Denied Error in Hive:

FAILED: Execution Error, return code1 from org.apache.hadoop.hive.ql.exec.DDLTask
hive> set hive.auto.convert.join.nonconditional task = false:
hive> create database myhive:
FAILED: Error in metadata: MetaException(message:Got exception: org.apache.hadoop.security.AccessControlException Permission denied user = hadoop access = WRITE, inode*/user*: hdfs : supergroup : drwxr-rx-r
at org.apache.hadoop.hdfs.server.namenode.FSPErmissionChecker.check(FSPermissionChecker.java:224)
at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:149)
at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:149)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkPermission(FSNamesystem.java:4891)
at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.mkdirs(NameNodeRpcServer.java:669)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.mkdirs
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java.453)
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask
hive>

Above error belongs to Permission issue in Hive component:

Resolution:

To resolve the permission issue to the user Hadoop in HDFS. Can you please follow step for the solutions is just change permission to the user using chmod commands:

Step 1: Login to as hduser then execute the below commands one by one.
Step 2: sudo - u hdfs hadoop fs -mkdir /user/hive/warehouse
Step 3: sudo -u hdfs hadoop fs -chmod g+w /tmp
Step 4: sudo -u hdfs hadoop fs -chmod g+w /user/hive/warehouse
Step 5: sudo - u hdfs hadoop fs  -chown -R /user/hive/warehouse
Step 6: sudo chmod 777 /var/lib/hive/metastore
Step 7: cd /var/lib/hive/metastore/metastore_db/
Step 8 :sudo rm *.lck

Summary: I have tried above resolutions then working fine now for above error in Hive.

How to Install Kakfa in Linux/Ubuntu (Single node cluster) with Pictures

Apache Kafka is one of the distributed messaging systems. Here is step by step processing to install of Apache Kafka in Linux/Ubuntu operating system.



Prerequisites:

To install Kafka required Zookeeper and java to run. Mandatory for JDK 1.7 or above version for Kafka installation using below commands:

$ sudo apt update
$ sudo apt install default - jdk

Step 1: Download the Kafka binary files from official website like Download from apache website.

https://archive.apache.org/dist/kafka/

Step 2:  Extract the tarball using the below command:

$ tar -xzvf kafka_2.12.-0.10.1.1tgz


Step 3: After extraction, we see Kafka directory

Step 4: Update the KAFKA_HOME & PATH variable in bashrc file

 export KAFKA_HOME = /home/your_path/INSTALL/kafka_2.12-0.10.1.1
 export PATH=$PATH:$KAFKA_HOME/bin


Step 5: Ater bashrc changes, open a new terminal and check the bashrc changes using below command:

$ echo $KAFKA_HOME






After installing Apache Kafka on Linux/Ubuntu Start Kafka Server. Before start, the Kafka server, start Zookeeper server on your single node cluster using below commands:

$ cd/usr/local/kafka
$ bin/zookeeper-server-start.sh config/zookeeper.properties

After the start, the Zookeeper server then start the Kafka server

$ bin/kafka-server-start.sh config/server.properties

After starting Kafka server then create topics after that will go with message passing from producer to consumer. Above steps Kafka installation for single/pseudo cluster setup on top of the Hadoop ecosystem.

Summary: Apache Kafka installation in Linux/Ubuntu operating system it is very simple and uses it. If you need installation in Clouder to need to download separately bundle in Cloudera manager to set up in multi-node cluster setup. In Hortonworks need to the installation of Kafka in Ambari.

 

Unable to Integrate Hive with Spark and different resolutions




How to integrate (connect) Hive and Spark:

Here are to provide solutions for how to integrate (connect) Hive DB with Spark in Hadoop development.
The first time, we tried to connect the Hive and Spark then we got below error and find different types of resolutions with different modes.

caused by: org.datanucleus.exceptions. NucleusExcepiton: Attempt tp invoke 
the ONECP" plugin to create a ConnectionPool gave an error: The specified 
data driver ("co.mysql.jdbc.Driver) was not found in the CLASSPATH. Please 
change our CLASSPATH specification and the name of the driver.

Different types of solution for the above error:

Resolution 1:

1.Download MySQL connector java jar file from maven official website like below link
https://mvnrepository.com/artifact/mysql/mysql-connector-java/5.1.21
2. Paste the jar file into jars folder which is present in the Spark installed directory.

Resolution 2:

Without JDBC driver:

1. Goto hive-site.xml and give hive.metastore.uri in that hive xml file
2. Import the org.apache.spark.sql.hive.HiveContext, as it can perform SQL query over Hive tables then define the sqlContext param like below code:
Val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
3. Finally, verify Tables in Spark SQL

Resolution 3:





Go with the beeline for Hive and Spark connection in Hive CLI. In beeline, they provide high security and provide a remote server through directly and check with below two commands for beeline with Hive 2 server configurations.

Step 1: ./bin/beeline
Step 2:  !connect jdbc.hive2.//remote_hive:10000

Hadoop Cluster Interview Questions and Answers Updated

1. In Which directory Hadoop installed?




Apache Hadoop and Cloudera have the same directory structure. Hadoop installed in cd/usr/lib/hadoop

2. Which are three modes in which Hadoop can be run?
In Hadoop eco-system three different types of odes can be run are:
A. Standalone (Local) mode
B.Pseudo-distributed mode
C.Fully distributed ode

3. What are the features of Standalone mode in a Hadoop environment?
In Hadoop eco-system Standalone mode there are one of the modes there are no daemons and everything runs on a single JVM (Java Virtual Machine). There is no DFS(Distributed File System) and utilizes the local file system. Stand-alone mode is suitable only for running MapReduce programs during development.

4. Does Hadoop follow the UNIX pattern?
Yes, Hadoop follows UNIX Pattern.

5. What are the features of Pseudo distributed mode in a Hadoop environment?
In Hadoop eco-system, Pseudo distributed mode is used for both the QA environment and development. In the Pseudo distributed mode all the daemons run on the same machine.




6. Can we call VMs as Pseudos?
No, VMs are not Pseudos because of VM is different and Pseudos are only for Hadoop environment.

7. What are the features of Fully distributed mode in a Hadoop environment?
In Hadoop eco-system, the Fully distributed mode is used for Production, Development and QA environment. Where we have a number of machines forming the Hadoop cluster.

8. What are the default port numbers of Namenode, job tracker and task tracker in Hadoop eco-system?
The port number of Namenode is “50070” for job tracker “50030” and task tracker”50060″.

9. What is the Hadoop configuration files only for Hadoop Installation?
There are three files to configured for Hadoop:
1.core-site.xml
2.hdfs-site.xml
3.mapred-site.xml
These files are located in hadoop/conf/ directory

10. What happens if you get a “connection refused java exception” in Hadoop? check Hadoop fsck? What happened?
It means that the Name node is not working on your machine.

11. What does /etc/init.d do?
/etc/init.d specifies that where daemons are placed or to see the status of these daemons

Most Typical Hive Interview Questions and Answers




Hive Interview Questions and Answers

1. Does Hive support record level Insert, delete or Update?

Hive does not support recode level insert, delete or update. It doesn’t provide transactions also. If the user can go with CASE statements and built-in functions of Hive to satisfy the insert, update and delete.

2. What kind of data warehouse applications is suitable for Hive?

Basically, Hive is not a full database it is a data summarization tool in Hadoop eco-system. Hive can do below applications:

I)Fast response times are not required
II)When the data is not changing rapidly
III)Relatively static data is analyzed

3. How can the columns of a table in Hive be written to a File?

In Hive using the awk command in Hive shell, the output from HiveQL can be written to a file

Example : hive -S -e  "describe table_name" | awk -F " '{print 1}' > ~/output

4.Difference between order by and sort by in Hive?

In Hive SORT BY will sort the data within each reducer. It can use any number of reducers for SORT BY operations.
Coming to ORDER BY will sort all of the data together, which has to pass through one reducer. Thus, ORDER BY in Hive uses single reducers and guarantees total order in the output while SORT BY only guarantees ordering of the rows within a reducer




5. Wherever Different directory I run Hive query, it creates new metastore_db, please explain the reason for it?

Whenever you run the Hive in embedded mode, it means that it creates the local metastore. And before creating the metastore it looks whether metastore already exists or not. This property is defined in the configuration file in hive_site.xml properties.

"javax.jdo.option.ConnectionURL" with default value
"jdbc:derby::databaseName=metastore_db";create=true

6. Is it possible to use the same metastore by multiple users, in case of embedded Hive?

No, it is impossible to use metastore for multiple users, it is only for a single user in a single mode database like PostgreSQL, MySQL, etc.

Kafka Interview Questions and Answers

Kafka Interview Questions and Answers:

1. What is Kafka?

Kafka is an open source message broker project coded in Scala/Python/Java. Kafka is originally developed by LinkedIn and developed as an open sourced in early.




2. Which are the components of Kafka?

The major components of Kafka are:

Topic: A group of messages belongs to the same type

Producer: Using the producer can publish messages to the topic

Consumer: Pulls data from the brokers

Brokers: This is the place where the disclose messages are stored known as servers.

3. What role does Zookeeper play in a cluster of Kafka?

Kafka is an open source system and it also a distributed system and it is built to use Zookeeper. The basic responsibility of Zookeeper is to build coordination between different nodes in a cluster. Zookeeper works as periodically commit offset so that if any ode gets failure it will be used to recover from previously committed offset. The Zookeeper is also responsible for configuration management leader detection, finding if any node leaves or joins the cluster, synchronization.

4. Distinguish between the Kafka and Flume?

Flume’s major use-case is incorporated with the Hadoop’s monitoring system, file formats, file systems, and utilities. It is used for Hadoop integration. Flume will be the best option to use when you have non-relational data sources. But Kafka used for the distributed publish-subscribe messaging system. Kafka is not developed for Hadoop and using Kafka to read and write data to Hadoop considerably than the Flume. Kafka is a highly reliable and scalable enterprise messaging system to connect different multiple systems.




5. It is possible to use Kafka without Zookeeper?

It is impossible to use Kafka without Zookeeper because it is not possible to bypass Zookeeper and connect directly to the server. If the Zookeeper is down then we will not able to sever any client request.

6. How to start a Kafka Server?

Kafka uses Zookeeper, we have to start the zookeeper server. One can use the convince script packaged with Kafka with a single node Zookeeper
> bin/zookeeper-server-start.shconfig/zookeeper.properties Now the Kafka server can start> bin/Kafka-server-start.sh config/server.properties

What are the different Hadoop Components and Definitions

What are the Different Hadoop Components in Hadoop Eco-System





HDFS – Filesystem of Hadoop ( Hadoop Distributed File System)
MapReduce – Processing of Large Datasets

HBase – Database (Hadoop+dataBase)

Apache Oozie – Workflow Scheduler

Apache Mahout – Machine learning and Data mining

Apache Hue – Hadoop user interface, Browser for HDFS, HBase, Query editors for Hive, etc.
Flume – To integrate other data source

Sqoop – Export / Import data from RDBMS to HDFS and HDFS to RDBMS

What is HDFS?

HDFS (Hadoop Distributed File System) is a filesystem that can store very large data sets by scaling out across a cluster of hosts.

What is Map Reduce?

MapReduce is a programming model and it is implemented for processing and generating large data sets. It specifies a map function that process a (key, value) pair to generate a set of intermediate(Key, Value) pairs.

What is Hive?




A data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis.

What is  Pig?

Pig is an analyzing large data sets that consist of a high-level (scripting) language for expressing data analysis programs.

What is Flume?

Flume is on top of Hadoop applications, we need to get data from the source into HDFS.

What is Sqoop?

Apache Sqoop is a tool designed for transferring bulk data between Hadoop and structured data stores it means that Export / Import data from RDBMS to HDFS vice versa.

What is HBase?

HBase ( Hadoop + dataBase) is a column-oriented store database layered on top of HDFS.

What is NoSQL database?

NoSQL means that Not Only SQL using traditional relational Data Base Management System.

What is Heartbeat in Hadoop? How to resolve Heartbeat lost in Cloudera and Hortonworks

Heartbeat in Hadoop:





In Hadoop, eco-system Heartbeat is an in-between Namenode and Datanode communication. It is the signal that is sent by the Datanode to Namenode after a regular interval. If Datanode in HDFS does not send a heartbeat to Namenode around 10 minutes by default then Namenode considers the Datanode is not available.

The default heartbeat interval is 3 seconds. Put in dfs.heartbeat.interval in a hdfs-site.xml file in Hadoop installation directory.

What is Heartbeat lost:

In Hadoop eco-system, the Datanode does not send a heartbeat to Namenode around 10 minutes by default. So, in this case, Namenode considers a Datanode is unavailable it is known as “Heartbeat lost”.

How to resolve Heartbeat lost:

In Bigdata distribution environment will take Hortonworks (HDP)In Hortonworks:
1. In HDP check Amabari agents status whether it is running or not by using” ambari-agent status ”
2. If it is not running then check with log files for Ambari server and Ambari agent as well as in the directory of /var/log/ambari-server and /var/log/ambari-agent.

3. Follow the below steps:

A) Stop ambari-server
B) Stop ambari-agent service on all nodes
C) Start ambari-agent service on all nodes
D) Start ambari-server server

Cloudera:

1. First Check the Cloudera scm agent status whether it is running or not by using” sudo service cloudera-scm-agent status ”





2.check the agent log files in this directory in /var/log/cloudera-scm-agent/

2. Then follow the below commands with root user

sudo service cloudera-scm-agent status
sudo service cloudera-scm-agent stop
sudo service cloudera-scm-agent start

Summary: Hadoop is following Master, Slave architecture. The master node stores the metadata and slave nodes stores the actual data. So while sending data communication between Namenode and Datanode is called as a “Heartbeat”. If it fails simply called as a “Heartbeat lost” it means that Datanode is unavailable.  To find resolution steps for Bigdata distributions like Hortonworks (HDP) and Cloudera (CDH) with step by step process for this issue.

Hadoop job (YARN Staging) error while executing simple job

In a Hadoop eco-system, no.of jobs are executing in a fraction of time in that time. I am trying to execute the Hive job for Data validation in Hive server in Production server. While executing a Hive job in the hive command line I got this type of error.



at org.apache.hadoop.ipc.Client.call(Client.java:1468)
at org.apache.hadoop.ipc.Client.call(Client.java:1399)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
at com.sun.proxy.$Proxy9.addBlock(Unknown Source)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.addBlock(ClientNamenodeProtocolTranslatorPB.java:399)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
at com.sun.proxy.$Proxy10.addBlock(Unknown Source)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.locateFollowingBlock(DFSOutputStream.java:1532)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1349)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:588)
22:33:33 INFO mapreduce.JobSubmitter: Cleaning up the staging area /tmp/hadoop-yarn/staging//.staging/job_1562044010976_0003
Exception in thread "main" org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /tmp/hadoop-yarn/staging//.staging/job_1562044010976_0003/job.jar could only be replicated to 0 nodes instead of minReplication (=1). There are 1 datanode(s) running and no node(s) are excluded in this operation.
at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:1549)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:3200)
at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:641)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:482)
at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2039)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2035)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2033)

The above error belongs to a connection error in Datanode while executing the code. At the time Datanode not running properly. so find below resolution for this issue:

Stop all services:

stop-all.sh
start-all.sh

Here restart all services including Namenode, Secondary Namenode, DataNodes and remaining services like Hive, Spark,
etc.

If still showing this type of error then start the distributed file system.

start-dfs.sh

Check all the Hadoop Daemons like Name node, Secondary Name node, Datanode, Resource Manager and Node Manager, etc. By using below command

jps

And then check All node information by using “hadoop dfsadmin -report ” for the status of the Datanode whether it is running fine or not.

Above steps for Local, Pseudo distributed,  and standalone mode only in Hadoop eco-system.

For Cloudera, Hortonworks, MapR distributions are simply “Restart” DataNodes and Services like Hive, Spark, etc.




Summary: In Big Data environment we executing so many jobs like Hadoop/Spark/Hive for the result but some times showing above error. At the time we stuck but here the simple solution for the above error