How to Install Kakfa in Linux/Ubuntu (Single node cluster) with Pictures

Apache Kafka is one of the distributed messaging systems. Here is step by step processing to install of Apache Kafka in Linux/Ubuntu operating system.



Prerequisites:

To install Kafka required Zookeeper and java to run. Mandatory for JDK 1.7 or above version for Kafka installation using below commands:

$ sudo apt update
$ sudo apt install default - jdk

Step 1: Download the Kafka binary files from official website like Download from apache website.

https://archive.apache.org/dist/kafka/

Step 2:  Extract the tarball using the below command:

$ tar -xzvf kafka_2.12.-0.10.1.1tgz


Step 3: After extraction, we see Kafka directory

Step 4: Update the KAFKA_HOME & PATH variable in bashrc file

 export KAFKA_HOME = /home/your_path/INSTALL/kafka_2.12-0.10.1.1
 export PATH=$PATH:$KAFKA_HOME/bin


Step 5: Ater bashrc changes, open a new terminal and check the bashrc changes using below command:

$ echo $KAFKA_HOME






After installing Apache Kafka on Linux/Ubuntu Start Kafka Server. Before start, the Kafka server, start Zookeeper server on your single node cluster using below commands:

$ cd/usr/local/kafka
$ bin/zookeeper-server-start.sh config/zookeeper.properties

After the start, the Zookeeper server then start the Kafka server

$ bin/kafka-server-start.sh config/server.properties

After starting Kafka server then create topics after that will go with message passing from producer to consumer. Above steps Kafka installation for single/pseudo cluster setup on top of the Hadoop ecosystem.

Summary: Apache Kafka installation in Linux/Ubuntu operating system it is very simple and uses it. If you need installation in Clouder to need to download separately bundle in Cloudera manager to set up in multi-node cluster setup. In Hortonworks need to the installation of Kafka in Ambari.

 

Unable to Integrate Hive with Spark and different resolutions




How to integrate (connect) Hive and Spark:

Here are to provide solutions for how to integrate (connect) Hive DB with Spark in Hadoop development.
The first time, we tried to connect the Hive and Spark then we got below error and find different types of resolutions with different modes.

caused by: org.datanucleus.exceptions. NucleusExcepiton: Attempt tp invoke 
the ONECP" plugin to create a ConnectionPool gave an error: The specified 
data driver ("co.mysql.jdbc.Driver) was not found in the CLASSPATH. Please 
change our CLASSPATH specification and the name of the driver.

Different types of solution for the above error:

Resolution 1:

1.Download MySQL connector java jar file from maven official website like below link
https://mvnrepository.com/artifact/mysql/mysql-connector-java/5.1.21
2. Paste the jar file into jars folder which is present in the Spark installed directory.

Resolution 2:

Without JDBC driver:

1. Goto hive-site.xml and give hive.metastore.uri in that hive xml file
2. Import the org.apache.spark.sql.hive.HiveContext, as it can perform SQL query over Hive tables then define the sqlContext param like below code:
Val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
3. Finally, verify Tables in Spark SQL

Resolution 3:





Go with the beeline for Hive and Spark connection in Hive CLI. In beeline, they provide high security and provide a remote server through directly and check with below two commands for beeline with Hive 2 server configurations.

Step 1: ./bin/beeline
Step 2:  !connect jdbc.hive2.//remote_hive:10000

Hadoop Cluster Interview Questions and Answers Updated

1. In Which directory Hadoop installed?




Apache Hadoop and Cloudera have the same directory structure. Hadoop installed in cd/usr/lib/hadoop

2. Which are three modes in which Hadoop can be run?
In Hadoop eco-system three different types of odes can be run are:
A. Standalone (Local) mode
B.Pseudo-distributed mode
C.Fully distributed ode

3. What are the features of Standalone mode in a Hadoop environment?
In Hadoop eco-system Standalone mode there are one of the modes there are no daemons and everything runs on a single JVM (Java Virtual Machine). There is no DFS(Distributed File System) and utilizes the local file system. Stand-alone mode is suitable only for running MapReduce programs during development.

4. Does Hadoop follow the UNIX pattern?
Yes, Hadoop follows UNIX Pattern.

5. What are the features of Pseudo distributed mode in a Hadoop environment?
In Hadoop eco-system, Pseudo distributed mode is used for both the QA environment and development. In the Pseudo distributed mode all the daemons run on the same machine.




6. Can we call VMs as Pseudos?
No, VMs are not Pseudos because of VM is different and Pseudos are only for Hadoop environment.

7. What are the features of Fully distributed mode in a Hadoop environment?
In Hadoop eco-system, the Fully distributed mode is used for Production, Development and QA environment. Where we have a number of machines forming the Hadoop cluster.

8. What are the default port numbers of Namenode, job tracker and task tracker in Hadoop eco-system?
The port number of Namenode is “50070” for job tracker “50030” and task tracker”50060″.

9. What is the Hadoop configuration files only for Hadoop Installation?
There are three files to configured for Hadoop:
1.core-site.xml
2.hdfs-site.xml
3.mapred-site.xml
These files are located in hadoop/conf/ directory

10. What happens if you get a “connection refused java exception” in Hadoop? check Hadoop fsck? What happened?
It means that the Name node is not working on your machine.

11. What does /etc/init.d do?
/etc/init.d specifies that where daemons are placed or to see the status of these daemons

Most Typical Hive Interview Questions and Answers




Hive Interview Questions and Answers

1. Does Hive support record level Insert, delete or Update?

Hive does not support recode level insert, delete or update. It doesn’t provide transactions also. If the user can go with CASE statements and built-in functions of Hive to satisfy the insert, update and delete.

2. What kind of data warehouse applications is suitable for Hive?

Basically, Hive is not a full database it is a data summarization tool in Hadoop eco-system. Hive can do below applications:

I)Fast response times are not required
II)When the data is not changing rapidly
III)Relatively static data is analyzed

3. How can the columns of a table in Hive be written to a File?

In Hive using the awk command in Hive shell, the output from HiveQL can be written to a file

Example : hive -S -e  "describe table_name" | awk -F " '{print 1}' > ~/output

4.Difference between order by and sort by in Hive?

In Hive SORT BY will sort the data within each reducer. It can use any number of reducers for SORT BY operations.
Coming to ORDER BY will sort all of the data together, which has to pass through one reducer. Thus, ORDER BY in Hive uses single reducers and guarantees total order in the output while SORT BY only guarantees ordering of the rows within a reducer




5. Wherever Different directory I run Hive query, it creates new metastore_db, please explain the reason for it?

Whenever you run the Hive in embedded mode, it means that it creates the local metastore. And before creating the metastore it looks whether metastore already exists or not. This property is defined in the configuration file in hive_site.xml properties.

"javax.jdo.option.ConnectionURL" with default value
"jdbc:derby::databaseName=metastore_db";create=true

6. Is it possible to use the same metastore by multiple users, in case of embedded Hive?

No, it is impossible to use metastore for multiple users, it is only for a single user in a single mode database like PostgreSQL, MySQL, etc.

Kafka Interview Questions and Answers

Kafka Interview Questions and Answers:

1. What is Kafka?

Kafka is an open source message broker project coded in Scala/Python/Java. Kafka is originally developed by LinkedIn and developed as an open sourced in early.




2. Which are the components of Kafka?

The major components of Kafka are:

Topic: A group of messages belongs to the same type

Producer: Using the producer can publish messages to the topic

Consumer: Pulls data from the brokers

Brokers: This is the place where the disclose messages are stored known as servers.

3. What role does Zookeeper play in a cluster of Kafka?

Kafka is an open source system and it also a distributed system and it is built to use Zookeeper. The basic responsibility of Zookeeper is to build coordination between different nodes in a cluster. Zookeeper works as periodically commit offset so that if any ode gets failure it will be used to recover from previously committed offset. The Zookeeper is also responsible for configuration management leader detection, finding if any node leaves or joins the cluster, synchronization.

4. Distinguish between the Kafka and Flume?

Flume’s major use-case is incorporated with the Hadoop’s monitoring system, file formats, file systems, and utilities. It is used for Hadoop integration. Flume will be the best option to use when you have non-relational data sources. But Kafka used for the distributed publish-subscribe messaging system. Kafka is not developed for Hadoop and using Kafka to read and write data to Hadoop considerably than the Flume. Kafka is a highly reliable and scalable enterprise messaging system to connect different multiple systems.




5. It is possible to use Kafka without Zookeeper?

It is impossible to use Kafka without Zookeeper because it is not possible to bypass Zookeeper and connect directly to the server. If the Zookeeper is down then we will not able to sever any client request.

6. How to start a Kafka Server?

Kafka uses Zookeeper, we have to start the zookeeper server. One can use the convince script packaged with Kafka with a single node Zookeeper
> bin/zookeeper-server-start.shconfig/zookeeper.properties Now the Kafka server can start> bin/Kafka-server-start.sh config/server.properties

What are the different Hadoop Components and Definitions

What are the Different Hadoop Components in Hadoop Eco-System





HDFS – Filesystem of Hadoop ( Hadoop Distributed File System)
MapReduce – Processing of Large Datasets

HBase – Database (Hadoop+dataBase)

Apache Oozie – Workflow Scheduler

Apache Mahout – Machine learning and Data mining

Apache Hue – Hadoop user interface, Browser for HDFS, HBase, Query editors for Hive, etc.
Flume – To integrate other data source

Sqoop – Export / Import data from RDBMS to HDFS and HDFS to RDBMS

What is HDFS?

HDFS (Hadoop Distributed File System) is a filesystem that can store very large data sets by scaling out across a cluster of hosts.

What is Map Reduce?

MapReduce is a programming model and it is implemented for processing and generating large data sets. It specifies a map function that process a (key, value) pair to generate a set of intermediate(Key, Value) pairs.

What is Hive?




A data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis.

What is  Pig?

Pig is an analyzing large data sets that consist of a high-level (scripting) language for expressing data analysis programs.

What is Flume?

Flume is on top of Hadoop applications, we need to get data from the source into HDFS.

What is Sqoop?

Apache Sqoop is a tool designed for transferring bulk data between Hadoop and structured data stores it means that Export / Import data from RDBMS to HDFS vice versa.

What is HBase?

HBase ( Hadoop + dataBase) is a column-oriented store database layered on top of HDFS.

What is NoSQL database?

NoSQL means that Not Only SQL using traditional relational Data Base Management System.

What is Heartbeat in Hadoop? How to resolve Heartbeat lost in Cloudera and Hortonworks

Heartbeat in Hadoop:





In Hadoop, eco-system Heartbeat is an in-between Namenode and Datanode communication. It is the signal that is sent by the Datanode to Namenode after a regular interval. If Datanode in HDFS does not send a heartbeat to Namenode around 10 minutes by default then Namenode considers the Datanode is not available.

The default heartbeat interval is 3 seconds. Put in dfs.heartbeat.interval in a hdfs-site.xml file in Hadoop installation directory.

What is Heartbeat lost:

In Hadoop eco-system, the Datanode does not send a heartbeat to Namenode around 10 minutes by default. So, in this case, Namenode considers a Datanode is unavailable it is known as “Heartbeat lost”.

How to resolve Heartbeat lost:

In Bigdata distribution environment will take Hortonworks (HDP)In Hortonworks:
1. In HDP check Amabari agents status whether it is running or not by using” ambari-agent status ”
2. If it is not running then check with log files for Ambari server and Ambari agent as well as in the directory of /var/log/ambari-server and /var/log/ambari-agent.

3. Follow the below steps:

A) Stop ambari-server
B) Stop ambari-agent service on all nodes
C) Start ambari-agent service on all nodes
D) Start ambari-server server

Cloudera:

1. First Check the Cloudera scm agent status whether it is running or not by using” sudo service cloudera-scm-agent status ”





2.check the agent log files in this directory in /var/log/cloudera-scm-agent/

2. Then follow the below commands with root user

sudo service cloudera-scm-agent status
sudo service cloudera-scm-agent stop
sudo service cloudera-scm-agent start

Summary: Hadoop is following Master, Slave architecture. The master node stores the metadata and slave nodes stores the actual data. So while sending data communication between Namenode and Datanode is called as a “Heartbeat”. If it fails simply called as a “Heartbeat lost” it means that Datanode is unavailable.  To find resolution steps for Bigdata distributions like Hortonworks (HDP) and Cloudera (CDH) with step by step process for this issue.

Hadoop job (YARN Staging) error while executing simple job

In a Hadoop eco-system, no.of jobs are executing in a fraction of time in that time. I am trying to execute the Hive job for Data validation in Hive server in Production server. While executing a Hive job in the hive command line I got this type of error.



at org.apache.hadoop.ipc.Client.call(Client.java:1468)
at org.apache.hadoop.ipc.Client.call(Client.java:1399)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
at com.sun.proxy.$Proxy9.addBlock(Unknown Source)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.addBlock(ClientNamenodeProtocolTranslatorPB.java:399)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
at com.sun.proxy.$Proxy10.addBlock(Unknown Source)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.locateFollowingBlock(DFSOutputStream.java:1532)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1349)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:588)
22:33:33 INFO mapreduce.JobSubmitter: Cleaning up the staging area /tmp/hadoop-yarn/staging//.staging/job_1562044010976_0003
Exception in thread "main" org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /tmp/hadoop-yarn/staging//.staging/job_1562044010976_0003/job.jar could only be replicated to 0 nodes instead of minReplication (=1). There are 1 datanode(s) running and no node(s) are excluded in this operation.
at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:1549)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:3200)
at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:641)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:482)
at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2039)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2035)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2033)

The above error belongs to a connection error in Datanode while executing the code. At the time Datanode not running properly. so find below resolution for this issue:

Stop all services:

stop-all.sh
start-all.sh

Here restart all services including Namenode, Secondary Namenode, DataNodes and remaining services like Hive, Spark,
etc.

If still showing this type of error then start the distributed file system.

start-dfs.sh

Check all the Hadoop Daemons like Name node, Secondary Name node, Datanode, Resource Manager and Node Manager, etc. By using below command

jps

And then check All node information by using “hadoop dfsadmin -report ” for the status of the Datanode whether it is running fine or not.

Above steps for Local, Pseudo distributed,  and standalone mode only in Hadoop eco-system.

For Cloudera, Hortonworks, MapR distributions are simply “Restart” DataNodes and Services like Hive, Spark, etc.




Summary: In Big Data environment we executing so many jobs like Hadoop/Spark/Hive for the result but some times showing above error. At the time we stuck but here the simple solution for the above error

Most frequently Asked Hadoop Admin interview Questions for Experienced

Hadoop Admin interview Questions for Experienced

1.Difference between Missing and Corrupt blocks in Hadoop 2.0 and how to handle it?

Missing block: Missing block means that there are blocks with no replicas anywhere in the cluster.

Corrupt block: It means that HDFS cannot find any replica containing data and replicas are all corrupted.




How to Handle :
By using  below command will handle

To find out which file is corrupted and remove a file

A) hdfs fsck /
B)hdfs fsck / | grep -v '^\.+$' | grep -v eplica
C) hdfs fsck /path/to/corrupt/file -location -block -files
D)hdfs fs -rm /path/to/file/

2. What is the reason behind of an odd number of zookeepers count?

Because Zookeeper elects a master based opinion of more than half of nodes from the cluster. If even number of zookeepers is there difficult to elects master so zookeepers count should be an odd number.

3. Why Kafka is required for zookeeper?

Apache Kafka uses zookeeper, need to first start zookeeper server. Zookeeper elects the controller topic configuration.

4. What is the retention period of Kafka logs?

When a message sent to Kafka cluster appended to the end of logs. The message remains on the topic for a configurable period of time. In this period of time Kafka generates a log file, it called retention period of Kafka log.
It defines log.retention.hours 

5. What is block size in your cluster, why not recommended for 54 MB block?

Depends upon your cluster, because of Hadoop standard is 64 MB

6. For suppose if the file is 270 MB then block size is 128 MB on your cluster so how many blocks if 3 blocks are 128+!28+14MB so 3rd block 14MB is wasted or other data can be appended?

7. What are the FS image and Edit logs?

FS image: In a Hadoop cluster the entire file system namespace, file system properties and block of files are stored into one image it is called an FS image (File System image). And total information in Editlogs.

8. What is your action plan if your PostgreSQL or MySQL down on your cluster?

First, check with the log file, then go with what is an error and find out the solution

For example: If connectionBad Postgres SQL
Solution: First status Postgres SQL service

sudo systemctl status postgressql

Then stop the Postgres SQL service

sudo systemctl stop postgressql

Then provide pg_ctlcluster with the right user and permissions

sudo systemctl enable postgressql

9. If both name nodes are in stand by name node, then if jobs are running or failed?

10. What is the Ambari port number?

By Default Ambari port number is 8080 for access to Ambari web and the REST API.




11. Is your Kerberos cluster which one using LDAP or Active Directory?

Depends upon your project if LDAP integration or Active Directory and explain it.

Hadoop Architecture vs MapR Architecture





Basically, In BigData environment Hadoop is a major role for storage and processing. Coming to MapR is distribution to provide services to Eco-System. Hadoop architecture and MapR architecture have some of the difference in Storage level and Naming convention wise.

For example in In Hadoop single storage unit is called Block. But in MapR it is called Container.

Hadoop VS MapR

Coming to Architecture wise somehow the differences in both:
In Hadoop Architecture based on the Master Node (Name node) and Slave (Data Node) Concept. For Storage purpose using HDFS and Processing for MapReduce.




In MapR Architecture is Native approach it means that SAN, NAS or HDFS approaches to store the metadata. It will directly approach to SAN  no need to JVM. Sometimes Hypervisor, Virtual machines are crashed then data directly pushed into HardDisk it means that if a server goes down the entire cluster re-syncs the data node’s data. MapR has its own filesystem called MapR File System for storage purpose. For processing using MapReduce in background.

There is no Name node concept in MapR Architecture. It completely on CLDB ( Container Location Data Base). CLDB contains a lot of information about the cluster. CLDB  installed one or more nodes for high availability.

It is very useful for failover mechanism to recovery time in just a few seconds.

In Hadoop Architecture Cluster Size will mention for Master and Slave machine nodes but in MapR CLDB default size is 32GB in a cluster.




 

In Hadoop Architecture:

NameNode
Blocksize
Replication

 

In MapR Architecture:

Container Location DataBase
Containers
Mirrors

Summary: The MapR Architecture is entirely on the same architecture of Apache Hadoop including all the core components distribution. In BigData environment have different types of distributions like Cloudera, Hortonworks. But coming to MapR is Enterprise edition. MapR is a stable distribution compare to remaining all. And provide default security for all services.