What is Hive and Architecture of Hive

What is the HIVE?




Apache Hive is data warehousing infrastructure based on Hadoop. Hadoop provided massive scale out and fault tolerance capabilities for data storage and processing on commodity hardware.

Hive is designed to enable data summarization, ad-hoc querying, and analysis of the large volume of data. At the same time, Hive’s SQL gives users multiple places to integrate their own functionality to do custom analysis like UDFs

Architecture of HIVE

Here CLI -Command Line Interface, JDBC- JavaDataBase Connector and Web GUI(Graphical User Interface). When the user comes with CLI  then directly connected with Drivers, the user comes with JDBC at that time by using API it connected to Hive driver. When Hive Driver receives the tasks queries from the user and sends to Hadoop architecture then architecture uses name node, data node, job tracker, task tracker for receiving data.

How to Install Flume on Ubuntu/Linux in Hadoop

Flume Installation on Ubuntu/Linux:





Step 1: Download Apache flume tarball from Apache Mirrors

Step 2: Extract the Downloaded Tarball using below command

tar -xzvf apache-flume-1.7.0.bin.tar.gz

Step 3: Update the FLUME_HOME &PATH variables in bashrc file

Step 4: To check the bashrc changes, open a new terminal and type ‘echo $FLUME_HOME‘ command

Step 5: To check the Flume version

How to Install MongoDB on Ubuntu/Linux in Hadoop

MongoDB is one kind of NoSQL database which is popular among many enterprises. It is purely open source document DB. Mongo stores data using document which is called BISON. BSON is a data format which is like JSON in javascript making an application and faster.

Why MongoDB?

It is Full index support:  Can just use an index like what you do in RDBMS

Replication & High Availability:  MongoDB supports  replication  of data between servers for fail over .

Querying: If you knowledge on query language querying is easy to you.

Step 1: Download tarball from Mongo DB website. Here select which version of your Ubuntu and download the tarball 

 

Step 2: Extract tarball using below command:

tar -xzvf  your tar ball full name

Step 3: After that  update the MONGODB_HOME & PATH variables in bashrc file using below command

nano ~/.bashrc

Step 4:  To check the bashrc changes, open a new terminal and type ‘echo $MONGODB_HOME‘ command.

After will check the exact version of Mongo DB

After Installation and configuration of MongoDB will start services

Step 5: Before starting the mongodb service for the first time, we need to create the data  directory:

 

 

Step 6: To start the MongoDB service use the  below command

mongod

After completion of start services and open mongo shell to write and read queries in your shell. In Non-relation ship databases didn’t offer that features which means usually it cannot provide full ACID properties. So it will not replace RDBMS in the future because of its weakness in business consistency.

How to Install HIVE with MySQL on Ubuntu/Linux in Hadoop

Apache Hive is a data warehouse system mostly used for data summarization for structured data type files. Hive is a one of the component of Hadoop built on top of HDFS and is a data warehouse kind of system in Hadoop. It is used in Tabular form(Structured data) not for FLAT files.

Step:1 Download the hive-1.2.2 tarball from Apache Mirrors official website

http://apache.mirrors.tds.net/hive/hive-1.2.2

Step 2: Extract the tar ball file in your path using below command:




tar-xzvf Apache-hive-1.2.2-bin.tar.gz

Step 3:Update HIVE_HOME & PATH variables in bashrc file

export HIVE_HOME=/home/sreekanth/Big_Data/Apache-hive-1.2.1-bin

export PATH=$PATH:$HIVE_HOME/bin

After update, the .bashrc file will change then go to the next step

Step 5: To check the bashrc changes, open a new terminal and type the command

echo $HIVE_HOME

Step 6: Remove jline-0.9.94.jar file from the below path to avoid the incompatibility issues of Hive version with hadoop-2.6.0

Step 7: There are 2 types of Meta Stores we can configure in Hive to store metadata.



Internally using Derby in Hive. It is only for one user

Externally using MySQL is used multiple users. In case your conf file does not contain hive-site.xml file then

Create hive-site.xml  file

Step 8: Configure hive-site.xml  file with MySQL configuration and add the below content:

Step 9: For External Meta Store ‘MySQL’ , we need MySQL connector jar file

Step 10: MySQL connector jar file into $HIVE_HOME/lib path

Step 11: Run hive command in terminal but it will showing connection refused

Due to daemons are not working so it is necessary to start all daemons other wise hive is not working

Step 11: First start all daemons using start-all.sh command

Step 12: Now successfully run the hive in your machine


Step 13: How to Check Hive version using below command:

hive –version

Why we use HIVE?

Because of data summarization or querying tabular data in the Hadoop system. Default hive database Derby it is only for one user. Mostly MySQL used for large data and multiple users.

How to Install Hadoop in Ubuntu/Linux in Single Node Cluster

Nowadays most emerging technology Hadoop. Is a solution for Big data to store and process a large amount of data. For storage purpose HDFS and Processing in Map Reduce but nowadays Map Reduce is not used. Will move to Apache Spark for processing and 100% better than Map Reduce because it is based on c

Step 1: First step we need to update the “System Software Repositories” using below command:
sudo apt-get update

Step 2: Next will Install java-1.8 version using below command.

sudo apt-get install openjdk-8-jdk

Step 3: After that check Java Version using below command:

java -version

Step 4: We must and should Install ssh using below command

sudo apt-get install ssh

Password Less SSH Communication, enter the below commands at any terminal:

ssh localhost

ssh-keygen -t rsa -P ” -f ~/.ssh/id_rsa
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

 

STEP 5: Download Hadoop-2.6.0 version tar ball from Apache Mirrors from Apache official website

STEP 6: Extract the copied tar ball using below command:

tar -xzvf hadoop-2.6.0.tar.gz

Below are the Total Configuration files in ‘Hadoop’ directory

STEP 7: We must and should to do edit the below 8 configuration files as part of HADOOP Installation:
1. core-site.xml

2. mapred-site.xml

3. mapred-env.sh

4. yarn-site.xml

5. hdfs-site.xml

6. hadoop-env.sh

7. yarn-env.sh

8. slaves

 

STEP 8: Open  core-site.xml file, add the below  properties

STEP 9: Open “hadoop-env.sh” file and update JAVA_HOME path

 

STEP 10: Open mapred-env.sh and update JAVA_HOME

STEP 11: Open hdfs-site.xml  file and add the below properties:

STEP 12: Open mapred-site.xml and update the framework architecture details as “yarn”

STEP 13: Open yarn-env.sh and update JAVA_HOME path in that file

STEP 14: Open yarn-site.xml and add the below properties to configure “Resource Manager”.

STEP 15: Open slaves file and to check whether the hostname is localhost or not

STEP 16: Update and Set JAVA_HOME, HADOOP_HOME & PATH variables:

export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64

export HADOOP_HOME=/home/gopalkrishna/INSTALL/hadoop-2.6.0

export PATH=$PATH:$JAVA_HOME/bin:$HADOOP_HOME/sbin

STEP 17: To check the bashrc changes, open a new terminal and type the below command:

echo $HADOOP_HOME

STEP 18: Before starting Name Node, we must have to format the name node using below command:

hadoop namenode -format

STEP 19: To start all the daemons of hadoop in 2.X.X use “start-all.sh” command

Step 20: How to check Name node, Node manager, Data node running or not will use below command:

simply using:  jps

STEP 21: To Access the Name Node information in GUI using below link in your systerm

http://localhost:50070

STEP 22: To Start Job History Server in Hadoop Cluster using below command

mr-jobhistory-daemon.sh start historyserver

STEP 23: To Access Resource Manager in Hadoop cluster:

localhost:8088

STEP 24: To Access Job History Server in Hadoop Cluster

localhost:19888

STEP 25: To stop all the daemons of hadoop in 2.X.X use “stop-all.sh” command

STEP 26: To Stop Job History Server in 2.x.x.

mr-jobhistory-daemon.sh stop historyserve

How to Install Scala on Hadoop in Linux

Nowadays most familiar functional programming language is Scala. Scala likes a Java but little bit different. When Apache Spark enter into a picture SCALA is most scalable. Here some steps for Scala installation

Step 1: Download the Scala tarball from scala official website in your machine.

After downloading tarball will put into your Hadoop related path then will follow below step
Step 2: Extract the tar ball using below command:

tar -xzvf scala-2.11.8.tgz for extract the scala tarball

Get Scala file check whether files are there or not. Will go next step

Step 4: Update the SCALA_HOME & PATH variable in bashrc file

After an update, the SCALA_HOME and PATH will automatically environment variables are taken by the .bashrc file

Step 5: After bashrc changes, open a new terminal and check the bashrc changes using ‘ echo $SCALA_HOME  ‘ command

Open a new terminal and check above command whether scala home is updated or not

Step 6: After that Check Scala version    

How to Install Kafka on Ubuntu/Linux in Hadoop

Apache Kafka is a open source stream-processing software application developed by Apache Foundation. Here simple steps for Installation in Ubuntu\Linux operating system on Hadoop Eco system

Step 1: First step Download the Kafka tar ball from Apache Mirrors from apache official website

http://apache.mesi.com.ar/kafka/0.10.1.1/

Place If we need a specific hadoop directory  create and copy the Kafka tar ball into that directory.

Step 2: After download the kafka tar ball Extract the tar ball using below command:

tar -xzvf kafka_2.10-0.10.2.1.tgz

After extracting Apache Kafka you got Kafka Folder(Directory) including lib files and configuration files.

Step 3: After extraction Kafka tar ball we see Kafka directory 

Will need Apache kafka update the kafka home and path variables in .bashrc file follow below step simply.

Step 4: Update the KAFKA_HOME & PATH variable in bashrc file:

Step 5: After bashrc changes, open a new terminal and check the bashrc changes using ‘ echo $KAFKA_HOME  ‘ command

Apache Kafka majorly messaging system like Producers are process that publish data into Kafka topics to Consumers with the brokers. Consumer of  topics pulls the message off a topic.

After completed of Kafka Installation successfully will go to start and stop Kafka broker using below simple commands

Goto

Apache Kafka home directory and execute the command:

./bin/kafka-server-start.sh

How to stop Kafka broker through the below command :

./bin/kafkaserver-stop.sh

How to Setup Cloudera Multi Node Cluster Setup with Pictures

Cloudera Installation and Configure Multi Node Cluster



  1. Open Putty:

2. Type Your Machine IP address and then click on Open

3.Then Login as per Username & Password:

4. Type: vi/etc/hosts then add remaining hosts

5. Edit: vi/etc/sysconfig/network

6. Type: vi/etc/selinux/config

SELINUX =enforcing replaced with disabled

7.Type: setenforce 0

8.Type: yum install ntp ntpdate ntp-doc: Install ntp(Netowork Time Protocol)

9. After Installation ntp then check ntp configurations type: chkconfig ntpd on

10.Type: vi/etc/ntp.conf

11.Type : ntpq -p

12.Then start the service ntpd start

13.Then ntpq

14. Then rsa pub key generator ssh-keygen-t rsa in remaining machines

15. File save as id_rsa

16.cd /root/.ssh

17.ll -check whether id_rsa.pub is there or not

18.cat id_rsa.pub>authorized_keys

19.Type: scp authorized_keys root@machine@.localdomain:/root/.ssh

20.Then type : yum install openssl python perl
21. yum clean all
22.yum repolist

23. Then download Clouder Manager using below command

wget http://archive.cloudera.com/cm5/installer/latest/cloudera-manager-installer.bin

24.chmod 700 cloudera-manager-installer.bin






25.Then type ./cloudera-manager-insatller.bin click on Next

 

26. After that Accept License

27. It will take automatically installing JDK

28. Automatically Installing Embedded Database

29. Cloudera manager server Installing

30. Installation Successfully

31.Click on “OK”

32. If you get any Error then you have disabled Firewalls and IP tables
33. Disabled firewall Type: systemctl disable firewalld

34. Disabled IPV6 Type: vi /etc/sysctl.conf

35. Browse your Machine IP:xxx.xxx.xx.xxx:7180

36.Login : Username: admin

Password: admin

37. Yes, I accept the UserLicense ” Terms and Conditions”

38. Select Cloudera Express “Free”

39.Then Search host machines using as per domain names

40. Select Repository

41. If you need any Proxy Settings then select and fill it. Don’t need leave it.

42.Click on Continue for Three machines cluster Installations. Is there any issue then choose Mozila FireFox .

43. Click on “Continue” check CDH version

44. 100% completed then click on “Continue”

45.After “Continue” then check Validations

46. Here mainly two validations are showing warnings then type below commands then Run Again

echo never > /sys/kernel/mm/transparent_hugepage/enabled
echo never > /sys/kernel/mm/transparent_hugepage/defrag
sysctl vm.swappiness=10

47. Click on “Finish”

48.It shown Version Summary

49.HDFS NameNode and ResourceManager must be different


50.Select “Core with Spark” then Continue

 




51. Click on “Test Connection”  when using embedded  Database

52.Successful Setup the Cluster.

How to Install SQOOP on Ubuntu

Apache Sqoop Installation on Ubuntu

Apache SQOOP is one of the Hadoop components. It is mainly used for data fetching from HDFS to RDBMS vice versa or bulk data between Hadoop and data stores such as relational databases.

Prerequisites :

Before you can installation of Sqoop, you have to need Hadoop 2.x.x and compatible with Sqoop 1.x.x

Step 1: Download SQOOP 1.x.x tar ball from below website:

http://redrockdigimark.com/apachemirror/sqoo p/1.4.6/

Step 2: After downloading extract the SQOOP tar ball using below  command:

tar – xzvf sqoop – 1.x.x. bin – hadoop- 2.x.x – alpha.tar. gz

Step 3: Update the bashrc file with SQOOP_HOME & PATH variables

export SQOOP_HOME=/home/slthupili/INSTALL/sqoop-1.x.x.bin-hadoop-2.x.x

PATH=$PATH:$SQOOP_HOME/bin

Step 4: To check the bashrc changes, open a new terminal and type ‘echo $ SQOOP_HOME’

 

Step 5: To Integrate with MySQL Database from Hadoop Using SQOOP, we MUST have to place the respective

JAR file (mysql – connector-java5.1.38. jar) in $SQOOP _ HOME / lib path

Step 6: To check the version of SQOOP using below command:

sqoop version

Above steps are simple to the installation of Sqoop on top of Hadoop in Ubuntu

To check with this video for more clarity on SQOOP Installation on Ubuntu

Sqoop to import data from a relational databases management system (RDBMS) like a  MySQL into the Hadoop Distributed File System. Sqoop automates most of this process on the database to explain about schema for the data to be imported. Sqoop uses Map Reduce to import and export the data.

How to Install Hadoop Single Node Cluster




How to Install Hadoop Single Node Cluster on Ubuntu.

Step 1: Update the “System Software Repositories” using sudo apt-get update

The first step update the packages of Ubuntu

Step 2: JAVA 1.8 JRE INSTALLATION using below command.

JAVA is prerequisite for Installation so first install JRE then will go with JDK

 

Step 3: JAVA 1.8 JDK INSTALL using below command

 

Step 4: How to check JAVA version on Linux using below command

Step 5: After that We must and should Install SSH(Secure Shell) using below command:

SSH for secure less communication in name node and secondary name node for frequently communication

Step 6: Check  SSH Installation using below command

After installation of SSH will check using ssh localhost command whether the communication is working or not.

Step 7: Download Hadoop-2.6.0 tarball from Apache Mirrors.

After completion of Hadoop prerequisites then download the Hadoop tarball

Step 8: Extract the tar ball using below command

 

Step 9: Update Environment variables and Path for HADOOP_HOME and JAVA_HOME:

 

 

Step 10: To check Path variable is there or not after that edit the Configuration files as part of Hadoop Installation.

 

 

Step 11: First open “Core-site.xml” file, add the properties

Core-site file for Name node information

Step 12: Open “hdfs-site.xml” file and add the properties

Hdfs site xml file related to replication factor and data node information.

Step 13: Open “yarn-site.xml” file and add the properties to configure ‘Resource Manager’ & ‘Node Manage’ details:

Step 14: Update JAVA_HOME path in ‘ hadoop-env.sh’ file




Step 15:Update JAVA_HOME path in ‘ mapred-env.sh‘ file

Step 16: Open ‘mapred-site.xml‘ and update the yarn into that file

Step 17: Open slaves file and check whether the hostname is localhost or not

Step 18: Before starting Name Node, we must and should format the name node using below command: hadoop namenode -format

Step 19: To start all the daemons of hadoop using below command:

start-all.sh

Step 20: How to check daemons whether work or not using jps command


Step 21: After that all to access the Name Node information in GUI:

http://localhost:50070