Basics of BigData and Hadoop for beginners

Bigdata Solutions:




1.NoSQL – database(Non relational database) – Only for structured and semi-structured

2. Hadoop – Implementation – structured,semi-structured and unstructured data

3.Hadoop eco-systems and its components for everything.

Hadoop:

Hadoop is a parallel system for large data storage and processing. It is a solution for Bigdata.

For Storage purpose HDFS -Hadoop Distributed File System

For Processing purpose MapReduce using simply.

In Hadoop, some keywords are very important for learning scope.

Hadoop Basic Terminology:

1.Cluster

2.Clustered Node

3.Hadoop Clustered Node

4.Hadoop cluster

5. Hadoop Cluster Size

1.Cluster:

A cluster is a group of all nodes belongs to one common network is called a cluster.

2.Clustered Node:

A Clustered Node is a grouping of all individual machines is called a clustered node in Hadoop

3.Hadoop Cluster Node:

A Hadoop Cluster Node is basic storage and processing purpose of a cluster is called as Hadoop Cluster Node.

For storage purpose, we are using the Hadoop Distributed File System.

For processing purpose, we are using MapReduce

4.Hadoop Cluster:

A Hadoop Cluster is a collection of “Hadoop Cluster Node” in a common network is called Hadoop Cluster

5.Hadoop Cluster Size:

A Hadoop cluster size is a total no.of node in a Hadoop cluster.

Hadoop Ecosystem:

1. Apache Pig              –  Processing           – Pig Scripting

2. Hive                             – Processing           – HiveQL (Query language like SQL)

3.SQOOP                       – Integration tool  – Import and Export data

4.Zookeeper               – Coordination      – Distribution coordinator

5.Apache Flume      – Streaming              – log data for streaming purpose

6.Oozie                        – Scheduling             – Open source scheduling jobs

7.HBase                     – Random Access   – Hadoop+dataBASE

8.NoSQL                  – NotOnlySql              – MongoDB, Cassandra

9.Apache Kafka    – Messaging               – Distributed messaging

10.YARN                  – Resource Manager – Yet Another Resource Negotiator

Note: Apache Spark is not a part of Hadoop but including nowadays. It is used for Data Processing purpose. Spark 100 times faster than Hadoop MapReduce.

Compatible Operating System for Hadoop Installation:

1. Linux

2.Mac OS

3.Sun Solaris

4.Windows.

Hadoop Versions:

Hadoop 1.x

Hadoop 2.x




Hadoop 3.x

Different Distributions of Hadoop

1. Cloudera Distribution for Hadoop (CDH)

2.Hortonworks

3.MapR