HBASE in Hadoop

HBASE – Hadoop dataBASE 





Apache HBase runs on top of Hadoop. It is a Database which is an open source, distributed, NoSQL database related.

Hadoop can perform on batch processing and data will access only in sequential manner leading with low latency but HBASE internally uses Hash tables and provices random access, and stores the data in HDFS files that are indexed by their key for faster lookups thus providing high latency comapred to Hadoop HDFS.

Here Comparison HBASE and RDBMS

Some more points for comparison of HBASE and RDBMS :

HBASE doesnot directly supports JOINS or Aggregations and it can handle large amount of gdata

RDMSB supports JOINS, Aggregations using SQL and it can handle limited amount of data at a time.

The comparison of HBASE and HDFS :

HBase is a distributed, column oriented database and stores data in key,value pairs. In HDFS is a distributed file system and stores data in the form of flat files.

HBase random reads and writes but HDFS sequential file access random writes are not possible as it allows to write once and read many times.

HBase is a suitable for Low latency operations by providing access to the specific row data from the big volume. In HDFS suitable for High Latency operations through batch processing.

The comparison of HBASE and NoSQL:

HBase is a NoSQL database, data stores in <key,value> pair. In NoSQL by default, Value is stored in key, value pair.




HBase is a Horizontal scalability, No SQL also Horizontal Scalability.

HBase uses MapReduce for processing data but in NoSQL can perform basic CRUD operations. Complex aggregations are tough to handle so we need to integrate with solutions like Hadoop for complex processing.

HBase Maste Slave model to address parallel processing

HBase may permit two types of access: random access of rows through their row keys and offline or batch access through map-reduce queries. In NoSQL Random access of data is possible

MapReduce in Hadoop





Map Reduce :

MR is a core processing component of Hadoop which is meant for processing of huge data in a parallel on commodity hardware machines. It is an algorithm contains two important tasks, that is Map and Reduce,

Map: Takes a set of data and converts it into another set of data, where individual elements are broken into tuples are like key and values pairs.

Reduce: reduce task, which takes the output from a map as an input and combines those data tuples into a smaller set of tuples.

Map Reduce Life Cycle:

A Map-Reduce job usually splits the input data-set into independent chunks, which are processed by the map tasks in a completely parallel manner. The framework sorts the outputs of the maps, which are then inputted to the reduce tasks. Both the input and output of the job are stored in a file-system. The framework takes care of scheduling tasks, monitoring them and re-executes the failed tasks.

The Map/Reduce framework operates on < Key, Value> pairs that are the framework views the input to the job as a set of < key,value> pairs and produces a set of < key, value> pairs as the output of the job.





MapReduce Programming Model:

  • Split the data into independent chunks based on key,value pair. This is done by Map task in a parallel manner.
  • Output of the Map jobs is sorted based on the key values
  • The sorted output is the input to the Reduce job. And then it produces the final output to the processing and returns to the client.

 

Hadoop Distributed File System – HDFS in Hadoop

HDFS – Hadoop Distributed File System is the primary storage system used by Hadoop application. HDFS is a Distributed File System that provides high-performance access to data across on Hadoop Clusters. When HDFS takes in data, it breaks the information into smaller parts called blocks.  Allowing for parallel processing.




HDFS is built to support the application with large data sets, including individual files.

It uses Master/Slave architecture

 

Features of HDFS:

A)Fault Tolerance

In HDFS refers to the working strength of a system in unfavorable conditions and how that system can handle such related situations. HDFS is highly faulted tolerant, in HDFS that data is divided into blocks and multiple copies of blocks are created on different machines in the cluster.

B)High Availability

HDFS is a high availability file system, data gets replicated among the nodes in the HDFS cluster by creating a replica of the blocks on the other slaves present in HDFS cluster. When your node failure, the user can access their data from other nodes. Because duplicate copies of blocks which contain user data are created on the other nodes present in the HDFS cluster.

C)Replication

In HDFS Data Replication is the most important and unique feature of HDFS. In HDFS replication of data is done to solve the problem of data loss in unfavorable conditions like crashing of a node or hardware failure etc. Data replicated across a number of machines in the cluster by creating blocks.




D)Reliability

HDFS is a distributed file system which provides reliable data storage. HDFS can store data in the range of 100s of petabytes. It stores data reliably on the cluster on nodes. HDFS divides the data into blocks these blocks are stored on nodes present in HDFS cluster. It stores data reliable by creating a replica of each and every block present on the nodes present in the cluster and hence provide fault tolerance.

E)Distributed Storage

In HDFS all the features are achieved by via distributed storage and replication. In HDFS data in stored in distributed wisely across the nodes in HDFS Cluster.

What is Big Data?




Big Data means:

Big Data is a really Big Data it is a term for large data sets or complex that traditional data processing applications are insufficient to deal with them.  Here some challenges include analysis, analytics, data streaming, capture, search, storage, visualization, querying, updating and information privacy. So the term of Big Data offers simply to the use of predictive analytics, user behavior analytics and advanced data analytics methods that extract value from data.

Big Data and Analytics requires different types of techniques and technologies with new forms of integration to reveal insights from data sets that are diverse, the complexity of a program.

Facts of Big Data:

A)Nowadays Data is growing faster than ever before and by the year 2020, it will go around 2.0 megabytes of new information will be created every second.

B)Large Data volumes are exploding, more data has been created in the past two years in entire big data and analytics.

C)We are seeing massive growth in video and photo data, and bulk amount of data will uploads, downloads in social media data.

D)Social media users send on average around 50 million messages and view around 5 million videos every minute.

E)Distributed computing is a very real case example Google uses it every day to involve about 1,000 computers in answering a single search query.

Uses of Big Data:

A)Nowadays organizations are increasingly turning to big data to discover new ways to improve decision-making, opportunities, and performance.

B)Coming to Operational insights it might depend upon machine data, which can include anything from computers to sensors or meters to GPS devices.

C)Cyber Security or identification and fraud detection is another use of big data. With access to real-time data, a business can enhance security and intelligence analysis platforms.




Finally, Big Data is a problem for Large data sets so will find out a solution for storage and processing purpose using so simple solutions is Hadoop for Big Data.

Hadoop Admin Commands with Examples(Pictures)

Apache Hadoop is opensource framework for Storage & Processing purpose. For Storage purpose HDFS(Hadoop Distributed File System) and for Processing purpose(Map Reduce) using. Here is some Hadoop admin commands for beginners.



Hadoop Admin Commands:

hadoop namenode -format:

In this command explain about format HDFS file system from Nam node in a cluster.

hadoop dfsadmin -report :

In this command showing report on the overall HDFS file system. This command very useful for how much disk is available , Name node information, how many Data Nodes are running and corrupted blocks are in a cluster.

hadoop dfsadmin -refreshNodes:

This commands used commission or decommission nodes

Safe mode commands in Hadoop:

It is a state of Name node, does not allow changes to file system and  Read only mode for Hadoop cluster. Mostly three commands are related to safe mode.

hadoop dfadmin -safemode get:

safemode get means that get status of the safe mode(maintenance mode)

hadoop dfadmin -safemode enter:

In this command exactly meaning that Safe mode is ON.

hadoop dfsadmin -safemode leave:

Leave command means that Safe mode is OFF.

start-all.sh:

Start the all daemons like name node, secondary name node, yarn, data nodes, node manager etc.

stop-all.sh:

Stop the all daemons like name node, data node, secondary name node, resource manager, node manager, yarn etc.

hadoop fs -copyFromLocal <file 1> <file 2> :

In hadoop environment we need to copy files from local file system to HDFS use this command

hadoop fs -copyToLocal< file 1> <file2> :

Copy files HDFS to local file system  in a hadoop cluster

hadoop fs -put <file 1> <file 2> :

This command same like as copyFromLocal but small difference is remote location to HDFS

hadoop fs -get <file 1 > <file 2> :

This command same like as copyToLocal but small difference is HDFS to remote location.

hadoop fs -setrep -w 5  file:

For set a replication factor manually using below command

hadoop distcp hdfs://<ip1>/input hdfs://<ip2>/output :

For copy file one cluster to another cluster using below command

hadoop job -status <job -id >:

To check hadoop jobs status we use this command.

hadoop job -submit <job – files > :

To submit hadoop job file using this command.

hadoop job -list all :

You want to hadoop jobs so will use this command simply.

hadoop job -kill-task < task – id>:

In hadoop jobs we need to urgently kill the task in processing time. At the time this command is more useful




How to Install JAVA 11 latest version on Windows





Nowadays Java is the best programming language and here showing how to simple installations of Java latest version 11 on Windows 10 operating system for beginners. Simply download JDK11 from java official website then install it by below steps on Windows 10.

Step 1: Download JDK11 from the website after downloading .exe file then Run as administrator.

After running this JDK11 file is shown below  wizard

Step 2: Click on “Next” for a complete installation. Here one by one ask than simply click on Next button

Step 3: Successfully installed JDK11 file after that will configure

Step 4: First Go to Local disk -> Program files check whether JDK11 is there or not.

Step 5: After that go to JDK11 bin path copy that path for further configurations

Step 6: Right Click on This PC (Computer) click on properties

 

Step 7: Choose Advanced system settings

Step 8: Click on Advanced tab then go with Environment Variables

Step 9: After that go to System variables select Path and then Edit

Step 10: Paste your JDK 11 bin path (directory) here




Step 11: Click on Move Up will reach out into first place for our convince. It is only for the Windows 10 operating system.

Step 12: To check the directory whether Top place or not

 

Step 13: Open command prompt check  whether Java 11 installed or not simply check with Java version using below command

java -version

Step 14: To check the Java compilation version using below command for Windows

        javac -version 

 

 

JAVA 11 Features:

1.Java 11 supports  Unicode Standard  9 version and 10 version combination

2. Single file source code programs

3.Introduce new module and packages Http Client java.net.http.* They are mainly:

A)Http client

B)Http Request

C)Http Response

D)Web Socket.

4. And some modules are removed Java EE and COBRA Modules.

Summary: Recently Java released the latest version and bit of change to install of Java packages on Windows operating system and here to explain about of Java features.

Difference between Managed and External Tables with Syntax in HIVE

Difference between Managed and External Tables with Syntax in HIVE

Apache is HIVE is mainly used for data summarization for querying language.  Hive SQL is same like as SQL but a little bit different here how data summarized and data processing through the query language.




Here explain about Apache tables difference between tables with syntax.

Hive table is logically made up of the data associated with metadata describing the layout of the data in the table

There are two types of Hive table:

1.Managed or Default or Internal Table

2.External Table

1.Managed Table:

When will create a table in Hive, the default Hive will manage the data, which means that Hive moves the data into its warehouse directory.

The default warehouse location of HDFS:

“/usr/hive/warehouse/<table name>>/<<table data>>”

 

Syntax to create a managed table in Hive

CREATE TABLE <TableName>

ROW FORMAT DELIMITED

FIELDS TERMINATED BY<DELIMITER>

LINES TERMINATED BY<DELIMITER>

STORED AS <FILE FORMAT>;





2.External Table:

External tables are external to the hive warehouse path. External table data will be stored in an external location of HDFS which we specify at the table schema.

Warehouse location of HDFS:

“/usr/hive/warehouse/<table name>>/<<table data>>”

 

CREATE EXTERNAL TABLE <TableName>

ROW FORMAT DELIMITED

FIELDS TERMINATED BY<DELIMITER>

LINES TERMINATED BY<DELIMITER>

STORED AS <FILE FORMAT> LOCATION “HDFS PATH”;

Difference between Managed table and External Table:

Default tables mean that local tables on premises or within the database. External tables are external to hive warehouse system with HDFS path. In Managed tables no need to give an extra keyword for at the time of creating a table but in external tables, we need an external keyword for table creation. Mainly if the table was dropped in managed table entire data will be lost but in the external table, only metadata will be lost.

Will check with this video for how to create managed tables and load the data in the hive and go with as well as External Tables.

Meta Store in APACHE HIVE

In Hadoop eco-system Hive component processing all the structure information of the various tables and partitions in the warehouse including column, column type information to the necessary to read-write data and the corresponding HDFS files where data is stored. That is the central repository of Hive metadata.

Here mainly Metadata is the internal database of the hive which is responsible for managing metadata information.




Metadata which metastore stores contain things like:

1.Ids of database

2.Ids of Tables and index

3.Time of creation of the index

4.Time of creation of tables

5.Input format used for tables

6.Output format used for tables

Hive majorly Three modes of Metastore

1.Embedded Metastore

2.Local Metastore

3.Remote Metastore

1.Embedded Metastore:

Embedded metastore runs in the same JVM as the Hive service and contains an embedded Derby database instance backed by the local disk it is called the embedded metastore.

Default configuration for Embedded Metastore

below are the metastore configuration details in “hive-site.xml”

<property>

<name>javax.jdo.option.ConnectionURL</name>

<value>jdbc:derby: ;databaseName=/var/lib/hive/metastore/metastore_db;create=true</value>

<descripton>JDBC connect string for a Embedded metastore</description>

</property>

<property>

<name>javax.jdo.option.ConnectionDriverName</name>

<value>org.apache.derby.jdbc.EmbededDriver</value>

<descripton>Driver class name  for a Embedded metastore</description>

</property>

 

2. Local Metastore:

It supports multiple users to use a standalone database. Configuration same as local metastore but connect to a database running in a separate process.

If we use MySQL database as the local metastore, then we need some configuration in “hive-site.xml”

<property>

<name>javax.jdo.option.ConnectionURL</name>

<value>jdbc:mysql://host/dbname?create DAtabaseIfnotExist=trueue</value>

<descripton>JDBC connect string for a JDBC metastore</description>

</property>

<property>

<name>javax.jdo.option.ConnectionDriverName</name>

<value>com.mysql.jdbc.ConnectionDriverName</value>

<descripton>Driver class name  for a JDBCmetastore</description>

</property>

 

3.Remote Metastore: 

In remoter metastore, where one or more metastore servers run in separate processes to Hive service.

This metastore better manageability and security.

 

What is Hive and Architecture of Hive

What is the HIVE?




Apache Hive is data warehousing infrastructure based on Hadoop. Hadoop provided massive scale out and fault tolerance capabilities for data storage and processing on commodity hardware.

Hive is designed to enable data summarization, ad-hoc querying, and analysis of the large volume of data. At the same time, Hive’s SQL gives users multiple places to integrate their own functionality to do custom analysis like UDFs

Architecture of HIVE

Here CLI -Command Line Interface, JDBC- JavaDataBase Connector and Web GUI(Graphical User Interface). When the user comes with CLI  then directly connected with Drivers, the user comes with JDBC at that time by using API it connected to Hive driver. When Hive Driver receives the tasks queries from the user and sends to Hadoop architecture then architecture uses name node, data node, job tracker, task tracker for receiving data.

Most frequently asked Interview questions for experienced

In this era  in between 2-8 years experienced persons interviewer asked this type of questions in interview panel related to Big data and analytics and specially in Hadoop eco-system.
Mostly on Hands on experience in Hadoop and related to Project.




1. what properties you changed in Hadoop configuration files for your project?
Can you explain about your project related
2. where do you know Name Node and Datanode directory paths?
/etc/hadoop/coree-site.xml
3. How do you handle incremental load in your project?
By using SQOOP incremental
4. can you do dynamic hive partitions through Sqoop?
Yes, dynamic partitions hive through SQOOP.
5. in which scenarios will we use Parquet and Avro?
It is based upon client and can you explore on it.
6. how do you handle Authentication and Authorization in your project?
Can you explain whether using Kerbreos and AD/LDAP. It is purely depends upon your project related.
7. How to Handle if Spark all jobs are failed?