Hive: SortBy Vs OrderBy Vs DistributeBy Vs ClusterBy

SortBy:

Hive uses the column in SortBy to sort the rows before sustaining the rows to a reducer in Hive environment. The sort order will be dependent on the column types especially for the column is of numeric type, then the sort order is also in numeric order. If the column is of string type, then the sort order will be lexicographical order in Hive Query. It orders data at each of ‘N’ reducers, but each reducer can have overlapping ranges of data in Hive.

Output: N or more sorted files with overlapping ranges.

Example Query for SortBy

SELECT key, value FROM source SORTBY key ASC, value DESC

Order By:

This is similar to ORDER BY in SQL language. In Hive, ORDER BY guarantees total ordering of data, but for that, it has to be passed on to a single reducer which is normally intolerable and therefore in inflexible mode, in hive makes it compulsory to use LIMIt with ORDER BY so that reducer doesn’t get exhausted.

Ordering: Total Order DATA.

Output: Single output i.e fully ordered.

Example Query for OrderBy

SELECT key, value FROM source ORDER BY key ASC, value DESC

Distribute By:

Apache Hive uses the columns in Distribute By to distribute the rows between reducers in a query language. All rows with the same Distribute By columns will go to the same reducer.
Distribute By protecting each of N reducers gets non-overlapping ranges of the column but doesn’t sort the output of each reducer.

Example:

In  Distribute By x on the following 5 rows to 2 reducers:

x1
x2
x4
x1

Reducer 1 got

x1
x2
x1

Reducer 2 got

x4
x3

Cluster By:

Cluster By is a combination of both Distribute By and Sort By. CLUSTER BY x protecting each of N reducers gets non-overlapping ranges, then sorts by those ranges at the reducers.

Ordering: Global ordering between multiple reducers.

Output: N or more sorted files with non-overlapping ranges.

Example:

Refer to same example as above, if we use Cluster By x, the two reducers will further sort rows on x:

Reducer 1 got

x1
x1
x2

Reducer 2 got

x3
x4

Connection refused error while running Hive in Hadoop

When Hive Installation in a single node cluster setup on Hadoop ecosystem sometimes showing below like this:

Connection refused error in Hive

Exception in thread  “main”  java.lang.RuntimeException: call From your domain/127.0.1.1 to localhost:8020 failed on connection exception: Java.net.ConnectionException:Connection refused:

For more details see:

http://wiki.apache.org/hadoop/Conncetionrefused

at org.apache.hadoop.hive.ql.session.SesseionState.start(SessionStart.java:522)
at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:677)
at org.apache.hadoop.hive.cli.CliDriver.main.(CliDriver.java:621)
at sum.reflect.NativeMethodAccessorImpl.invoke(Native Method)
...more

Caused by: java.net.ConncetException : Call From  slthupili/127.0.1.1 to localhost:8020 failed on connection exception: java.net.ConnectionException: Connection refused;

at sun.reflect.NativeConstructorAccessorImpl.newInstance0
(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance
(NativeConstructorAccessorImple.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance
(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
...more

Caused by : java.netConnectException: Connection refused

at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)

at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannel Impl.java:717)

at org.apache.hadoop.net.NetUtilis.connect(NetUtils.java:530)

...more

Solution:

First, stop all services in Hadoop using below command:

$ stop-all.sh

This command used for all services like NameNode, DataNode, SecondaryNode, YARN, etc.

Second step back up the data then will use below command

$ hadoop namenode -format

Above command removes unnecessary data then enter hive command

$ hive

Check below video for more details:

Hive Vs Impala

Hive :

Hive is one of the component of Hadoop built on top of Hadoop Distributed File System and is a data ware house kind of system in Hadoop. It is mainly used for data summarization and querying tabular data for advanced querying.

Impala:

Impala is a n Existing query engine like Apache Hive has run high run time overhead, latency low throughput. To avoid this latency, Impala avoids Map Reduce and access the data directly using specialized distributed query engine similar to RDBMS.

Hive Vs Impala:

1. UDFs : In Hive User Defined Functions are support but Impala doesn’t support User Defined Functions.

2.Speed : Compare with Impala Hive speed is slow while execution.

3.Syntax : In Hive syntax like SQL (HQL) and Impala also follow same syntax.

4.Complex Types : Complex types of queries are support in Hive but Impala doesn’t support.

And some more important differences:

I) Impala can query data from HDFS or Apache Base but in real time, like Hive

II) Impala uses the same SQL syntax similar to Hive SQL, user interface and metadata.

III)Queries which can required more than one MapReduce phases when done in Hive will have a higher speed in Impala.

 

Difference between Managed and External Tables with Syntax in HIVE

Difference between Managed and External Tables with Syntax in HIVE

Apache is HIVE is mainly used for data summarization for querying language.  Hive SQL is same like as SQL but little bit different here how data summarized and data processing through query language.

Here explain about Apache tables difference between tables with syntax.

Hive table is logically made up of the data associated with metadata describing the layout of the data in the table

There are two types of Hive table:

1.Managed or Default or Internal Table

2.External Table

1.Managed Table:

When will create a table in Hive, default Hive will manage the data, which means that Hive moves the data into its warehouse  directory.

The default warehouse location of HDFS:

“/usr/hive/warehouse/<table name>>/<<table data>>”

 

Syntax to create managed table in Hive

CREATE TABLE <TableName>

ROW FORMAT DELIMITED

FIELDS TERMINATED BY<DELIMITER>

LINES TERMINATED BY<DELIMITER>

STORED AS <FILE FORMAT>;

 

2.External Table:

External tables are external to the hive warehouse path. External table data will be stored in an external location of HDFS which we specify at the table schema.

Warehouse location of HDFS:

“/usr/hive/warehouse/<table name>>/<<table data>>”

 

CREATE EXTERNAL TABLE <TableName>

ROW FORMAT DELIMITED

FIELDS TERMINATED BY<DELIMITER>

LINES TERMINATED BY<DELIMITER>

STORED AS <FILE FORMAT> LOCATION “HDFS PATH”;

Difference between Managed table and External Table:

Default tables means that local tables on premises or with in database. External tables are external to hive ware house system with HDFS path. In Managed tables no need to give extra keyword for at the time of create table but in external tables we need external keyword for table creation. Mainly if the table was dropped in managed table entire data will be lose but in external table only meta data will be lose.

Will check with this video for how to create managed tables and load the data in hive and go with as well as External tables.

Meta Store in APACHE HIVE

In Hadoop eco-system Hive component processing all the structure information of the various tables and partitions in the warehouse including column ,column type information to the  necessary to read write data and the corresponding HDFS files where data is stored. That is the central repository of Hive metadata.

Here mainly Metadata is the internal database of hive which is responsible for managing metadata information.

 

Meta data which metastore stores contains things like:

1.Ids of database

2.Ids of Tables and index

3.Time of creation of index

4.Time of creation of tables

5.Input format used for tables

6.Output format used for tables

 

Hive majorly Three modes of Metastore

1.Embeded Metastore

2.Local Metastore

3.Remote Metastore

1.Embedded Metastore:

Embedded metastore runs in the same JVM as the Hive service and contains an embedded Derby database instance backed by the local disk it is called the embedded metastore.

Default configuration for Embedded Metastore

below are the metastore configuration details in “hive-site.xml”

<property>

<name>javax.jdo.option.ConnectionURL</name>

<value>jdbc:derby: ;databaseName=/var/lib/hive/metastore/metastore_db;create=true</value>

<descripton>JDBC connect string for a Embedded metastore</description>

</property>

<property>

<name>javax.jdo.option.ConnectionDriverName</name>

<value>org.apache.derby.jdbc.EmbededDriver</value>

<descripton>Driver class name  for a Embedded metastore</description>

</property>

 

2. Local Metastore:

It supports for multiple users to use a standalone database. Configuration same as local metastore but connect to a database running in a separate process.

If we use MySQL database as the local metastore, then we need some configuration in “hive-site.xml”

<property>

<name>javax.jdo.option.ConnectionURL</name>

<value>jdbc:mysql://host/dbname?create DAtabaseIfnotExist=trueue</value>

<descripton>JDBC connect string for a JDBC metastore</description>

</property>

<property>

<name>javax.jdo.option.ConnectionDriverName</name>

<value>com.mysql.jdbc.ConnectionDriverName</value>

<descripton>Driver class name  for a JDBCmetastore</description>

</property>

 

3.Remote Metastore: 

In remoter metastore, where one or more metastore servers run in separate processes to Hive service.

This metastore better manageability and security.

 

What is Hive and Architecture of Hive

What is HIVE?

Apache Hive is data warehousing infrastructure based on Hadoop. Hadoop provied massive scale out and fault tolerance capabilities for data storage and processing on commodity hardware.

Hive is designed to enable data summarization, ad-hoc querying and analysis of large volume of data. At the same time, Hive’s SQL gives users multiple palces to integrate their own functionality to do custom analysis like UDFs

Architecture of HIVE

Here CLI -Command Line Interface, JDBC- JavaDataBase Connector and Web GUI(Graphical User Interface). When user comes with CLI  then directly connected with Drivers, user comes with JDBC at that time by using API it connected to Hive driver. When Hive Driver receives the tasks queries from user and send to Hadoop architecture then architecture uses name node, data node ,job tracker , task tracker for receiving data.