SortBy:

Hive uses the column in SortBy to sort the rows before sustaining the rows to a reducer in Hive environment. The sort order will be dependent on the column types especially for the column is of numeric type, then the sort order is also in numeric order. If the column is of string type, then the sort order will be lexicographical order in Hive Query. It orders data at each of ‘N’ reducers, but each reducer can have overlapping ranges of data in Hive.

Output: N or more sorted files with overlapping ranges.

Example Query for SortBy

SELECT key, value FROM source SORTBY key ASC, value DESC

Order By:

This is similar to ORDER BY in SQL language. In Hive, ORDER BY guarantees total ordering of data, but for that, it has to be passed on to a single reducer which is normally intolerable and therefore in inflexible mode, in hive makes it compulsory to use LIMIt with ORDER BY so that reducer doesn’t get exhausted.

Ordering: Total Order DATA.

Output: Single output i.e fully ordered.

Example Query for OrderBy

SELECT key, value FROM source ORDER BY key ASC, value DESC

Distribute By:

Apache Hive uses the columns in Distribute By to distribute the rows between reducers in a query language. All rows with the same Distribute By columns will go to the same reducer.
Distribute By protecting each of N reducers gets non-overlapping ranges of the column but doesn’t sort the output of each reducer.

Example:

In Distribute By x on the following 5 rows to 2 reducers:

x1
x2
x4
x1

Reducer 1 got

x1
x2
x1

Reducer 2 got

x4
x3

Cluster By:

Cluster By is a combination of both Distribute By and Sort By. CLUSTER BY x protecting each of N reducers gets non-overlapping ranges, then sorts by those ranges at the reducers.

Ordering: Global ordering between multiple reducers.

Output: N or more sorted files with non-overlapping ranges.

Example:

Refer to same example as above, if we use Cluster By x, the two reducers will further sort rows on x:

Reducer 1 got

x1
x1
x2

Reducer 2 got

x3
x4

Tag: distributeby

Hive: SortBy Vs OrderBy Vs DistributeBy Vs ClusterBy

SortBy:

Order By:

Distribute By:

Cluster By: