SortBy:
Hive uses the column in SortBy to sort the rows before sustaining the rows to a reducer in Hive environment. The sort order will be dependent on the column types especially for the column is of numeric type, then the sort order is also in numeric order. If the column is of string type, then the sort order will be lexicographical order in Hive Query. It orders data at each of ‘N’ reducers, but each reducer can have overlapping ranges of data in Hive.
Output: N or more sorted files with overlapping ranges.
Example Query for SortBy
SELECT key, value FROM source SORTBY key ASC, value DESC
Order By:
This is similar to ORDER BY in SQL language. In Hive, ORDER BY guarantees total ordering of data, but for that, it has to be passed on to a single reducer which is normally intolerable and therefore in inflexible mode, in hive makes it compulsory to use LIMIt with ORDER BY so that reducer doesn’t get exhausted.
Ordering: Total Order DATA.
Output: Single output i.e fully ordered.
Example Query for OrderBy
SELECT key, value FROM source ORDER BY key ASC, value DESC
Distribute By:
Apache Hive uses the columns in Distribute By to distribute the rows between reducers in a query language. All rows with the same Distribute By columns will go to the same reducer.
Distribute By protecting each of N reducers gets non-overlapping ranges of the column but doesn’t sort the output of each reducer.
Example:
In Distribute By x on the following 5 rows to 2 reducers:
x1 x2 x4 x1
Reducer 1 got
x1 x2 x1
Reducer 2 got
x4 x3
Cluster By:
Cluster By is a combination of both Distribute By and Sort By. CLUSTER BY x protecting each of N reducers gets non-overlapping ranges, then sorts by those ranges at the reducers.
Ordering: Global ordering between multiple reducers.
Output: N or more sorted files with non-overlapping ranges.
Example:
Refer to same example as above, if we use Cluster By x, the two reducers will further sort rows on x:
Reducer 1 got
x1 x1 x2
Reducer 2 got
x3 x4