A Complete Guide to Writing Hive UDF

In Hive, the users can define own functions to satisfy certain client requirements. These are referred to as UDFs in Hive. User Defined Functions written in Java for particular modules.

Some of UDFs are particularly designed for the reusability of code in application frameworks. The developer will enhance these functions in Java and integrate those UDFs with the Hive.

During the Query execution, the developer can directly use the code, and UDFs will return outputs consistent with the user defined tasks. it’ll provide high performance in terms of coding and execution.

For instance, for string stemming we don’t have any predefined function in Hive, for this we’ll write stem UDF in Java. Wherever we require Stem functionality, we’ll directly call this Stem UDF in Hive.

Here stem functionality means obtain words from its root words. it’s like stemmer reduces the words “wishing”, “wished”, and “wishes” to the basis word “wish.” For performing this sort functionality, we will write UDF in java and integrate with Hive.

Depending on the utilization cases the UDFs are often written, it’ll accept and produce different numbers of input and output values.

The general sort of UDF will accept single input value and produce one output value. If the UDF utilized in the query, then UDF are going to be called once for every row within the result data set.

In the other way, it can accept a gaggle of values as input and return single output value also .

UDF

A UDF summons one or several columns of one row and outputs one value. For example :

SELECT lower(str) from table

For each row in “table,” the “lower” UDF use one argument, the value of “str”, and outputs one value, the lowercase representation of “str”.

SELECT datediff(date_begin, date_end) from table

For each row in “table,” the “datediff” UDF uses two arguments, the value of “date_begin” and “date_end”, and outputs one value, the distinction in time between these two dates.

Each argument of a UDF can be:

A column of the table
A constant value
The result of another UDF
The result of an arithmetic computation

In Hive, you’ll write UDF in two ways: “simple” and “generic”.

Simple Hive UDFs

“Simple”, especially for UDF are truly simple to write down . It are often as easy as:

/** An easy UDF to convert Celcius to Fahrenheit **/
public class ConvertToCelcius extends UDF {
public double evaluate(double value) {
return (value - 32) / 1.8;
}
}

Once compiled, you’ll invoke an UDF like that:

hive> addjar my-udf.jar
hive> create temporary function fahrenheit_to_celcius using "com.mycompany.hive.udf.ConvertToCelcius";
hive> SELECT fahrenheit_to_celcius(temp_fahrenheit) from temperature_data;

Simple UDF also can handle many types by writing several versions of the “evaluate” method.

/** an easy UDF to get the absolute the value of a number **/
public class AbsValue extends UDF {
public double evaluate(double value) {
return Math.abs(value);
}

public long evaluate(long value) {
return Math.abs(value);
}

public int evalute(int value) {
return Math.abs(value);
}
}

In short, to write down an easy UDF:

Extend the org.apache.hadoop.hive.ql.exec.UDF class
Write an “evaluate” method that features a signature like the signature of your UDF in HiveQL.

Types

Simple UDF can accept an outsized sort of types to represent the column types. Particularly, it accepts both Java primitive types and Hadoop IO types

Hive column type        UDF types

string                 java.lang.String, org.apache.hadoop.io.Text
int                    int, java.lang.Integer, org.apache.hadoop.io.IntWritable
boolean                bool, java.lang.Boolean, org.apache.hadoop.io.BooleanWritable
array                 <type>java.util.List
map                   <ktype>java.util.Map<Java type for k>
struct                Don't use Simple UDF, use GenericUDF

Generic Hive UDF

A generic UDF is write down by extending the GenericUDF class.

public interface GenericUDF {
public Object evaluate(DeferredObject[] args) throws HiveException;
public String getDisplayString(String[] args);
public ObjectInspector initialize(ObjectInspector[] args) throws UDFArgumentException;
}

A key concept when working with Generic UDF and UDAF is that the ObjectInspector.

In generic UDFs, all objects are passed around using the thing type. Hive is structured this manner in order that all code handling records and cells is generic, and to avoid the prices of instantiating and de-serializing objects when it isn’t needed.

Therefore, all interaction with the info passed in to UDFs is completed via ObjectInspectors. they permit you to read values from an UDF parameter, and to write down output values.

Object Inspectors belong to at least one of the subsequent categories:

Primitive, for primitive types (all numerical values, string, boolean, …)
List, for Hive arrays
Map, for Hive maps
Struct, for Hive structs

When Hive analyses the query, it computes the particular sorts of the parameters passed in to the UDF, and calls

public ObjectInspector initialize(ObjectInspector[] args) throws UDFArgumentException;

The method accepts one object inspector for every of the arguments of the query, and must return an object inspector for the return type.

Later, rows are passed in to the UDF, which must utilize the ObjectInspectors it received in initialize() to read the deferred objects.

Some Traps of Generic UDF

Everything within the Generic UDF stack is processed through Object, so you’ll definitely have a tough time grasping the right object types. Almost no type checking are often done at compile time, you’ll need to roll in the hay all at Runtime.

It is important to know that the thing returned by the evaluate method must be a Writable object. for instance , within the “multiply by two” example, we didn’t return Integer, but IntWritable. Failure to try to to so will end in cast exceptions.

Debugging generic UDFs isn’t trivial. you’ll often got to peek at the execution logs.

When running Hive fully map-reduce mode, use the task logs from your Jobtracker interface
When running Hive in local mode (which i like to recommend for development purposes), search for the subsequent lines within the Hive output

Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is about to 0 since there is no reduce operator
Execution log at: /tmp/clement/clement_20130219103434_08913263-5a10-496f-8ddd-408b9c2ff0af.log
Job running in-process (local Hadoop)

Here, Hive tells you where the logs for this question are going to be stored. If the query fails, you will have the complete stack there.