Spark SQL (Dataset) Joins with Scala Examples





Spark joins are used for datasets. To join one or more datasets with join() function. Spark API contains join () function using in Scala classes to join huge datasets

Here are different types of Spark join() functions in Scala:

1.join
2.rightOuterJoin()
3.leftOuterJoin()

1. Join (): In Spark simple join is used for the inner join between two RDDs
example: rdd . join ( other)

2.rightOuterJoin() : It is used for join between two RDDs. In this join key mandatory in the first RDD

example: rdd.rightOuterJoin (other)

3. leftOuterJoin(): It is used to join between two RDDs. In this join key mandatory in the other RDD

example: rdd.leftOuterJoin(other)

A brief explanation for Spark join programming example with Scala coding:

val linesdata = sc.textFile("Datalog.txt")
val linesLength = linesdata.map(_.split("\t"))
linesdata.join(linesLength).collect()

Most of the cases, Spark SQL is using joins with RDBMS data structured.

Like an employee, customer data, and etc.

For example: Take an employee database with Schema:

Employee Schema:                        Job Schema:    
emp id, emp name, emp sal              emp id, company name

Here is the Spark with Scala partial code for Spark joins

case class Employee(id: Int, name: String, sal: Float);
case class Job (id: Int, c_name: String);
emp.join(job).collect()