Spark joins are used for datasets. To join one or more datasets with join() function. Spark API contains join () function using in Scala classes to join huge datasets
Here are different types of Spark join() functions in Scala:
1.join 2.rightOuterJoin() 3.leftOuterJoin()
1. Join (): In Spark simple join is used for the inner join between two RDDs
example: rdd . join ( other)
2.rightOuterJoin() : It is used for join between two RDDs. In this join key mandatory in the first RDD
example: rdd.rightOuterJoin (other)
3. leftOuterJoin(): It is used to join between two RDDs. In this join key mandatory in the other RDD
example: rdd.leftOuterJoin(other)
A brief explanation for Spark join programming example with Scala coding:
val linesdata = sc.textFile("Datalog.txt") val linesLength = linesdata.map(_.split("\t")) linesdata.join(linesLength).collect()
Most of the cases, Spark SQL is using joins with RDBMS data structured.
Like an employee, customer data, and etc.
For example: Take an employee database with Schema:
Employee Schema: Job Schema: emp id, emp name, emp sal emp id, company name
Here is the Spark with Scala partial code for Spark joins
case class Employee(id: Int, name: String, sal: Float); case class Job (id: Int, c_name: String); emp.join(job).collect()