What is Lineage Graph in Spark with Example | What is DAG | Lineage Graph vs DAG




What is Lineage Graph in Spark?

In Spark, Lineage Graph is a dependencies graph in between existing RDD and new RDD. It means that all the dependencies between the RDD will be recorded in a graph,  rather than the original data.

The need for an RDD lineage graph happens when we want to compute new RDD or if we want to recover the lost data from the lost persisted RDD.

How the RDD lineage graph happens in programmatically:

Lineage Graph Example in Spark with Scala

scala> val rdd1= sc.parallelize(List("Hyderabad","Bangalore","Chennai","Hyderabad"))
rdd1: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[0] at parallelize at <console>:24

scala> val count = rdd1.flatMap(rec=>rec.split("")).map(word=>(word,1)).reduceByKey(_+_)
count: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[3] at reduceByKey at <console>:25

scala> count.toDebugString
res0: String =
(4) ShuffledRDD[3] at reduceByKey at <console>:25 []
+-(4) MapPartitionsRDD[2] at map at <console>:25 []
| MapPartitionsRDD[1] at flatMap at <console>:25 []
| ParallelCollectionRDD[0] at parallelize at <console>:24 []

What is DAG in Spark?

DAG means that Directed Acyclic Graph(No directed cycles). It is a set of Edges and Vertices, where vertices act the RDDs. Edges act as the Operation to be applied on RDD. It is a collection of all transformation and actions.

Lineage Graph vs DAG:




  • Lineage Graph is dealing with only RDDs so it is applicable to transformations
  • DAG(Directed Acyclic Graph) dealing with both transformation and actions
  • DAF allows the user to dive into the stage and expanded details on any stage.