Why Spark is Lazy Evaluation and How RDDs are Fault Tolerant

Why Spark is Lazy Evaluation and How RDDs are Fault Tolerant

Why Spark is Lazy Evaluation ? :

Why Spark is “Lazy Evaluated ” system because Spark computes RDDs. Although you can define new RDDs any time, Spark computes them only in a lazy way that is the first time they are used in an action. This approach might seem unusual at first, but makes a lot of sense when you are working with Big Data.

How RDDs are Fault Tolerant ? :

RDD is designed to be fault tolerant and represents data distributed across a cluster of nodes. The probability of a node failing is proportional to the number of nodes in a cluster. The larger a cluster, the higher the probability that some node will fail on any given RDD automatically handles node failures. When a node fails, and partitions stored on that node become inaccessible Spark reconstructs the lost RDD partitions on another node.

Spark storage lineage information for each RDD. Using this lineage information, it can recover parts of an RDD or even an entire RDD in the event of node failures RDD. persist ()

Spark’s RDDs are by default recomputed each time you run an action on them. If you would like to reuse an RDD in multiple actions, can ask Spark to persist it using RDD. persist ()

Go to Spark shell and check it filtered Name .persist method in Spark after creating RDD.

ex: Scala > filtered Name. persist()

Every Spark program and shell session will work as follows:

Create some input RDDs from external data. Transform them to define new RDDs using transformations like filter().

Spark to persist() any intermediate RDDs that will need to be reused. After launch actions such as count() and first() to kick off a parallel computation which is then optimized and executed by Spark.

In Spark cache() method same as the calling persist() with default storage level