Fault Tolerant in Spark |RDD|Transformations|Actions




What is exactly meaning of Spark ?

  • Spark is one of the opensource in-memory cluster computing processing framework to drive the huge data processing.
  • Spark is not meant for storage it is only processing framework
  • Spark doesn’t support data locality design rule i.e. Spark accept the input data from any legacy system called LFS (Local File System), HDFS, NoSQL, RDBMS tables etc.

Important Spark Modules or Components:

1.Spark Core

2.Spark SQL -structured data

3.Spark Streaming – Real-time

4.MLib – Machine learning

5.GraphX – Graph processing.

RDD:

Resilent – Fault Tolerant

Distributed – Span Across

Dataset – Collection of huge data.

What is RDD?

The main abstraction  Apache Spark provides is “RDD”, which is collection of elements partitioned across the nodes of the cluster(single node cluster or multi node cluster) that can be operated on in parallel. If nodes of the cluster failing is proportional to the number of nodes in the cluster.



 Fault Tolerant in Spark with RDDs:

RDDs is designed to be fault tolerant, it automatically handle node failures. When node fails and partitions stored on that node become inaccessible, spark reconstructs the lost RDD partitions on another node.

Spark stores lineage(here lineage means  spark hierarchy) information for each RDD using this lineage information, it can recover parts of a RDD or even an entire RDD in the event of node failures.

Major RDD operations:

To drive Spark processing two operations are we can apply RDDs on the data processing.

  • Transformations or Transformed RDD
  • Actions or Action RDD

Transformations:

A Transformations will converts the source RDD into a new RDD

  • Source ——-> Transformation ——->New RDD

Below are most used Transformation examples in Spark:

map
filter
flatMap
reduceByKey
groupByKey

Note: A transformed RDD will never ever return a value to the drive program instead of, it will produce new RDD only in Spark processing.

Actions: 

An Action RDD will connect RDD into a value to the driver program. It is not going to produce one more RDD.

  • Source —> Action RDD —> Return a value to driver program

Below are most used Actions examples in Spark processing.

collect
count
take top
saveAsTextFile