What is exactly meaning of Spark ?
- Spark is one of the opensource in-memory cluster computing processing framework to drive the huge data processing.
- Spark is not meant for storage it is only processing framework
- Spark doesn’t support data locality design rule i.e. Spark accept the input data from any legacy system called LFS (Local File System), HDFS, NoSQL, RDBMS tables etc.
Important Spark Modules or Components:
2.Spark SQL -structured data
3.Spark Streaming – Real-time
4.MLib – Machine learning
5.GraphX – Graph processing.
Resilent – Fault Tolerant
Distributed – Span Across
Dataset – Collection of huge data.
What is RDD?
The main abstraction Apache Spark provides is “RDD”, which is collection of elements partitioned across the nodes of the cluster(single node cluster or multi node cluster) that can be operated on in parallel. If nodes of the cluster failing is proportional to the number of nodes in the cluster.
Fault Tolerant in Spark with RDDs:
RDDs is designed to be fault tolerant, it automatically handle node failures. When node fails and partitions stored on that node become inaccessible, spark reconstructs the lost RDD partitions on another node.
Spark stores lineage(here lineage means spark hierarchy) information for each RDD using this lineage information, it can recover parts of a RDD or even an entire RDD in the event of node failures.
Major RDD operations:
To drive Spark processing two operations are we can apply RDDs on the data processing.
Transformations or Transformed RDD
Actions or Action RDD
A Transformations will converts the source RDD into a new RDD
- Source ——-> Transformation ——->New RDD
Below are most used Transformation examples in Spark:
map filter flatMap reduceByKey groupByKey
Note: A transformed RDD will never ever return a value to the drive program instead of, it will produce new RDD only in Spark processing.
An Action RDD will connect RDD into a value to the driver program. It is not going to produce one more RDD.
- Source —> Action RDD —> Return a value to driver program
Below are most used Actions examples in Spark processing.
collect count take top saveAsTextFile