What is Apache Spark?Features of Spark and Difference between Hadoop and Spark

Apache Spark is an in-memory cluster computing framework for processing and analyzing a large amount of data. Spark provides a simple programming interface, which enables an application developer to easily to use Memory, CPU and storage resources across the cluster of servers for processing in large data sets.




Spark also Open Source distributed framework for big data processing written in Java, Python, Scala, and R languages simply.

Key Features of Spark:

1.Fast
2.In General Purpose
3. Scalable
4.Fault-Tolerant

1. Fast:

Spark data fits in the memory, it is 100 times faster than Map Reduce.
Spark is faster than Hadoop Map Reduce for two reasons:
1. It implements an advanced execution ending
2. It allows in-memory cluster computing
Spark does not automatically cache input data in memory. A common misconception is that Spark cannot be used if input data does not fit in memory. It is not true.
Spark can process terabytes of data on a cluster that may have only 100 GB total cluster memory.

2. In General Purpose:

Apache Spark provides for different types of data processing jobs. It can be used for:




1.Stream Processing
2.Batch Processing
3.Machine Learning
4.Graph Computing
5.Interactive Processing

3. Scalable:

Spark is scalable because the data processing capacity of a spark cluster can be increased by just adding more nodes to a cluster. No code change is required when you add a node to a Spark Cluster.

4.Fault-Tolerant:

Apache Spark is Fault Tolerant because, in a cluster of a few hundred nodes, the probability of a node failing on any given day is high. The hard disk may crash or some other hardware problem.
So Spark automatically handles the failure of a node in a cluster.

Here the basic difference between Spark and Hadoop Map Reduce:

Leave a Reply

Your email address will not be published. Required fields are marked *