Basic Difference between Spark and Map Reduce with Examples





Big data analytics environment needs to data processing frameworks are mostly Hadoop Map Reduce and Apache Spark. Here we explain What is Hadoop Map Reduce and how to processing with different phases and What is Spark with a full explanation. Finally discussing the basic difference between Spark and Hadoop Map Reduce.

Hadoop Map Reduce:

Basically  Map Reduce is a core component of a Hadoop which is exclusively meant for processing of large volumes and varieties of distributed parallel processing systems on commodity hardware.

Map Reduce processing phases:

1.Mapper Phase: Developer driven phase – Transformation phase

2.Sort & Shuffle Phase: Framework driven – Synchronise phase

3.Reduce Phase: Developer driven phase – computation phase

Spark:

Spark is one of the framework, opensource and in-memory cluster computing processing for large volumes and varieties of the processing systems. Spark doesn’t support data locality design rule i.e. Spark can accept the input data from any legacy systems like HDFS, LFS NoSQL.

Difference between Spark and Map Reduce:

  • Hadoop Map Reduce:

1. Batch Processing (OLAP)

2.Disk-Based Processing

3.Top to Bottom approach

4.HDFS – High latency

5. Third-party tools help like Sqoop, Flume for data ingestion

  • Spark:

1.Streaming processing

2.Cache based processing

3.Bottom to Top approach

4. Low latency




Above points are the basic difference between Hadoop Map Reduce and Spark

Coming to the Task/Job in Map Reduce is the Mapper/Reducer task. In Spark, it is DAG (Directly Acyclic Graph) Task.

Latency in Map Reduce is High latency because Read/Write the temporary files to disk on the local system. In Spark, latency is low latency because it will read/write the in-memory.

Spark gives high throughput compared to Map Reduce.

Iterative algorithms are difficult to implement in Map Reduce.

In Spark iterative algorithms easy to implement.

Input / Output Types are in Map Reduce is Key-Value Pair

In Spark  Input / Output Types are RDDs

Finally, Data Processing in Map Reduce is Java Code

In Spark data processing is ready available Transformation and actions.

Here is a simple differentiation between Hadoop Map Reduce and Spark in a table manner.

Map Reduce 

Apache Spark

Job/Task

Mapper /Reducer Task

DAG Task

Latency

High latency (Read /Write the temporary files to disk on the local system)

Low latency(It will read/write the result in memory )

High throughput

Map Reduce gives low throughput compares to Spark

Spark gives High throughput compares to Map Reduce.

Iterative Algorithms

Difficult to implement

Easy to implement

Input /Output Types

Key-Value pair

RDD’s

Data Processing

Java code

Ready available Transformation and actions.





Summary: Above the table is simple to understand the major difference between Apache Spark and Hadoop Map Reduce. While data processing Hadoop Map Reduce takes more time. In Spark 100 times faster than Map Reduce in data processing time because it is based Cache memory. Map Reduce takes disk based memory so delay in data processing with large data sets. In Spark Streaming live data is faster as compared to flume. Nowadays Machine learning also in Spark with components. Coming to Spark Graph X is most used in industry for Graphics related.