Big data analytics environment needs to data processing frameworks are mostly Hadoop Map Reduce and Apache Spark. Here we explain What is Hadoop Map Reduce and how to processing with different phases and What is Spark with a full explanation. Finally discussing the basic difference between Spark and Hadoop Map Reduce.

Hadoop Map Reduce:

Basically Map Reduce is a core component of a Hadoop which is exclusively meant for processing of large volumes and varieties of distributed parallel processing systems on commodity hardware.

Map Reduce processing phases:

1.Mapper Phase: Developer driven phase – Transformation phase

2.Sort & Shuffle Phase: Framework driven – Synchronise phase

3.Reduce Phase: Developer driven phase – computation phase

Spark:

Spark is one of the framework, opensource and in-memory cluster computing processing for large volumes and varieties of the processing systems. Spark doesn’t support data locality design rule i.e. Spark can accept the input data from any legacy systems like HDFS, LFS NoSQL.

Difference between Spark and Map Reduce:

Hadoop Map Reduce:

1. Batch Processing (OLAP)

2.Disk-Based Processing

3.Top to Bottom approach

4.HDFS – High latency

5. Third-party tools help like Sqoop, Flume for data ingestion

Spark:

1.Streaming processing

2.Cache based processing

3.Bottom to Top approach

4. Low latency

Above points are the basic difference between Hadoop Map Reduce and Spark

Coming to the Task/Job in Map Reduce is the Mapper/Reducer task. In Spark, it is DAG (Directly Acyclic Graph) Task.

Latency in Map Reduce is High latency because Read/Write the temporary files to disk on the local system. In Spark, latency is low latency because it will read/write the in-memory.

Spark gives high throughput compared to Map Reduce.

Iterative algorithms are difficult to implement in Map Reduce.

In Spark iterative algorithms easy to implement.

Input / Output Types are in Map Reduce is Key-Value Pair

In Spark Input / Output Types are RDDs

Finally, Data Processing in Map Reduce is Java Code

In Spark data processing is ready available Transformation and actions.

Here is a simple differentiation between Hadoop Map Reduce and Spark in a table manner.

	Map Reduce	Apache Spark
Job/Task	Mapper /Reducer Task	DAG Task
Latency	High latency (Read /Write the temporary files to disk on the local system)	Low latency(It will read/write the result in memory )
High throughput	Map Reduce gives low throughput compares to Spark	Spark gives High throughput compares to Map Reduce.
Iterative Algorithms	Difficult to implement	Easy to implement
Input /Output Types	Key-Value pair	RDD’s
Data Processing	Java code	Ready available Transformation and actions.

Summary: Above the table is simple to understand the major difference between Apache Spark and Hadoop Map Reduce. While data processing Hadoop Map Reduce takes more time. In Spark 100 times faster than Map Reduce in data processing time because it is based Cache memory. Map Reduce takes disk based memory so delay in data processing with large data sets. In Spark Streaming live data is faster as compared to flume. Nowadays Machine learning also in Spark with components. Coming to Spark Graph X is most used in industry for Graphics related.

Tag: Big data

Basic Difference between Spark and Map Reduce with Examples

Difference between Spark and Map Reduce: