Big data analytics environment needs to data processing frameworks are mostly Hadoop Map Reduce and Apache Spark. Here we explain What is Hadoop Map Reduce and how to processing with different phases and What is Spark with a full explanation. Finally discussing the basic difference between Spark and Hadoop Map Reduce.
Hadoop Map Reduce:
Basically Map Reduce is a core component of a Hadoop which is exclusively meant for processing of large volumes and varieties of distributed parallel processing systems on commodity hardware.
Map Reduce processing phases:
1.Mapper Phase: Developer driven phase – Transformation phase
2.Sort & Shuffle Phase: Framework driven – Synchronise phase
3.Reduce Phase: Developer driven phase – computation phase
Spark:
Spark is one of the framework, opensource and in-memory cluster computing processing for large volumes and varieties of the processing systems. Spark doesn’t support data locality design rule i.e. Spark can accept the input data from any legacy systems like HDFS, LFS NoSQL.
Difference between Spark and Map Reduce:
- Hadoop Map Reduce:
1. Batch Processing (OLAP)
2.Disk-Based Processing
3.Top to Bottom approach
4.HDFS – High latency
5. Third-party tools help like Sqoop, Flume for data ingestion
- Spark:
1.Streaming processing
2.Cache based processing
3.Bottom to Top approach
4. Low latency
Above points are the basic difference between Hadoop Map Reduce and Spark
Coming to the Task/Job in Map Reduce is the Mapper/Reducer task. In Spark, it is DAG (Directly Acyclic Graph) Task.
Latency in Map Reduce is High latency because Read/Write the temporary files to disk on the local system. In Spark, latency is low latency because it will read/write the in-memory.
Spark gives high throughput compared to Map Reduce.
Iterative algorithms are difficult to implement in Map Reduce.
In Spark iterative algorithms easy to implement.
Input / Output Types are in Map Reduce is Key-Value Pair
In Spark Input / Output Types are RDDs
Finally, Data Processing in Map Reduce is Java Code
In Spark data processing is ready available Transformation and actions.
Here is a simple differentiation between Hadoop Map Reduce and Spark in a table manner.
Map Reduce |
Apache Spark |
|
Job/Task |
Mapper /Reducer Task |
DAG Task |
Latency |
High latency (Read /Write the temporary files to disk on the local system) |
Low latency(It will read/write the result in memory ) |
High throughput |
Map Reduce gives low throughput compares to Spark |
Spark gives High throughput compares to Map Reduce. |
Iterative Algorithms |
Difficult to implement |
Easy to implement |
Input /Output Types |
Key-Value pair |
RDD’s |
Data Processing |
Java code |
Ready available Transformation and actions. |
Summary: Above the table is simple to understand the major difference between Apache Spark and Hadoop Map Reduce. While data processing Hadoop Map Reduce takes more time. In Spark 100 times faster than Map Reduce in data processing time because it is based Cache memory. Map Reduce takes disk based memory so delay in data processing with large data sets. In Spark Streaming live data is faster as compared to flume. Nowadays Machine learning also in Spark with components. Coming to Spark Graph X is most used in industry for Graphics related.