Hadoop is a distributed file system for large data storage and data processing using MapReduce in the Big Data environment. It is one of the solutions for BigData for storage and processing. In the Hadoop, the Eco-System contains many services for data processing like Hive, Apache PIG, etc.
In the Hadoop Eco-System, Apache Sqoop is one of the major roles for large data import and export from source to destination.
Spark is one of the open-source, in-memory cluster computing processing framework to large data processing.
Basically spark is used for big data processing, not for data storage purpose
Major Difference between Hadoop and Spark:
- Hadoop is Batch processing like OLAP (Online Analytical Processing)
- Hadoop is Disk-Based processing It is a Top to Bottom processing approach
- In the Hadoop HDFS (Hadoop Distributed File System) is High latency.
- MapReduce is used for large data processing in the backed from any services like Hive, PIG script also for large data.
- Some of the third party tools to help for data ingestion like Sqoop & Flume.
- Spark is one of the open-source and in-memory cluster computing processing framework to drive the large data processing
- It is not meant for storage, it is only processing framework.
- Spark is in-memory processing
- It is a Bottom to Top processing approach
- Spark doesn’t support data locality design rule i.e. Spark can accept input data from any legacy system like HDFS (Hadoop Distributed File System), LFS (Local File System), NoSQL (Not only SQL)
- No need to third party tools, Spark is used for individual Spark Streaming, Spark Machine Learning is also available.
Summary: Hadoop and Spark are different frameworks in the Bigdata environment. The above points are the major difference between Hadoop and Spark-based on the processing, performance. In the Hadoop, different services are available like Hive, Flume, Pig, etc. Coming to Spark, different modules are available like Spark core, Spark SQL, Spark streaming, Spark MLib, etc.