- Scenario: We have 100 crores of 1 TB log files with error records we need to find out error records.
Basically, Hadoop follows top to bottom processing approach, how it works from source to destination with respective large data files.
Hadoop top to bottom processing approach:
Step 1: Take storage system HDFS or LFS to have 100 crores of 1 TB log files.
Step 2: Log files convert into splits for next processing
Step 3: After converted into splits then move to Mapper phase.
Step 4: In this step, sort & shuffle phase happens
Step 5: After completion of the Sort & Shuffle phase it will convert into Reducer phase.
Step 6: We get output like error log files from the above steps.
Here error record files processing from step 1 to step 6. It is a bit of time complexity processing finding error log files.
Let’s Spark comes into the picture. Spark using the bottom to top processing approach using cache memory with less time complexity with fewer steps to find out the error log files.
Spark Bottom to Top processing approach :
Step 1: Spark using Base RDD with the location of 100 crores of files either HDFS, LFS, NoSQL, RDBMS, etc.
Step 2: In this step, the files are filtered out whether error log files are there or not using Transformed RDD
file.filter(x => x.contains(“error”)
Step 3: Here using Action RDD to find out how many error log files in the storage location using below action RDD:
count of (“error”).
The above three steps enough for Spark processing to find out error files in the location.
How Spark processing the above steps:
- First, Spark processing with step 3 Action RDD to pick up the files stand by with error keyword in the 100 crores of 1 TB files.
- Second, To find out log files have an error then count of the total files.
- Third, Select files from HDFS, LFS, NoSQL, etc.
Summary: In Spark have a bottom to top approach, so it is very fast compared to Hadoop top to bottom approach with respect to large data. Here Spark using Cache for fast processing required only data is stored in the only cache.
First, Spark triggered Action RDD, then Tran formed RDD after that will go to the storage location. So Spark 100 times faster than Hadoop MapReduce for large data processing.