Spark Performance Tuning with pictures





Spark Execution with a simple program:

Basically, the Spark program consists of a single spark driver and process and a set of executors processes across nodes of the cluster.

Performance tuning of Spark is measured bottleneck using big data environment metrics for block time analysis. Spark is run on In-memory cache so need to avoid network and I/O are key role while performance.

For example: Take two clusters, one cluster with 25 machines and cluster size is 750 GB of data. The second cluster with 75 machines clusters with 4.5TB of raw data.  The network communication is always irrelevant for the performance of these workloads coming to network optimization is to reduce job completion by 5% for better performance. Finally serialized compressed data.

Mostly Apache Spark always supports transformations like groupByKey and reduceByKey dependencies. Spark executes a shuffle, which transfers data around the cluster. Below three operations  with different outputs:

sc.textFile(" /hdfs/path/sample.txt")
map(mapFunc) #Using Map function 
flatMap(flatMapFunc) #Flatmap is another transformer
filter(filterFunc)
count() # Count the data.

Above code executes a single performer, which depends on a sequence of transformations on an RDD derived from the sample.tx file.

If the code contains how many times each character appears in all the words that appear more 1000 times in a given text file. Below code in Scala:

Val  token=sc.textFile(args(0).flatMap(_.split(' ' ))
Val wc=token.map((_,1)).reduceByKey(_+_)
Val filtered = wc.filter(_._2 > =1000)
val charCount = filtered.flatMap(_._1.toCharArray).map((_, reduceByKey(_+_)
charCount.collect

Above code breaks into mainly three stages. The reduceByKey operations result in stage boundaries