In this article, we will explain the minimum data transfer in the large data sets using Spark. Nowadays, the interviewer asked this type of question for Hadoop and Spark developers and Admins also.
How we can minimize data transfers using Spark:
To minimize large data transfers using Spark shared variables in the cluster. There are two types of shared variables in the Spark
1.Broadcast variable :
In the Big Data environment, we have large data sets so copied to each node at one time using the broadcast variable. In the case of transfer small data sets, we are using transferring a copy of the data set for each task in the cluster.
How to create broadcast shared variables?
We need to create a broadcast variable using SparkContext.broadcast or sc.broadcast and assign the same to all nodes from the driver program. After that take the Value method then access the shared value. If you want to local copy data from the driver program then use the Accumulator.
Here first we defined variables in the drive program using Spark functions. After that local copy of variables will be generated. Basically, an accumulator has shared variables that help to update variables.
How to create accumulator shared variables?
Accumulators used for accumulating the values/variables in the Spark environment. By using SparkContext will create a shared variable in the Spark like below.
Summary: In Spark, so many ways to transfer data but we need to reduce the operation like repartition, reduceByKey, groupByKey. But these operations only for small data sets, we need to large data sets for node to node so we are using the broadcast variable, accumulator as we discussed in the above content with the cluster for Spark developers, Admin and Data Engineers also. In most of the cases, we are using two share variables.