Spark Core programming in Scala with Examples

Spark Core is one of the bases for entire Spark programming. It can replace MapReduce to perform high-speed computation.




How to find out the number of records in a dataset using Spark?

Here is we provide Spark with Scala programming for a number of records in a dataset:

Val lines = sc.textFile("Datalog.txt")
val lineLength = lines.map(x => (x,1)) . reduceByKey(_+_)
lineLength.saveAsTextFile("/home/Spark/Data");

A simple definition for the above coding style:

The "textFile" used for loading the dataset into RDD
The "map" is used for transformations function to iterates every record in the dataset.
The "._split (delimiter)"  is used for transformation function which splits every record with the delimiter.
"reduceByKey" is a key to its functionality defined by function against values of each key.

Note: We have to use map transformation in the case study where we are achieving one to one relation from input to output.
How to loaded programmatic array into RDD in Spark?

In Spark, this method is called parallelize to load arrays into RDDs like below code:

val sparray = Array(10,20,30,40);
val rdd = sc . parallelize (sparray,3); // here we load array into an RDD

Here are parallelized collections that are created by calling Spark Contexts to parallelize method on an existing collection in your driver program. The elements of the collection are applied to form a distributed dataset that can be operated in parallel.

Val newCode = sc.textFile("Datalog.txt")
file.filter (x => x.contains("error"))

The filter is transformed RDD to filter it within the content file, contains means that if the existing mechanism is available or not.

How to find out distinct elements in the source RDD?




The distinct method of an RDD returns a new RDD containing the distinct elements in the source RDD

Example:

val filetech = sc.parallelize(List("Bigdata", "Hadoop", "Spark"))
val disttech = filetech.distinct
disttech. collect()

Basically, Spark with Scala has a lot of functional keywords for better performance of the programming with less time complexity.