Spark & Scala Interview Questions and Answers

1. What is Scala what are the importance of it and the difference between Scala and other Programming languages (Java/Python)?

Scala is the most powerful language for developing big data environment applications. Scala provides several benefits to achieve significant productivity. It helps to write robust code with fewer bugs.  Apache Spark is written in Scala, so Scala is a natural fit for the developing Spark applications.

2. What is RDD tell me in brief?

Spark RDD is a primary abstract class in Spark API. RDD is a collection of partitioned data elements that can be operated in parallel. Normally, RDD is supporting properties like Immutable, Cacheable, Type Infer, and Lazy evaluation.

Immutable: RDD’s are Immutable data structures. Once created, it cannot be modified

Partitioned: The Data in RDD’s are partitioned across the distributed cluster of nodes. However, multiple Cassandra partition can be mapped to one single RDD partition

Fault Tolerance: RDD is designed to be a fault – tolerant. Because the RDD data is stored across the large distributed cluster. So there is a chance for node failure in that cluster by this we can lose the Partitioned data in that node.

RDD automatically handles the node failure. Spark will maintain the metadata of each RDD and details about the RDD. So by using that information, we can get that data from other nodes.

Interface: RDD provides a uniform interface for processing data from a variety of data sources such as HDFS, HBase, Cassandra, MongoDB, and others. The same interface can also be used to process data stored in memory across a cluster of nodes.

InMemory: The RDD class provides the API for enabling in-memory cluster computing. Spark allows RDDs to be cached or persisted in memory

3. How to register a temporary table in Spark SQL?

When we creating the data frame by loading the data into it using SQL Context object. This is treated a temporary table. Because the scope of the data frame is to a particular session

4. How to count the number of lines in Scala?

In Scala programming language using getLines.size property we can count

Example: Val countlines = source.getLines.size


Java.lang.ClassNotFoundException: oracle.jdbc.driver.OracleDriver in Spark Scala

While writing Apache Spark in Scala / Python (PySpark) programming language to read data from Oracle Data Base using Scala / Python in Linux operating system/ Amazon Web Services, sometimes will get below error in

spark.driver.extraClassPath in either executor class or driver class.

Caused by: java.lang.ClassNotFoundException: oracle.jdbc.driver.OracleDriver


at Java .lang.ClassLoader.loadClass(

at Java.lang.ClassLoader.loadClass(

at org.apache.spark.sql.execution.datasources.jdbc.DriverRegistry$.register(DriverRegistry.scala:35)

at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$anofun$createConnectionFactory$1.api

at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$anofun$createConnectionFactory$1.api

at scala.Option.foreach ( Option .scla:236)

at org . apache . spark . sql . execution . datasources . jdbc.JdbcUtils $ anofun $ createConnection Factory $ (JdbcUtils.scala)

at <init> ( < console >:46)

at . <init> (<console>:52)

at. <clinit> (<console>)

at. <init> (<console>:7)

at. <clinit> (<console>)

at $print (<console>)


After getting this error will provide a simple solution in Spark Scala. Sometimes these are coming to Python (PySpark).

import related jars to both executor class and driver class. First, we need to edit the configuration file as spark defaults in spark-default.conf file.

Adding below two jar files path in spark-default.conf file.

spark.driver.extraClassPath /home/hadoop/ojdbc7.jar
spark.executor.extraClassPath /home/hadoop/ojdbc7.jar

Above two jar files path in configurations with exact version is matched to your Spark version otherwise will get the compatible issue.

Sometimes these two classpaths get an error then will add in your code either Scala or Pyspark programming –conf before Spark driver, executor jar files in bottom of the page example.

If will get the same issue again then will follow the below solution:
Step 1: Download Spark ODBC jar files from the official Maven website.

Step 2: Copy the download jar files into the below path in the share location in Spark.


For Example –  PySpark programming code snippet for more information.


pyspark --driver-classpath /home/hadoop/odbc7.jar --jars #  jar file path
from pyspark import SparkContext, Spark conf # import Spark Context and configuration

from pyspark.sql import SparkContext #Sql context

sqlContext = sqlContext (sc)

dbconnection = sqlContext . read . format ("jdbc") . options (url: "Give your jdbc connection url path").load()
#Data Base connection to read data from Spark with the help of jdbc


Pattern Matching in Scala

Pattern Matching in Scala with Examples

A pattern match includes a sequence of alternatives each starting with the keyword use case. Each alternative includes a pattern and one or more expressions.

Example :

object PatternDemo {

def main(args : Array[String])


var months = List(“Jan”,”Feb”,”Mar”,”Apr”,”May”)

for (month <- months)


prinltn (month)

month match{

case “Jan” => prinltn (“First Month of the Year”)

case “Feb” => prinltn (“Second Month of the Year”)

case “Mar” => prinltn (“Third Month of the Year”)

case “Apr” => prinltn (“Fourth Month of the Year”)

case “May” => prinltn (“Fifth Month of the Year”)




Jan First Month of the Year

Feb Second Month of the Year

Mar Second Month of the Year

Apr Second Month of the Year

May Second Month of the Year

Example 2:

object techPatternMatch


def main( args: Array[String] )


var technologies = List (“Java”, “Python”, “Hadoop”, “Scala” )

for (tech <- technologies ){

println (tech)

tech match{

case “Hadoop”  | “Spark”=> println (“Big data technlogies”)

case “Java”  | “C++”=> println (“OOPS”)

case “Python”  | “Go”=> println (“Advanced Tech”)

case “Scala” => println (“Functional Programming”)





Hadoop Big data technologies

Spark Big data technologies



Python Advanced Tech

Go Advanced Tech

Scala Functional Programming


object letterMatch_Case


def main(args: Array[String]){

println (x)

x match() {

case x if x%2 == 0 => prinltn (“Number is Even”)

case x if x%2 == 1 => prinltn (“Number is Odd”)




1  Number is Odd

2 Number is Even



List :

Scala Lists are similar to Arrays which means that all elements of a list have the same type but there are two important differences. First Lists are immutable (Cannot be changed) and Second Lists are represented, Linked List.

scala > val names : List [String] = List (“Sreekanth”, “Vijay”, “Vinay”);

names : List[String] = List (Sreekanth, Vijay, Vinay)

scala > val names : List[String] = List (“Alien”, “Bob”, “Carey”);

names: List [String] = List (Alien, Bob, Carey)

scala > println(names (0));

O/P : Alien

scala > println (names (1));

O/P : Bob

scala > val marks = List[Int] = List [10,20,30,40)

marks: List[Int] = List(10,20,30,40)

scala > prinltn(marks. head)

O/P : 10

scala > prinltn (marks. tail)

O/P : 20,30,40

To Concatenate the Two Lists using :::

val names  = “Sreekanth” :: ( “Vijay”:: “Vinay”);

names: List [String] = List (Sreekanth, Vijay, Vinay)

val address = “Hyd” :: (“Banglore” :: (“Chennai” ))

address: List[String] = List (Hyd, Banglore, Chennai)

var name_address= names ::: address

name_address: List [String] = List (Sreekanth, Vijay, Vinay, Hyd, Banglore, Chennai)

Set :

Set is a collection that contains no duplicate elements. There are two kinds of Sets, the immutable and mutable. The difference between mutable and immutable objects that when an object is immutable the object itself can’t be changed.

By default, Scala uses the immutable set.

Scala > val ranks = List (1,2,3,4,2,2,3,2)

ranks: List[Int] = List (1,2,3,4,2,2,3,2)

Scala > val ranks = Set(1,2,3,4,2,2,3,2)

ranks : scala .collection.immutable.Set[Int] = Set (1,2,3,4)

Set Operations in Scala:

Example : object setOperations


def main (args: Array[String])


val marks = Set(10,20,30,40);

val updateMarks = set(15,25,35,45);

println (“Max marks : ” + marks .max);

prinltn( “Min marks :” + marks.min);

prinltn (“marks.intersect(updateMarks): ” + marks.intersect(updateMarks));





  1. Scala map is a collection of key and value pairs in a collection set.
  2. Any value can be retrieved based on its key.
  3. Keys are unique in the Map, but values need not be unique.
  4. There are two kinds of Maps, the immutable and the mutable.

scala > mapPrgogram.scala

object mapPrg


def main(args: Array[String])


val Technology= Map (“Java” -> “OOPS”, “ML” -> AI , “Hadoop” -> “Big Data”);

println(“Keys in Technologies:” + Technology.keys);

prinltn(“Values of Technologies:” + Technology.values)



scala > scalac mapProgram.scala

scala > scala mapProgram


Keys in Technologies: Set(Java, ML, Hadoop)

Values in Technologies : Map ( OOPS, AI, Big Data )

SCALA Iterators & Arrays With Examples


Scala iterator is not a collection, but rather a way to access the elements of a collection one by one.

Here two basic operations on iterators

2. hasNext

A Call to will return the element of the iterator and advance the state of the iterator.

A Call to it.hasNext() will return more elements of the iterator


scala > val itObject = Iterator (“Hadoop”, “Scala”, “Spark”);

itObject: Iterator[String] = non – empty iterator

scala > while (itObject.hasNext) {

println(” ” +}


Scala  Spark

scala > println (“Size of Iterator : ” +itObject. size)

Size of Iterator : 3


Scala array is a normal array. To create an array of integers below like this:

scala > var arrayEle = Array (1,2,3,4,5,6,7,8) ;

arrayEle: Array[Int] = Array (1,2,3,4,5,6,7,8)

scala > for (x < – arrayEle) {

println(“Array elements”+ x)}

scala > println (arrayEle. length)

O/p: 8

SCALA LOOPS With Examples


For Loop:

A for loop is repetition control structure that allows   to efficiently write a loop that needs to execute a specific number of times.

Different types of for loop in SCALA:

  1. For Loop with Collections
  2. For Loop with Range
  3. For Loop with Filters
  4. For Loop with Yield.

For Loop with Collections: 

for (var <- List)




Here the List variable is a collection type having a list of elements and for loop iterate through all the elements returning one element in x variable at a time.


object ForLoopCollections{

def main (args: Array[String])


var x=0;

val numList = List(1,2,3,4,5);

for (x <-numList)


println(“Value of X:” +x)




For Loop with Range:

for (var x <- Range)




Here Range is the number of numbers that represented as I to J and   ” <-“means that “Generator” command as this operator only generating individual values from a range of numbers.


object ForLoopRange{

def main(args: Array[String])


var x = 0;

for (x <- 1 to 100)


println(“Value of X:” + x);




For Loop with Filters:

for ( var x <- List if conditions 1; if conditions 2; if conditions 3…)






object ForLoopFilterDemo{

def main(args: Array[String])


var x =0;

val numList = List (1,2,3,4,5,6,7,8,9);

for ( x <- numList if x!=5; if x < 9)


println(“Value of X:”+x);




For Loop with Yield:

val NorVal = for { var x <- List if condition 1; if condition 2….}

yield x

Here to store return values from a for loop in a variable or can return through a function.


object ForLoopYieldDemo{

def main(args: Array[String])


var x=0;

val numList = List(1,2,3,4,5,6,7,8,9);

val NorVal = for { x <- numList if x!=3; if x <9) yield a

for (x <- NorVal)


println(“Value of X”+x);




While Loop

Normally while loop statement repeatedly executes a target statement as long as a given condition is true.






object WhileLoopDemo{

def main(args: Array[String])


var x = 100;

while (x <300){

println (“Value of X:” +x);

x= x+1;




Do While Loop

Unlike while loop, which tests the loop condition at the top of the loop, the do while loop checks its condition at the bottom of the loop.

While and Do While loop is similar, except that a do while loop is guaranteed to execute at least one time.




} while(condition);


object DoWhileLoopDemo{

def main(args: Array[String]{

var x =100;


println (“Value of X :”+x);


}while (x >300)




Functions in SCALA:

1. Higher Order Functions in Scala:

I) foreach ()

II) map ()

III) reduce ()

I) foreach() :

scala > val technologies = List ( “Hadoop”, “Java”, “Salesforce”);

technologies: List[String] = List (Hadoop, Java, Salesforce)

scala >  technologies . foreach ((t:String) = > ( println (t) ) );





II) map() :

scala > val technologies = List ( “Hadoop”, “Java”, “Salesforce”);

technologies: List[String] = List (Hadoop, Java, Salesforce)

scala >  val tSize = technologies . map ((c) = > ( c.size));

tSize : List [Int] = List (6, 4, 10)

scala > println (tSize);

List (6,4,10)

III) reduce() :

scala > val score = List (183,186,190,191)

score: List [Int] = List (183,186,190,191)

scala > val totalScore = score.reduce( (a : Int, b: Int) = > (a+b));

totalMarks: Int = 750

scala> val totalScore =score. reduce((a,b) => (a+b));

totalMarks: Int = 750

Anonymous Functions in SCALA:

Anonymous Functions mainly the combination of Function literals and Function Values.

Function literals : Anonymous functions in source code is called function literals 

Function Values : Function literals are instantiated into objects are called function values.

Example of Anonymous Function:


object Anonymous_Funct


def main (args: Array[String])


var add = (x: Int) => x+1;

var mul = (a:Int, b:Int) => a*b;

println (“Addition value :” + add(10));

println (“Multiple value : “+ mul (9, 10) );

println ( ” New Addition Value: ” + ( (10) – 5));

println ( ” New Multiplied Value: ” + ( mul  (10,9)+50 ) ) );




Scala> scalac Anonymous_Funct.scala

Scala> Anonymous_Funct

Addition value 11

Multiple value 19

New Addition Value 5

New Multiplied Value 140

Spark Streaming Twitter Example

Spark Streaming Twitter Example:

//Using Scala Program

package org . apache . spark . demo . streaming

import org . apache . spark . streaming . SparkContext._

import org . apache . spark . streaming . twitter._

import org . apache . spark . streaming . {Seconds, StreamingContext}

import org . apache . spark . SparkConf

object TwitterTags{

def main(args: Array[String]){

if(args.length < 5 ){

System. err. println (“Usage : Twitter Popular Tags <consumer key> <consumer secret>”+”<access token><access token secret>[<filters]”)



StreamingExamples . setStreamingLogLevels()

val Array ( consumerKey, consumerSecret, accessToken, accessTokenSecret ) = args.take(5)

val filters = args . takeRight(args.length – 5)

//Set the system properties so that Twitter 4j library used by twitter stream

//Can we use them to generate OAuth(Open Authentication) credentials

System . setProperty (“twitter4j . oauth . consumerKey”, consumerKey)

System . setProperty( ” twitter4j . oauth . consumerSecret”, consumerSecret)

System. setProperty(“twitter4j.oauth.accessToken”,accessToken)

System. setProperty (” twitter4j . oauth. accessTokenSecret”, accessTokenSecret)

val sparkConf = new SparkConf(). setAppName(“TwitterTags”)

val scc=new StreamingContext (sparkConf, Seconds(3))

val stream = TwitterUtils.createStream ( scc, None,filters)

val hashTags =stream. flatMap (status = > status. getText. split(” “).filter(_.startsWith(“#”)))

val topCounts = hashTags. map((_, 1).reduceByKeyAndWindow(_+_, Seconds(60)).map{case (topic, count)=>(count, topic)}.transform(_.sortByKey(false))

val topCounts1 = hashTags . map((_, 1). reduceByKeyAndWindow(_+_, Seconds(30)).map{case (topic, count)=>(count, topic)}.transform(_.sortByKey(false))

//Print Popular hashtags

topCounts . foreachRDD (rdd = > {

val topList  =  rdd. take(30)

println (“\n Popular topics in last 60 seconds(%s total): “.format ( rdd . count()))

topList . foreach {case(count ,tag) => println(“%s(%s tweets)”.format(tag,count))


topCounts . foreachRDD (rdd = > {

val topList  =  rdd. take(60)

println(“\n Popular topics in last 30 seconds(%s total): “.format(rdd. count()))

topList . foreach {case(count ,tag)=>println(“%s(%s tweets)”. format(tag,count)))



scc . start()

scc . awaitTermination()



Spark Streaming Use Case

Spark Streaming Use Case with Explanation:

Using Scala streaming imports

import org. apache. spark. streaming. StreamingContext

import org. apache. spark. streaming. StreamingContext._

import org. apache. spark. streaming.dstream . DStream

import org. apache. spark. streaming.Duration

import org. apache. spark. streaming.Seconds

Spark Streaming Context :

This is also sets up underlying SparkContext that it will use to process data. It takes as input a batch interval specifying how often to process new data


We use socketTextStream() to create a DStream based on text data received on the local machine

Then we transform the DStream with filter() to get only the lines that contains error. Output operation print() to print some of the filtered lines.

Create a Streaming Context with a 1 – second batch size frin a SparkConf

val scc=new StreamingContext(conf, Seconds(1))

// Create DStream using data received after connecting to default port on the local machine

val lines = scc.socketTextStream(“localhost”, 9000)

//Filter our DStream for lines with “error”

var errorLines = lines.filter(_.contains(“error”))

//Print out the lines with errors

errorLines. print()

Above an example of converting a stream of lines to words, the flatMap operation is applied on each RDD in the lines DStream to generate the RDDs of the words DStream. This is shown in below figure Input DStreams are DStreams representing the stream of input data received from streaming process. In the above example of converting streaming of lines information words, lines was an input DStream as it represented the stream if data received from the server.

Every input DStream is associated with a Receiver object whether Java, Scala etc. Which receives the data from a source and stores it in Spark’s memory for processing. Here Spark Streaming provides two categories :

  1. Basic Sources: Sources directly available in the Streaming Context API example: file systems, socket connections
  2. Advanced Source: Sources indirectly available  like Flume, Kafka, Twitter etc. are available through extra utility classes.

How to Install SpringToolSuite(STS) with Scala on Windows 7/8/10

Java is an open source programming language for software developers and coming to Spring Boot is an advanced framework in Spring. So how to develop in using Spring Tool Suite IDE in a simple way. Here is how to install STS simple steps and how to install SCALA on Spring Tool Suite IDE. Mainly how to integrate Scala with STS for Scala developers without Scala IDE.

In this STS both using Java developers as well as Scala developers.

Prerequisites: Java (JDK 1.7) is mandatory to install STS along with SCALA.

How to Install STS with SCALA on Windows:

Step 1: Download STS  zip file from spring official website

Step 2: After downloading Spring Tool Suite zip file then extract that then Open STS directory file click on STS icon below like this.

Step 3:  Run that file then Select a directory as a workspace and browse your directory.

Step 4: After that open workspace spring Dashboard for more information.

Step 5: Go to Top menu bar and click on Help option then click on Eclipse Marketplace.

Step 6: After open Eclipse Market Place then search Scala IDE and click on Install option.

Step 6: Click on I Accept Apache License & agreement then Click on Finish button.

Step 7: After that Restart Spring Tool Suite IDE then start with SCALA simply.

Above steps are simple to integrate SCALA in Eclipse or Spring Tool Suite for developers. If you don’t need this type SCALA installation then go with SCALA IDE from Scala official website.