What is Apache Spark Eco-System | Spark SQL | Spark Streaming | GrapX

What is Apache Spark?
Spark is a fast, easy to use and flexible data processing and in-memory compute framework. It can run on top of Hadoop eco-system, and Cloud accessing diverse data sources including HDFS, HBase, and other services.

Different Key Features of Spark:

2.In General Purpose



What is the Spark Engine?

Spark engine is for scheduling, distributing and monitoring the large data applications.

What is RDD?

RDD means that Resilient Distribution DataSets. Designed to be fault-tolerant and represents data distributed across the cluster. If node failing is proportional to the number of nodes in a cluster.

RDD supports two operations:
1. Transformations


What is Hive on Spark?

Hive support for Apache Spark, wherein Hive execution is configured to Spark below configurations:

hive > set spark.home=/location/ to /Spark_Home

hive > set hive.execution.engine=spark;

Hive on Spark supports Spark on yarn mode by default

Spark Eco-System

1.Spark SQL – For developing

2.Spark Streaming – For live data streaming

3.GraphX for computing graphs

4.MLib for Machine learning

5.SparkR for Spark engine.

What is Spark SQL?

Spark SQL called as a Shark is a novel module. It introduced that Spark with structured data and processing. Spark executes relational SQL queries on data. The core of the Spark SQL is to supports the RDDs.

What is Spark Streaming?

Apache Spark streaming supports live data processing. It is an extension to the Spark API, allowing stream processing of continuous live data streams. For example data from different sources like HDFS, Flume services are streamed and finally processed to file systems.

What is Spark GraphX?

Spark GraphX means that processing the graphs to build and transform capable graphs. And its component enables programmers to reason about structured data at small.

What is Spark MLib?

Spark MLib is a scalable machine learning library provided by an Apache Spark. It provides easy to understand with algorithms and use different use cases like clustering, filtering, etc.


Spark & Scala Interview Questions and Answers

1. What is Scala what are the importance of it and the difference between Scala and other Programming languages (Java/Python)?

Scala is the most powerful language for developing big data environment applications. Scala provides several benefits to achieve significant productivity. It helps to write robust code with fewer bugs.  Apache Spark is written in Scala, so Scala is a natural fit for the developing Spark applications.

2. What is RDD tell me in brief?

Spark RDD is a primary abstract class in Spark API. RDD is a collection of partitioned data elements that can be operated in parallel. Normally, RDD is supporting properties like Immutable, Cacheable, Type Infer, and Lazy evaluation.

Immutable: RDD’s are Immutable data structures. Once created, it cannot be modified

Partitioned: The Data in RDD’s are partitioned across the distributed cluster of nodes. However, multiple Cassandra partition can be mapped to one single RDD partition

Fault Tolerance: RDD is designed to be a fault – tolerant. Because the RDD data is stored across the large distributed cluster. So there is a chance for node failure in that cluster by this we can lose the Partitioned data in that node.

RDD automatically handles the node failure. Spark will maintain the metadata of each RDD and details about the RDD. So by using that information, we can get that data from other nodes.

Interface: RDD provides a uniform interface for processing data from a variety of data sources such as HDFS, HBase, Cassandra, MongoDB, and others. The same interface can also be used to process data stored in memory across a cluster of nodes.

InMemory: The RDD class provides the API for enabling in-memory cluster computing. Spark allows RDDs to be cached or persisted in memory

3. How to register a temporary table in Spark SQL?

When we creating the data frame by loading the data into it using SQL Context object. This is treated a temporary table. Because the scope of the data frame is to a particular session

4. How to count the number of lines in Scala?

In Scala programming language using getLines.size property we can count

Example: Val countlines = source.getLines.size


Spark Lazy Evaluation and Advantages with Example

Apache Spark is an in-memory cluster computing framework for processing and analyzing large amounts of data (Bigdata). Spark provides a simple programming model than that provided by Map Reduce. Developing a distributed data processing application with Apache Spark is a lot easier than developing the same application with Map Reduce. In Hadoop MapReduce provides only two operations for processing the data like “Map” & “Reduce”, whereas Spark comes with 80 plus data processing operations to work with big data application.

While data processing from source to destination. Spark is 100 times faster than Hadoop Map Reduce because it allows in-memory clustering computing, it implements an advanced execution engine.

What is meant by Apache Spark Lazy Evaluation?

In Apache Spark, two types of RDD operations are


II) Actions.

We can define new RDDs any time, Apache Spark computes them only in a lazy evaluation. That is, the first time they are used in an action. The Lazy evaluation seems unusual at first but makes a lot of sense when you are working with large data(BigData).

Simply Lazy Evaluation in Spark means that the execution will not start until an action is triggered. In Apache Spark, the picture of lazy evaluation comes when Spark transformation occurs”.

Consider where we defined a text file and then filtered the lines that include “CISCO” client name if Apache Spark were to load and store all the lines in the file as soon as we wrote like lines = sc.text( file path ). Here Spark Context would waste a lot of o storage space, given that we then immediately filter out many lines. Instead, once Spark seems that whole chain transformation. It can compute the data needed for its result. Hence first() action, Apache Spark scans the file only until it finds the first matching line; it doesn’t even read the whole file.

Advantages Of Lazy Evaluation in Spark Transformations:

Some advantages of Lazy evaluation in Spark in below:

  • Increase Manageability: The Spark Lazy evaluation, users can divide into smaller operations. It reduces the number of passes on data by transformation grouping operation.
  • Increases Speed: By lazy evaluation in Spark to saves the trip between driver and cluster, speed up the process.
  • Reduces Complexities: There are two types of complexities of any operations are Time and Space complexity using Spark lazy evaluation we can overcome both complexities. The action is triggered only when the data is required.

Simple Example:

In Spark, Lazy evaluation below code writes in  Scala, who evaluates the expression as it’s declared.

With Lazy:

Scala> Val sparkList = List(1,2,3,4)

Scala> lazy val output = sparkList .map( 1 => 1*10)

Scala> println( output )


List( 10, 20, 30, 40 )

Spark Performance Tuning with pictures

Spark Execution with a simple program:

Basically, the Spark program consists of a single spark driver and process and a set of executors processes across nodes of the cluster.

Performance tuning of Spark is measured bottleneck using big data environment metrics for block time analysis. Spark is run on In-memory cache so need to avoid network and I/O are key role while performance.

For example: Take two clusters, one cluster with 25 machines and cluster size is 750 GB of data. The second cluster with 75 machines clusters with 4.5TB of raw data.  The network communication is always irrelevant for the performance of these workloads coming to network optimization is to reduce job completion by 5% for better performance. Finally serialized compressed data.

Mostly Apache Spark always supports transformations like groupByKey and reduceByKey dependencies. Spark executes a shuffle, which transfers data around the cluster. Below three operations  with different outputs:

sc.textFile(" /hdfs/path/sample.txt")
map(mapFunc) #Using Map function 
flatMap(flatMapFunc) #Flatmap is another transformer
count() # Count the data.

Above code executes a single performer, which depends on a sequence of transformations on an RDD derived from the sample.tx file.

If the code contains how many times each character appears in all the words that appear more 1000 times in a given text file. Below code in Scala:

Val  token=sc.textFile(args(0).flatMap(_.split(' ' ))
Val wc=token.map((_,1)).reduceByKey(_+_)
Val filtered = wc.filter(_._2 > =1000)
val charCount = filtered.flatMap(_._1.toCharArray).map((_, reduceByKey(_+_)

Above code breaks into mainly three stages. The reduceByKey operations result in stage boundaries




Java.lang.ClassNotFoundException: oracle.jdbc.driver.OracleDriver in Spark Scala

While writing Apache Spark in Scala / Python (PySpark) programming language to read data from Oracle Data Base using Scala / Python in Linux operating system/ Amazon Web Services, sometimes will get below error in

spark.driver.extraClassPath in either executor class or driver class.

Caused by: java.lang.ClassNotFoundException: oracle.jdbc.driver.OracleDriver

at scala.tools.nsc.interpreter.AbstractFileClassLoader.findClass(AbstractFileClassLoader.scala:83)

at Java .lang.ClassLoader.loadClass(ClassLoader.java.424)

at Java.lang.ClassLoader.loadClass(ClassLoader.java.357)

at org.apache.spark.sql.execution.datasources.jdbc.DriverRegistry$.register(DriverRegistry.scala:35)

at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$anofun$createConnectionFactory$1.api

at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$anofun$createConnectionFactory$1.api

at scala.Option.foreach ( Option .scla:236)

at org . apache . spark . sql . execution . datasources . jdbc.JdbcUtils $ anofun $ createConnection Factory $ (JdbcUtils.scala)

at <init> ( < console >:46)

at . <init> (<console>:52)

at. <clinit> (<console>)

at. <init> (<console>:7)

at. <clinit> (<console>)

at $print (<console>)


After getting this error will provide a simple solution in Spark Scala. Sometimes these are coming to Python (PySpark).

import related jars to both executor class and driver class. First, we need to edit the configuration file as spark defaults in spark-default.conf file.

Adding below two jar files path in spark-default.conf file.

spark.driver.extraClassPath /home/hadoop/ojdbc7.jar
spark.executor.extraClassPath /home/hadoop/ojdbc7.jar

Above two jar files path in configurations with exact version is matched to your Spark version otherwise will get the compatible issue.

Sometimes these two classpaths get an error then will add in your code either Scala or Pyspark programming –conf before Spark driver, executor jar files in bottom of the page example.

If will get the same issue again then will follow the below solution:
Step 1: Download Spark ODBC jar files from the official Maven website.

Step 2: Copy the download jar files into the below path in the share location in Spark.


For Example –  PySpark programming code snippet for more information.


pyspark --driver-classpath /home/hadoop/odbc7.jar --jars #  jar file path
from pyspark import SparkContext, Spark conf # import Spark Context and configuration

from pyspark.sql import SparkContext #Sql context

sqlContext = sqlContext (sc)

dbconnection = sqlContext . read . format ("jdbc") . options (url: "Give your jdbc connection url path").load()
#Data Base connection to read data from Spark with the help of jdbc


Hadoop and Spark Interview Questions

Nowadays IT market Hadoop and Spark interview question for experienced persons.

Round 1:

1. What is the future class in Scala programming language?

2.Difference between fold by fold Left or foldRight-in Scala?

3. How to distribute by will work in hive give some data tell me how to data will be distributed

4.dF.filter(Id == 3000) how to pass this condition in data frame on values in dynamically?

5. Have you worked on multithreading in Scala and explain?

7.On what basis you will increase the mappers in Apache Sqoop?

8. What will you mention last value while you are importing for the first time in Sqoop?

9. How do you mention date for incremental last modified in Spark?

10. Let’s say you have created the partition for Bengaluru but you loaded Hyderabad data what is the validation we have to do in this case to make sure that there won’t be any errors?

11. How many reducers will be launched in distributed by in Spark?

12. How to delete sqoop job in simple command?

13.In which location sqoop job last value will be stored?

14. What are the default input and output formats in Hive?

15. Can you explain brief idea about distributing cache in Spark with an example?

16. Did you use Kafka/Flume in your project and explain in detail?

17.Difference between Parquet and ORC file formats?

Round 2:

1. Explain your previous project?

2. How do you handle incremental data in apache sqoop?

3. Which Optimization techniques are used in sqoop?

4. What are the different parameters you pass your spark job?

5. In case one task is taking more time how will you handle?

6. What is stages and task in spark and give a real-time scenario?

7.On what basis you set mappers in Sqoop?

8. How will you export the data to Oracle without putting much load in the table?

9. What is column family in Hbase?

10. Can you create a table without mentioning column family

11.The number of column families limits for one table?

12. How to schedule Spark jobs in your previous project?

13. Explain Spark architecture with a real-time based scenario?

Resilient Distributed Datasets(RDD) in Spark


Resilient Distributed Datasets represents a collection of partitioned data elements that can be operated on in a parallel manner. RDD is the primary data abstraction mechanism in Spark and defined as an abstract class in Spark library it is similar to SCALA collection and it supports LAZY evaluation.

Characteristics of RDD:

1.Immutable :

RDD is an immutable data structure. Once created, it cannot be modified in-place. Basically, an operation that modifies RDD returns a new RDD.


In RDD Data is split into partitions. These partitions are generally distributed across a cluster of nodes. When Spark is running on a single machine all the partitions are on that machine.


RDD Operations :

Applications in Spark process data using the same methods in RDD class. It referred to as operations

RDD operations are two types:



 1.Transformations :

A transformation method of an RDD creates a new RDD by performing a computation on the source RDD.

RDD transformations are conceptually similar to SCALA collection methods.

The key difference is that the SCALA collection methods operate on data that can fit in the memory of a single machine, whereas RDD methods can operate on data distributed across a cluster of node RDD transformations are LAZY but SCALA collection methods are strict.

A) Map:

The map method is a higher order method that takes a function as input and applies it to each element in the source RDD to create a new RDD.

B) filter:

The filter method is a high order method that takes a Boolean function as input and applies it to each element in the source RDD to create a new RDD. A Boolean function takes an input and returns false or true. It returns a new RDD formed by selecting only those elements for which the input Boolean function returned true. The new RDD contains a subset of the elements in the original RDD.

c) flatMap:

This method is a higher order method that takes an input function in Spark, it returns a sequence for each input element passed to it. The flatMap method returns a new RDD formed by flattening this collection of the sequence.

D) mapPartitions :

It is a higher order method allows you to process data at a partition level. Instead of passing one element at a time to its input function, mapPartitions passes a partition in the form an iterator. The input function to the mapPartitions method takes an iterator as input and returns iterator as output.


Intersection method itakesRDD as input and returns a new RDD that contains the intersection of the element in the source RDD and the RDD passed to it as an input.


This method takes  RDD as input and returns a new RDD that contains a Union of the element in the resource RDD and the RDD passed to it as an input.


Subtract method takes RDD as input and returns a new RDD that contains elements in the source RDD but not in the input RDD.



The Prallelized collections are created by calling Spark Context’s parallelize method on an existing collection in your driver program. The elements of the collection are copied to form a distributed data set that can be operated on in parallel.



Distinct method of an RDD returns a new RDD containing the distinct elements in the source RDD


J)Group By:

Group By is a higher order method it groups the elements of  RDD according to user-specified criteria. It takes as input a function that generates a key for each element in the source RDD. It is applicable to all the elements in the source RDD and returns an RDD of pairs.


K)Sort By:

The sortBy method is a higher order it returns RDD with sorted elements from the source RDD. It takes two input parameters. The first input is a function that generates a key for each element in the source RDD. The second input allows specifying ascending or descending order for sort.



Coalesce method reduces the number of partitions in  RDD. It takes an integer input and returns new RDD with the specified number of partitions.



The GroupByKey method returns an RDD of pairs, where the first element in a pair is a key from the source RDD and the second element is a collection of all values that have the same key. It is the same as the groupBy method. The major difference is that groupBy is a higher order method that takes an input function that returns a key for each element in the source RDD. The groupByKey method operates in an RDD of key-value pairs.


The higher-order reduceBy key method takes an associative binary operator as input and reduces values with the same key to a single value using specified binary operators.


Actions are RDD methods that return a value to a driver program.


The collect method returns the elements in the source RDD as an array. This method should be used with caution since it moves data from all the worker to the driver program.


This method returns a count of the elements in the source RDD.

C)Count By Value :

The countByValue method returns a count of each unique element in the source RDD. It returns an instance of the Map class containing each unique element and its count as a key-value pair.


The first method returns the first element in the source RDD


The max method returns the largest element in  RDD


The min method returns the smallest element in RDD


The top method takes an integer N as input and returns an array containing the N largest elements in the source RDD.


The high order reduces method aggregates the elements of the source RDD using an associative and commutative binary operator provided to it.


The countByKey methods count the occurrences of each unique key in the source RDD. It returns a Map of key count pairs.

Spark Streaming Twitter Example

Spark Streaming Twitter Example:

//Using Scala Program

package org . apache . spark . demo . streaming

import org . apache . spark . streaming . SparkContext._

import org . apache . spark . streaming . twitter._

import org . apache . spark . streaming . {Seconds, StreamingContext}

import org . apache . spark . SparkConf

object TwitterTags{

def main(args: Array[String]){

if(args.length < 5 ){

System. err. println (“Usage : Twitter Popular Tags <consumer key> <consumer secret>”+”<access token><access token secret>[<filters]”)



StreamingExamples . setStreamingLogLevels()

val Array ( consumerKey, consumerSecret, accessToken, accessTokenSecret ) = args.take(5)

val filters = args . takeRight(args.length – 5)

//Set the system properties so that Twitter 4j library used by twitter stream

//Can we use them to generate OAuth(Open Authentication) credentials

System . setProperty (“twitter4j . oauth . consumerKey”, consumerKey)

System . setProperty( ” twitter4j . oauth . consumerSecret”, consumerSecret)

System. setProperty(“twitter4j.oauth.accessToken”,accessToken)

System. setProperty (” twitter4j . oauth. accessTokenSecret”, accessTokenSecret)

val sparkConf = new SparkConf(). setAppName(“TwitterTags”)

val scc=new StreamingContext (sparkConf, Seconds(3))

val stream = TwitterUtils.createStream ( scc, None,filters)

val hashTags =stream. flatMap (status = > status. getText. split(” “).filter(_.startsWith(“#”)))

val topCounts = hashTags. map((_, 1).reduceByKeyAndWindow(_+_, Seconds(60)).map{case (topic, count)=>(count, topic)}.transform(_.sortByKey(false))

val topCounts1 = hashTags . map((_, 1). reduceByKeyAndWindow(_+_, Seconds(30)).map{case (topic, count)=>(count, topic)}.transform(_.sortByKey(false))

//Print Popular hashtags

topCounts . foreachRDD (rdd = > {

val topList  =  rdd. take(30)

println (“\n Popular topics in last 60 seconds(%s total): “.format ( rdd . count()))

topList . foreach {case(count ,tag) => println(“%s(%s tweets)”.format(tag,count))


topCounts . foreachRDD (rdd = > {

val topList  =  rdd. take(60)

println(“\n Popular topics in last 30 seconds(%s total): “.format(rdd. count()))

topList . foreach {case(count ,tag)=>println(“%s(%s tweets)”. format(tag,count)))



scc . start()

scc . awaitTermination()



Spark Streaming Use Case

Spark Streaming Use Case with Explanation:

Using Scala streaming imports

import org. apache. spark. streaming. StreamingContext

import org. apache. spark. streaming. StreamingContext._

import org. apache. spark. streaming.dstream . DStream

import org. apache. spark. streaming.Duration

import org. apache. spark. streaming.Seconds

Spark Streaming Context :

This is also sets up underlying SparkContext that it will use to process data. It takes as input a batch interval specifying how often to process new data


We use socketTextStream() to create a DStream based on text data received on the local machine

Then we transform the DStream with filter() to get only the lines that contains error. Output operation print() to print some of the filtered lines.

Create a Streaming Context with a 1 – second batch size frin a SparkConf

val scc=new StreamingContext(conf, Seconds(1))

// Create DStream using data received after connecting to default port on the local machine

val lines = scc.socketTextStream(“localhost”, 9000)

//Filter our DStream for lines with “error”

var errorLines = lines.filter(_.contains(“error”))

//Print out the lines with errors

errorLines. print()

Above an example of converting a stream of lines to words, the flatMap operation is applied on each RDD in the lines DStream to generate the RDDs of the words DStream. This is shown in below figure Input DStreams are DStreams representing the stream of input data received from streaming process. In the above example of converting streaming of lines information words, lines was an input DStream as it represented the stream if data received from the server.

Every input DStream is associated with a Receiver object whether Java, Scala etc. Which receives the data from a source and stores it in Spark’s memory for processing. Here Spark Streaming provides two categories :

  1. Basic Sources: Sources directly available in the Streaming Context API example: file systems, socket connections
  2. Advanced Source: Sources indirectly available  like Flume, Kafka, Twitter etc. are available through extra utility classes.

Spark Streaming with Pictures

Spark Streaming:

Spark Streaming is a Spark’s module for real-time applications(Twitter tweets, statistics, page views). Lets user writes streaming applications using a very similar API to batch jobs. Spark Streaming is a distributed data stream processing framework. It makes it easy to develop distributed applications for processing live data streams in real time. It only provides a simple programming model but also enables an application to process high-velocity stream data. It also allows the combining of data streams and data for processing.

Spark Streaming is an extension of the core Spark API that enables scalable, fault-tolerant stream processing of live data streams. Data can be ingested from many sources like Flume, Kafka etc can be processed using complex algorithms expressed with high-level functions like map, reduce. Processed data can be pushed out to file systems and live dashboards.

Process Flow in Spark Streaming:

Spark Streaming receives live input data streams and divides the data into batches. Spark Engine will process the same data. Once processing is done Spark engine will generate the final stream of outputs in batches.


Streaming Context

Streaming Context , a class defined in the Spark Streaming library, is the main entry point into the Spark Streaming library. It allows a Spark Streaming application to connect to  a Spark Cluster.

Streaming Context provides methods for creating an instance of the data stream abstraction provided by Spark Streaming.

Every Spark Streaming application must create an instance of this class

import org. apache. spark._

import org. apache. spark. streaming._

val config = new Spark Conf(). setMaster (“spark : // host : port”) . setAppName (“Streaming app”)

val batch = 20

val ssc = new Streaming Context(conf, Seconds(batch)

The batch size can be as small as 500 milliseconds. The upper bound for the batch size is determined by the latency requirement of your application and the available memory in spark streaming.