Interview Questions on Spark Core

Spark Core interview questions :

1)What is Spark and explain briefly?

Spark is an in-memory cluster computing framework for processing and analyzing large amount of data. Spark provides a simple programming interface, which enables an application developer to easily to use Memory, CPU and storage resources across cluster of servers for processing in large data sets.

2)What is an an RDD and explain RDD properties?

Resilient Distributed Data sets  represents a collection of partitioned data elements that can be operated on in parallel manner. RDD is the primary data abstraction mechanism in Spark and defined as an abstract class in Spark library it is similar to SCALA collection and it supports LAZY evaluation.

3)What is Lazy evaluation, Why Spark is Lazy Evaluated?

Spark is “Lazy Evaluated ” system because Spark computes RDDs. Although you can define new RDDs any time, Spark computes them only in lazy way that is the first time they are used in an action. This approach might seem unusual at first, but makes a lot of sense when you are working with Big Data.



4)What is the Spark Context?

Spark Context is a class defined in the Spark library and main entry point into the Spark library. Spark Context will run in a program called “Driver Program” is the main program in Spark.

5)What are narrow and wide dependencies in RDD?

Narrow Dependencies:

In an RDD each parent partition contribute data to a single child partition  and it is a sequence of operations involving narrow dependencies can be pipe lined.

Wide Dependencies:

In and RDD each parent partition contributes data to multiple child partition and it requires a shuffle and expensive operation in a distributed system

6)What are the components of the Spark Compute Engine?

Spark Compute Engine is a data parallel application for data processing. It is divide into three components .

1.Driver

2.Cluster manager

3.Executor

Why Spark is Lazy Evaluation and How RDDs are Fault Tolerant



Why Spark is Lazy Evaluation and How RDDs are Fault Tolerant

Why Spark is Lazy Evaluation ? :

Why Spark is “Lazy Evaluated ” system because Spark computes RDDs. Although you can define new RDDs any time, Spark computes them only in a lazy way that is the first time they are used in an action. This approach might seem unusual at first, but makes a lot of sense when you are working with Big Data.

How RDDs are Fault Tolerant ? :

RDD is designed to be fault tolerant and represents data distributed across a cluster of nodes. The probability of a node failing is proportional to the number of nodes in a cluster. The larger a cluster, the higher the probability that some node will fail on any given RDD automatically handles node failures. When a node fails, and partitions stored on that node become inaccessible Spark reconstructs the lost RDD partitions on another node.

Spark storage lineage information for each RDD. Using this lineage information, it can recover parts of an RDD or even an entire RDD in the event of node failures RDD. persist ()



Spark’s RDDs are by default recomputed each time you run an action on them. If you would like to reuse an RDD in multiple actions, can ask Spark to persist it using RDD. persist ()

Go to Spark shell and check it filtered Name .persist method in Spark after creating RDD.

ex: Scala > filtered Name. persist()

Every Spark program and shell session will work as follows:

Create some input RDDs from external data. Transform them to define new RDDs using transformations like filter().

Spark to persist() any intermediate RDDs that will need to be reused. After launch actions such as count() and first() to kick off a parallel computation which is then optimized and executed by Spark.

In Spark cache() method same as the calling persist() with default storage level


Word Count Use Case in Spark



Word Count use case in Spark

First How to initialize Spark Context

import org. apache. spark. SparkConf

import org. apache. spark. SparkContext

import org. apache. spark. SparkContext._

val conf = new SparkConf (). setMaster (“local”). setAppName (“APP”)

val sc = new SparkContext (conf)

 

Note: An application name , namely APP in these examples. This will identify your application on the cluster manager’s UI and A cluster URL namely local in these examples which tells Spark how to connect to a cluster.

 

#)Word Count Use Case Using Spark Context in SCALA

//Create a Scala Spark Context.

val conf = new SparkConf (). setAppName (“Word Count”)

val sc = new SparkContext (conf)

//Load our input data

val input = sc. textFile (inputFile)

//Split into words

val words =input. flatMap ( line => line. split (” “))

//Transformation into pairs and count.

val counts=words.map(word =>(word,1)).reduce By Key { case (x, y) => x + y }

//Save the word count back out a text file

counts. saveAsTextFile (outputFile).

 



OTHER EXAMPLES IN SCALA:

//create a RDD based on “data”

val data = 1 to 1000

val distData = sc. parallelize(data)

//select the values less than 10

distData. filter(_<10).collect()

//base RDD

val lines=sc. textFile(“localhost:54280/ EmployeeLogs. txt”)

//transformed RDDs

val emp = lines.filter (_.startsWith (“Emp”))!

val messages = emp.map (_.split(“\t”)). map(r=>r(1))!

messages.cache()!

messages.filter (_.contains(“mysql”)).count()!

messages.filter (_.contains(“Hadoop”)).count

Spark Architecture




1.Spark Context:

Spark Context is a class defined in the Spark library and main entry point into the Spark library. Spark Context will run in a program called “Driver Program” is the main program in Spark.

Spark application must create an instance of the Spark Context class.

An Application can have only one active instance of Spark context. An instance of the Spark Context can be created as below:

val sc=new SparkContext() 

Here SparkContext gets configuration settings like the address of the Spark master, application name, and other settings from system properties.

val config =new SparkConf().setMaster(“localhost:port”).setAppName(“Spark”)

val sc=new SparkContext(config)

2. Cluster Manager:

It allocates resources across the application of cluster to run on a cluster, the Spark Context connected several ways of Cluster managers

3.Executors:

Spark acquires executors on nodes in the cluster, which are processes that run computations and store data of application then the cluster manager sends your application code to the executors.

Different Cluster Managers in  Spark Architecture:

In Spark Architecture there are 3 types of Cluster Managers:

A) Standalone Mode

B)Apache Mesos

C)YARN



A) Standalone Mode:

Spark’s Default cluster environment it is the easiest way to run your Spark applications in a clustered environment. Here mostly Spark Master is a resource manager for the Standalone and Spark Worker is a worker in Spark standalone mode. In this mode, Spark allocates resources based on cores.

B)Apache Mesos:

Apache Mesos is a general purpose cluster manager that can run both analytics workloads and log running services on a cluster.

C)YARN:

YARN is a cluster manager introduced in Hadoop 2.x that allows drivers data processing frameworks to run on a shared resources pool and is typically installed on the same nodes as the HDFS.

Running Spark on YARN in these environments is useful because it lets Spark access HDFS data quickly, on the same nodes where the data is stored. To use Spark on Hadoop YARN.



Apache Spark




Apache Spark is an in-memory cluster computing framework for processing and analyzing a large amount of data. Spark provides a simple programming interface, which enables an application developer to easily to use Memory, CPU and storage resources across the cluster of servers for processing in large data sets.

Spark also Open Source distributed frame work for big data processing written in Java, Python, Scala and R languages simply.

Key Features of Spark:

1.Fast
2.In General Purpose
3. Scalable
4.Fault Tolerant

1. Fast:

Spark data fits in the memory, it is 100 times faster than Map Reduce.
Spark is faster than Hadoop Map Reduce for two reasons:
1. It implements an advanced execution ending
2. It allows in-memory cluster computing



Spark does not automatically cache input data in memory. A common misconception is that Spark cannot be used if input data does not fit in memory. It is not true.
Spark can process terabytes of data on a cluster that may have only 100 GB total cluster memory.

2.In General Purpose:

Apache Spark provides provides for different types of data processing jobs. It can be used for:
1.Stream Processing
2.Batch Processing
3.Machine Learning
4.Graph Computing
5.Interactive Processing

3. Scalable:

Spark is scalable because data processing capacity of a spark cluster can be increased by just adding more nodes to a cluster. No code change is required when you add a node to a Spark Cluster.

4.Fault Tolerant:

Apache Spark is Fault Tolerant because, in a cluster of a few hundred nodes, the probability of a node failing on any given day is high. The hard disk may crash or some other hardware problem.
So Spark automatically handles the failure of a node in a cluster.




Here the basic difference between Spark and Hadoop Map Reduce:

Click here for more information below video About Spark and Architecture with a simple explanation in detail.