Spark Coding Test for Spark Developers

Here two programs for spark developers :



Question 1:

Mr. Bolt is in his 60’s and loves traveling. He recently visited a country famous for its pens. He has  ‘A’ grandchildren. He went to open Pen shop to purchase pens for them. The shop keeper showed him ‘a’ varieties of pens each variety containing ‘b[i ]’ pens.

He has to select city ‘c’ varieties of pens in a set in such a way that all the ‘A’  grandchildren get the same number of pens. If there are more than one such sets, the one with the minimum number of pens per child should be returned.

Inputs:

input 1: Value of ‘X’

input 2: Value of ‘c’

input 3: Value of ‘a’

input 4: Values in the array ‘b’

Output:

Return the minimum number of pens each grandchild should get. Return -1 if no solution possible

Example:

Inputs:

input 1 : 5

input 2 : 3

input 3: 5

input 4 : {1,2,3,4,5}

 

Output: 2

Explanation : He can purchase pens in two sets {2,3,45} and {1,4,5}. The sum of each set is 10. Therefore, he will be able to give 2 pens to each of his grandchildren.

 

Question 2:

You just got a new job but your new office has a different rule. They allow taking interval breaks in between tasks if there is no task available but the problem is that the tasks com randomly and sometimes it may be required to do them simultaneously.

On your first day, you are given a list of tasks with their starting and ending time. Find out the total time you will get the breaks. Assurance ending time to be greater than starting time.

Input:

Input 1: No.of tasks

Input 2: 2-d array in for [10,11] representing starting and ending time period of the task

Output:

Your function must return an integer representing the total break time

Example:




input 1: 4

input 2: { (6,8)(1,9)(2,4)(4,7)}

Output: 0

Above Programs are related to Spark using SCALA, Java, Python and R languages for Spark developers.

Spark with Kafka Scenario for Spark Developers




Problems:

This scenario is related to real-time example in Spark with Kafka for Spark developers.

Problem Statement:

A Reality Television in a Game show has 7 players, the game for one complete day, the winner of the game is decided by the votes cast by the audience watching the show. At the end of the day, the winner is decided by certain criteria which are detailed below.

Rules to cast vote:

1)Each unique user(let us assume has an ID) can cast vote for the players

2)The user can cast, maximum one vote every 2 minutes he has the liberty casting different players each time

3)If a user casts more than one vote in the spam of two minutes, the latest vote will overwrite the previous vote.

Calculation criteria for the winner:

1)Find the player who has maximum votes every minute of the day, the player with maximum votes for the minute will get one reward point.

2)At the end of the day player who has maximum reward points is the winner

 

Tasks:

1)Create a system which simulates user voting to a Kafka topic

2)Spark Streaming job should process the stream data and process the data based on the rules mentioned above

3)The reward points for the users should be stored in the persistent system

4)Provide a query to find the winner.





Apache Kafka is messaging and integration for Spark Streaming. Kafka acts as the central hub for real-time streams of data and is processed.

Above scenario asked for coding in Spark with Kafka. For Spark Developers will implement in SCALA or Python depends upon your programming knowledge. Nowadays the most important scenario in the IT industry for CCA – 175 also.

Spark SQL with example(Pictures)





In Hive Context provides a superset of the functionality provided by SQL Context. The parser that comes with Hive Context is more powerful than the SQL Context parser. It can execute both HiveQL (Hive Query Language) and SQL queries and it can read data from Hive tables. It also allows applications to access Hive UDFs(User Defined Functions). If we want to process existing Hive tables then add the hive-site.xml file to Spark’s classpath. Hive Context read Hive configuration from the hive-site.xml file.

Data Frames: Data Frame is a Spark SQL’s primary data abstraction. It represents a distributed collections of rows organized into named columns. It is similar to a relational database.

Spark SQL is a Spark module for structured data processing and it provides a programming abstraction called Data Frames and can also act as a distributed SQL query engine.

Few points for Spark SQL:

1. A Data Frame is a distributed collection of data organized into named columns.

2. It is conceptually equivalent to a table in a relational database

3. Data Frames can be constructed from a wide array of sources such as structured data files, tables in Hive, external databases, or existing RDDs converting “RDD to Data Frame”.

In Spark SQL provides an implicit conversion method named to Df which it creates a Data Frame simply.

Coming to RDD of objects represented by a case class when this technique is used Spark SQL infers the schema of a data set. The to DF method is not defined in the RDD class, but it is available through an implicit conversion. To convert an RDD to a Data Frame using to DF, then we need to import the implicit methods.

example:

Scala > val data =sc . parallelize (1 to 100)

Scala > val new Data=data.map(l=> (l, 1 – 10))

Scala > val result Data=new Data. toDF(“normal”, “transformed”)






result Data.print Schema

Scala > result Data.show

Above example very simple for Spark beginners.

Interview Questions on Spark Core




Spark Core interview questions :

1)What is Spark and explain briefly?

Spark is an in-memory cluster computing framework for processing and analyzing a large amount of data. Spark provides a simple programming interface, which enables an application developer to easily to use Memory, CPU and storage resources across a cluster of servers for processing in large data sets.

2)What is RDD and explain RDD properties?

Resilient Distributed Data sets represent a collection of partitioned data elements that can be operated on in a parallel manner. RDD is the primary data abstraction mechanism in Spark and defined as an abstract class in Spark library it is similar to SCALA collection and it supports LAZY evaluation.

3)What is Lazy evaluation, Why Spark is Lazy Evaluated?

Spark is “Lazy Evaluated ” system because Spark computes RDDs. Although you can define new RDDs any time, Spark computes them only in a lazy way that is the first time they are used in an action. This approach might seem unusual at first, but makes a lot of sense when you are working with Big Data.

4)What is the Spark Context?

Spark Context is a class defined in the Spark library and main entry point into the Spark library. Spark Context will run in a program called “Driver Program” is the main program in Spark.

5)What are narrow and wide dependencies in RDD?

Narrow Dependencies:

In an RDD each parent partition contributes data to a single child partition and it is a sequence of operations involving narrow dependencies can be pipelined.




Wide Dependencies:

In and RDD each parent partition contributes data to multiple child partitions and it requires a shuffle and expensive operation in a distributed system

6)What are the components of the Spark Compute Engine?

Spark Compute Engine is a data parallel application for data processing. It is divided into three components.

1.Driver

2.Cluster manager

3.Executor

Why Spark is Lazy Evaluation and How RDDs are Fault Tolerant

Why Spark is Lazy Evaluation and How RDDs are Fault Tolerant




Why Spark is Lazy Evaluation ? :

Why Spark is “Lazy Evaluated ” system because Spark computes RDDs. Although you can define new RDDs any time, Spark computes them only in a lazy way that is the first time they are used in an action. This approach might seem unusual at first, but makes a lot of sense when you are working with Big Data.

How RDDs are Fault Tolerant ? :

RDD is designed to be fault tolerant and represents data distributed across a cluster of nodes. The probability of a node failing is proportional to the number of nodes in a cluster. The larger a cluster, the higher the probability that some node will fail on any given RDD automatically handles node failures. When a node fails, and partitions stored on that node become inaccessible Spark reconstructs the lost RDD partitions on another node.

Spark storage lineage information for each RDD. Using this lineage information, it can recover parts of an RDD or even an entire RDD in the event of node failures RDD. persist ()




Spark’s RDDs are by default recomputed each time you run an action on them. If you would like to reuse an RDD in multiple actions, can ask Spark to persist it using RDD. persist ()

Go to Spark shell and check it filtered Name .persist method in Spark after creating RDD.

ex: Scala > filtered Name. persist()

Every Spark program and shell session will work as follows:

Create some input RDDs from external data. Transform them to define new RDDs using transformations like filter().

Spark to persist() any intermediate RDDs that will need to be reused. After launch actions such as count() and first() to kick off a parallel computation which is then optimized and executed by Spark.

In Spark cache() method same as the calling persist() with default storage level

Word Count Use Case in Spark




Word Count use case in Spark

First How to initialize Spark Context

import org. apache. spark. SparkConf

import org. apache. spark. SparkContext

import org. apache. spark. SparkContext._

val conf = new SparkConf (). setMaster (“local”). setAppName (“APP”)

val sc = new SparkContext (conf)

 

Note: An application name, namely APP in these examples. This will identify your application on the cluster manager’s UI and A cluster URL namely local in these examples which tells Spark how to connect to a cluster.

 

#)Word Count Use Case Using Spark Context in SCALA

//Create a Scala Spark Context.

val conf = new SparkConf (). setAppName (“Word Count”)

val sc = new SparkContext (conf)

//Load our input data

val input = sc. textFile (inputFile)

//Split into words

val words =input. flatMap ( line => line. split (” “))

//Transformation into pairs and count.

val counts=words.map(word =>(word,1)).reduce By Key { case (x, y) => x + y }

//Save the word count back out a text file

counts. saveAsTextFile (outputFile).

OTHER EXAMPLES IN SCALA:

//create an RDD based on “data”

val data = 1 to 1000

val distData = sc. parallelize(data)

//select the values less than 10

distData. filter(_<10).collect()

//base RDD

val lines=sc. textFile(“localhost:54280/ EmployeeLogs. txt”)

//transformed RDDs

val emp = lines.filter (_.startsWith (“Emp”))!

val messages = emp.map (_.split(“\t”)). map(r=>r(1))!

messages.cache()!

messages.filter (_.contains(“mysql”)).count()!

messages.filter (_.contains(“Hadoop”)).count

Spark Architecture

1.Spark Context:

Spark Context is a class defined in the Spark library and main entry point into the Spark library. Spark Context will run in a program called “Driver Program” is the main program in Spark.




Spark application must create an instance of the Spark Context class.

An Application can have only one active instance of Spark context. An instance of the Spark Context can be created as below:

val sc=new SparkContext() 

Here SparkContext gets configuration settings like the address of the Spark master, application name, and other settings from system properties.

val config =new SparkConf().setMaster(“localhost:port”).setAppName(“Spark”)

val sc=new SparkContext(config)

2. Cluster Manager:

It allocates resources across the application of cluster to run on a cluster, the Spark Context connected several ways of Cluster managers

3.Executors:

Spark acquires executors on nodes in the cluster, which are processes that run computations and store data of application then the cluster manager sends your application code to the executors.

Different Cluster Managers in  Spark Architecture:

In Spark Architecture there are 3 types of Cluster Managers:

A) Standalone Mode

B)Apache Mesos

C)YARN

A) Standalone Mode:

Spark’s Default cluster environment it is the easiest way to run your Spark applications in a clustered environment. Here mostly Spark Master is a resource manager for the Standalone and Spark Worker is a worker in Spark standalone mode. In this mode, Spark allocates resources based on cores.

B)Apache Mesos:

Apache Mesos is a general purpose cluster manager that can run both analytics workloads and log running services on a cluster.




C)YARN:

YARN is a cluster manager introduced in Hadoop 2.x that allows drivers data processing frameworks to run on a shared resources pool and is typically installed on the same nodes as the HDFS.

Running Spark on YARN in these environments is useful because it lets Spark access HDFS data quickly, on the same nodes where the data is stored. To use Spark on Hadoop YARN.

What is Apache Spark?Features of Spark and Difference between Hadoop and Spark

Apache Spark is an in-memory cluster computing framework for processing and analyzing a large amount of data. Spark provides a simple programming interface, which enables an application developer to easily to use Memory, CPU and storage resources across the cluster of servers for processing in large data sets.




Spark also Open Source distributed framework for big data processing written in Java, Python, Scala, and R languages simply.

Key Features of Spark:

1.Fast
2.In General Purpose
3. Scalable
4.Fault-Tolerant

1. Fast:

Spark data fits in the memory, it is 100 times faster than Map Reduce.
Spark is faster than Hadoop Map Reduce for two reasons:
1. It implements an advanced execution ending
2. It allows in-memory cluster computing
Spark does not automatically cache input data in memory. A common misconception is that Spark cannot be used if input data does not fit in memory. It is not true.
Spark can process terabytes of data on a cluster that may have only 100 GB total cluster memory.

2. In General Purpose:

Apache Spark provides for different types of data processing jobs. It can be used for:




1.Stream Processing
2.Batch Processing
3.Machine Learning
4.Graph Computing
5.Interactive Processing

3. Scalable:

Spark is scalable because the data processing capacity of a spark cluster can be increased by just adding more nodes to a cluster. No code change is required when you add a node to a Spark Cluster.

4.Fault-Tolerant:

Apache Spark is Fault Tolerant because, in a cluster of a few hundred nodes, the probability of a node failing on any given day is high. The hard disk may crash or some other hardware problem.
So Spark automatically handles the failure of a node in a cluster.

Here the basic difference between Spark and Hadoop Map Reduce: