Machine Learning

What is Machine Learning?

Machine Learning is a branch of artificial intelligence (AI) that focuses on the development of computer programs that can teach to grow and change exposed to new data. It is concerned with the design and development of algorithms that can take complex input data and can make an intelligent decision based on the input data.

Types of Machine Learning:

1. Supervised Learning :

This type of machine learning is concerned with associating some undefined data document to some predefined label of the training data in order to predict the value of any valid input.  Common examples of supervised learning include classifying e-mail message as spam. Mainly all classification algorithms. Example Mahout.

2.Unsupervised Learning :

Unsupervised learning type of machine learning is concerned with making sense out of complex hard to understand data by creating some similarity or some interesting patterns. No labels are associated with it. Mainly all clustering algorithms in mahout for example.

3.Semi-Supervised Learning :

Semi-supervised learning is concerned with defining an undefined data document in the presence of both labeled and unlabeled data. It is a combination of supervised and unsupervised learning. The main aim of semi-supervised learning is to demonstrate how combining both labeled and unlabeled data can change learning behavior.

Spark Streaming Twitter Example

Spark Streaming Twitter Example:

//Using Scala Program

package org . apache . spark . demo . streaming

import org . apache . spark . streaming . SparkContext._

import org . apache . spark . streaming . twitter._

import org . apache . spark . streaming . {Seconds, StreamingContext}

import org . apache . spark . SparkConf

object TwitterTags{

def main(args: Array[String]){

if(args.length < 5 ){

System. err. println (“Usage : Twitter Popular Tags <consumer key> <consumer secret>”+”<access token><access token secret>[<filters]”)



StreamingExamples . setStreamingLogLevels()

val Array ( consumerKey, consumerSecret, accessToken, accessTokenSecret ) = args.take(5)

val filters = args . takeRight(args.length – 5)

//Set the system properties so that Twitter 4j library used by twitter stream

//Can we use them to generate OAuth(Open Authentication) credentials

System . setProperty (“twitter4j . oauth . consumerKey”, consumerKey)

System . setProperty( ” twitter4j . oauth . consumerSecret”, consumerSecret)

System. setProperty(“twitter4j.oauth.accessToken”,accessToken)

System. setProperty (” twitter4j . oauth. accessTokenSecret”, accessTokenSecret)

val sparkConf = new SparkConf(). setAppName(“TwitterTags”)

val scc=new StreamingContext (sparkConf, Seconds(3))

val stream = TwitterUtils.createStream ( scc, None,filters)

val hashTags =stream. flatMap (status = > status. getText. split(” “).filter(_.startsWith(“#”)))

val topCounts = hashTags. map((_, 1).reduceByKeyAndWindow(_+_, Seconds(60)).map{case (topic, count)=>(count, topic)}.transform(_.sortByKey(false))

val topCounts1 = hashTags . map((_, 1). reduceByKeyAndWindow(_+_, Seconds(30)).map{case (topic, count)=>(count, topic)}.transform(_.sortByKey(false))

//Print Popular hashtags

topCounts . foreachRDD (rdd = > {

val topList  =  rdd. take(30)

println (“\n Popular topics in last 60 seconds(%s total): “.format ( rdd . count()))

topList . foreach {case(count ,tag) => println(“%s(%s tweets)”.format(tag,count))


topCounts . foreachRDD (rdd = > {

val topList  =  rdd. take(60)

println(“\n Popular topics in last 30 seconds(%s total): “.format(rdd. count()))

topList . foreach {case(count ,tag)=>println(“%s(%s tweets)”. format(tag,count)))



scc . start()

scc . awaitTermination()



Spark Streaming Use Case

Spark Streaming Use Case with Explanation:

Using Scala streaming imports

import org. apache. spark. streaming. StreamingContext

import org. apache. spark. streaming. StreamingContext._

import org. apache. spark. streaming.dstream . DStream

import org. apache. spark. streaming.Duration

import org. apache. spark. streaming.Seconds

Spark Streaming Context :

This is also sets up underlying SparkContext that it will use to process data. It takes as input a batch interval specifying how often to process new data


We use socketTextStream() to create a DStream based on text data received on the local machine

Then we transform the DStream with filter() to get only the lines that contains error. Output operation print() to print some of the filtered lines.

Create a Streaming Context with a 1 – second batch size frin a SparkConf

val scc=new StreamingContext(conf, Seconds(1))

// Create DStream using data received after connecting to default port on the local machine

val lines = scc.socketTextStream(“localhost”, 9000)

//Filter our DStream for lines with “error”

var errorLines = lines.filter(_.contains(“error”))

//Print out the lines with errors

errorLines. print()

Above an example of converting a stream of lines to words, the flatMap operation is applied on each RDD in the lines DStream to generate the RDDs of the words DStream. This is shown in below figure Input DStreams are DStreams representing the stream of input data received from streaming process. In the above example of converting streaming of lines information words, lines was an input DStream as it represented the stream if data received from the server.

Every input DStream is associated with a Receiver object whether Java, Scala etc. Which receives the data from a source and stores it in Spark’s memory for processing. Here Spark Streaming provides two categories :

  1. Basic Sources: Sources directly available in the Streaming Context API example: file systems, socket connections
  2. Advanced Source: Sources indirectly available  like Flume, Kafka, Twitter etc. are available through extra utility classes.

Smart way of looping in JavaScript

Whenever there is a scenario of looping the first thing that strikes our mind is for. But actually, there are clean and better ways to loop without using for in javascript.


ECMAScript 5 provided methods forEach, map, reduce, filter and ECMAScript 6 provided find method that resides on Array prototype. Therefore depending on the scenario, we can use appropriate methods instead of for loop.

    • forEach: forEach() method calling a provided function on every array element.
      • Demo Array.forEach forEach
    • map: map() method creates a new array with the result of calling a provided function on each element. 
      • Demo map


  • reduce: reduce() method results a single value that is a result of calling a reducer function on each element.
    • Demo Array.reduce()js reducer method
  • filter: filter() method creates a new array with all elements that pass a test implemented by the provided function
    • Demo Array.filter()filter method
  • find: method returns the value of the first element in an array that passes a test implemented by the provided function
    • Demo Array.find()

Who is better at performance?

Obviously for loops are faster than Javascript methods(map, reduce, filter, find)  because these methods can have extra overhead, behind the scene even these methods could be using for loop.

Demo reducer vs forfor vs reduce

If for loops are faster then why javascript methods?

These methods are self-explanatory and using the appropriate method for the use case can help the team understand what you are doing.

Spark Streaming with Pictures

Spark Streaming:

Spark Streaming is a Spark’s module for real-time applications(Twitter tweets, statistics, page views). Lets user writes streaming applications using a very similar API to batch jobs. Spark Streaming is a distributed data stream processing framework. It makes it easy to develop distributed applications for processing live data streams in real time. It only provides a simple programming model but also enables an application to process high-velocity stream data. It also allows the combining of data streams and data for processing.

Spark Streaming is an extension of the core Spark API that enables scalable, fault-tolerant stream processing of live data streams. Data can be ingested from many sources like Flume, Kafka etc can be processed using complex algorithms expressed with high-level functions like map, reduce. Processed data can be pushed out to file systems and live dashboards.

Process Flow in Spark Streaming:

Spark Streaming receives live input data streams and divides the data into batches. Spark Engine will process the same data. Once processing is done Spark engine will generate the final stream of outputs in batches.


Streaming Context

Streaming Context , a class defined in the Spark Streaming library, is the main entry point into the Spark Streaming library. It allows a Spark Streaming application to connect to  a Spark Cluster.

Streaming Context provides methods for creating an instance of the data stream abstraction provided by Spark Streaming.

Every Spark Streaming application must create an instance of this class

import org. apache. spark._

import org. apache. spark. streaming._

val config = new Spark Conf(). setMaster (“spark : // host : port”) . setAppName (“Streaming app”)

val batch = 20

val ssc = new Streaming Context(conf, Seconds(batch)

The batch size can be as small as 500 milliseconds. The upper bound for the batch size is determined by the latency requirement of your application and the available memory in spark streaming.

Parquet File with Example


Parquet is a columnar format that is supported by many other data processing systems, Spark SQL support for both reading and writing Parquet files that automatically preserves the schema of the original data. Parquet is a popular column-oriented storage format that can store records with nested fields efficiently. Parquet often used with tools in the Hadoop ecosystem and it supports all of the data types in Spark SQL.

Spark SQL provides methods for reading data directly to and from Parquet files.

Parquet is a columnar storage format for the Hadoop ecosystem. And has gotten good adoption due to highly efficient compression and encoding schemes used that demonstrate significant performance benefits. Its ground-up design allows it to be used regardless of any data processing framework, data model, and programming language used in Hadoop ecosystem including Map Reduce, Hive, Pig and Impala provided the ability to work with Parquet data and the number of data models such as AVRO, Thrift, etc have been expanded to be used with Parquet as storage format.

Parquet is widely adopted by a number of major companies including tech giants such as Social media to Save the file as parquet file use the method. people. saveAsParquetFile(“people.parquet”)

Example on Parquet file:

Scala > val parquet File = sql Context. parquet File(“/home/ sreekanth / SparkSQLInput /users.parquet”)

parquet File: org. apache. spark. sql. Data Frame=[name: string, favorite_hero: string, Favorite_color: string]

Scala > parquet File. register Temp Table(“parquet File”)

Scala>parquet File. print Schema


| name: string( nullable : false)

| favorite: hero( nullable : true)

| favorite_numbers( nullable : false)

Scala>val selected People = sql Context. sql (“SELECT name FROM parquet File”)

Scala> selected>”Name: ” + t(0)).collect().foreach( println )


Name: Alex

Name: Bob

Scala > val selected People = sql Context. sql (“SELECT name FROM parquet File”).show







How to Save the Data in a “Parquet File” format

Scala> val sql Context = new org. apache. spark. sql. SQL Context( sc )

sql Connect: org. apache. spark. sql. SQL Context=org. apache. spark. sql. SQL Context@ hf0sf

Scala> val data frame=sql“/home/ sreekanth /Spark SQL Input/users.parquet”)

data frame: org. apache. spark. sql. Data Frame=[name: string, favorite_hero:string, favorite_color:string]

data“name”, “favorite_hero”) And Fav Hero.parquet)

Hadoop Admin Roles and Responsibilities

Hadoop Admin Roles and Responsibilities:

Hadoop Administrator career is an excellent career and lot of growth opportunities because less amount of people and Hadoop is huge demand technology.

Hadoop Administrator is responsible for Hadoop Install and monitoring Cluster Management.

Roles and Responsibilities:

  1. Capacity Planning and Hardware requirement of the nodes, Network architecture and Planning.
  2. Hadoop Software Installation and configuration whether Cloudera Distribution or Horton Works distribution etc.
  3. Configuring Name Node, Data Nodes to ensure its high availability.
  4. Tuning of Hadoop Cluster and creating new users in Hadoop, handling permissions, performance upgrades.
  5. Hadoop Backup and Recovery tasks
  6. Every day finding out which jobs are taking more time, if users say that jobs are stuck to find out the reason.
  7. Health check of Hadoop cluster Monitoring
  8. Deployment in Hadoop Cluster and maintaining it.
  9. Support and maintenace of Hadoop Storage (HDFS)
  10. Security administration during installation and basic knowledge on Kerberos, Apache Knoz and Apache Ranger etc.
  11. Data migration between clusters if needed ex: using Falcon tool.
  12. Manage Hadoop Log files and analyzing failed jobs
  13. Troubleshoot Network and applications
  14. Knowledge on Scripting Skills on Linux environment
  15. Knowledge on Oozie, Hive , HCatalog and Hadoop Eco – System


Day to Day Activities of Hadoop Admin:

  1. Monitoring Console whether Cloudera Manager or Horton works and job tracker UI.
  2. HDFS Maintenance and Support
  3. Health check of Hadoop cluster monitoring
  4. Managing Hadoop log files and find out errors
  5. Managing users, permissions etc.
  6. Troubleshoot Network errors and application errors.

Skill sets required to become a Hadoop Administrator :

  1. Strong Knowledge on Linux/Unix
  2. Knowledge on Shell Scripting/Python Scripting
  3. Hands on Experience of Cluster Monitoring tools like Ambari, Gangila etc.
  4. Networking and Memory management

Summary: Hadoop Administration is one of the best careers in terms of growth and opportunities. Nowadays the Hadoop market is on rising. If you have knowledge on Linux and Database then admin it can be an advantage.

Spark Most Typical Interview Questions List

Apache SPARK Interview Questions List

    1. Why  RDD resilient?
    2. Difference between Persist and Cache?
    3. Difference between Lineage and DAG?
    4. What are narrow and wide transformations?
    5. What are Shared variables and it uses?
    6. How to define custom accumulator?
    7. If we have 50 GB memory and 100 GB data, how spark will process it?
    8. How to create UDFs in Spark?
    9. How to use hive UDFs in Spark?
    10. What are accumulators and broadcast variables?
    11. How to decide various parameter values in Spark – Submit?
    12. Difference between Coalesce and Repartition?
    13. Difference between RDD DATA FRAME and DATA SET. When to use one?
    14. What is Data Skew and how to fix it?
    15. Why shouldn’t we use group by the transformation in Spark?
    16. How to do Map side join in Spark?

1. What Challenges are faced in the Spark Project?

2.Use of map, flat map, map partition, for each partition?

3. What is Pair RDD? When to use them?

4.Performance optimization techniques in Spark?

5.Difference between Cluster and Client mode?

6. How to capture log in client mode and Cluster mode?

7. What happens if a worker node is dead?

8. What types of file format does Spark support? Which of them are most suitable for our organization needs?

Basic Spark Developer Interview Questions:

1.Difference between reduceByKey() and groupByKey()?

2.Difference between Spark 1 and Spark 2?

3. How do you debug Spark jobs?

4.Difference between Var and Val?

5. What size of file do you use for development?

6. How long will take to run your script in production?

7. Perform joins using RDD’s?

8. How do run your job in Spark?

9. What is the difference between the Spark data frame and the data set?

10. How data sets are type safe?

11. What are sink processors?

12.Lazy evaluation in Spark and its benefits?

13. After Spark – Submit,  Whats’s process runs behind of application?

14. How to decide no.of stages in Spark job?

Above questions are related to Spark developers for experienced and beginners.

Spark Coding Test for Spark Developers

Here two programs for spark developers :

Question 1:

Mr. Bolt is in his 60’s and loves traveling. He recently visited a country famous for its pens. He has  ‘A’ grandchildren. He went to open Pen shop to purchase pens for them. The shop keeper showed him ‘a’ varieties of pens each variety containing ‘b[i ]’ pens.

He has to select city ‘c’ varieties of pens in a set in such a way that all the ‘A’  grandchildren get the same number of pens. If there are more than one such sets, the one with the minimum number of pens per child should be returned.


input 1: Value of ‘X’

input 2: Value of ‘c’

input 3: Value of ‘a’

input 4: Values in the array ‘b’


Return the minimum number of pens each grandchild should get. Return -1 if no solution possible



input 1 : 5

input 2 : 3

input 3: 5

input 4 : {1,2,3,4,5}


Output: 2

Explanation : He can purchase pens in two sets {2,3,45} and {1,4,5}. The sum of each set is 10. Therefore, he will be able to give 2 pens to each of his grandchildren.


Question 2:

You just got a new job but your new office has a different rule. They allow taking interval breaks in between tasks if there is no task available but the problem is that the tasks com randomly and sometimes it may be required to do them simultaneously.

On your first day, you are given a list of tasks with their starting and ending time. Find out the total time you will get the breaks. Assurance ending time to be greater than starting time.


Input 1: No.of tasks

Input 2: 2-d array in for [10,11] representing starting and ending time period of the task


Your function must return an integer representing the total break time


input 1: 4

input 2: { (6,8)(1,9)(2,4)(4,7)}

Output: 0

Above Programs are related to Spark using SCALA, Java, Python and R languages for Spark developers.

Spark with Kafka Scenario for Spark Developers


This scenario is related to real-time example in Spark with Kafka for Spark developers.

Problem Statement:

A Reality Television in a Game show has 7 players, the game for one complete day, the winner of the game is decided by the votes cast by the audience watching the show. At the end of the day, the winner is decided by certain criteria which are detailed below.

Rules to cast vote:

1)Each unique user(let us assume has an ID) can cast vote for the players

2)The user can cast, maximum one vote every 2 minutes he has the liberty casting different players each time

3)If a user casts more than one vote in the spam of two minutes, the latest vote will overwrite the previous vote.

Calculation criteria for the winner:

1)Find the player who has maximum votes every minute of the day, the player with maximum votes for the minute will get one reward point.

2)At the end of the day player who has maximum reward points is the winner



1)Create a system which simulates user voting to a Kafka topic

2)Spark Streaming job should process the stream data and process the data based on the rules mentioned above

3)The reward points for the users should be stored in the persistent system

4)Provide a query to find the winner.

Apache Kafka is messaging and integration for Spark Streaming. Kafka acts as the central hub for real-time streams of data and is processed.

Above scenario asked for coding in Spark with Kafka. For Spark Developers will implement in SCALA or Python depends upon your programming knowledge. Nowadays the most important scenario in the IT industry for CCA – 175 also.