Functions in SCALA:

1. Higher Order Functions in Scala:

I) foreach ()

II) map ()

III) reduce ()

I) foreach() :

scala > val technologies = List ( “Hadoop”, “Java”, “Salesforce”);

technologies: List[String] = List (Hadoop, Java, Salesforce)

scala >  technologies . foreach ((t:String) = > ( println (t) ) );





II) map() :

scala > val technologies = List ( “Hadoop”, “Java”, “Salesforce”);

technologies: List[String] = List (Hadoop, Java, Salesforce)

scala >  val tSize = technologies . map ((c) = > ( c.size));

tSize : List [Int] = List (6, 4, 10)

scala > println (tSize);

List (6,4,10)

III) reduce() :

scala > val score = List (183,186,190,191)

score: List [Int] = List (183,186,190,191)

scala > val totalScore = score.reduce( (a : Int, b: Int) = > (a+b));

totalMarks: Int = 750

scala> val totalScore =score. reduce((a,b) => (a+b));

totalMarks: Int = 750

Anonymous Functions in SCALA:

Anonymous Functions mainly the combination of Function literals and Function Values.

Function literals : Anonymous functions in source code is called function literals 

Function Values : Function literals are instantiated into objects are called function values.

Example of Anonymous Function:


object Anonymous_Funct


def main (args: Array[String])


var add = (x: Int) => x+1;

var mul = (a:Int, b:Int) => a*b;

println (“Addition value :” + add(10));

println (“Multiple value : “+ mul (9, 10) );

println ( ” New Addition Value: ” + ( (10) – 5));

println ( ” New Multiplied Value: ” + ( mul  (10,9)+50 ) ) );




Scala> scalac Anonymous_Funct.scala

Scala> Anonymous_Funct

Addition value 11

Multiple value 19

New Addition Value 5

New Multiplied Value 140

Mahout in Machine Learning


Mahout is an open source by the Apache Software Foundation to implementations of all kinds of machine learning techniques with the goal of creating scalabe algorithms that are free to under the Apache license. In order to see the algorithms currently implemented in mahout type the following command in the terminal.

I)export MAHOUT_HOME = /Path



All these can be accessed the $MAHOUT_HOME /bin/mahout command-line driver. Each of these needs certain arguments as input to generate the output.

Mahout algorithms are Divided into 4 sections:

1.Collaborative filtering



4.Mahout utilities

1. Collaborative filtering:

Collaborative filtering is a machine learning technique used for generating recommendations. It uses information’s like ratings, user preference, etc.

Collaborative filtering basically tow ways of generating recommendations.

I) User – Based : In this recommend items by finding similar users. Example if a user purchased a computer and the second user has even purchased a computer along with other products then they are a supposed to be similar users and the items purchased by the second user other than the computer recommended to the first user. It is like the dynamic nature of users.

II) Item-Base: This item based recommendations calculate the similarity between items and creates a similarity matrix from which recommendation are generated Mahout provides a set of components from our own recommendation engine.

Examples of Collaborative Filtering Algorithms:-

Distributed Item – based Collaborative Filtering

Non-Distributed recommend

Collaborative Filtering using Matrix factorization.

Mahout Utilities :

In Mahout some algorithms , it helps in preparing content into formats for Mahout and are called MAHOUT UTILITIES.

Mahout Utilities mainly 3 categories

Creating Vectors from Text – In this utilities allow to produce Mahout Vector representations. There are mainly two utilities for converting a directory of text documents into Vector.

Creating text from Vectors – In this utilities allow to produce text from vectors.

Viewing Result – These utilities particularly is used for viewing the clusters generating by clustering algorithms.

How to Install HBASE on Hadoop Eco – System

HBASE:  Hadoop dataBASE

Apache HBase runs on top of Hadoop. It is a Database which is an open source, distributed, NoSQL database related. Hadoop can perform on batch processing and data will access only in a sequential manner leading with low latency but HBase internally uses Hash tables and provides random access, and stores the data in HDFS files that are indexed by their key for faster lookups thus providing high latency compared to Hadoop HDFS.

HBASE Installation on Ubuntu/Linux

Step 1: Download HBase 1.2.7. bin. tar. gz tarball from Apache Mirrors Website

Step 2: After Downloading tarball place the Downloaded Tar Ball in “Hadoop” director(Path: /home/sreekanth/Big_Data)

Step 3: Extract the Downloaded tar ball using below command:

tar -xzvf HBase-1.2.7-bin.tar.gz

Step 4: After Tar ball extraction we will get the hbase-1.2.7 directory

Step 5:  After that Update the HBASE_HOME and HBASE PATH Variables in the bashrc file using below command

nano ~/.bashrc

Step 6: Give HBASE_HOME & PATH details below like this:

export HBASE_HOME = / home / sreekanth / Big_Data / HBase-1.2.7

export PATH = $ PATH : $ HBASE_HOME / bin

Step 7:  Check Bashrc changes are using below command


Above command is not working due to not  update within the terminal so we need for the new terminal

Step 8:  Open new terminal then checks update bashrc using below command  :


Step 9: To Install HBase in clustered mode we have to place below properties in conf/ hbase – site.xml file in between configuration tag

Give hbase properties like name and value (directory etc info) below like this

Step 10: We  have to add these properties the end of file in hbase – for region servers

export JAVA_HOME= /usr/lib/jvm/java-8-openjdk-amd64

export HBASE_REGIONSERVERS = $ {HBASE_HOME}/conf/regionservers

export HBASE_MANAGES_ZK= true

Step 11: First start all daemons then start hbase by using below command:

Step 12:  To Access HMaster in Web UI using default port 16010

http : // <<  hostname Of HMaster >> : 16010

When not use use HBASE

When we need to handle transactions and relational analytics.

When Applications need data to be rolled up aggregated or analyzed across now.

Machine Learning

What is Machine Learning?

Machine Learning is a branch of artificial intelligence (AI) that focuses on the development of computer programs that can teach to grow and change exposed to new data. It is concerned with the design and development of algorithms that can take complex input data and can make an intelligent decision based on the input data.

Types of Machine Learning:

1. Supervised Learning :

This type of machine learning is concerned with associating some undefined data document to some predefined label of the training data in order to predict the value of any valid input.  Common examples of supervised learning include classifying e-mail message as spam. Mainly all classification algorithms. Example Mahout.

2.Unsupervised Learning :

Unsupervised learning type of machine learning is concerned with making sense out of complex hard to understand data by creating some similarity or some interesting patterns. No labels are associated with it. Mainly all clustering algorithms in mahout for example.

3.Semi-Supervised Learning :

Semi-supervised learning is concerned with defining an undefined data document in the presence of both labeled and unlabeled data. It is a combination of supervised and unsupervised learning. The main aim of semi-supervised learning is to demonstrate how combining both labeled and unlabeled data can change learning behavior.

Spark Streaming Twitter Example

Spark Streaming Twitter Example:

//Using Scala Program

package org . apache . spark . demo . streaming

import org . apache . spark . streaming . SparkContext._

import org . apache . spark . streaming . twitter._

import org . apache . spark . streaming . {Seconds, StreamingContext}

import org . apache . spark . SparkConf

object TwitterTags{

def main(args: Array[String]){

if(args.length < 5 ){

System. err. println (“Usage : Twitter Popular Tags <consumer key> <consumer secret>”+”<access token><access token secret>[<filters]”)



StreamingExamples . setStreamingLogLevels()

val Array ( consumerKey, consumerSecret, accessToken, accessTokenSecret ) = args.take(5)

val filters = args . takeRight(args.length – 5)

//Set the system properties so that Twitter 4j library used by twitter stream

//Can we use them to generate OAuth(Open Authentication) credentials

System . setProperty (“twitter4j . oauth . consumerKey”, consumerKey)

System . setProperty( ” twitter4j . oauth . consumerSecret”, consumerSecret)

System. setProperty(“twitter4j.oauth.accessToken”,accessToken)

System. setProperty (” twitter4j . oauth. accessTokenSecret”, accessTokenSecret)

val sparkConf = new SparkConf(). setAppName(“TwitterTags”)

val scc=new StreamingContext (sparkConf, Seconds(3))

val stream = TwitterUtils.createStream ( scc, None,filters)

val hashTags =stream. flatMap (status = > status. getText. split(” “).filter(_.startsWith(“#”)))

val topCounts = hashTags. map((_, 1).reduceByKeyAndWindow(_+_, Seconds(60)).map{case (topic, count)=>(count, topic)}.transform(_.sortByKey(false))

val topCounts1 = hashTags . map((_, 1). reduceByKeyAndWindow(_+_, Seconds(30)).map{case (topic, count)=>(count, topic)}.transform(_.sortByKey(false))

//Print Popular hashtags

topCounts . foreachRDD (rdd = > {

val topList  =  rdd. take(30)

println (“\n Popular topics in last 60 seconds(%s total): “.format ( rdd . count()))

topList . foreach {case(count ,tag) => println(“%s(%s tweets)”.format(tag,count))


topCounts . foreachRDD (rdd = > {

val topList  =  rdd. take(60)

println(“\n Popular topics in last 30 seconds(%s total): “.format(rdd. count()))

topList . foreach {case(count ,tag)=>println(“%s(%s tweets)”. format(tag,count)))



scc . start()

scc . awaitTermination()



Spark Streaming Use Case

Spark Streaming Use Case with Explanation:

Using Scala streaming imports

import org. apache. spark. streaming. StreamingContext

import org. apache. spark. streaming. StreamingContext._

import org. apache. spark. streaming.dstream . DStream

import org. apache. spark. streaming.Duration

import org. apache. spark. streaming.Seconds

Spark Streaming Context :

This is also sets up underlying SparkContext that it will use to process data. It takes as input a batch interval specifying how often to process new data


We use socketTextStream() to create a DStream based on text data received on the local machine

Then we transform the DStream with filter() to get only the lines that contains error. Output operation print() to print some of the filtered lines.

Create a Streaming Context with a 1 – second batch size frin a SparkConf

val scc=new StreamingContext(conf, Seconds(1))

// Create DStream using data received after connecting to default port on the local machine

val lines = scc.socketTextStream(“localhost”, 9000)

//Filter our DStream for lines with “error”

var errorLines = lines.filter(_.contains(“error”))

//Print out the lines with errors

errorLines. print()

Above an example of converting a stream of lines to words, the flatMap operation is applied on each RDD in the lines DStream to generate the RDDs of the words DStream. This is shown in below figure Input DStreams are DStreams representing the stream of input data received from streaming process. In the above example of converting streaming of lines information words, lines was an input DStream as it represented the stream if data received from the server.

Every input DStream is associated with a Receiver object whether Java, Scala etc. Which receives the data from a source and stores it in Spark’s memory for processing. Here Spark Streaming provides two categories :

  1. Basic Sources: Sources directly available in the Streaming Context API example: file systems, socket connections
  2. Advanced Source: Sources indirectly available  like Flume, Kafka, Twitter etc. are available through extra utility classes.

Smart way of looping in JavaScript

Whenever there is a scenario of looping the first thing that strikes our mind is for. But actually, there are clean and better ways to loop without using for in javascript.


ECMAScript 5 provided methods forEach, map, reduce, filter and ECMAScript 6 provided find method that resides on Array prototype. Therefore depending on the scenario, we can use appropriate methods instead of for loop.

    • forEach: forEach() method calling a provided function on every array element.
      • Demo Array.forEach forEach
    • map: map() method creates a new array with the result of calling a provided function on each element. 
      • Demo map


  • reduce: reduce() method results a single value that is a result of calling a reducer function on each element.
    • Demo Array.reduce()js reducer method
  • filter: filter() method creates a new array with all elements that pass a test implemented by the provided function
    • Demo Array.filter()filter method
  • find: method returns the value of the first element in an array that passes a test implemented by the provided function
    • Demo Array.find()

Who is better at performance?

Obviously for loops are faster than Javascript methods(map, reduce, filter, find)  because these methods can have extra overhead, behind the scene even these methods could be using for loop.

Demo reducer vs forfor vs reduce

If for loops are faster then why javascript methods?

These methods are self-explanatory and using the appropriate method for the use case can help the team understand what you are doing.

Spark Streaming with Pictures

Spark Streaming:

Spark Streaming is a Spark’s module for real-time applications(Twitter tweets, statistics, page views). Lets user writes streaming applications using a very similar API to batch jobs. Spark Streaming is a distributed data stream processing framework. It makes it easy to develop distributed applications for processing live data streams in real time. It only provides a simple programming model but also enables an application to process high-velocity stream data. It also allows the combining of data streams and data for processing.

Spark Streaming is an extension of the core Spark API that enables scalable, fault-tolerant stream processing of live data streams. Data can be ingested from many sources like Flume, Kafka etc can be processed using complex algorithms expressed with high-level functions like map, reduce. Processed data can be pushed out to file systems and live dashboards.

Process Flow in Spark Streaming:

Spark Streaming receives live input data streams and divides the data into batches. Spark Engine will process the same data. Once processing is done Spark engine will generate the final stream of outputs in batches.


Streaming Context

Streaming Context , a class defined in the Spark Streaming library, is the main entry point into the Spark Streaming library. It allows a Spark Streaming application to connect to  a Spark Cluster.

Streaming Context provides methods for creating an instance of the data stream abstraction provided by Spark Streaming.

Every Spark Streaming application must create an instance of this class

import org. apache. spark._

import org. apache. spark. streaming._

val config = new Spark Conf(). setMaster (“spark : // host : port”) . setAppName (“Streaming app”)

val batch = 20

val ssc = new Streaming Context(conf, Seconds(batch)

The batch size can be as small as 500 milliseconds. The upper bound for the batch size is determined by the latency requirement of your application and the available memory in spark streaming.

Parquet File with Example


Parquet is a columnar format that is supported by many other data processing systems, Spark SQL support for both reading and writing Parquet files that automatically preserves the schema of the original data. Parquet is a popular column-oriented storage format that can store records with nested fields efficiently. Parquet often used with tools in the Hadoop ecosystem and it supports all of the data types in Spark SQL.

Spark SQL provides methods for reading data directly to and from Parquet files.

Parquet is a columnar storage format for the Hadoop ecosystem. And has gotten good adoption due to highly efficient compression and encoding schemes used that demonstrate significant performance benefits. Its ground-up design allows it to be used regardless of any data processing framework, data model, and programming language used in Hadoop ecosystem including Map Reduce, Hive, Pig and Impala provided the ability to work with Parquet data and the number of data models such as AVRO, Thrift, etc have been expanded to be used with Parquet as storage format.

Parquet is widely adopted by a number of major companies including tech giants such as Social media to Save the file as parquet file use the method. people. saveAsParquetFile(“people.parquet”)

Example on Parquet file:

Scala > val parquet File = sql Context. parquet File(“/home/ sreekanth / SparkSQLInput /users.parquet”)

parquet File: org. apache. spark. sql. Data Frame=[name: string, favorite_hero: string, Favorite_color: string]

Scala > parquet File. register Temp Table(“parquet File”)

Scala>parquet File. print Schema


| name: string( nullable : false)

| favorite: hero( nullable : true)

| favorite_numbers( nullable : false)

Scala>val selected People = sql Context. sql (“SELECT name FROM parquet File”)

Scala> selected>”Name: ” + t(0)).collect().foreach( println )


Name: Alex

Name: Bob

Scala > val selected People = sql Context. sql (“SELECT name FROM parquet File”).show







How to Save the Data in a “Parquet File” format

Scala> val sql Context = new org. apache. spark. sql. SQL Context( sc )

sql Connect: org. apache. spark. sql. SQL Context=org. apache. spark. sql. SQL Context@ hf0sf

Scala> val data frame=sql“/home/ sreekanth /Spark SQL Input/users.parquet”)

data frame: org. apache. spark. sql. Data Frame=[name: string, favorite_hero:string, favorite_color:string]

data“name”, “favorite_hero”) And Fav Hero.parquet)

Hadoop Admin Roles and Responsibilities

Hadoop Admin Roles and Responsibilities:

Hadoop Administrator career is an excellent career and lot of growth opportunities because less amount of people and Hadoop is huge demand technology.

Hadoop Administrator is responsible for Hadoop Install and monitoring Cluster Management.

Roles and Responsibilities:

  1. Capacity Planning and Hardware requirement of the nodes, Network architecture and Planning.
  2. Hadoop Software Installation and configuration whether Cloudera Distribution or Horton Works distribution etc.
  3. Configuring Name Node, Data Nodes to ensure its high availability.
  4. Tuning of Hadoop Cluster and creating new users in Hadoop, handling permissions, performance upgrades.
  5. Hadoop Backup and Recovery tasks
  6. Every day finding out which jobs are taking more time, if users say that jobs are stuck to find out the reason.
  7. Health check of Hadoop cluster Monitoring
  8. Deployment in Hadoop Cluster and maintaining it.
  9. Support and maintenace of Hadoop Storage (HDFS)
  10. Security administration during installation and basic knowledge on Kerberos, Apache Knoz and Apache Ranger etc.
  11. Data migration between clusters if needed ex: using Falcon tool.
  12. Manage Hadoop Log files and analyzing failed jobs
  13. Troubleshoot Network and applications
  14. Knowledge on Scripting Skills on Linux environment
  15. Knowledge on Oozie, Hive , HCatalog and Hadoop Eco – System


Day to Day Activities of Hadoop Admin:

  1. Monitoring Console whether Cloudera Manager or Horton works and job tracker UI.
  2. HDFS Maintenance and Support
  3. Health check of Hadoop cluster monitoring
  4. Managing Hadoop log files and find out errors
  5. Managing users, permissions etc.
  6. Troubleshoot Network errors and application errors.

Skill sets required to become a Hadoop Administrator :

  1. Strong Knowledge on Linux/Unix
  2. Knowledge on Shell Scripting/Python Scripting
  3. Hands on Experience of Cluster Monitoring tools like Ambari, Gangila etc.
  4. Networking and Memory management

Summary: Hadoop Administration is one of the best careers in terms of growth and opportunities. Nowadays the Hadoop market is on rising. If you have knowledge on Linux and Database then admin it can be an advantage.