Difference between groupByKey vs reduceByKey in Spark with examples





In Spark groupByKey, and reduceByKey methods. Here is we discuss major difference between groupByKey and reduceByKey

What is groupByKey?

  • The groupByKey is a method it returns an RDD of pairs in the Spark. Where the first element in a pair is a key from the source RDD and the second element is a collection of all the values that have the same key in the Scala programming. It is basically a group of your dataset based on a key only.
    The groupByKey is similar to the groupBy method but the major difference is groupBy is a higher-order method that takes as input a function that returns a key for each element in the source RDD.
  • The groupByKey method operates on an RDD of key-value pairs, so key a key generator function is not required as input.

Example:

groupByKey

Scala > var data = List ("Big data","Spark","Spark","Scala","","Spark","data")
Scala > val mapData = sc.parallelize(data).map(x=>(x,1))
Scala > mapData.groupByKey().map(x=>x._1,x._2.sum)).collect.foreach(println)

Output:
(Spark,3)
(Data,1)
(Scala,1)
(Bigdata,1)

What is reduceByKey?

  • The reduceByKey is a higher-order method that takes associative binary operator as input and reduces values with the same key. This function merges the values of each key using the reduceByKey method in Spark.
  • Basically a binary operator takes two values as input and returns a single output. An associative operator returns the same result regardless of the grouping of the operands. It can be used for Calculating sum, product, and Calculating minimum, or a maximum of all the values mapped to the same key.

Example:

reduceByKey:

Scala > var data = List ("Big data","Spark","Spark","Scala","","Spark","data")
Scala > val mapData = sc.parallelize(data).map(x=>(x,1))
Scala > mapData.reduceBykey(_+_).collect.foreach(println)

Ouput: 
(Spark, 3)
(data ,1)
(Scala ,1 )
(Bigdata, 1)

groupByKey vs reduceByKey

The above two transformations are groupByKey and reduceByKey, we are getting the same output.
So we avoid “groupByKey” where ever possibly follow the below reasons:



  • reduceByKey works faster on a larger dataset (Cluster) because Spark knows about the combined output with a common key on each partition before shuffling the data in the transformation RDD.
  • When we calling the groupByKey method then take all the key-value pairs are shuffled around. This is a lot of useless data to being transferred over the network.

How to Create First Spark Application in IntelliJ IDEA with SBT

Spark is an in-memory framework for huge data processing in the BigData environment. Here is creating the first Spark application in IntelliJ IDEA with Sbt.




Prerequisites:

  • IntelliJ IDEA
  • Scala
  • Basic knowledge of Spark

Create First Spark Application in IntelliJ IDEA with SBT

Step 1: Open IntelliJ IDEA, select “New Project”. After that choose Scala with Sbt then click on the “Next” button.

Step 2: Provide your Project Name and Location of your programs. Here we selected JDK 1.8 version, Sbt: 1.1.6 version, and the select Scala version 2.11.12. I followed the above versions if you want new versions then select it. Then click on the “Finish” button.

Step3: After clicking on the “Finish” button it will take some time for sync up with Sbt structure.

Step 4:  Will get the Spark Application structure with below hierarchy:

Spark Application-> Project
                 -> .idea
                 -> src -> main->packages -> App
                        -> test
                  -> target -> build.sbt
External Libraries

Here we created  “New Project Name: Spark Application”. Default we get main and test subfolders within the source folders. In the main subfolder, we created the package and created an application with “Scala Object”. Please check with the snapshot.

Step 5: Next go to Maven repository for Spark Core dependencies. Copy the “SBT” code in the Maven page to build.sbt file in IDE like the below screenshot. It will take some amount of time for Spark Core dependencies from Maven repositories.

Note: In Maven repository please select “SBT” only then will get Sbt code.

Step 6:Next go to SparkApplication and write simple code with import, packages with Spark RDDs for our understanding.



package com.sreekanth
import org.apache.spark.{SparkConf, SparkContext}
object SparkApplication {
def main(args:Array[String]): Unit = {
val conf =new SparkConf()
conf.setMaster("local")
conf.setAppName("First Spark Application")
val sc = new SparkContext(conf)
val rdd1 = sc.makeRDD(Array(100,200,400,250))
rdd1.collect().foreach(println)
}
}

In the above Spark Application program created a Spark Config and Spark Context to initialization of Spark. And created makeRDD for Array.

Step 7: Right-click on the Spark Application “Run Spark Application”.  Then we will get the output in the console like below:

Summary: For Data Engineers Spark is a mandatory skill for large data processing. Beginners need to how to execute the first Spark application in IntelliJ IDEA with SBT. Here is how to create Scala Object, and import Spark dependencies from Maven official website. After that Run the application.



How to add Scala plugin in IntelliJ IDEA on Ubuntu





Basically, IntelliJ IDEA is a Java-related IDE for developing software developed by JetBrains. Here we discussing how to install Scala plugin in IntelliJ IDEA on the Ubuntu operating system.

For, first time users in IntelliJ IDEA we checked whether Scala Plugin is available or not. But here is Scala is not available.

Install Scala plugin in IntelliJ IDEA:

Step 1: Go to the Menu bar on top of the IDEA  “File” -> “Settings”

Step 2: Click on “Plugins” then search Scala, it will be showing like “Scala Install”. Click on the Install button and then click on “OK” at the bottom.

Step 3: After successfully installed the Scala plugin then “Restart IDE”. Then we get the notification “IDE and Plugin Updates”, Restart it.

Step 4: After Restarted IntelliJ IDEA then click on “New Project” it showing programming languages including Scala also.

Step 5: Choose Scala and Sbt then click on the “Next” button on the bottom of the IntelliJ IDEA.

Step 6: Change your  Project Name and Location as per your compatibility. Check your JDK version, Sbt version, and Scala version then click on the “Finish” button.

Getting Scala version for Sbt, downloading Maven Repos from scala official website then it showing [SUCCESSFUL] in IntelliJ IDEA console. It will take some more time to set up the Scala project structure and get external library files for Java, Scala from the internet.

IntelliJ IDEA default programming languages and its libraries in below:

Java
Groovy
Android
Kotlin
Maven,etc

If you need any programming language like Scala, Python, and etc just add a plugin like the above steps then simply will get the plugin on IntelliJ IDEA.



Note: IntelliJ IDEA one of the best IDE for developing code in Java, Scala, Python, etc.
Nowadays technology is day by day updated to simplify the time complexity, optimization, etc.
So need to update the programming language in IDEs using related plugins like the above steps.

Sqoop commands for Spark Certification (CCA 175)

Basically, Spark and Hadoop Developer certification (Cloudera Certification175) is a purely hands-on test and needs to the real-time experience. CCA 175 based on Sqoop export/import, data ingestion, and Spark transformations. Here is provide complete Sqoop import/export commands.



Mandatory Skills:

Data Ingest, Transform:

  • Import/Export data a MySQL DB into HDFS using Sqoop
  • Data Ingest real-time streaming data into HDFS using Sqoop
  • Load data from HDFS and storing vice versa to HDFS using Spark

 

Sqoop Commands:

First, we created the Sqoop directory for data ingestion and validation.

 $ hadoop fs -mkdir /user/cloudera/sqoop_import

How to find out Sqoop available commands on your Linux box

$ sqoop help

Available Commands:

codegen 
create-hive-table 
eval 
help
import
import-all-tables
import-mainframe
job
list-databases
list-tables
merge
metastore
version

How to find out the list of databases in Sqoop:

sqoop list-databases \  --connect jdbc:mysql://localhost:3306 \dataingest 
--username sqoopusername 
--password sqooppassword

How to find out the list of tables in Sqoop:

sqoop list-tables \  --connect jdbc:mysql://localhost:3306 \dataingest 
--username sqoopusername 
--password sqooppassword

How to evaluate the SQL(MySQL) statements in Sqoop:

sqoop eval \  --connect jdbc:mysql://localhost:3306 \dataingest --username sqoopusername --password sqooppassword
--quer "select * from table_name"

List of Sqoop import commands:
How to change the Hive (Import data) as an Avro data file in Sqoop:

sqoop import-all-tables \  --connect jdbc:mysql://localhost:3306 \dataingest --username sqoopusername --password sqooppassword 
--as-avrodatafile \ --warehouse-dir =/user/hive/warehouse/custemoer.db

How to change the Import data as text file format into the target directory in Sqoop:

sqoop import \  --connect jdbc:mysql://localhost:3306 \dataingest --username sqoopusername --password sqooppassword 
--as-textfile \ --target-dir =/user/Sqoop/customer

How to change the Import data file as Parquet file format into the target directory in Sqoop:



sqoop import \  --connect jdbc:mysql://localhost:3306 \dataingest --username sqoopusername --password sqooppassword --as-parquet \ --target-dir =/user/Sqoop/customer

How to import the data in the table in the target directory in Sqoop

sqoop import \  --connect jdbc:mysql://localhost:3306 \dataingest --username sqoopusername --password sqooppassword --table-dir \ --target-dir =/user/Sqoop/customer

How to import table with querying and split in Sqoop import by the command in Sqoop:

sqoop import \  --connect jdbc:mysql://localhost:3306 \dataingest --username sqoopusername --password sqooppassword 
--query = "select * from table where cutomer_id ='123' "
--table-dir \ --target-dir =/user/Sqoop/customer
--split-by customer_id \
--num-mappers default

How to import table with copy into an existing table or append in Sqoop:

sqoop import \  --connect jdbc:mysql://localhost:3306 \dataingest --username sqoopusername --password sqooppassword 
--table department
--table-dir \ --target-dir =/user/hive/warehouse/customer.db/departments
--append\
--fields-terminated-by ',' \
--lines-terminated-by '\n'\
--num-mappers 4

How to import table without a primary key using split by

sqoop import \  --connect jdbc:mysql://localhost:3306 \dataingest --username sqoopusername --password sqooppassword 
--table department
--table-dir \ --target-dir =/user/hive/warehouse/customer.db/departments
--append\
--fields-terminated-by ',' \
--lines-terminated-by '\n'\
--split-by department_id
--num-mappers 4

How to import table with incremental load in Sqoop:

sqoop import \  --connect jdbc:mysql://localhost:3306 \dataingest --username sqoopusername --password sqooppassword 
--table department
--table-dir \ --target-dir =/user/hive/warehouse/customer.db/departments
--append\
--fields-terminated-by ',' \
--lines-terminated-by '\n'\
--check-column "customer_id"
--incremental append\
--last-value 7\
--split-by department_id
--num-mappers default

How to create a Sqoop job using Sqoop import command:

sqoop job --create sqoop_job \
 import \  --connect jdbc:mysql://localhost:3306 \dataingest --username sqoopusername --password sqooppassword 
--table department
--table-dir \ --target-dir =/user/hive/warehouse/customer.db/departments
--append\
--fields-terminated-by ',' \
--lines-terminated-by '\n'\
--check-column "customer_id"
--incremental append\
--last-value default\
--split-by department_id \
--num-mappers default

How to find out Sqoop obs list:

sqoop job --list

How to show the Sqoop job using Sqoop command:

sqoop job --show sqoop_job

How to execute the Sqoop job using Sqoop command:

sqoop job --exec sqoop_job

How to create a Hive table in Sqoop with the command:

sqoop import \  --connect jdbc:mysql://localhost:3306 \dataingest --username sqoopusername --password sqooppassword 
--table department
--fields-terminated-by ',' \
--lines-terminated-by '\n'\
--hive-home /user/hive/warehouse\
--hive import\
--hive-table customer_test\
--create-hive-table\
--num-mappers default

How to connect MySQL and create a database for the reporting database:

--Connect to mysql and create database for reporting database 
--user: root, password:root
mysql -u root -p
create database customer_reporting_db;
grant all on customer_reoporting_db.* to sqoopusername
flush privileges;
use ustomer_reoporting_db;
create table customers as select* from sqoopusername.customers where 5=7;
exit;




How to add Scala Plugin in STS (Eclipse IDE) on Ubuntu




Simple steps to add Scala Plugin in Eclipse IDE (STS) in Ubuntu 16.04

Step 1: First install Eclipse IDE (STS ) in your operating system.

Step 2: Launch the Eclipse IDE with proper directory to store the Spring or Java application on your machine.

Step 3:Goto “Help” option on top of your Eclipse IDE in the dashboard.

Step 4: Click on “Eclipse Market place”

Step 5: Search on the “Scala” keyword. For what do you want to add a plugin on our Eclipse IDE. Then we get like below snapshot

Step 6: Click on the “Install” icon for the complete installation of Scala IDE with the version on your IDE.

Step 7: In this step, it showing more options like Scala IDE for Eclipse etc, then take default option and click on the “Confirm” option.

Step 8: Here is showing Apache License then click on “I accept the terms of the license agreements”. After that click on the “Finish” button.

Step 9: After that Eclipse IDE (STS) asked for the “Restart” window.




Step 10: Goto Right corner of the Eclipse IDE, click on “Open Perspective” then choose “Scala”. Like the below snapshot.

After that will write programs in Scala (Functional Programming) including Java also.

Summary: The above-mentioned steps to add Scala plugin in Eclipse IDE (Integrated Development Environment) or Spring Tool Suite in Ubuntu 16.04 for developers. Basically Scala is used for Functional programming language including the OOPS concept also. But here is less coding style with higher-order functions for developing code. For Spark need either Java or Scala in one environment we use this type setup. I think the Spring Tool Suite is the best choice because it is very fast and easier compared to Eclipse IDE.
Otherwise, download the Scala IDE from Scala official website for Ubuntu and then install it like Eclipse IDE. If you didn’t work for the above steps then uninstall Eclipse then again install it.

How to install STS(Spring Tool Suite) on Ubuntu 16.04 with pictures





STS: Spring Tool Suite is ready to run a time environment for Java, Scala, Groovy, Kotlin, and etc programming languages. It is used to deploy Spring applications directly. It is working like Eclipse IDE but very fast and easier to Spring applications.

Prerequisites :

Ubuntu 16.04 version operating system

JDK 1.7 or more version for compatibility

Installation of STS on Ubuntu 16.04:

Step 1: Goto  Spring official website: https://spring.io/tools3/sts/all
then select the Linux operating system to download the tarball from the site.

Step2: After downloading the tarball, move into Java path for better compatibility.

Step3: Change the permission of the Spring tarball using below command:

chmod 777 spring-tool-suite-tarball.x86_64.tar.gz

Step 4: Extract the tarball using below command from the command prompt:

tar -xzvf spring-tool-suite-tarball.x86_64.tar.gz

Output: We get three folders like legal, pivotal-tc-server and sts-version.RELEASE

Step5: Goto sts-version.RELEASE folder then runs the “STS” application directly. Will get “Spring Tool Suite 3 Launcher”. Click on the “Launch” button.

Step 6: We get the Spring Tool Suite Dashboard for Spring writing, deploying Applications for easier.
Creating a simple Spring Application:

Step 7: Goto top corner left of the STS Eclipse IDE click on the below option:

"File" - > "New" -> "Spring Starter Project"

Step 8: Give your name of the Spring Starter Project Name, then choose Type version either Maven or other options.




Step 9: Then prove Group, Artifacts of your project details. Click on “Next” then choose Spring Boot Version. If you want Cloud-based then choose the AWS, GCP etc, then click on “Finish” simply.

Step 10: Then automatically create a simple Spring project in STS with main, test files including pom.xml file in the project.

Step 11: The above steps are very simple to create projects in STS.

Summary: Here is the Installation of Spring Tool Suite on Ubuntu 16.04(Linux operating system) and creating a sample Spring project with details of the project including cloud bases also. It same as Windows operating system for installation and creating projects on STS.

Spark Core programming in Scala with Examples

Spark Core is one of the bases for entire Spark programming. It can replace MapReduce to perform high-speed computation.




How to find out the number of records in a dataset using Spark?

Here is we provide Spark with Scala programming for a number of records in a dataset:

Val lines = sc.textFile("Datalog.txt")
val lineLength = lines.map(x => (x,1)) . reduceByKey(_+_)
lineLength.saveAsTextFile("/home/Spark/Data");

A simple definition for the above coding style:

The "textFile" used for loading the dataset into RDD
The "map" is used for transformations function to iterates every record in the dataset.
The "._split (delimiter)"  is used for transformation function which splits every record with the delimiter.
"reduceByKey" is a key to its functionality defined by function against values of each key.

Note: We have to use map transformation in the case study where we are achieving one to one relation from input to output.
How to loaded programmatic array into RDD in Spark?

In Spark, this method is called parallelize to load arrays into RDDs like below code:

val sparray = Array(10,20,30,40);
val rdd = sc . parallelize (sparray,3); // here we load array into an RDD

Here are parallelized collections that are created by calling Spark Contexts to parallelize method on an existing collection in your driver program. The elements of the collection are applied to form a distributed dataset that can be operated in parallel.

Val newCode = sc.textFile("Datalog.txt")
file.filter (x => x.contains("error"))

The filter is transformed RDD to filter it within the content file, contains means that if the existing mechanism is available or not.

How to find out distinct elements in the source RDD?




The distinct method of an RDD returns a new RDD containing the distinct elements in the source RDD

Example:

val filetech = sc.parallelize(List("Bigdata", "Hadoop", "Spark"))
val disttech = filetech.distinct
disttech. collect()

Basically, Spark with Scala has a lot of functional keywords for better performance of the programming with less time complexity.

Spark SQL (Dataset) Joins with Scala Examples





Spark joins are used for datasets. To join one or more datasets with join() function. Spark API contains join () function using in Scala classes to join huge datasets

Here are different types of Spark join() functions in Scala:

1.join
2.rightOuterJoin()
3.leftOuterJoin()

1. Join (): In Spark simple join is used for the inner join between two RDDs
example: rdd . join ( other)

2.rightOuterJoin() : It is used for join between two RDDs. In this join key mandatory in the first RDD

example: rdd.rightOuterJoin (other)

3. leftOuterJoin(): It is used to join between two RDDs. In this join key mandatory in the other RDD

example: rdd.leftOuterJoin(other)

A brief explanation for Spark join programming example with Scala coding:

val linesdata = sc.textFile("Datalog.txt")
val linesLength = linesdata.map(_.split("\t"))
linesdata.join(linesLength).collect()

Most of the cases, Spark SQL is using joins with RDBMS data structured.

Like an employee, customer data, and etc.

For example: Take an employee database with Schema:

Employee Schema:                        Job Schema:    
emp id, emp name, emp sal              emp id, company name

Here is the Spark with Scala partial code for Spark joins

case class Employee(id: Int, name: String, sal: Float);
case class Job (id: Int, c_name: String);
emp.join(job).collect()

Jupyter : 500 Internal Server on Ubuntu 16.04




While accessing Juptyer Notebook we got “500: Internal Server Error” in the Web browser and get the error ( IOError) in command prompt like below:

 IOError: [Errno 2] No such file or directory: '/home/sreekanth/.jupyter/jupyter_notebook_config.json'

Solution 1:

Step 1: Stop the Jupyter Notebook  services on command prompt

Step 2:  Clear cache, browser history in the Web browser.

Step3:  After that Restart the Ubunutu 16.04 operating system

Step 4: Again start the Jupyter Notebook on command prompt using below command:

jupyter notebook

Step 5: Then it will take automatically browsing, otherwise open your localhost with the port on your command prompt.

localhost:8888/tree

Solution 2:

Step 1: In this solution, we provide update the Jupyter Hub, using below command:

pip install --upgrade jupyterhub

Step 2: After that update the “nbconvert” user in Jupyter Notebook. It may cause the internal server error on Web-UI:

pip install --upgrade  --user nbconvert

Step 3: Clear cache, browser history of your web browser before accessing the localhost.

Step 4: Start the jupyter notebook in the command prompt on Ubuntu 16.04 using “jupyter notebook” command.

Step 5: Then verify on the Web browser using localhost, if it is accessing or not. Then access the Jupyter access, logout from jupyter notebook on your web browser.




Step 6: Open again localhost on your web browser with a specific port number on your command prompt.

After completion of the above steps, I opened my Jupyter on the browser. Now its working fine for me

How to Install Jupyter on Ubuntu 16.04 with Pictures| Commands | Errors | Solution




Installation of Jupyter notebook on Ubuntu 16.04 with commands:

Here is step by step processing to the installation of Jupyter. you will need some prerequisites:

  • Ubuntu 16.04 version.
  • Python 2.7 or more version

First, we will update the packages lists from the repositories using the below command.

sudo apt-get update

Jupyter notebook needs a Python, and Python Development kit. so check the Python version and Python pip version.

python --version

Check  the Python pip version

pip --version

If Python and Python pip is not available then, install Python and its development kit for Jupyter notebook.

sudo apt-get -y install python2.7 python-pip python-dev


After installation of Python and Python pip then check the version. Now we will install Ipython and Jupyter Notebook, Ipython means an interactive command-line interface to Python. It is used for the web interface to many languages.

sudo apt-get -y install ipython  ipython-notebook

In this step, we will install Jupyter Notebook using the below command.

sudo - H pip install jupyter

It will take some more time for the installation of the Jupyter Notebook on the Ubuntu operating system. In the output stament it showing like “You should consider upgrading using the ‘pip install –upgrade pip,  it has shown this command , then use this in the command line prompt simply.

sudo -H pip install --upgrade pip

After the successful installation of the Jupyter Notebook then check whether installed properly or not.

jupyter notebook

Jupyter Notebook running on a system with JavaScript installed if it is successful running no problem, otherwise will get this type of error from the command prompt :

Error:

sreekanth@sreekanth-Inspoiron-5537:~# jupyter notebook
Traceback (most recent call last):
File "/usr/local/bin/jupyter", line 10, in <module>
sys.exit(main())
File "/usr/local/lib/python2.7/dist-packages/jupyter_core/command.py", line 230, in main command = _jupyter_abspath(subcommand)
File "/usr/local/lib/python2.7/dist-packages.jupyter_core/command.py", line 133, in_jupyter_abspath 'Jupyter command '{}' not found.".format (jupyter_subcommand)
 Exception: Jupyter command  'jupyter-notebook' not found.

Solution:




Go to root user the run the below command then automatically upgrade the Python pip and forcefully reinstall the packages and libraries. After that, it will collecting Jupyter Notebook packages and libraries.

pip install --upgrade --force-reinstall --no-cache-dir jupyter

The above command takes around 5 mins for collecting packages.

After completion of the above step then try to open Jupyter Notebook in root user. We get below error on your CLI, so exit from root user using “exit” command then go to normal user.

"Writing notebook server cookie secret to /root/.local/share/jupyter/runtime/notebook_cookie_secret [] Running as root is not recommended. Use --allow-root to bypass."

Existing root user then open jupyter notebook command on CLI like below command :

Here it showing Jupyter Notebook is running status, then automatically go to the browser server otherwise will access the below hostname with the port number.



http://localhost:8888

We will get below WebUI on your browser: