ClassNotFoundException in Spark while submit the jar file

First, I am writing the Spark application program on Scala Spark-shell in Scala program.



ClassNotFoundExceptions in Spark:

import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import org.apache.sql.SparkSession

object sparkDemo{
def  main(args:Array[String]): Unit={
val conf = new SparkCoonf().setAppName("Deply").setMaster("local[*])")
val sc = nes SparkContext(conf)
sc.setLogLevel("ERROR")
val data = sc.textFile("file:///home//Spark//Test//Input.log")
val file_data = data.filter(x=>x.contains("Population"))
println("Filtered Data")
println(fil_data.foreach(println))
fil_data.saveAsTextFile("file://home/Spark//Output")
}
}

While submitting the spark application jar file on spark-shell getting below error:

Scala> spark - submit --class sparkDemo.sparkDemo --master local[*] file:///home//Spark//Output/sparkDemo-0.0.1.-SNAPSHOT.jar

Error 1:

Exception in thread "main" java.lang.ClassNotFoundException: Failed to find data source: 
Caused by: java.lang.ClassNotFoundException: kafka.DefaultSource

Solution:

bin\spark-submit --class sparkDemo.Main --master local[*]  file:///home//Spark//Output/sparkDemo-0.0.1.-SNAPSHOT.jar

After submitting the Spark jar using the above command, it’s worked.

Error 2:

java.lang.ClassNotFoundException: Failed to find data source: 
org.apache.spark.sql.avro.AvroFileFormat. Please find parameters.html at 
org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:438) at 
org.apache.spark.sql.DaataFrameWriter.save(DataFrameWriter.scala:244) at 
org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:288) 
Caused by: java.lang.ClassNotFound Exception: 
org.apache.spark.sql.avro.AvroFileFormat.DefaultSource at 
scala.reflect.internal.util.AbstractFileClassLoader. 
findClass(AbstractFileClassLoader.scala:62)

Solution:

The above error belongs to the Avro file format exception due to incompatible issues with Spark. So first check with Spark Core and Spark SQL compatible.

After the fixed, the compatible issue for Spark Core & Spark SQL then go with Spark-Avro




Summary: Here are we resolved two errors like ClassNotFoundException one is Spark jar file submission and another one is Spark – Avro file exception. We provide exact solutions.

Difference between map and flatMap in Spark | what is map and flatMap with examples




  • What is the map in Spark?
    The map method is a higher-order method that takes a function as input and applies it to each element in the source RDD to create a new RDD in Spark. Here we take logfile from a local file system
Scala > var file =sc.textFile("/home/sreekanth/Desktop/input.log")
Scala > var fileLength= file.map(l=>l.length)
Scala > fileLength.collect
Output :
res1: Array[Int]  = Array(10,34,15,14)
  • What is flatMap in Spark?

The flatMap method is a higher-order method and transformation operation that takes an input function, which returns sequence for each input element passed to it.

The flatMap method returns a new RDD formed by flattening this collection of sequences.

Example:

Scala> var file = sc.textFile("hdfs://localhost:54310/SparkInputDirectory/gdt")
scala> var fileWords = file.flatMap( a = > a.split)))
scala > fileWords.collect
Output: res1: Array[String] = Array(Hello, Bigdata,Spark, MongoDB)

Difference between map and flatMap:

map -> map returns only one element

Example:

sc.parallelize([10,20,30]).map(lambda a:range(1,a)).collect()

flatMap -> flatMap returns a list of elements or none of the elements
(0 or more)
Example:

sc.parallelize([10,20,30]).flatMap(lambda a:range(1,a)).collect()

How to resolve odbc connection issue in Hive in Hadoop Cluster |MapR|Cloudera

Nowadays, most one of the familiar issue ODBC connection issue while connecting to Hive in Hadoop cluster (Cloudera, MapR).




Hive ODBC connection issue in Hadoop Cluster whether Cloudera or MapR distributions.

Most probably got this type of error in Hive ODBC connection error.

Failed to establish the connection
SQLSTATE:HY000[MapR][HiveODBC] Error from Hive: connect () failed: error no = 10061

Resolution :

Step 1: First, remove before connections Hive ODBC connections
Step 2: Open ODBC Data Source Administrator and click on HIVE ODBC related system data source available depends upon your cluster.
Step 3: Here is HIVESSL_ODBC is available and then click on the “Add” button like below snapshot.

Step 4: After that select MicroStrategy Hive ODBC Driver will get Hive ODBC Driver DSN Setup for configurations.

Step 5: After getting Hive ODBC Driver DSN Setup:

Will provide  below details:

  • Data Source Name: either Sample MapR Hive DSN or Cloudera Hive DSN
  • Host: Give the hostname of your machine wherever installed Hive server
  • Database: default
  • Port: By default, it showing like 10000, 10501, etc.
  • Hive Server Type: Hive Server 2
  • Mechanism: select User Name and Password
  • Username: Give your username
  • Password: Give your password only
  • Check your SSL options whether it is enabled or not. If the SSL option is not enabled then change to enable the option.

Finally, Test the connection will get a message “SUCCESS” then click on “Ok” simply.

Step 6: After successfully completed of your Hive ODBC settings then Restart the hive service.




Open beeline on your edge node cluster:

beeline> !connect jdbc:hive2://<hiveserver>10001/default;principal = hive/hiveserver@HADOOPCLUSTER.COM;transportMode=http;httpPath=cliservice

Bottom to Top Approach in Spark Use case with Example (Scenario)

    • Scenario: We have 100 crores of  1 TB log files with error records we need to find out error records.





Basically, Hadoop follows top to bottom processing approach, how it works from source to destination with respective large data files.

Hadoop top to bottom processing approach:

Step 1: Take storage system HDFS or LFS to have  100 crores of 1 TB log files.

Step 2: Log files convert into splits for next processing

Step 3: After converted into splits then move to Mapper phase.

Step 4: In this step, sort & shuffle phase happens

Step 5: After completion of the Sort & Shuffle phase it will convert into Reducer phase.

Step 6: We get output like error log files from the above steps.

Here error record files processing from step 1 to step 6. It is a bit of time complexity processing finding error log files.

Let’s Spark comes into the picture. Spark using the bottom to top processing approach using cache memory with less time complexity with fewer steps to find out the error log files.

Spark Bottom to Top processing approach :

Step 1: Spark using Base RDD with the location of 100 crores of files either HDFS, LFS, NoSQL, RDBMS, etc.

Step 2: In this step, the files are filtered out whether error log files are there or not using Transformed RDD

file.filter(x => x.contains(“error”)

Step 3: Here using Action RDD to find out how many error log files in the storage location using below action RDD:

count of (“error”).




The above three steps enough for Spark processing to find out error files in the location.

How Spark processing the above steps:

      • First, Spark processing with step 3 Action RDD to pick up the files stand by with error keyword in the 100 crores of 1 TB files.
      • Second, To find out log files have an error then count of the total files.
      • Third, Select files from HDFS, LFS, NoSQL, etc.

Summary: In Spark have a bottom to top approach, so it is very fast compared to Hadoop top to bottom approach with respect to large data. Here Spark using Cache for fast processing required only data is stored in the only cache.

First, Spark triggered  Action RDD, then Tran formed RDD after that will go to the storage location. So Spark 100 times faster than Hadoop MapReduce for large data processing.

Fault Tolerant in Spark |RDD|Transformations|Actions




What is exactly meaning of Spark ?

  • Spark is one of the opensource in-memory cluster computing processing framework to drive the huge data processing.
  • Spark is not meant for storage it is only processing framework
  • Spark doesn’t support data locality design rule i.e. Spark accept the input data from any legacy system called LFS (Local File System), HDFS, NoSQL, RDBMS tables etc.

Important Spark Modules or Components:

1.Spark Core

2.Spark SQL -structured data

3.Spark Streaming – Real-time

4.MLib – Machine learning

5.GraphX – Graph processing.

RDD:

Resilent – Fault Tolerant

Distributed – Span Across

Dataset – Collection of huge data.

What is RDD?

The main abstraction  Apache Spark provides is “RDD”, which is collection of elements partitioned across the nodes of the cluster(single node cluster or multi node cluster) that can be operated on in parallel. If nodes of the cluster failing is proportional to the number of nodes in the cluster.



 Fault Tolerant in Spark with RDDs:

RDDs is designed to be fault tolerant, it automatically handle node failures. When node fails and partitions stored on that node become inaccessible, spark reconstructs the lost RDD partitions on another node.

Spark stores lineage(here lineage means  spark hierarchy) information for each RDD using this lineage information, it can recover parts of a RDD or even an entire RDD in the event of node failures.

Major RDD operations:

To drive Spark processing two operations are we can apply RDDs on the data processing.

  • Transformations or Transformed RDD
  • Actions or Action RDD

Transformations:

A Transformations will converts the source RDD into a new RDD

  • Source ——-> Transformation ——->New RDD

Below are most used Transformation examples in Spark:

map
filter
flatMap
reduceByKey
groupByKey

Note: A transformed RDD will never ever return a value to the drive program instead of, it will produce new RDD only in Spark processing.

Actions: 

An Action RDD will connect RDD into a value to the driver program. It is not going to produce one more RDD.

  • Source —> Action RDD —> Return a value to driver program

Below are most used Actions examples in Spark processing.

collect
count
take top
saveAsTextFile

How to install MongoDB on mac osx





1. Navigate to https://www.mongodb.com/download-center from your browser

2. Inside MongoDB download center you will see three sub-tabs ‘Cloud’, ‘Server’ and ‘Tools’

3. Click on the ‘Server’ tab. You will be prompted to select the MongoDB version that you would like to download, the operating system of your machine(macOS in our case) and package.
4. On completion of prompted fields click on Download.

5. Your browser will download the MongoDB-macOS folder.

6. Paste this folder in your ‘Home Directory’.

7. Not able to find Home Directory? Inside finder click ‘Shift + Cmd + H’ You will be navigated to the home directory.

8. Past the downloaded MongoDB-macOS folder inside the home directory.

9. Now we have to set an environmental path so our machine can find MongoDB.

10. Open your .bash_profile file generally, it will be hidden you can find the hidden files by clicking ‘ Cmd+ Shift + .’ Now you should be able to see .bash_profile. If you don’t find any create a new file and save as ‘.bash_profile’

11. Set MongoDB path in .bash_profile. Attached sample bash_profile for reference

12. Restart your terminal




13. Now type a basic command to check you mongo setup is a success ‘mongo –version’. Command will respond with MongoDB version that you installed.

What is Sqoop in Hadoop | Simple Commands for beginners




What is Sqoop in Hadoop eco-system?

Sqoop is one of the component built on top of the Hadoop Distributed File System(HDFS) and is meant for communication with RDBMS.

i.e.    Either importing  RDBMS tabular data to Hadoop or exporting process data in Hadoop to RDBMS table.

  • Sqoop is only meant for Data Ingestion, it is used for processing data with some business logic like MapReduce, PIG, and HIVE, etc.
  • As Sqoop component is not bundled in the default installation of Hadoop and hence we must have to install Sqoop exclusively on top of Hadoop boxes.
  • Either import or export Sqoop internally makes use of MapReduce Mapper phase only.

Some of the key observations with respect to Sqoop:

1. Either import or export data only happens through Hadoop HDFS. Doesn’t communicate with LFS (Local File System).

2. If you are communicating in Hadoop to any Relational Database using Sqoop, the target RDBMS must be of a java compatible Data Base.

3. If you are communicating in RDBMS from Hadoop using Sqoop, the RDBMS specific connector jar file must be part of below directory:

$SQOOP_HOME/lib

By default, Sqoop is making use of the Mapper process alone to import the data on Hadoop (HDFS). There is no concept of Reducer phases.

Sqoop simple command for import entire table from source to destination, using below command:

Sqoop import \
--connect jdbc:mysql://localhost:3306/testdata \
--username sqoopuser \
--password sqooppaswd \
--table tablename

Sqoop by default uses 4 Mappers, however, we can change the o.of mappers by using below command:

Sqoop import \
--connect jdbc:mysql://localhost:3306/testdata \
--username sqoopuser \
--password sqooppaswd \
--table tablename \
--num-mappers 10

Note: When we are using import all tables option:

1. Either we have to use default path pf HDFS
2.HIVE Warehouse path (user HIVE warehouse path to import the data)
  • Sqoop using Parquet, ORC, CSV, etc file for tabular data.
  • We are also using Sqoop integration with HIVE and HBASE.




Basics for Cloudera (Hortonworks) Hadoop Administration




Basics for Cloudera (Hortonworks) Administration:

1.Introduction to Hadoop?

Hadoop is a solution for BigData (large volume of data), open-source framework for storage and processing of large data including structured, semi-structured and unstructured data with different volumes. In Hadoop storage purpose using HDFS (Hadoop Distributed File System), for processing Map-Reduce with Java, Python and R programming language.

2.Mandatory of Hadoop Ecosystem components:
Basically Hadoop eco-system including below components:

A. HDFS (Hadoop Distributed File System) - Storage
B.Map-Reduce                             - Processing
C.Hive                                   - Data Summarization
D.Sqoop                                  - Import/Export Data from RDBMS to Hadoop and Hadoop to RDBMS
E.Zookeeper                              - Coordination of Hadoop tasks
F.HBase                                  - Hadoop + Database
G.Oozie                                  - Scheduling Hadoop jobs
H.Kafka                                  - Distribute Message system
I. PIG                                   - Scripting language for processing.

 

3. Hadoop Distributed File System Concepts:
A.Master Node

B.Data Node

C.Secondary Name Node

4. MapReduce & YARN concepts

MapReduce has Job Tracker & Task Tracker in Hadoop 1.0 version

Come to YARN concepts are evolved in Hadoop 2.0 version. These are all Resource Manager, Application Manager, and Node Manager.

5. Hadoop cluster Capacity Planning:
It depends upon the project and daily data then only will proceed with capacity planning for Hadoop cluster.

6.Hadoop Installation & Prerequisites:

Hadoop Installation & Prerequisites on Ubuntu
Hadoop Installation & Prerequisites on Windows

7. Configuring Different types of schedulers like Capacity & Fair schedulers

Capacity Scheduler is a FIFO scheduler in Hadoop eco-system

Fair Scheduler is like Capacity scheduler but here two jobs parallelly processing.

8. Cloudera Installation on Single node cluster using Cloudera Manager

Cloudera Installation like Hadoop single node but here is using Cloudera Manager. By default, ClouderaManager has JDK and default DataBase

9.Cloudera Manager Upgrade Process :

Clouder Manager up gradation one of the easy process for CDH versions. Here are different features from one version to another version.

A. Collect the upgrade Information & Upgrade the services and Prerequisities.

B. After upgrade, the services then test that services with versions and compatible

10.Commissioning and Decommissioning:

A. Add the data nodes hosts in slaves, create a file “includes”.




B. Refresh the nodes and then goto $HADOOP_HOME/sbin directory, start the services. This is called the commission process.

C. Remove the data node in slaves. create a file “exclude”.

D. Refresh nodes and then run balances. This is called Decommission process.

11. Edit logs and Name Node Image File details:
About Edit logs and Name node image file information. How to update the files.

12. Fundamentals for High Availability

13.Configuring High Availability

14. Hadoop security – Securing Authentication with Kerberos

15. Hadoop Security – HDFS encryption

16. Cloudera Backup and Disaster Recovery

17. Monitor & Manage the Cloudera Hadoop Cluster.



How to enable Kerberos on CDH (Cloudera latest version ) using Cloudera manager with MIT




Cloudera with Kerberos:

Nowadays, the most popular big data distribution is Cloudera, at present trends, Hortonworks and Cloudera merged and then different features. Here is explained and setup of the Kerberos on Cloudera (CDH) latest version using Cloudera Manager with steps. Logged in Cloudera Web UI and then see the setup of the cluster and Cloudera Management services with the data.


After successfully logged in the cluster then will provide security either Kerberos or SASL in the present era. Here we choose MIT Kerberos Installation and how to set up with Cloudera manager simply.

What is Cloudera?
Cloudera is one of the Biggest Big Data Distribution with simple Web UI and features. It provides Hadoop eco-system services like HDFS, Hive, Spark, Hue, Impala, and etc. services for Hadoop users.

What is Kerberos?

Kerberos is a third level authentication protocol in the Hadoop cluster developed by MIT. It is highly secured because of default authentication in Hadoop

MIT Kerberos Installation on Clouder Manager:

First installing MIT Kerberos on Cloudera with the following steps on top of the cluster environment.

Step 1: Install the KDC Server here is the latest version of the KDC server on CentOS / RHEL / ORACLE LINUX operating system. For Ubuntu or other operating-system has different commands in the Cloudera documentation. By using yum install krb5 server command with libraries for krb5 server on Centos operating system.

yum install krb5 -server -libs krb5 -workstation

Step 2: After the yum installation of the krb5 server and libraries. Need to some changes on the configure files like IP address or hostname in that file. It means that to change the default realm of your hostname.

nano /etc/krb5.conf

Step 3: Changed the configuration file like your hostname or FQDN (Fully Qualified Domain Name) below changes in the realm.

Step4: After successfully changes int the krb5 configuration file then create the database for kdb5_util. It means that to create the Kerberos database for strong level authentication for the KDC server.



kdb5_util create -s

Step 5: After completed of the Kerberos database then start the  KDC server and KDC admin server by using below commands on your command line.

service krb5kdc start
service kadmin start


Step 6: After starting the KDC server and KDC admin server, if we need to check whether service status, in case of KDC service automatically reboots so we need to check configuration like  below commands:

chkconfig krb5kdc on 
chkconfig kadmin on

Step 7: After that, we created Admin principals for Kerberos: Here Kadmin server added the principals by using below command:

kadmin.local  -q "addprinc admin/admin"

Step 8: After the added admin principal has permissions in the KDC ACL (Access Control List) file. Then open the KDC ACL (Access Control List) file for changing the permission to give our specific details like FQDN etc.

nano /var/lib/Kerberos/krb5kdc/kadm5.acl

Step  9: In the above step showing “*/admin@EXAMPLE.com” we need to change into  alias name of your hostname like below:

*/admin@EXAMPLE.com to */admin@HADOOP.com*

Step 10: Changed to your hostname and saving the kadm5.acl file, then restart the kadmin process (restart the Kadmin server) using below simple command.

service kadmin restart


Step 11: Now, connecting kadmin server here is Authenticating as principal “admin/admin@HADOOP.com” with the password. Here we set up the kadmin password for the this server

kadmin  -p  admin/admin@HADOOP.com

Step 12: Open Cloudera Web UI with admin user and admin password then go to right corner click on Administration  -> Security. After choosing the security option to enable Kerberos by using Cloudera manager for simply.

Step 13: In this sep will go to enable options here we choose default options “select all options” to enable Kerberos for a cluster like below snapshot.

Step 14: Give your default values (options) like the below steps for KDC and Realm services. Here we give default values then click on the “Continue” option.

 A.Select KDC Type "MIT KDC" ,
 B. Kerberos security Realm give your alias name like "HADOOP.com"
 C.KDC server type, KDC admin server type  give our host name                                  
 D.Give Kerberos encryption type etc.

 

Step 15: After completion of KDC settings then click on KRB5 configuration select the “Manage krb5.conf through Cloudera manager” then choose default values then click on “Continue”




Step 16: Give your KDC Account credential details like your KDC Account username and Password like below snapshot.

Step 17: After click on “continue” then automatically imported KDC Account Manager Credentials successfully. If it is not successful then change KDC configuration with correct details then again continue with this process.

Step 18: Updated KDC Account manager credentials after that will update the Kerberos Principals on Hadoop Cluster services like HDFS, Hue, Hive, Spark, Oozie and Zookeeper, etc.

Step 19: In this step, we need to configure the ports for Datanode Transceiver and Datanode HTTP Web UI  with default ports. If you want to change the password then simply change it after that proceed to click on “continue”.

Step 20: After setting up the Ports. Here is enable the Kerberos deployment on cluster services then click on continue option

Step 21: Then click on the “Finish” button. After successfully completed deployment on the Cluster services. It means that “Enable Kerberos for Cluster” on your Cloudera. It showing like below snapshot with Congratulation message for this cluster.

Step 22: After successfully enabled Keberos on the Cloudera cluster for all services.

Step 23: “Restart” all services on Cloudera Web UI only. After restarting all services means successfully completed enabled Kerberos on Cluster services also.

Step 24: In case HDFS or any services get failed then again stop that service and start the service manually within the Cloudera Manager Web UI only.




Summary: The following above steps for enabling Kerberos on Cloudera with the help of Cloudera manager as simple as. Here is using MIT KDC server for security within the entire cluster. If you need folder access with security using Ranger. For gateway using Knox in the cluster. Kerberos is a strong third level authentication for data security. AD or MIT KDC are first level authentication in any Hadoop cluster. How to utilize Cloudera manager in a cluster with different services like Hive, Spark, and Zookeeper. While clouder Web UI is the simplest to connect security for Data. Cloudera Manager is a simple WebUI with different features compared to the remaining Big Data distributions. In case of any service is failed then restart the services simply. Most of the time network issues or socket connection issues on Cloudera Manager on the Cluster services.

How to Install Spark on Ubuntu 16.04 with Pictures




Spark Installation:

Spark is one of the frame-work, in-memory data processing engine. It is 100 times faster than Hadoop MapReduce for data processing. Spark is developed by SCALA, JAVA, Python programming languages for Hadoop & Spark developers in Big Data and Analytics.

Here is Apache Spark Installation Prerequisites:

1. Update the Packages on Ubuntu/Linux operating system

sudo apt-get update


2. Install Java 1.7 or more version using below command:

sudo apt-get install default -jdk

Complete Installation of Spark on Ubuntu 16.04 steps with pictures:

Step 1: Download the Spark tarball from Apache Spark mirror official website.


A. Here choose a Spark Release which version we want

B. After that choose package type which version we want from the                              dropdown button

C. After selecting the Spark version click on Download Spark

the latest version for more features.

Step 2: Extract the Spark latest version tarball by directly right click on that tarball otherwise will use the command on terminal :

tar -xzvf spark-2.4.4-bin-hadoop2.7.tar.gz

Please find below a snapshot for more information:

Step 3: After tarball extraction, update the Spark home and path variables in bashrc file on the home directory.
.bashrc file – > Ctrl+H on Home directory will see .bashrc file then open that file and edit below like otherwise use the terminal command “~/.bashrc”

export SPARK_HOME = /home/sreekanth/Spark/spark_version
export PATH=$PATH:$SPARK_HOME/bin

Step 4: How to check the Spark home and path exists or not.
A. Open a new terminal and use below command:

echo $SPARK_HOME

Note: Don’t use the previous terminal.

Step 5: After completion of the above steps successful installation of Spark on Ubuntu 16.04

Step 6: Then open Spark shell on the terminal by using the below command:

spark-shell

After open Spark shell will be showing like the above screenshot. It showing Spark version with Java version also.




Summary: In Big Data & Analytics, Hadoop is one of the solutions to provide storage and processing. While processing data MapReduce is one of the solutions. But once Spark is coming to picture, MapReduce is going down due to Spark is light and 100 times faster than MapReduce for large data processing. The above simple steps to the installation of Spark on Ubuntu 16.04 version with pictures.