Java.lang.ClassNotFoundException: oracle.jdbc.driver.OracleDriver in Spark Scala




While writing Apache Spark in Scala / Python (PySpark) programming language to read data from Oracle Data Base using Scala / Python in Linux operating system/ Amazon Web Services, sometimes will get below error in spark.driver.extraClassPath in either executor class or driver class.



Caused by: java.lang.ClassNotFoundException: oracle.jdbc.driver.OracleDriver

at scala.tools.nsc.interpreter.AbstractFileClassLoader.findClass(AbstractFileClassLoader.scala:83)

at Java .lang.ClassLoader.loadClass(ClassLoader.java.424)

at Java.lang.ClassLoader.loadClass(ClassLoader.java.357)

at org.apache.spark.sql.execution.datasources.jdbc.DriverRegistry$.register(DriverRegistry.scala:35)

at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$anofun$createConnectionFactory$1.api

at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$anofun$createConnectionFactory$1.api

at scala.Option.foreach ( Option .scla:236)

at org . apache . spark . sql . execution . datasources . jdbc.JdbcUtils $ anofun $ createConnection Factory $ (JdbcUtils.scala)

at <init> ( < console >:46)

at . <init> (<console>:52)

at. <clinit> (<console>)

at. <init> (<console>:7)

at. <clinit> (<console>)

at $print (<console>)

Solution:

After getting this error will provide a simple solution in Spark Scala. Sometimes these are coming to Python (PySpark).

import related jars to both executor class and driver class. First, we need to edit the configuration file as spark defaults in spark-default.conf file.

Adding below two jar files path in spark-default.conf file.

spark.driver.extraClassPath /home/hadoop/ojdbc7.jar
spark.executor.extraClassPath /home/hadoop/ojdbc7.jar

Above two jar files path in configurations with exact version is matched to your Spark version otherwise will get the compatible issue.

Sometimes these two classpaths get an error then will add in your code either Scala or Pyspark programming –conf before Spark driver, executor jar files in bottom of the page example.

If will get the same issue again then will follow the below solution:



Step 1: Download Spark ODBC jar files from the official Maven website.

Step 2: Copy the download jar files into the below path in the share location in Spark.

/usr/lib/spark/jars

For Example –  PySpark programming code snippet for more information.

 

pyspark --driver-classpath /home/hadoop/odbc7.jar --jars #  jar file path
from pyspark import SparkContext, Spark conf # import Spark Context and configuration

from pyspark.sql import SparkContext #Sql context

sqlContext = sqlContext (sc)

dbconnection = sqlContext . read . format ("jdbc") . options (url: "Give your jdbc connection url path").load()
#Data Base connection to read data from Spark with the help of jdbc

 

Big Data Spark Multiple Choice Questions



Spark Multiple Choice Questions and Answers:

1)Point out the incorrect  statement in the context of Cassandra:

A) Cassandra is a centralized key -value store

B) Cassandra is originally designed at Facebook

C) Cassandra is designed to handle a large amount of data across many commodity servers, providing high availability with no single point if failure.

D) Cassandra uses a right based DHT*Distribution Hash Table) but without finger tables or routing

Ans : D

2. Which of the following are the simplest NoSQL databases in BigData environment?

A) Document                                    B) Key-Value Pair

C) Wide – Column                        D) All of the above mentioned 

Ans : ) All of the above mentioned

3) Which of the following is not a NoSQL database?

A) Cassandra                          B) MongoDB

C) SQL Server                           D) HBase

Ans: SQL Server


4) Which of the following is a distributed graph processing framework on top of Spark?

A) Spark Streaming                   B)MLlib

C)GraphX                                          D) All of the above

Ans: GraphX

5) Which of the following is leverage of Spark core fast scheduling capability to perform streaming analytics?

A) Spark Streaming                     B) MLlib

C)GraphX                                       D) RDDs

Ans: Spark Streaming

6) Which of the following Machine Learning API for Spark based on Which one:

A) RDD                                 B) Dataset

C)DataFrame          D) All of the above

Ans: DataFrame

7) Based on which functional programming language construct for Spark optimizer

A) Python                         B) R

C) Java                                   D)Scala

Ans: Scala is a functional programming language

8) Which of the following is a basic abstraction of Spark Streaming?

A)Shared variable                 B)RDD

C)Dstream                                  D)All of the above

Ans: Dstream

9) In a which cluster manager to do support of Spark?

A) MESOS                                B)YARN

C) Standalone Cluster manager   D) Pseudo Cluster manager

E) All of the above

Ans: All of the above

10) Which of the following is the reason for Spark being faster than MapReduce while execution time?

A) It supports different programming languages like Scala, Python, R, and Java.

B)RDDs

C)DAG execution engine and in-memory computation (RAM based)

D) All of the above

Ans: DAG execution engine and in-memory computation (RAM based)


Deloitte Hadoop and Spark Interview Questions



Round 1:

1. Explain about your previous Project?

2. Write the Apache Sqoop code that you are using in your previous project?

3. What is the reason for moving data from DBMS to Hadoop Environment?

4. What happens when you increase mappers in MapReduce?

5. What is the command to check the last value of Apache Sqoop job?

6. Can you explain Distributed Cache?

7. Explain about Hive optimization techniques in your project?

8. Which Hive analytic functions you used in the project?

9. How to update records in Hive table in a single command?

10. How to limit the records when you are consuming the data in Hive table?

11. How to change the Hive engine to Apache Spark engine?

12.Difference between Parquet and ORC file format?

13. How to handle huge data flow situation in your project?

14. Explain about Apache Kafka with architecture?

15. Which tool will create partitions in the Apache Kafka topic?

16. Which transformation and actions are used in your project?

17. Explain a brief idea about Spark Architecture?

18. How will check if data is there or not in the 6th partition in RDD?

19. How do you debug in Spark code in Regex?

20. Give me the idea about a functional programming language?

21.Difference between Map Vs Flat Map in Spark?

22. For example, Spark word count while splitting which one do you use? what happens if you use map instead of flatMap in that program?

23. If you have knowledge on Hadoop Cluster then will you explain about capacity planning for four node cluster?


Round-2

1. Define YARN and MapReduce Architecture?

2. Explain Zookeeper functionalities and give how the flow when the node is down?

3. Explain Data modeling in your project?

4. In your project, reporting tools are used? if you yes then explain it?

5. Give me a brief idea about Broadcast variables in Apache Spark?

6. Can you explain about Agile methodology and give me architecture of Agile?


Basic Terminology in Hadoop




Bigdata Solutions:

1.NoSQL – database(Non relational database) – Only for structured and semi-structured

2. Hadoop – Implementation – structured,semi-structured and unstructured data

3.Hadoop eco-systems and its components for everything.

Hadoop:

Hadoop is a parallel system for large data storage and processing. It is a solution for Bigdata.

For Storage purpose HDFS -Hadoop Distributed File System

For Processing purpose MapReduce using simply.

In Hadoop, some keywords are very important for learning scope.

Hadoop Basic Terminology:

1.Cluster

2.Clustered Node

3.Hadoop Clustered Node

4.Hadoop cluster

5. Hadoop Cluster Size

1.Cluster:

A cluster is a group of all nodes belongs to one common network is called a cluster.

2.Clustered Node:

A Clustered Node is a grouping of all individual machines is called a clustered node in Hadoop

3.Hadoop Cluster Node:

A Hadoop Cluster Node is basic storage and processing purpose of a cluster is called as Hadoop Cluster Node.

For storage purpose, we are using the Hadoop Distributed File System.

For processing purpose, we are using MapReduce

4.Hadoop Cluster:

A Hadoop Cluster is a collection of “Hadoop Cluster Node” in a common network is called Hadoop Cluster

5.Hadoop Cluster Size:

A Hadoop cluster size is a total no.of node in a Hadoop cluster.


Hadoop Ecosystem:

1. Apache Pig              –  Processing           – Pig Scripting

2. Hive                             – Processing           – HiveQL (Query language like SQL)

3.SQOOP                       – Integration tool  – Import and Export data

4.Zookeeper               – Coordination      – Distribution coordinator

5.Apache Flume      – Streaming              – log data for streaming purpose

6.Oozie                        – Scheduling             – Open source scheduling jobs

7.HBase                     – Random Access   – Hadoop+dataBASE

8.NoSQL                  – NotOnlySql              – MongoDB, Cassandra

9.Apache Kafka    – Messaging               – Distributed messaging

10.YARN                  – Resource Manager – Yet Another Resource Negotiator

Note: Apache Spark is not a part of Hadoop but including nowadays. It is used for Data Processing purpose. Spark 100 times faster than Hadoop MapReduce.

Compatible Operating System for Hadoop Installation:

1. Linux

2.Mac OS

3.Sun Solaris

4.Windows.

Hadoop Versions:

Hadoop 1.x

Hadoop 2.x

Hadoop 3.x

Different Distributions of Hadoop

1. Cloudera Distribution for Hadoop (CDH)

2.Hortonworks

3.MapR



Resilient Distributed Datasets(RDD) in Spark



RDD:

Resilient Distributed Datasets represents a collection of partitioned data elements that can be operated on in a parallel manner. RDD is the primary data abstraction mechanism in Spark and defined as an abstract class in Spark library it is similar to SCALA collection and it supports LAZY evaluation.

Characteristics of RDD:

1.Immutable :

RDD is an immutable data structure. Once created, it cannot be modified in-place. Basically, an operation that modifies RDD returns a new RDD.

2.Partitioned:

In RDD Data is split into partitions. These partitions are generally distributed across a cluster of nodes. When Spark is running on a single machine all the partitions are on that machine.

 

RDD Operations :

Applications in Spark process data using the same methods in RDD class. It referred to as operations

RDD operations are two types:

1.Transformations

2.Actions

 1.Transformations :

A transformation method of an RDD creates a new RDD by performing a computation on the source RDD.

RDD transformations are conceptually similar to SCALA collection methods.

The key difference is that the SCALA collection methods operate on data that can fit in the memory of a single machine, whereas RDD methods can operate on data distributed across a cluster of node RDD transformations are LAZY but SCALA collection methods are strict.

A) Map:

The map method is a higher order method that takes a function as input and applies it to each element in the source RDD to create a new RDD.

B) filter:

The filter method is a high order method that takes a Boolean function as input and applies it to each element in the source RDD to create a new RDD. A Boolean function takes an input and returns false or true. It returns a new RDD formed by selecting only those elements for which the input Boolean function returned true. The new RDD contains a subset of the elements in the original RDD.

c) flatMap:

This method is a higher order method that takes an input function in Spark, it returns a sequence for each input element passed to it. The flatMap method returns a new RDD formed by flattening this collection of the sequence.

D) mapPartitions :

It is a higher order method allows you to process data at a partition level. Instead of passing one element at a time to its input function, mapPartitions passes a partition in the form an iterator. The input function to the mapPartitions method takes an iterator as input and returns iterator as output.

E)Intersection:

Intersection method itakesRDD as input and returns a new RDD that contains the intersection of the element in the source RDD and the RDD passed to it as an input.

F)Union:

This method takes  RDD as input and returns a new RDD that contains a Union of the element in the resource RDD and the RDD passed to it as an input.


G)Subtract:

Subtract method takes RDD as input and returns a new RDD that contains elements in the source RDD but not in the input RDD.

 

H)Parallelize:

The Prallelized collections are created by calling Spark Context’s parallelize method on an existing collection in your driver program. The elements of the collection are copied to form a distributed data set that can be operated on in parallel.

 

I)Distinct:

Distinct method of an RDD returns a new RDD containing the distinct elements in the source RDD

 

J)Group By:

Group By is a higher order method it groups the elements of  RDD according to user-specified criteria. It takes as input a function that generates a key for each element in the source RDD. It is applicable to all the elements in the source RDD and returns an RDD of pairs.

 

K)Sort By:

The sortBy method is a higher order it returns RDD with sorted elements from the source RDD. It takes two input parameters. The first input is a function that generates a key for each element in the source RDD. The second input allows specifying ascending or descending order for sort.

 

L)Coalesce:

Coalesce method reduces the number of partitions in  RDD. It takes an integer input and returns new RDD with the specified number of partitions.

 

M)GroupByKey:

The GroupByKey method returns an RDD of pairs, where the first element in a pair is a key from the source RDD and the second element is a collection of all values that have the same key. It is the same as the groupBy method. The major difference is that groupBy is a higher order method that takes an input function that returns a key for each element in the source RDD. The groupByKey method operates in an RDD of key-value pairs.


N)ReduceByKey:

The higher-order reduceBy key method takes an associative binary operator as input and reduces values with the same key to a single value using specified binary operators.

2.Actions:

Actions are RDD methods that return a value to a driver program.

A)Collect:

The collect method returns the elements in the source RDD as an array. This method should be used with caution since it moves data from all the worker to the driver program.

B)Count:

This method returns a count of the elements in the source RDD.

C)Count By Value :

The countByValue method returns a count of each unique element in the source RDD. It returns an instance of the Map class containing each unique element and its count as a key-value pair.

D)First:

The first method returns the first element in the source RDD

E)Max:

The max method returns the largest element in  RDD

F)Min

The min method returns the smallest element in RDD

G)Top

The top method takes an integer N as input and returns an array containing the N largest elements in the source RDD.

H)Reduce

The high order reduces method aggregates the elements of the source RDD using an associative and commutative binary operator provided to it.

G)CountByKey

The countByKey methods count the occurrences of each unique key in the source RDD. It returns a Map of key count pairs.


Latest interview questions on Hadoop and Spark




1. Which internal algorithm used for NameNode to decide where the replica of a block will be stored exactly?

2. What will happen if a block of data is corrupted?

3. In the SCALA Program how to find out the number of transformations and actions?

4. If we are executing a query, how we can know that which are the joins taking more time especially in Hive and Spark query?

5. Scenario – I: In Hive, we have two tables A and B. B is the master table and A is the table which receives the updates of certain information. So I want to update the table B using the latest updated columns based upon the id

Question: How do we achieve that and what is exact query we use?

6.If Spark jobs are all failed without checking log files without WebUI how to handle it?

7. How to provide Security in Ambari without Kerberos?

8. Can you explain about High Availability Cluster in Hadoop Environment?

9. If you have a Spark job and there are 25 node cluster. How many executors are will be created by default?

10. How to change the column names in HIVE while importing the data into hive using Apache SQOOP?

11. How to handle the data type mismatch while importing the data from RDBMS to HIVE table?

12. How to handle when NULLS are present in the partition column? What is the internal mechanism for this simple scenario?



13. Scenario: For suppose we have 4 node cluster having 128 GB ram per node, then we have 532 GB memory, now we have to process 1000 GB of data.

Question ) How spark process this data is more than available memory?

14. Did you use email reader in Oozie? How do you configure it?

15. In a Scala programming, you have to make two restful API calls, let’s say we have API 1 and API 2 and we have API 3. Then you have concurrently call API 1and API 2 and have to wait to finish both the call and make the 3rd call. How do you thin  SCALA concurrently?

 

Toughest Big Data(Spark, Kafka,Hive) Interview Questions




1. How to handle Kafka back pressure with scripting parameters?

2.How to achieve performance tuning through executors?

3. What is idle size of deciding the executors and what ram should be used ?

4. How do you scale Kafka brokers and Integrate with spark streaming without stopping the cluster and along with script?

5.How to delete records in Hive and how to delete duplicate records with the scripting?

6. Can we have more than one replica exist in same rack?

7. In a data base out of 10 tables, one table is failed while importing from MySql into HDFS by using Sqoop? What is the solution?

8. If you submit a spark job in a cluster and almost rdd has already created in the middle of the process the cluster goes down what will happen to you are rdd and how data will tackle?

Hadoop and Spark Scenario Typed Questions



1.  Hadoop – Scenario :

If you working on Hadoop Cluster and you have already cache the RDD and got the output stored in cache now I want to clear the memory space and use that space for caching another RDD? How to achieve this?

2. Spark – Scenario :

I) Suppose you are running 100 SQL jobs which generally take 50 mins to complete, but one it took 5 hour to complete.

Q 1) In this case How do you report this errors?

Q 2)How do you debug to code and provide a proper solution for this scenario.

Rare interview questions on Hadoop Eco – System:

1.What do you about type safety and which frame work has type safety in Hadoop?

2.What are the serializations in Hive? why do you choose that serialization explain in detail?

3. What modules you have worked in Scala and name the module and explain briefly?

4.What are the packages you have worked in Scala and name the package you have imported in your current project ?

5.What is the difference between map and map partition with clear explanation with real time example in Scala.

6. How do you connect to your cluster using data nodes or edge nodes?

7. How do you allocate buffer memory to your datanode?

8.How much buffer space have you allocated to your map task and reduce task in your data node

9. How do you achieve broadcast join automatically without out doing it manually? and how do you setup your driver program to detect where broadcast join can be good to use and how do you automate the process?

 

Spark Streaming Twitter Example



Spark Streaming Twitter Example:

//Using Scala Program

package org . apache . spark . demo . streaming

import org . apache . spark . streaming . SparkContext._

import org . apache . spark . streaming . twitter._

import org . apache . spark . streaming . {Seconds, StreamingContext}

import org . apache . spark . SparkConf

object TwitterTags{

def main(args: Array[String]){

if(args.length < 5 ){

System. err. println (“Usage : Twitter Popular Tags <consumer key> <consumer secret>”+”<access token><access token secret>[<filters]”)

System.exit(1)

}

StreamingExamples . setStreamingLogLevels()

val Array ( consumerKey, consumerSecret, accessToken, accessTokenSecret ) = args.take(5)

val filters = args . takeRight(args.length – 5)

//Set the system properties so that Twitter 4j library used by twitter stream

//Can we use them to generate OAuth(Open Authentication) credentials

System . setProperty (“twitter4j . oauth . consumerKey”, consumerKey)

System . setProperty( ” twitter4j . oauth . consumerSecret”, consumerSecret)

System. setProperty(“twitter4j.oauth.accessToken”,accessToken)

System. setProperty (” twitter4j . oauth. accessTokenSecret”, accessTokenSecret)

val sparkConf = new SparkConf(). setAppName(“TwitterTags”)

val scc=new StreamingContext (sparkConf, Seconds(3))



val stream = TwitterUtils.createStream ( scc, None,filters)

val hashTags =stream. flatMap (status = > status. getText. split(” “).filter(_.startsWith(“#”)))

val topCounts = hashTags. map((_, 1).reduceByKeyAndWindow(_+_, Seconds(60)).map{case (topic, count)=>(count, topic)}.transform(_.sortByKey(false))

val topCounts1 = hashTags . map((_, 1). reduceByKeyAndWindow(_+_, Seconds(30)).map{case (topic, count)=>(count, topic)}.transform(_.sortByKey(false))

//Print Popular hashtags

topCounts . foreachRDD (rdd = > {

val topList  =  rdd. take(30)

println (“\n Popular topics in last 60 seconds(%s total): “.format ( rdd . count()))

topList . foreach {case(count ,tag) => println(“%s(%s tweets)”.format(tag,count))

}})

topCounts . foreachRDD (rdd = > {

val topList  =  rdd. take(60)

println(“\n Popular topics in last 30 seconds(%s total): “.format(rdd. count()))

topList . foreach {case(count ,tag)=>println(“%s(%s tweets)”. format(tag,count)))

}})

 

scc . start()



scc . awaitTermination()

}

}

Spark Streaming Use Case



Spark Streaming Use Case with Explanation:

Using Scala streaming imports

import org. apache. spark. streaming . StreamingContext

import org. apache. spark. streaming. StreamingContext._

import org. apache. spark. streaming.dstream . DStream

import org. apache. spark. streaming.Duration

import org. apache. spark. streaming.Seconds

Spark Streaming Context :

This is also sets up underlying SparkContext that it will use to process data. It takes as input a batch interval specifying how often to process new data

socketTextStream()

We use socketTextStream() to create a DStream based on text data received on the local machine

Then we transform the DStream with filter() to get only the lines that contains error. Output operation print() to print some of the filtered lines.

Create a Streaming Context with a 1 – second batch size frin a SparkConf

val scc=new StreamingContext(conf, Seconds(1))

// Create DStream using data received after connecting to default port on the local machine

val lines = scc.socketTextStream(“localhost”, 9000)

//Filter our DStream for lines with “error”

var errorLines = lines.filter(_.contains(“error”))

//Print out the lines with errors

errorLines. print()



Above example of converting a stream of lines to words the flatMap operation is applied on each RDD in the lines DStream to generate the RDDs of the words DStream . This is shown in below figure Input DStreams are DStreams representing the stream of input data received from streaming process. In the above example of converting streaming of lines information words, lines was an input DStream as it represented the stream if data received from the server.

Every input DStream is associated with a Receiver object whether Java, Scala etc. Which receives the data from a source and stores it in Spark’s memory for processing. Here Spark Streaming provides two categories :

  1. Basic Sources: Sources directly available in the Streaming Context API example: file systems, socket connections
  2. Advanced Source: Sources indirectly available  like Flume, Kafka, Twitter etc. are available through extra utility classes.