Unable to Integrate Hive with Spark and different resolutions




How to integrate (connect) Hive and Spark:

Here are to provide solutions for how to integrate (connect) Hive DB with Spark in Hadoop development.
The first time, we tried to connect the Hive and Spark then we got below error and find different types of resolutions with different modes.

caused by: org.datanucleus.exceptions. NucleusExcepiton: Attempt tp invoke 
the ONECP" plugin to create a ConnectionPool gave an error: The specified 
data driver ("co.mysql.jdbc.Driver) was not found in the CLASSPATH. Please 
change our CLASSPATH specification and the name of the driver.

Different types of solution for the above error:

Resolution 1:

1.Download MySQL connector java jar file from maven official website like below link
https://mvnrepository.com/artifact/mysql/mysql-connector-java/5.1.21
2. Paste the jar file into jars folder which is present in the Spark installed directory.

Resolution 2:

Without JDBC driver:

1. Goto hive-site.xml and give hive.metastore.uri in that hive xml file
2. Import the org.apache.spark.sql.hive.HiveContext, as it can perform SQL query over Hive tables then define the sqlContext param like below code:
Val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
3. Finally, verify Tables in Spark SQL

Resolution 3:





Go with the beeline for Hive and Spark connection in Hive CLI. In beeline, they provide high security and provide a remote server through directly and check with below two commands for beeline with Hive 2 server configurations.

Step 1: ./bin/beeline
Step 2:  !connect jdbc.hive2.//remote_hive:10000

Spark & Scala Interview Questions and Answers

1. What is Scala what are the importance of it and the difference between Scala and other Programming languages (Java/Python)?





Scala is the most powerful language for developing big data environment applications. Scala provides several benefits to achieve significant productivity. It helps to write robust code with fewer bugs.  Apache Spark is written in Scala, so Scala is a natural fit for the developing Spark applications.

2. What is RDD tell me in brief?

Spark RDD is a primary abstract class in Spark API. RDD is a collection of partitioned data elements that can be operated in parallel. Normally, RDD is supporting properties like Immutable, Cacheable, Type Infer, and Lazy evaluation.

Immutable: RDD’s are Immutable data structures. Once created, it cannot be modified

Partitioned: The Data in RDD’s are partitioned across the distributed cluster of nodes. However, multiple Cassandra partition can be mapped to one single RDD partition

Fault Tolerance: RDD is designed to be a fault – tolerant. Because the RDD data is stored across the large distributed cluster. So there is a chance for node failure in that cluster by this we can lose the Partitioned data in that node.

RDD automatically handles the node failure. Spark will maintain the metadata of each RDD and details about the RDD. So by using that information, we can get that data from other nodes.

Interface: RDD provides a uniform interface for processing data from a variety of data sources such as HDFS, HBase, Cassandra, MongoDB, and others. The same interface can also be used to process data stored in memory across a cluster of nodes.




InMemory: The RDD class provides the API for enabling in-memory cluster computing. Spark allows RDDs to be cached or persisted in memory

3. How to register a temporary table in Spark SQL?

When we creating the data frame by loading the data into it using SQL Context object. This is treated a temporary table. Because the scope of the data frame is to a particular session

4. How to count the number of lines in Scala?

In Scala programming language using getLines.size property we can count

Example: Val countlines = source.getLines.size

 

Spark Lazy Evaluation and Advantages with Example

Apache Spark is an in-memory cluster computing framework for processing and analyzing large amounts of data (Bigdata). Spark provides a simple programming model than that provided by Map Reduce. Developing a distributed data processing application with Apache Spark is a lot easier than developing the same application with Map Reduce. In Hadoop MapReduce provides only two operations for processing the data like “Map” & “Reduce”, whereas Spark comes with 80 plus data processing operations to work with big data application.




While data processing from source to destination. Spark is 100 times faster than Hadoop Map Reduce because it allows in-memory clustering computing, it implements an advanced execution engine.

What is meant by Apache Spark Lazy Evaluation?

In Apache Spark, two types of RDD operations are

I)Transformations

II) Actions.

We can define new RDDs any time, Apache Spark computes them only in a lazy evaluation. That is, the first time they are used in an action. The Lazy evaluation seems unusual at first but makes a lot of sense when you are working with large data(BigData).

Simply Lazy Evaluation in Spark means that the execution will not start until an action is triggered. In Apache Spark, the picture of lazy evaluation comes when Spark transformation occurs”.

Consider where we defined a text file and then filtered the lines that include “CISCO” client name if Apache Spark were to load and store all the lines in the file as soon as we wrote like lines = sc.text( file path ). Here Spark Context would waste a lot of o storage space, given that we then immediately filter out many lines. Instead, once Spark seems that whole chain transformation. It can compute the data needed for its result. Hence first() action, Apache Spark scans the file only until it finds the first matching line; it doesn’t even read the whole file.

Advantages Of Lazy Evaluation in Spark Transformations:

Some advantages of Lazy evaluation in Spark in below:

  • Increase Manageability: The Spark Lazy evaluation, users can divide into smaller operations. It reduces the number of passes on data by transformation grouping operation.
  • Increases Speed: By lazy evaluation in Spark to saves the trip between driver and cluster, speed up the process.
  • Reduces Complexities: There are two types of complexities of any operations are Time and Space complexity using Spark lazy evaluation we can overcome both complexities. The action is triggered only when the data is required.

Simple Example:

In Spark, Lazy evaluation below code writes in  Scala, who evaluates the expression as it’s declared.

With Lazy:

Scala> Val sparkList = List(1,2,3,4)

Scala> lazy val output = sparkList .map( 1 => 1*10)

Scala> println( output )

Output:

List( 10, 20, 30, 40 )



Spark Performance Tuning with pictures





Spark Execution with a simple program:

Basically, the Spark program consists of a single spark driver and process and a set of executors processes across nodes of the cluster.

Performance tuning of Spark is measured bottleneck using big data environment metrics for block time analysis. Spark is run on In-memory cache so need to avoid network and I/O are key role while performance.

For example: Take two clusters, one cluster with 25 machines and cluster size is 750 GB of data. The second cluster with 75 machines clusters with 4.5TB of raw data.  The network communication is always irrelevant for the performance of these workloads coming to network optimization is to reduce job completion by 5% for better performance. Finally serialized compressed data.

Mostly Apache Spark always supports transformations like groupByKey and reduceByKey dependencies. Spark executes a shuffle, which transfers data around the cluster. Below three operations  with different outputs:

sc.textFile(" /hdfs/path/sample.txt")
map(mapFunc) #Using Map function 
flatMap(flatMapFunc) #Flatmap is another transformer
filter(filterFunc)
count() # Count the data.

Above code executes a single performer, which depends on a sequence of transformations on an RDD derived from the sample.tx file.

If the code contains how many times each character appears in all the words that appear more 1000 times in a given text file. Below code in Scala:

Val  token=sc.textFile(args(0).flatMap(_.split(' ' ))
Val wc=token.map((_,1)).reduceByKey(_+_)
Val filtered = wc.filter(_._2 > =1000)
val charCount = filtered.flatMap(_._1.toCharArray).map((_, reduceByKey(_+_)
charCount.collect

Above code breaks into mainly three stages. The reduceByKey operations result in stage boundaries

 

 

 

Java.lang.ClassNotFoundException: oracle.jdbc.driver.OracleDriver in Spark Scala

While writing Apache Spark in Scala / Python (PySpark) programming language to read data from Oracle Data Base using Scala / Python in Linux operating system/ Amazon Web Services, sometimes will get below error in




spark.driver.extraClassPath in either executor class or driver class.

Caused by: java.lang.ClassNotFoundException: oracle.jdbc.driver.OracleDriver

at scala.tools.nsc.interpreter.AbstractFileClassLoader.findClass(AbstractFileClassLoader.scala:83)

at Java .lang.ClassLoader.loadClass(ClassLoader.java.424)

at Java.lang.ClassLoader.loadClass(ClassLoader.java.357)

at org.apache.spark.sql.execution.datasources.jdbc.DriverRegistry$.register(DriverRegistry.scala:35)

at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$anofun$createConnectionFactory$1.api

at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$anofun$createConnectionFactory$1.api

at scala.Option.foreach ( Option .scla:236)

at org . apache . spark . sql . execution . datasources . jdbc.JdbcUtils $ anofun $ createConnection Factory $ (JdbcUtils.scala)

at <init> ( < console >:46)

at . <init> (<console>:52)

at. <clinit> (<console>)

at. <init> (<console>:7)

at. <clinit> (<console>)

at $print (<console>)

Solution:

After getting this error will provide a simple solution in Spark Scala. Sometimes these are coming to Python (PySpark).

import related jars to both executor class and driver class. First, we need to edit the configuration file as spark defaults in spark-default.conf file.

Adding below two jar files path in spark-default.conf file.

spark.driver.extraClassPath /home/hadoop/ojdbc7.jar
spark.executor.extraClassPath /home/hadoop/ojdbc7.jar

Above two jar files path in configurations with exact version is matched to your Spark version otherwise will get the compatible issue.

Sometimes these two classpaths get an error then will add in your code either Scala or Pyspark programming –conf before Spark driver, executor jar files in bottom of the page example.

If will get the same issue again then will follow the below solution:
Step 1: Download Spark ODBC jar files from the official Maven website.

Step 2: Copy the download jar files into the below path in the share location in Spark.

/usr/lib/spark/jars

For Example –  PySpark programming code snippet for more information.

 

pyspark --driver-classpath /home/hadoop/odbc7.jar --jars #  jar file path
from pyspark import SparkContext, Spark conf # import Spark Context and configuration

from pyspark.sql import SparkContext #Sql context

sqlContext = sqlContext (sc)

dbconnection = sqlContext . read . format ("jdbc") . options (url: "Give your jdbc connection url path").load()
#Data Base connection to read data from Spark with the help of jdbc

 

Big Data Spark Multiple Choice Questions

Spark Multiple Choice Questions and Answers:

1)Point out the incorrect  statement in the context of Cassandra:

A) Cassandra is a centralized key -value store

B) Cassandra is originally designed at Facebook

C) Cassandra is designed to handle a large amount of data across many commodity servers, providing high availability with no single point if failure.

D) Cassandra uses a right based DHT*Distribution Hash Table) but without finger tables or routing

Ans : D

2. Which of the following are the simplest NoSQL databases in BigData environment?

A) Document                                    B) Key-Value Pair

C) Wide – Column                        D) All of the above mentioned 

Ans : ) All of the above mentioned

3) Which of the following is not a NoSQL database?

A) Cassandra                          B) MongoDB

C) SQL Server                           D) HBase

Ans: SQL Server

4) Which of the following is a distributed graph processing framework on top of Spark?

A) Spark Streaming                   B)MLlib

C)GraphX                                          D) All of the above

Ans: GraphX

5) Which of the following is leverage of Spark core fast scheduling capability to perform streaming analytics?

A) Spark Streaming                     B) MLlib

C)GraphX                                       D) RDDs

Ans: Spark Streaming

6) Which of the following Machine Learning API for Spark based on Which one:

A) RDD                                 B) Dataset

C)DataFrame          D) All of the above

Ans: DataFrame

7) Based on which functional programming language construct for Spark optimizer

A) Python                         B) R

C) Java                                   D)Scala

Ans: Scala is a functional programming language

8) Which of the following is a basic abstraction of Spark Streaming?

A)Shared variable                 B)RDD

C)Dstream                                  D)All of the above

Ans: Dstream

9) In a which cluster manager to do support of Spark?

A) MESOS                                B)YARN

C) Standalone Cluster manager   D) Pseudo Cluster manager

E) All of the above

Ans: All of the above

10) Which of the following is the reason for Spark being faster than MapReduce while execution time?

A) It supports different programming languages like Scala, Python, R, and Java.

B)RDDs

C)DAG execution engine and in-memory computation (RAM based)

D) All of the above

Ans: DAG execution engine and in-memory computation (RAM based)

Deloitte Hadoop and Spark Interview Questions

Round 1:

1. Explain about your previous Project?




2. Write the Apache Sqoop code that you are using in your previous project?

3. What is the reason for moving data from DBMS to Hadoop Environment?

4. What happens when you increase mappers in MapReduce?

5. What is the command to check the last value of Apache Sqoop job?

6. Can you explain Distributed Cache?

7. Explain about Hive optimization techniques in your project?

8. Which Hive analytic functions you used in the project?

9. How to update records in Hive table in a single command?

10. How to limit the records when you are consuming the data in Hive table?

11. How to change the Hive engine to Apache Spark engine?

12.Difference between Parquet and ORC file format?

13. How to handle huge data flow situation in your project?

14. Explain about Apache Kafka with architecture?

15. Which tool will create partitions in the Apache Kafka topic?

16. Which transformation and actions are used in your project?

17. Explain a brief idea about Spark Architecture?

18. How will check if data is there or not in the 6th partition in RDD?

19. How do you debug in Spark code in Regex?

20. Give me the idea about a functional programming language?

21.Difference between Map Vs Flat Map in Spark?

22. For example, Spark word count while splitting which one do you use? what happens if you use map instead of flatMap in that program?

23. If you have knowledge on Hadoop Cluster then will you explain about capacity planning for four node cluster?

Round-2

1. Define YARN and MapReduce Architecture?

2. Explain Zookeeper functionalities and give how the flow when the node is down?

3. Explain Data modeling in your project?

4. In your project, reporting tools are used? if you yes then explain it?




5. Give me a brief idea about Broadcast variables in Apache Spark?

6. Can you explain about Agile methodology and give me the architecture of Agile?

Basic Terminology in Hadoop

Bigdata Solutions:




1.NoSQL – database(Non relational database) – Only for structured and semi-structured

2. Hadoop – Implementation – structured,semi-structured and unstructured data

3.Hadoop eco-systems and its components for everything.

Hadoop:

Hadoop is a parallel system for large data storage and processing. It is a solution for Bigdata.

For Storage purpose HDFS -Hadoop Distributed File System

For Processing purpose MapReduce using simply.

In Hadoop, some keywords are very important for learning scope.

Hadoop Basic Terminology:

1.Cluster

2.Clustered Node

3.Hadoop Clustered Node

4.Hadoop cluster

5. Hadoop Cluster Size

1.Cluster:

A cluster is a group of all nodes belongs to one common network is called a cluster.

2.Clustered Node:

A Clustered Node is a grouping of all individual machines is called a clustered node in Hadoop

3.Hadoop Cluster Node:

A Hadoop Cluster Node is basic storage and processing purpose of a cluster is called as Hadoop Cluster Node.

For storage purpose, we are using the Hadoop Distributed File System.

For processing purpose, we are using MapReduce

4.Hadoop Cluster:

A Hadoop Cluster is a collection of “Hadoop Cluster Node” in a common network is called Hadoop Cluster

5.Hadoop Cluster Size:

A Hadoop cluster size is a total no.of node in a Hadoop cluster.

Hadoop Ecosystem:

1. Apache Pig              –  Processing           – Pig Scripting

2. Hive                             – Processing           – HiveQL (Query language like SQL)

3.SQOOP                       – Integration tool  – Import and Export data

4.Zookeeper               – Coordination      – Distribution coordinator

5.Apache Flume      – Streaming              – log data for streaming purpose

6.Oozie                        – Scheduling             – Open source scheduling jobs

7.HBase                     – Random Access   – Hadoop+dataBASE

8.NoSQL                  – NotOnlySql              – MongoDB, Cassandra

9.Apache Kafka    – Messaging               – Distributed messaging

10.YARN                  – Resource Manager – Yet Another Resource Negotiator

Note: Apache Spark is not a part of Hadoop but including nowadays. It is used for Data Processing purpose. Spark 100 times faster than Hadoop MapReduce.

Compatible Operating System for Hadoop Installation:

1. Linux

2.Mac OS

3.Sun Solaris

4.Windows.

Hadoop Versions:

Hadoop 1.x

Hadoop 2.x




Hadoop 3.x

Different Distributions of Hadoop

1. Cloudera Distribution for Hadoop (CDH)

2.Hortonworks

3.MapR

Resilient Distributed Datasets(RDD) in Spark

RDD:

Resilient Distributed Datasets represents a collection of partitioned data elements that can be operated on in a parallel manner. RDD is the primary data abstraction mechanism in Spark and defined as an abstract class in Spark library it is similar to SCALA collection and it supports LAZY evaluation.



Characteristics of RDD:

1.Immutable :

RDD is an immutable data structure. Once created, it cannot be modified in-place. Basically, an operation that modifies RDD returns a new RDD.

2.Partitioned:

In RDD Data is split into partitions. These partitions are generally distributed across a cluster of nodes. When Spark is running on a single machine all the partitions are on that machine.

 

RDD Operations :

Applications in Spark process data using the same methods in RDD class. It referred to as operations

RDD operations are two types:

1.Transformations

2.Actions

 1.Transformations :

A transformation method of an RDD creates a new RDD by performing a computation on the source RDD.

RDD transformations are conceptually similar to SCALA collection methods.

The key difference is that the SCALA collection methods operate on data that can fit in the memory of a single machine, whereas RDD methods can operate on data distributed across a cluster of node RDD transformations are LAZY but SCALA collection methods are strict.

A) Map:

The map method is a higher order method that takes a function as input and applies it to each element in the source RDD to create a new RDD.

B) filter:

The filter method is a high order method that takes a Boolean function as input and applies it to each element in the source RDD to create a new RDD. A Boolean function takes an input and returns false or true. It returns a new RDD formed by selecting only those elements for which the input Boolean function returned true. The new RDD contains a subset of the elements in the original RDD.

c) flatMap:

This method is a higher order method that takes an input function in Spark, it returns a sequence for each input element passed to it. The flatMap method returns a new RDD formed by flattening this collection of the sequence.

D) mapPartitions :

It is a higher order method allows you to process data at a partition level. Instead of passing one element at a time to its input function, mapPartitions passes a partition in the form an iterator. The input function to the mapPartitions method takes an iterator as input and returns iterator as output.

E)Intersection:

Intersection method itakesRDD as input and returns a new RDD that contains the intersection of the element in the source RDD and the RDD passed to it as an input.

F)Union:

This method takes  RDD as input and returns a new RDD that contains a Union of the element in the resource RDD and the RDD passed to it as an input.

G)Subtract:

Subtract method takes RDD as input and returns a new RDD that contains elements in the source RDD but not in the input RDD.

 

H)Parallelize:

The Prallelized collections are created by calling Spark Context’s parallelize method on an existing collection in your driver program. The elements of the collection are copied to form a distributed data set that can be operated on in parallel.

 

I)Distinct:

Distinct method of an RDD returns a new RDD containing the distinct elements in the source RDD




 

J)Group By:

Group By is a higher order method it groups the elements of  RDD according to user-specified criteria. It takes as input a function that generates a key for each element in the source RDD. It is applicable to all the elements in the source RDD and returns an RDD of pairs.

 

K)Sort By:

The sortBy method is a higher order it returns RDD with sorted elements from the source RDD. It takes two input parameters. The first input is a function that generates a key for each element in the source RDD. The second input allows specifying ascending or descending order for sort.

 

L)Coalesce:

Coalesce method reduces the number of partitions in  RDD. It takes an integer input and returns new RDD with the specified number of partitions.

 

M)GroupByKey:

The GroupByKey method returns an RDD of pairs, where the first element in a pair is a key from the source RDD and the second element is a collection of all values that have the same key. It is the same as the groupBy method. The major difference is that groupBy is a higher order method that takes an input function that returns a key for each element in the source RDD. The groupByKey method operates in an RDD of key-value pairs.

N)ReduceByKey:

The higher-order reduceBy key method takes an associative binary operator as input and reduces values with the same key to a single value using specified binary operators.

2.Actions:

Actions are RDD methods that return a value to a driver program.

A)Collect:

The collect method returns the elements in the source RDD as an array. This method should be used with caution since it moves data from all the worker to the driver program.

B)Count:

This method returns a count of the elements in the source RDD.

C)Count By Value :

The countByValue method returns a count of each unique element in the source RDD. It returns an instance of the Map class containing each unique element and its count as a key-value pair.

D)First:

The first method returns the first element in the source RDD

E)Max:

The max method returns the largest element in  RDD

F)Min

The min method returns the smallest element in RDD

G)Top

The top method takes an integer N as input and returns an array containing the N largest elements in the source RDD.

H)Reduce

The high order reduces method aggregates the elements of the source RDD using an associative and commutative binary operator provided to it.

G)CountByKey

The countByKey methods count the occurrences of each unique key in the source RDD. It returns a Map of key count pairs.

Latest interview questions on Hadoop and Spark





1. Which internal algorithm used for NameNode to decide where the replica of a block will be stored exactly?

2. What will happen if a block of data is corrupted?

3. In the SCALA Program how to find out the number of transformations and actions?

4. If we are executing a query, how we can know that which are the joins taking more time especially in Hive and Spark query?

5. Scenario – I: In Hive, we have two tables A and B. B is the master table and A is the table which receives the updates of certain information. So I want to update the table B using the latest updated columns based upon the id

Question: How do we achieve that and what is exact query we use?

6.If Spark jobs are all failed without checking log files without WebUI how to handle it?

7. How to provide Security in Ambari without Kerberos?

8. Can you explain about High Availability Cluster in Hadoop Environment?

9. If you have a Spark job and there are 25 node cluster. How many executors are will be created by default?

10. How to change the column names in HIVE while importing the data into hive using Apache SQOOP?

11. How to handle the data type mismatch while importing the data from RDBMS to HIVE table?

12. How to handle when NULLS are present in the partition column? What is the internal mechanism for this simple scenario?
For suppose we have 4 node cluster having 128 GB ram per node, then we have 532 GB memory, now we have to process 1000 GB of data.




Question ) How spark process this data is more than available memory?

14. Did you use email reader in Oozie? How do you configure it?

15. In a Scala programming, you have to make two restful API calls, let’s say we have API 1 and API 2 and we have API 3. Then you have concurrently call API 1and API 2 and have to wait to finish both the call and make the 3rd call. How do you thin  SCALA concurrently?