Permission Denied error in Hive while creating Database in Hadoop eco-system

I have installed Hive service on top Hadoop eco-system then trying to create a database but I got below error and find out a solution as well.



Permission Denied Error in Hive:

FAILED: Execution Error, return code1 from org.apache.hadoop.hive.ql.exec.DDLTask
hive> set hive.auto.convert.join.nonconditional task = false:
hive> create database myhive:
FAILED: Error in metadata: MetaException(message:Got exception: org.apache.hadoop.security.AccessControlException Permission denied user = hadoop access = WRITE, inode*/user*: hdfs : supergroup : drwxr-rx-r
at org.apache.hadoop.hdfs.server.namenode.FSPErmissionChecker.check(FSPermissionChecker.java:224)
at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:149)
at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:149)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkPermission(FSNamesystem.java:4891)
at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.mkdirs(NameNodeRpcServer.java:669)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.mkdirs
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java.453)
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask
hive>

Above error belongs to Permission issue in Hive component:

Resolution:

To resolve the permission issue to the user Hadoop in HDFS. Can you please follow step for the solutions is just change permission to the user using chmod commands:

Step 1: Login to as hduser then execute the below commands one by one.
Step 2: sudo - u hdfs hadoop fs -mkdir /user/hive/warehouse
Step 3: sudo -u hdfs hadoop fs -chmod g+w /tmp
Step 4: sudo -u hdfs hadoop fs -chmod g+w /user/hive/warehouse
Step 5: sudo - u hdfs hadoop fs  -chown -R /user/hive/warehouse
Step 6: sudo chmod 777 /var/lib/hive/metastore
Step 7: cd /var/lib/hive/metastore/metastore_db/
Step 8 :sudo rm *.lck

Summary: I have tried above resolutions then working fine now for above error in Hive.

Unable to Integrate Hive with Spark and different resolutions




How to integrate (connect) Hive and Spark:

Here are to provide solutions for how to integrate (connect) Hive DB with Spark in Hadoop development.
The first time, we tried to connect the Hive and Spark then we got below error and find different types of resolutions with different modes.

caused by: org.datanucleus.exceptions. NucleusExcepiton: Attempt tp invoke 
the ONECP" plugin to create a ConnectionPool gave an error: The specified 
data driver ("co.mysql.jdbc.Driver) was not found in the CLASSPATH. Please 
change our CLASSPATH specification and the name of the driver.

Different types of solution for the above error:

Resolution 1:

1.Download MySQL connector java jar file from maven official website like below link
https://mvnrepository.com/artifact/mysql/mysql-connector-java/5.1.21
2. Paste the jar file into jars folder which is present in the Spark installed directory.

Resolution 2:

Without JDBC driver:

1. Goto hive-site.xml and give hive.metastore.uri in that hive xml file
2. Import the org.apache.spark.sql.hive.HiveContext, as it can perform SQL query over Hive tables then define the sqlContext param like below code:
Val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
3. Finally, verify Tables in Spark SQL

Resolution 3:





Go with the beeline for Hive and Spark connection in Hive CLI. In beeline, they provide high security and provide a remote server through directly and check with below two commands for beeline with Hive 2 server configurations.

Step 1: ./bin/beeline
Step 2:  !connect jdbc.hive2.//remote_hive:10000

Most Typical Hive Interview Questions and Answers




Hive Interview Questions and Answers

1. Does Hive support record level Insert, delete or Update?

Hive does not support recode level insert, delete or update. It doesn’t provide transactions also. If the user can go with CASE statements and built-in functions of Hive to satisfy the insert, update and delete.

2. What kind of data warehouse applications is suitable for Hive?

Basically, Hive is not a full database it is a data summarization tool in Hadoop eco-system. Hive can do below applications:

I)Fast response times are not required
II)When the data is not changing rapidly
III)Relatively static data is analyzed

3. How can the columns of a table in Hive be written to a File?

In Hive using the awk command in Hive shell, the output from HiveQL can be written to a file

Example : hive -S -e  "describe table_name" | awk -F " '{print 1}' > ~/output

4.Difference between order by and sort by in Hive?

In Hive SORT BY will sort the data within each reducer. It can use any number of reducers for SORT BY operations.
Coming to ORDER BY will sort all of the data together, which has to pass through one reducer. Thus, ORDER BY in Hive uses single reducers and guarantees total order in the output while SORT BY only guarantees ordering of the rows within a reducer




5. Wherever Different directory I run Hive query, it creates new metastore_db, please explain the reason for it?

Whenever you run the Hive in embedded mode, it means that it creates the local metastore. And before creating the metastore it looks whether metastore already exists or not. This property is defined in the configuration file in hive_site.xml properties.

"javax.jdo.option.ConnectionURL" with default value
"jdbc:derby::databaseName=metastore_db";create=true

6. Is it possible to use the same metastore by multiple users, in case of embedded Hive?

No, it is impossible to use metastore for multiple users, it is only for a single user in a single mode database like PostgreSQL, MySQL, etc.

What are the different Hadoop Components and Definitions

What are the Different Hadoop Components in Hadoop Eco-System





HDFS – Filesystem of Hadoop ( Hadoop Distributed File System)
MapReduce – Processing of Large Datasets

HBase – Database (Hadoop+dataBase)

Apache Oozie – Workflow Scheduler

Apache Mahout – Machine learning and Data mining

Apache Hue – Hadoop user interface, Browser for HDFS, HBase, Query editors for Hive, etc.
Flume – To integrate other data source

Sqoop – Export / Import data from RDBMS to HDFS and HDFS to RDBMS

What is HDFS?

HDFS (Hadoop Distributed File System) is a filesystem that can store very large data sets by scaling out across a cluster of hosts.

What is Map Reduce?

MapReduce is a programming model and it is implemented for processing and generating large data sets. It specifies a map function that process a (key, value) pair to generate a set of intermediate(Key, Value) pairs.

What is Hive?




A data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis.

What is  Pig?

Pig is an analyzing large data sets that consist of a high-level (scripting) language for expressing data analysis programs.

What is Flume?

Flume is on top of Hadoop applications, we need to get data from the source into HDFS.

What is Sqoop?

Apache Sqoop is a tool designed for transferring bulk data between Hadoop and structured data stores it means that Export / Import data from RDBMS to HDFS vice versa.

What is HBase?

HBase ( Hadoop + dataBase) is a column-oriented store database layered on top of HDFS.

What is NoSQL database?

NoSQL means that Not Only SQL using traditional relational Data Base Management System.

Adding Hive Service in MapR





After successful installation of MapR distribution, we need to add services like Hive, Sqoop, Spark, Impala etc. Here we are adding Hive service with simple commands in MapR for Hadoop Environment.

Add Hive Service in MapR :

We must should follow below commands for Hive services:

Step 1: yum install for Hive Mapr.

[root@master1 ~]# yum install mapr-hive mapr-hiveserver2 mapr-hivemetastore mapr-hivewebhcat

Here Loaded plugins like  fastest mirrors, refresh-package kit, security yu
Setting up Install Process is done in this step

Installing below packages of MapR Hiver Services:
mapr – hive noarch
mapr -hivemetastore
mapr-hiveserver2
mapr-hivewebhcat

Step 2:  To install MySQL server for external Database for multiple users.

[root@master1 ~]# yum install MySQL - server

Download below rpm files for MySQL servers:

mysql-5.1.73-8.el6_8.x86_64.rpm
mysql-server-5.1.73-8.el6_8.x86_64.rpm
perl-DBD-MySQL-4.013-3.el6.x86_64.rpm
perl-DBI-1.609-4.el6.x86_64.rpm

Step 3:  Checking of MySQL Status

[root@master1 ~]# service mysqld status

Step 4: Start MySQL service by using below command:

[root@master1 ~]# service mysqld start

After start MySQL services set the password for mysql service

#mysql -u root -p

Step 5: Grant all privileges.

mysql>grant all privileges on *.* to 'your name '@'localhost' identified by 'your name ';

Step 6: Flush all privileges.

mysql>flush privileges;

Step 7: Exit from MySQL cli

mysql>exit

Step 8: Set the hive site .xml file for fully configurations

[root@master1 ~] # vi /opt/mapr/hive/hive-2.1/conf/hive-site.xml
<configuration>

<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://localhost:3306/hive?createDatabaseIfNotExist=true</value>
<description>JDBC connect string for a JDBC metastore</description>
</property>

<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
<description>Driver class name for a JDBC metastore</description>
</property>

<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>siva</value>
<description>username to use against metastore database</description>
</property>

<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value> your name</value>
<description>password to use against metastore database</description>
</property>

<property>
<name>hive.metastore.uris</name>
<value>thrift://localhost:9089</value>
</property>

</configuration>

Step 9: export the metastotr with port number.

[root @ master1 ~]# export METASTORE_PORT=9089

Step 10: For MySQL DB schema

[root @ master1 ~]# /opt/mapr/hive/hive-2.1/bin/schematool -dbType mysql -initSchema

Step 11: Login with MySQL CLI with your credentials

[root @ master 1 ~]# mysql -u name -p
Enter password:

Step 12: To check databases

mysql> show databases;
+--------------------+
| Database |
+--------------------+
| information_schema |
| hive |
| mysql | 
| test |
+--------------------+

Step 13: Exit from MySQL CLI



mysql> exit
Bye

Step 14: Install MySQL connector java file for connection

[root@master1 ~]# yum -y install mysql-connector-java

Step 15: Start Meta store services

[root@master1 ~]# /opt/mapr/hive/hive-2.1/bin/hive --service metastore --start

Step 16: Start Hive services:

[root@master1 ~]# hive
Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.

Hive: SortBy Vs OrderBy Vs DistributeBy Vs ClusterBy

SortBy:

Hive uses the column in SortBy to sort the rows before sustaining the rows to a reducer in Hive environment. The sort order will be dependent on the column types especially for the column is of numeric type, then the sort order is also in numeric order. If the column is of string type, then the sort order will be lexicographical order in Hive Query. It orders data at each of ‘N’ reducers, but each reducer can have overlapping ranges of data in Hive.




Output: N or more sorted files with overlapping ranges.

Example Query for SortBy

SELECT key, value FROM source SORTBY key ASC, value DESC

Order By:

This is similar to ORDER BY in SQL language. In Hive, ORDER BY guarantees total ordering of data, but for that, it has to be passed on to a single reducer which is normally intolerable and therefore in inflexible mode, in hive makes it compulsory to use LIMIt with ORDER BY so that reducer doesn’t get exhausted.

Ordering: Total Order DATA.

Output: Single output i.e fully ordered.

Example Query for OrderBy

SELECT key, value FROM source ORDER BY key ASC, value DESC

Distribute By:

Apache Hive uses the columns in Distribute By to distribute the rows between reducers in a query language. All rows with the same Distribute By columns will go to the same reducer.
Distribute By protecting each of N reducers gets non-overlapping ranges of the column but doesn’t sort the output of each reducer.

Example:

In  Distribute By x on the following 5 rows to 2 reducers:

x1
x2
x4
x1

Reducer 1 got

x1
x2
x1

Reducer 2 got

x4
x3

Cluster By:

Cluster By is a combination of both Distribute By and Sort By. CLUSTER BY x protecting each of N reducers gets non-overlapping ranges, then sorts by those ranges at the reducers.

Ordering: Global ordering between multiple reducers.

Output: N or more sorted files with non-overlapping ranges.

Example:

Refer to same example as above, if we use Cluster By x, the two reducers will further sort rows on x:




Reducer 1 got

x1
x1
x2

Reducer 2 got

x3
x4

Basic Terminology in Hadoop

Bigdata Solutions:




1.NoSQL – database(Non relational database) – Only for structured and semi-structured

2. Hadoop – Implementation – structured,semi-structured and unstructured data

3.Hadoop eco-systems and its components for everything.

Hadoop:

Hadoop is a parallel system for large data storage and processing. It is a solution for Bigdata.

For Storage purpose HDFS -Hadoop Distributed File System

For Processing purpose MapReduce using simply.

In Hadoop, some keywords are very important for learning scope.

Hadoop Basic Terminology:

1.Cluster

2.Clustered Node

3.Hadoop Clustered Node

4.Hadoop cluster

5. Hadoop Cluster Size

1.Cluster:

A cluster is a group of all nodes belongs to one common network is called a cluster.

2.Clustered Node:

A Clustered Node is a grouping of all individual machines is called a clustered node in Hadoop

3.Hadoop Cluster Node:

A Hadoop Cluster Node is basic storage and processing purpose of a cluster is called as Hadoop Cluster Node.

For storage purpose, we are using the Hadoop Distributed File System.

For processing purpose, we are using MapReduce

4.Hadoop Cluster:

A Hadoop Cluster is a collection of “Hadoop Cluster Node” in a common network is called Hadoop Cluster

5.Hadoop Cluster Size:

A Hadoop cluster size is a total no.of node in a Hadoop cluster.

Hadoop Ecosystem:

1. Apache Pig              –  Processing           – Pig Scripting

2. Hive                             – Processing           – HiveQL (Query language like SQL)

3.SQOOP                       – Integration tool  – Import and Export data

4.Zookeeper               – Coordination      – Distribution coordinator

5.Apache Flume      – Streaming              – log data for streaming purpose

6.Oozie                        – Scheduling             – Open source scheduling jobs

7.HBase                     – Random Access   – Hadoop+dataBASE

8.NoSQL                  – NotOnlySql              – MongoDB, Cassandra

9.Apache Kafka    – Messaging               – Distributed messaging

10.YARN                  – Resource Manager – Yet Another Resource Negotiator

Note: Apache Spark is not a part of Hadoop but including nowadays. It is used for Data Processing purpose. Spark 100 times faster than Hadoop MapReduce.

Compatible Operating System for Hadoop Installation:

1. Linux

2.Mac OS

3.Sun Solaris

4.Windows.

Hadoop Versions:

Hadoop 1.x

Hadoop 2.x




Hadoop 3.x

Different Distributions of Hadoop

1. Cloudera Distribution for Hadoop (CDH)

2.Hortonworks

3.MapR

Connection refused error while running Hive in Hadoop

When Hive Installation in a single node cluster setup on Hadoop ecosystem sometimes showing below like this:



Connection refused error in Hive

Exception in thread  “main”  java.lang.RuntimeException: call From your domain/127.0.1.1 to localhost:8020 failed on connection exception: Java.net.ConnectionException:Connection refused:

For more details see:

http://wiki.apache.org/hadoop/Conncetionrefused

at org.apache.hadoop.hive.ql.session.SesseionState.start(SessionStart.java:522)
at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:677)
at org.apache.hadoop.hive.cli.CliDriver.main.(CliDriver.java:621)
at sum.reflect.NativeMethodAccessorImpl.invoke(Native Method)
...more

Caused by: java.net.ConncetException : Call From  slthupili/127.0.1.1 to localhost:8020 failed on connection exception: java.net.ConnectionException: Connection refused;

at sun.reflect.NativeConstructorAccessorImpl.newInstance0
(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance
(NativeConstructorAccessorImple.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance
(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
...more

Caused by : java.netConnectException: Connection refused

at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)

at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannel Impl.java:717)

at org.apache.hadoop.net.NetUtilis.connect(NetUtils.java:530)

...more

Solution:

First, stop all services in Hadoop using below command:

$ stop-all.sh

This command used for all services like NameNode, DataNode, SecondaryNode, YARN, etc.

Second step back up the data then will use below command

$ hadoop namenode -format

Above command removes unnecessary data then enter hive command

$ hive

 

Latest interview questions on Hadoop and Spark





1. Which internal algorithm used for NameNode to decide where the replica of a block will be stored exactly?

2. What will happen if a block of data is corrupted?

3. In the SCALA Program how to find out the number of transformations and actions?

4. If we are executing a query, how we can know that which are the joins taking more time especially in Hive and Spark query?

5. Scenario – I: In Hive, we have two tables A and B. B is the master table and A is the table which receives the updates of certain information. So I want to update the table B using the latest updated columns based upon the id

Question: How do we achieve that and what is exact query we use?

6.If Spark jobs are all failed without checking log files without WebUI how to handle it?

7. How to provide Security in Ambari without Kerberos?

8. Can you explain about High Availability Cluster in Hadoop Environment?

9. If you have a Spark job and there are 25 node cluster. How many executors are will be created by default?

10. How to change the column names in HIVE while importing the data into hive using Apache SQOOP?

11. How to handle the data type mismatch while importing the data from RDBMS to HIVE table?

12. How to handle when NULLS are present in the partition column? What is the internal mechanism for this simple scenario?
For suppose we have 4 node cluster having 128 GB ram per node, then we have 532 GB memory, now we have to process 1000 GB of data.




Question ) How spark process this data is more than available memory?

14. Did you use email reader in Oozie? How do you configure it?

15. In a Scala programming, you have to make two restful API calls, let’s say we have API 1 and API 2 and we have API 3. Then you have concurrently call API 1and API 2 and have to wait to finish both the call and make the 3rd call. How do you thin  SCALA concurrently?

Toughest Big Data(Spark, Kafka,Hive) Interview Questions

Hard Interview Questions for Spark, Kafka, and Hive:





1. How to handle Kafka back pressure with scripting parameters?

2. How to achieve performance tuning through executors?

3. What is the idle size of deciding the executors and what ram should be used?

4. How do you scale Kafka brokers and Integrate with spark streaming without stopping the cluster and along with script?

5.How to delete records in Hive and how to delete duplicate records with the scripting?

6. Can we have more than one replica exist in the same rack?

7. In a database out of 10 tables, one table is failed while importing from MySql into HDFS by using Sqoop? What is the solution?

8. If you submit a spark job in a cluster and almost rdd has already created in the middle of the process the cluster goes down what will happen to you are rdd and how data will tackle?

Summary: Nowadays asked these type of scenario-based interview questions in Big Data environment for Spark and Hive.