Hadoop and Spark Interview Questions




Cognization conducted Hadoop and Spark interview question for experienced persons.

Round 1:

1. What is the future class in Scala programming language?

2.Difference between fold by fold Left or foldRight-in Scala?

3. How to distribute by will work in hive give some data tell me how to data will be distributed

4.dF.filter(Id == 3000) how to pass this condition in data frame on values in dynamically?

5. Have you worked on multithreading in Scala and explain?

7.On what basis you will increase the mappers in Apache Sqoop?

8. What will you mention last value while you are importing for the first time in Sqoop?

9. How do you mention date for incremental last modified in Spark?

10. Let’s say you have created the partition for Bengaluru but you loaded Hyderabad data what is the validation we have to do in this case to make sure that there won’t be any errors?

11. How many reducers will be launched in distributed by in Spark?

12. How to delete sqoop job in simple command?

13.In which location sqoop job last value will be stored?

14. What are the default input and output formats in Hive?

15. Can you explain brief idea about distributing cache in Spark with an example?

16. Did you use Kafka/Flume in your project and explain in detail?


17.Difference between Parquet and ORC file formats?

Round 2:

1. Explain your previous project?

2. How do you handle incremental data in apache sqoop?

3. Which Optimization techniques are used in sqoop?

4. What are the different parameters you pass your spark job?

5. In case one task is taking more time how will you handle?

6. What is stages and task in spark and give a real-time scenario?

7.On what basis you set mappers in Sqoop?

8. How will you export the data to Oracle without putting much load in the table?

9. What is column family in Hbase?

10. Can you create a table without mentioning column family

11.The number of column families limits for one table?

12. How to schedule Spark jobs in your previous project?

13. Explain Spark architecture with a real-time based scenario?



MapR Installation steps on AWS




MapR Installation on Amazon Web Service Machine with simple steps for Hadoop environment.

Step 1: Login with AWS credentials and then open the root machine.

[ec2-user@ip----~]$ sudo su -

Step 2: Put off the IP tables  services

[root@ip---- ~]# service iptables stop

Step 3: Check the configuration of iptables

[root@ip----- ~]# chkconfig iptables off

Step 4: Edit the SELinux configuration

[root@ip----~]# vim /etc/selinux/config

Step 5: EDIT replace enforcing with disabled (save and exit)

[root@ip----~]# SELINUX = disabled

Step 6: Open repos by using below command

[root@ip----~]# cd /etc/yum.repos.d/

Step 7: edit mar ecosystem repo file.

[root@ip----yum.repos.d]# vi mapr_ecosystem.repo

Put the following lines into the above file

[MapR_Ecosystem]
name = MapR Ecosystem Components
baseurl = http://package.mapr.com/releases/MEP/MEP-3.0.4/redhat
gpgcheck = 0
enabled = 1
protected = 1

Step 8: edit mapr installer repo files.

[root@ip----yum.repos.d]# vi mapr_installer.repo

Step 9: Edit mapr core repo files.

[root@ip----yum.repos.d]# vi mapr_core.repo

Put the following lines into the above file

[MapR_Core]
name = MapR Core Components
baseurl = http://archive.mapr.com/releases/v5.0.0/redhat/
gpgcheck = 1
enabled = 1
protected = 1

Step 10: create yum repolist

[root@ip----- yum.repos.d]# yum repolist

(here you will seen all packages)



Step 11: Search mapr package files.

[root@ip------ yum.repos.d]# yum list all | grep mapr

(this displays all packages related to mapr)

Step 12: import rpm package files

[root@ip----- yum.repos.d]# rpm --import

http://package.mapr.com/releases/pub/maprgpg.key

Step 13:  install mapr cldb file server,webserver,resource manager and node manager

[root@ip------ yum.repos.d]# yum install mapr-cldb mapr-fileserver mapr-

webserver mapr-resourcemanager mapr-nodemanager

Step 14: Install mapr Zookeeper

[root@ip------ yum.repos.d]# yum install mapr-zookeeper

Step 15: list of mapr files

[root@ip----- yum.repos.d]# ls -l /opt/mapr/roles/

Step 16: search for mapr rpm files by using files grep command.

[root@ip------ yum.repos.d]# rpm -qa | grep mapr

(displays installed packages related to mapr)

Step 17: Adding Group for mapr system

[root@ip------ yum.repos.d]# groupadd -g 5000 mapr

Step 18: Adding a user for mapr group system

[root@ip------ yum.repos.d]# useradd -g 5000 -u 5000 mapr

Step 19 : Set passwd for mapr user

[root@ip------ yum.repos.d]#passwd mapr

(here you will give password for mapr user)
(you can give any name)

Step 20: create id mapr

[root@ip------ yum.repos.d]# id mapr

Step 21: check Fully Qualified Doman Name using below command

[root@ip------ yum.repos.d]# hostname -f

Step 22: check disk availability

[root@ip------ yum.repos.d]# fdisk -l

(here you have seen available disks in that machine and select the second disk for mapr)

Step 23: Edit second disk information for maprdisk file system.

[root@ip----- yum.repos.d]# vi /root/maprdisk.txt

(here that second disk put here)(save and exit)

Step 24: Set the configuration server in different zones.

[root@ip----- yum.repos.d]# /opt/mapr/server/configure.sh -N training -C ip--------.ap-southeast-1.compute.internal -Z ip------.ap-southeast-1.compute.internal:5181

Step 25: Edit second disk files

[root@ip------ yum.repos.d]# cat /root/maprdisk.txt

Step 26: Download the rpm files

[root@ip------ ~]# wget http://download.fedoraproject.org/pub/epel/6/x86_64/epel-release-6-8.noarch.rpm

Step 27: Extra package for enterprise linux system

[root@ip------ ~]# rpm -Uvh epel-release-6*.rpm

Step 28: Start Zookeeper services

[root@ip------ ~]# service mapr-zookeeper start

Step 29 :Start warden services

[root@ip-1----- ~]# service mapr-warden start

Step 30: Start MapR CLI NODE CLDB MASTER service

[root@ip----- ~]# maprcli node cldbmaster

Here you will go with your machine ip in web server for mcs..shown below..
example: http://192.168.0.0:8443


Adding Hive Service in MapR




After successful installation of MapR distribution, we need to add services like Hive, Sqoop, Spark, Impala etc. Here we are adding Hive service with simple commands in MapR for Hadoop Environment.

Add Hive Service in MapR :

We must should follow below commands for Hive services:

Step 1: yum install for Hive Mapr.

[root@master1 ~]# yum install mapr-hive mapr-hiveserver2 mapr-hivemetastore mapr-hivewebhcat

Here Loaded plugins like  fastest mirrors, refresh-package kit, security yu
Setting up Install Process is done in this step

Installing below packages of MapR Hiver Services:
mapr – hive noarch
mapr -hivemetastore
mapr-hiveserver2
mapr-hivewebhcat

Step 2:  To install MySQL server for external Database for multiple users.

[root@master1 ~]# yum install MySQL - server

Download below rpm files for MySQL servers:

mysql-5.1.73-8.el6_8.x86_64.rpm
mysql-server-5.1.73-8.el6_8.x86_64.rpm
perl-DBD-MySQL-4.013-3.el6.x86_64.rpm
perl-DBI-1.609-4.el6.x86_64.rpm

Step 3:  Checking of MySQL Status

[root@master1 ~]# service mysqld status

Step 4: Start MySQL service by using below command:

[root@master1 ~]# service mysqld start

After start MySQL services set the password for mysql service

#mysql -u root -p

Step 5: Grant all privileges.

mysql>grant all privileges on *.* to 'your name '@'localhost' identified by 'your name ';

Step 6: Flush all privileges.

mysql>flush privileges;

Step 7: Exit from MySQL cli

mysql>exit

Step 8: Set the hive site .xml file for fully configurations

[root@master1 ~] # vi /opt/mapr/hive/hive-2.1/conf/hive-site.xml
<configuration>

<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://localhost:3306/hive?createDatabaseIfNotExist=true</value>
<description>JDBC connect string for a JDBC metastore</description>
</property>

<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
<description>Driver class name for a JDBC metastore</description>
</property>

<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>siva</value>
<description>username to use against metastore database</description>
</property>

<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value> your name</value>
<description>password to use against metastore database</description>
</property>

<property>
<name>hive.metastore.uris</name>
<value>thrift://localhost:9089</value>
</property>

</configuration>

Step 9: export the metastotr with port number.

[root @ master1 ~]# export METASTORE_PORT=9089

Step 10: For MySQL DB schema

[root @ master1 ~]# /opt/mapr/hive/hive-2.1/bin/schematool -dbType mysql -initSchema

Step 11: Login with MySQL CLI with your credentials

[root @ master 1 ~]# mysql -u name -p
Enter password:

Step 12: To check databases

mysql> show databases;
+--------------------+
| Database |
+--------------------+
| information_schema |
| hive |
| mysql | 
| test |
+--------------------+



Step 13: Exit from MySQL CLI

mysql> exit
Bye

Step 14: Install MySQL connector java file for connection

[root@master1 ~]# yum -y install mysql-connector-java

Step 15: Start Meta store services

[root@master1 ~]# /opt/mapr/hive/hive-2.1/bin/hive --service metastore --start

Step 16: Start Hive services:

[root@master1 ~]# hive
Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.



Deloitte Hadoop and Spark Interview Questions



Round 1:

1. Explain about your previous Project?

2. Write the Apache Sqoop code that you are using in your previous project?

3. What is the reason for moving data from DBMS to Hadoop Environment?

4. What happens when you increase mappers in MapReduce?

5. What is the command to check the last value of Apache Sqoop job?

6. Can you explain Distributed Cache?

7. Explain about Hive optimization techniques in your project?

8. Which Hive analytic functions you used in the project?

9. How to update records in Hive table in a single command?

10. How to limit the records when you are consuming the data in Hive table?

11. How to change the Hive engine to Apache Spark engine?

12.Difference between Parquet and ORC file format?

13. How to handle huge data flow situation in your project?

14. Explain about Apache Kafka with architecture?

15. Which tool will create partitions in the Apache Kafka topic?

16. Which transformation and actions are used in your project?

17. Explain a brief idea about Spark Architecture?

18. How will check if data is there or not in the 6th partition in RDD?

19. How do you debug in Spark code in Regex?

20. Give me the idea about a functional programming language?

21.Difference between Map Vs Flat Map in Spark?

22. For example, Spark word count while splitting which one do you use? what happens if you use map instead of flatMap in that program?

23. If you have knowledge on Hadoop Cluster then will you explain about capacity planning for four node cluster?


Round-2

1. Define YARN and MapReduce Architecture?

2. Explain Zookeeper functionalities and give how the flow when the node is down?

3. Explain Data modeling in your project?

4. In your project, reporting tools are used? if you yes then explain it?

5. Give me a brief idea about Broadcast variables in Apache Spark?

6. Can you explain about Agile methodology and give me architecture of Agile?


How to generate PPK from PEM and open AWS console




Nowadays most of the technical people suffer from PEM file to PPK file generating with little bit easy to understand.

Login AWS account as per your credentials and click on  Instance ( Step 7: Review Instance Launch) then window showing like below image.

Then choose your option whether it existing or creating a key pair.

First, download the PEM file from AWS account whether to create a new key pair or existing key pair.

Here choose an existing key pair then give a name for that key pair and acknowledge it.

After that Launch instance machine as per requirement.

Download Putty Key Generator from Putty official website then load the PEM file like below snapshot.

First load the PEM file then clicks on Generate button.

Note: After the generating time some randomness by moving the mouse over the blank area otherwise, it will not generate the PPK file.

Then Save the generate the PPK file as either save private key or save public key.

After generating of PPK file then go with Putty

Note: Putty Generator only used to generate files.

Open Putty then give IP address and Port number as per machine details.

Here will give IPV4 address or completely Hostname ( check with “hostname” command in Linux machine). Don’t give  IPV6 address.



Next will go within the category clicks on SSH option -> Auth -> Browse the PPK file for authentication as per below snapshot in Putty.

Here SSH means that Secure Shell key management system for the authentication system in the network services.

After selecting the SSH option go with Auth option then will get direct Browse option so simply browse the PPK or PEM file then clicks on Open button.

Finally, open the command prompt ( terminal ) console then will give username after that will get Yes or No option then click on YES option.


Launch AWS Instance




Here Free tier version of Amazon Web Services Instances in simple steps for beginners and how to connect machines and generate PEM files.

Step 1: Login AWS account click on AWS Management Console then give you credentials.

Step 2: Click on Launch Virtual Machine EC2 (Amazon Elastic Compute Cloud).

Step3: Choose AMI step then go with Free tier version and click on Amazon Linux 2 AMI (HVM), SSD Volume Type then clicks on Select button.




Step 4: Choose Instance Type here we selected General purpose and t2.micro (Available 1GB memory and 1 CPU core) and then directly select Review and the Launch button.

Note: If you don’t need configuration of instances then directly go with Review and Launch of the machine in Dashboard.

Step5: Clicks on Configure Instance whether you need one or more instances and click on Next: Add storage steps.

Step6: Clicks on Add Storage. It acts like Hard Disk in the computer so choose the size of the machine.

Step 7:  Clicks on Add Tags otherwise no need to configured.

Step 8: Next Goto Configure Security Group for security purpose on the machines. It provides strong security to choose types like SSH or any other.

Step 9: After clicks on Review and  Launch button then directly will launch  AWS free tier version machines for beginners.

Step 10: Select an existing key pair or create a new key pair option then select a key pair will give a specific name for that pem file.

Step 11: Start the AWS Instance.

Step 12: After successfully launching machine go and check the status of the machine and click on Instance ID.

Step 13: After completing the above steps go to start Amazon Web Services Instances then connect with pem file. Here we must and should change pem file into ppf file from putty generator. Then start with putty and use it is simple.


Prerequisites for MapR Installation on CentOS




In Hadoop Eco-System we preferable mostly three Big data distributions:

1.Cloudera Distribution Hadoop

2.Horton Works Data Platform

3.MapR Distributions Platform

In Cloudera, Distribution Platform is a free version, express, and enterprise edition up to 60 days trial version.

Coming to Hortonworks Data Platform completely open source platform for production, developing and testing environment.

Then finally MapR distribution platform is a complete enterprise edition but in MapR 3 is free version is available with fewer features to compare to MapR 5 and MapR 7.

How to install MapR free version on Pseduo Cluster:

Before the install of MapR, we configured prerequisites as  below:

——-Prerequisites——–

1.Configure hostname like FQDN by using the setup command (mapr.hadoop.com) after that check your hostname using hostname -f

2. vi/etc/hosts

3.hostname < your Fully Qualified Domain>

4. vim/etc/selinux/config ===> SELinux = disabled

——-Disable Firewalls and IPTables——-

If you enable firewalls and iptables doesn’t allow some ports so we must and should disable it.

1.service iptables save

2.service iptables stop

3.chkconfig iptables off

4.service ip6table save

5.service ip6tables stop

6.chkconfig ip6tables off

—– Enable NTP service for machines —–

NTP is a Network Time Protocol is a networking protocol for time synchronization between computers and packet switched data.

1.yum -y install ntp ntpupdate ntp-doc

2.chkconfig ntpd on

3.vi /etc/ntp.conf

4.server 0.rhel.pool.ntp.org

5.server 1.rhel.pool.ntp.org

6.server 2.rhel.pool.ntp.org

7.ntpq -p

8.date ( All machines have the same date otherwise it will showing error)


—— Install some additional packages in Linux OS —-

Here will install JAVA 1.8 and Python

1.yum -y install java-1.8.0 -openjdk-devel

2.yum -y install python perl expect expectk

—- setup passwordless SSH On all nodes form master node ——

For passwordless authentication in between master and slave nodes

1.ssh-keygen -t rsa

2.cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

3.ssh-copy-id root@<FQDN1, FQDN2>

—–Additional Linux configuration or Transparent Huge Pages(THP)—-

1. echo never > /sys/kernel/mm/redhat_transparent_hugepage/enabled

2.echo never > /sys/kernel/mm/redhat_transparent_hugepage/defrag

3.sysctl vm.swapiness=10

set up EPEL repository for installing additional packages on the system

Here  EPEL repository for installing the additional packages in centos machine

1.Install -uvh the EPEL repository

2.wget http://http://download.fedoraproject.org/pub/epel/6/x86_64/epel-release -6.8.norach.rpm



HBase Table(Single&Multiple) data migration from one cluster to another cluster



HBase single table migration from one cluster to another cluster:

Here will be shown about Hbase single data table migration existing cluster to a new cluster simple steps:

Step 1: First export the hbase table data into the local hdfs path (Hadoop Distributed File System)

Step 2: After that copy the HBase table data from the source cluster to destination cluster by using the distcp command. (mostly distcp is a copy command for one cluster data to another cluster)

Step 3: Then create an Hbase table in the destination cluster (target cluster)

Step 4: After that import the Hbase table data from local to HBase table in the destination cluster.

Source Cluster:

1.  hbase.org.apache.hadoop.hbase.mapreduce.Driver export <hbase _table _name >  < source _hdfs _path >

2. hbase distcp hdfs :// <source_cluster_ipaddress:8020> to </source _hdfs _path>

3.hdfs: // < destination_cluster_ipaddress: 8020 > to <destination _hdfs _path>

Destination Cluster:

1.hbase org.hadoop.hbase.mapreduce.import < hbase _ table_ name > to < hbase _table _hdfs _path >

HBase multiple table migration from one cluster to another cluster:

We know how to Hbase single table migration then coming to multiple table migration from one cluster to another cluster in simple manner by below steps.

We have script files then simply multiple Hbase data migrations happening to go through below steps:


Step 1: First step place the hbase-export.sh and hbase-table.txt in the source cluster

Step 2: After that place the hbase -import.sh and hbase-table.txt in the destination cluster.

Step 3: Mention all the table list in the hbase-table.txt file

Step 4: Create all the HBase table on the destination cluster

Step 5: Execute the hbase-export-generic.sh in the source cluster

Step 6: Execute the hbase-import.sh in the destination cluster.



Summary: I tried in Cloudera Distribute Hadoop environment for Hbase data migration from one cluster to another cluster. For Hbase single table data and multiple table data migration in very simple for Hadoop administrator as well as Hadoop developers. It is the same as Hortonword Distribution also.

Replication Factor in Hadoop




How to Replication Factor comes into the picture:

The Backup mechanism in the traditional distribution system:

In Hadoop, Backup mechanism didn’t provide high availability. This system is followed by shaded architecture.

The first request from File to Master node then divided into blocksize. It is a continuous process but node 1(slave1) is failed to another node(Slave 2).

Replication Factor:

Replication factor is the process of duplicating the data on the different slave machines to achieve high availability processing.

Replication is a Backup mechanism or Failover mechanism or Fault tolerant mechanism.

In Hadoop, Replication factor default is 3 times. No need to configure.

Hadoop 1.x :
Replication Factor is 3
Hadoop 2.x:
Replication Factor is also 3.

In Hadoop, Minimum Replication factor is 1 time. It is possible for a single node Hadoop cluster.

In Hadoop, Maximum Replication factor is 512 times.

If 3 minimum replication factor then minimum 3 slave nodes are required.

If the replication factor is 10 then we need 10 slave nodes are required.



Here is simple for the replication factor:

'N' Replication Factor = 'N' Slave Nodes

Note: If the configured replication factor is 3 times but using 2 slave machines than actual replication factor is also 2 times.

How to configure Replication in Hadoop?

It is configured in the  hdfs-site.xml file.

/usr/local/hadoop/conf/hdfs-site.xml
<configuration>
<property>
<name> dfs.replication</name>
<value> 5 </value>
</property>
</value>

Design Rules Of Replication In Hadoop:

1. In Hadoop Replication is only applicable to Hadoop Distributed File System (HDFS) but not for Metadata.

2. Keep One Replication per slave node as per design.

3. Replication will only happen on Hadoop slave nodes alone but not on Hadoop Master node (because the master node is only for metadata management on its own. It will not maintain the data).

Storage only duplicates in Hadoop but not processing because processing us always unique.

Summary: In Hadoop, Replication factor is a major role for data backup mechanism in earlier days. Default replication factor always 3 except single node cluster environment.


Blocksize in Hadoop



How the data storage on HDFS:

BLOCK:

Individual storage unit on the Hadoop Distributed File System.

In Hadoop 1.X default block size is 64MB

In Hadoop 2.X default block size is 128MB

If any file request is coming to Hadoop cluster what are the steps:

Step 1: Hadoop Master node only receives the file request.

Step2: Based on the Blocksize configuration at that time, data will be divided into no.of blocks.

How to configure “Blocksize” in Hadoop?

/usr/local/hadoop/conf/hdfs-site.xml
<configuration>
<property>
<name>dfs.block.size</name>
<value>14323883></value>
</property>
</configuration>

How to store data in HDFS:

Assume that we have A.log, B.log, and C.log files:

Scenario1:

A.log -> 200mb -> 200/64 -> 64mb 64mb 64mb 8mb+remaining

Senaario2:

B.log->192mb->192/64-> 64mb 64mb 64mb

Design Rules of Blocksize:

1.Irrespective of the file size: In Blocksize for each and every file dedicated to no.of blocks will be there in Hadoop.

2.Except for the last block: Remaining all the blocks of a file will hold the equal volume of data.

Hadoop master node only looks at the block size at the time of blocking the data(dividing data). Not at the time of reading the data because at the time of reading the data only metadata matters.