Latest interview questions on Hadoop and Spark

1. Which internal algorithm used for NameNode to decide where the replica of a block will be stored exactly?

2. What will happen if a block of data is corrupted?

3. In the SCALA Program how to find out the number of transformations and actions?

4. If we are executing a query, how we can know that which are the joins taking more time especially in Hive and Spark query?

5. Scenario – I: In Hive, we have two tables A and B. B is the master table and A is the table which receives the updates of certain information. So I want to update the table B using the latest updated columns based upon the id

Question: How do we achieve that and what is exact query we use?

6.If Spark jobs are all failed without checking log files without WebUI how to handle it?

7. How to provide Security in Ambari without Kerberos?

8. Can you explain about High Availability Cluster in Hadoop Environment?

9. If you have a Spark job and there are 25 node cluster. How many executors are will be created by default?

10. How to change the column names in HIVE while importing the data into hive using Apache SQOOP?

11. How to handle the data type mismatch while importing the data from RDBMS to HIVE table?

12. How to handle when NULLS are present in the partition column? What is the internal mechanism for this simple scenario?

13. Scenario: For suppose we have 4 node cluster having 128 GB ram per node, then we have 532 GB memory, now we have to process 1000 GB of data.

Question ) How spark process this data is more than available memory?

14. Did you use email reader in Oozie? How do you configure it?

15. In a Scala programming, you have to make two restful API calls, let’s say we have API 1 and API 2 and we have API 3. Then you have concurrently call API 1and API 2 and have to wait to finish both the call and make the 3rd call. How do you thin  SCALA concurrently?


Toughest Big Data(Spark, Kafka,Hive) Interview Questions

1. How to handle Kafka back pressure with scripting parameters?

2.How to achieve performance tuning through executors?

3. What is idle size of deciding the executors and what ram should be used ?

4. How do you scale Kafka brokers and Integrate with spark streaming without stopping the cluster and along with script?

5.How to delete records in Hive and how to delete duplicate records with the scripting?

6. Can we have more than one replica exist in same rack?

7. In a data base out of 10 tables, one table is failed while importing from MySql into HDFS by using Sqoop? What is the solution?

8. If you submit a spark job in a cluster and almost rdd has already created in the middle of the process the cluster goes down what will happen to you are rdd and how data will tackle?

Hadoop and Spark Scenario Typed Questions

1.  Hadoop – Scenario :

If you working on Hadoop Cluster and you have already cache the RDD and got the output stored in cache now I want to clear the memory space and use that space for caching another RDD? How to achieve this?

2. Spark – Scenario :

I) Suppose you are running 100 SQL jobs which generally take 50 mins to complete, but one it took 5 hour to complete.

Q 1) In this case How do you report this errors?

Q 2)How do you debug to code and provide a proper solution for this scenario.

Rare interview questions on Hadoop Eco – System:

1.What do you about type safety and which frame work has type safety in Hadoop?

2.What are the serializations in Hive? why do you choose that serialization explain in detail?

3. What modules you have worked in Scala and name the module and explain briefly?

4.What are the packages you have worked in Scala and name the package you have imported in your current project ?

5.What is the difference between map and map partition with clear explanation with real time example in Scala.

6. How do you connect to your cluster using data nodes or edge nodes?

7. How do you allocate buffer memory to your datanode?

8.How much buffer space have you allocated to your map task and reduce task in your data node

9. How do you achieve broadcast join automatically without out doing it manually? and how do you setup your driver program to detect where broadcast join can be good to use and how do you automate the process?


Most frequently asked Interview questions for experienced

In this era  in between 2-8 years experienced persons interviewer asked this type of questions in interview panel related to Big data and analytics and specially in Hadoop eco-system.
Mostly on Hands on experience in Hadoop and related to Project.
1. what properties you changed in Hadoop configuration files for your project?
Can you explain about your project related
2. where do you know Name Node and Datanode directory paths?
3. How do you handle incremental load in your project?
By using SQOOP incremental

4. can you do dynamic hive partitions through Sqoop?
Yes, dynamic partitions hive through SQOOP.
5. in which scenarios will we use Parquet and Avro?
It is based upon client and can you explore on it.
6. how do you handle Authentication and Authorization in your project?
Can you explain whether using Kerbreos and AD/LDAP. It is purely depends upon your project related.
7. How to Handle if Spark all jobs are failed?

Top 10 Hadoop Interview Questions

1.What exactly meaning of Hadoop?

Hadoop is a framework to Process and Store the huge amount of data. It is an open source software framework for distributed file system

2. Why do we need Hadoop in IT?



C.Data Quality

D.High Availability

E.Hardware Commodity

3. Difference between Hadoop 2.x and Hadoop 3.x?

Hadoop 2 handles only single Name Node to manage all Name Spaces.

Hadoop 3 has multiple Namenodes for multiple NameSpaces

Hadoop 2 has a lot more storage overhead than Hadoop3

Hadoop 2 not support GPUs but Hadoop 3 support GPUs.

4.Define Data Locality in Hadoop?

Sending the Logic near to the of HDFS.

5. How is Security achieved in Hadoop?

In Hadoop by using Kerberos Hadoop achieves more securiy

6. What are different modes in which Hadoop run?

A.Standalone mode

B.Pseudo Distributed

C.Fully Distributed

7. Explain about Safemode in Hadoop?

Safemode in Hadoop is a maintenance state of Name Node. During which Name Node doesn’t allow any modifications to the file system. During Safemode, HDFS cluster is in read-only and doesn’t replicate or delete blocks.


hadoop dfsadming -safemode get

hadoop dfsadming -safemode enter

hadoop dfsadming -safemode leave

8.What are the main components in Hadoop eco-system?

A)HDFS             -Hadoop Distributed File System

B)MapReduce  – Programming paradigm- based on Java

C)Pig                  – To process and analyse  the structured,semi-structured data

D)Hive              – To process and analyse structured data

E)HBASE        – NoSQL database

F)SQOOP       – Import/Export structured data

G)Oozie          -Scheduler

H)Zookeeper – Configuration

9.Explain the differencebetween Name Node,Check point ,Backup Node in Hadoop eco-system?

Name Node- HDFS that manages the metadata

Checkpont Name Node- Directory structure as  Name Node, and creates checkpoints

Backup Node-It needs to save the current state in memory to an image file to create a new checkpoint.

10.Benefits of Hadoop?

A)Ability to handle bigdata

B)Commodity hardware and is open-source

C)Ability to handle multiple data types