BigData and Spark Multiple Choice Questions – I

1. In Spark, a —————– is a read-only collection of objects partitioned across a set of machines that can be rebuilt if a partition is lost.

A) Resilient Distributed Dataset (RDD)                  C)Driver

B)Spark Streaming                                                          D) Flat Map

Ans: Resilient Distributed Dataset (RDD)

2. Consider the following statement is the correct context of Apache Spark   :

Statement 1: Spark allows you to choose whether you want to persist Resilient Distributed Dataset (RDD) onto the disk or not.

Statement 2: Spark also gives you control over how you can partition your Resilient Distributed Datasets (RDDs).

A)Only statement 1 is true                 C)Both statements are true

B)Only statement 2 is true                  D)Both statements are false

Ans: Both statements are true

3) Given the following definition about the join transformation in Apache Spark:

def : join [W] (other: RDD[(K, W)]) : RDD [(K, (V, W))]

Where join operation is used for joining two datasets. When it is called on datasets of type (K, V) and (K, W), it returns a dataset of (K, (V, W)) pairs with all pairs of elements for each key.

Output the result of joinrdd, when the following code is run.

val rdd1 = sc.parallelize (Seq ((“m”,55), (“m”,56), (“e”,57), (“e”,58), (“s”,59),(“s”,54)))
val rdd2 = sc.parallelize (Seq ((“m”,60),(“m”,65),(“s”,61),(“s”,62),(“h”,63),(“h”,64)))
val joinrdd = rdd1.join(rdd2)
A) Array[(String, (Int, Int))] = Array((m,(55,60)), (m,(55,65)), (m,(56,60)), (m,(56,65)), (s,(59,61)), (s,(59,62)), (h,(63,64)), (s,(54,61)), (s,(54,62)))
B) Array[(String, (Int, Int))] = Array((m,(55,60)), (m,(55,65)), (m,(56,60)), (m,(56,65)), (s,(59,61)), (s,(59,62)), (e,(57,58)), (s,(54,61)), (s,(54,62)))
C) Array[(String, (Int, Int))] = Array((m,(55,60)), (m,(55,65)), (m,(56,60)), (m,(56,65)), (s,(59,61)), (s,(59,62)), (s,(54,61)), (s,(54,62)))
D)None of the mentioned.

Ans: Array[(String, (Int, Int))] = Array((m,(55,60)), (m,(55,65)), (m,(56,60)), (m,(56,65)), (s,(59,61)), (s,(59,62)), (s,(54,61)), (s,(54,62)))

4)Consider the following statements are correct:

Statement 1: Scale up means incrementally grow your cluster capacity by adding more COTS machines (Components Off the Shelf)

Statement 2: Scale out means grow your cluster capacity by replacing with more powerful machines

A) Only statement 1 is true               C) Both statements are true

B) Only statement 2 is true              D) Both statements are false

Ans: Both statements are true

Hadoop and Spark Interview Questions

Cognization conducted Hadoop and Spark interview question for experienced persons.

Round 1:

1. What is the future class in Scala programming language?

2.Difference between fold by fold Left or foldRight-in Scala?

3. How to distribute by will work in hive give some data tell me how to data will be distributed

4.dF.filter(Id == 3000) how to pass this condition in data frame on values in dynamically?

5. Have you worked on multithreading in Scala and explain?

7.On what basis you will increase the mappers in Apache Sqoop?

8. What will you mention last value while you are importing for the first time in Sqoop?

9. How do you mention date for incremental last modified in Spark?

10. Let’s say you have created the partition for Bengaluru but you loaded Hyderabad data what is the validation we have to do in this case to make sure that there won’t be any errors?

11. How many reducers will be launched in distributed by in Spark?

12. How to delete sqoop job in simple command?

13.In which location sqoop job last value will be stored?

14. What are the default input and output formats in Hive?

15. Can you explain brief idea about distributing cache in Spark with an example?

16. Did you use Kafka/Flume in your project and explain in detail?

17.Difference between Parquet and ORC file formats?

Round 2:

1. Explain your previous project?

2. How do you handle incremental data in apache sqoop?

3. Which Optimization techniques are used in sqoop?

4. What are the different parameters you pass your spark job?

5. In case one task is taking more time how will you handle?

6. What is stages and task in spark and give a real-time scenario?

7.On what basis you set mappers in Sqoop?

8. How will you export the data to Oracle without putting much load in the table?

9. What is column family in Hbase?

10. Can you create a table without mentioning column family

11.The number of column families limits for one table?

12. How to schedule Spark jobs in your previous project?

13. Explain Spark architecture with a real-time based scenario?

Deloitte Hadoop and Spark Interview Questions

Round 1:

1. Explain about your previous Project?

2. Write the Apache Sqoop code that you are using in your previous project?

3. What is the reason for moving data from DBMS to Hadoop Environment?

4. What happens when you increase mappers in MapReduce?

5. What is the command to check the last value of Apache Sqoop job?

6. Can you explain Distributed Cache?

7. Explain about Hive optimization techniques in your project?

8. Which Hive analytic functions you used in the project?

9. How to update records in Hive table in a single command?

10. How to limit the records when you are consuming the data in Hive table?

11. How to change the Hive engine to Apache Spark engine?

12.Difference between Parquet and ORC file format?

13. How to handle huge data flow situation in your project?

14. Explain about Apache Kafka with architecture?

15. Which tool will create partitions in the Apache Kafka topic?

16. Which transformation and actions are used in your project?

17. Explain a brief idea about Spark Architecture?

18. How will check if data is there or not in the 6th partition in RDD?

19. How do you debug in Spark code in Regex?

20. Give me the idea about a functional programming language?

21.Difference between Map Vs Flat Map in Spark?

22. For example, Spark word count while splitting which one do you use? what happens if you use map instead of flatMap in that program?

23. If you have knowledge on Hadoop Cluster then will you explain about capacity planning for four node cluster?


1. Define YARN and MapReduce Architecture?

2. Explain Zookeeper functionalities and give how the flow when the node is down?

3. Explain Data modeling in your project?

4. In your project, reporting tools are used? if you yes then explain it?

5. Give me a brief idea about Broadcast variables in Apache Spark?

6. Can you explain about Agile methodology and give me architecture of Agile?

Latest interview questions on Hadoop and Spark

1. Which internal algorithm used for NameNode to decide where the replica of a block will be stored exactly?

2. What will happen if a block of data is corrupted?

3. In the SCALA Program how to find out the number of transformations and actions?

4. If we are executing a query, how we can know that which are the joins taking more time especially in Hive and Spark query?

5. Scenario – I: In Hive, we have two tables A and B. B is the master table and A is the table which receives the updates of certain information. So I want to update the table B using the latest updated columns based upon the id

Question: How do we achieve that and what is exact query we use?

6.If Spark jobs are all failed without checking log files without WebUI how to handle it?

7. How to provide Security in Ambari without Kerberos?

8. Can you explain about High Availability Cluster in Hadoop Environment?

9. If you have a Spark job and there are 25 node cluster. How many executors are will be created by default?

10. How to change the column names in HIVE while importing the data into hive using Apache SQOOP?

11. How to handle the data type mismatch while importing the data from RDBMS to HIVE table?

12. How to handle when NULLS are present in the partition column? What is the internal mechanism for this simple scenario?
For suppose we have 4 node cluster having 128 GB ram per node, then we have 532 GB memory, now we have to process 1000 GB of data.

Question ) How spark process this data is more than available memory?

14. Did you use email reader in Oozie? How do you configure it?

15. In a Scala programming, you have to make two restful API calls, let’s say we have API 1 and API 2 and we have API 3. Then you have concurrently call API 1and API 2 and have to wait to finish both the call and make the 3rd call. How do you thin  SCALA concurrently?

Toughest Big Data(Spark, Kafka,Hive) Interview Questions

1. How to handle Kafka back pressure with scripting parameters?

2. How to achieve performance tuning through executors?

3. What is the idle size of deciding the executors and what ram should be used?

4. How do you scale Kafka brokers and Integrate with spark streaming without stopping the cluster and along with script?

5.How to delete records in Hive and how to delete duplicate records with the scripting?

6. Can we have more than one replica exist in the same rack?

7. In a database out of 10 tables, one table is failed while importing from MySql into HDFS by using Sqoop? What is the solution?

8. If you submit a spark job in a cluster and almost rdd has already created in the middle of the process the cluster goes down what will happen to you are rdd and how data will tackle?

Hadoop and Spark Scenario Typed Questions

1.  Hadoop – Scenario :

If you working on Hadoop Cluster and you have already cache the RDD and got the output stored in cache now I want to clear the memory space and use that space for caching another RDD? How to achieve this?

2. Spark – Scenario :

I) Suppose you are running 100 SQL jobs which generally take 50 mins to complete, but one it took 5 hour to complete.

Q 1) In this case How do you report this errors?

Q 2)How do you debug to code and provide a proper solution for this scenario.

Rare interview questions on Hadoop Eco – System:

1.What do you about type safety and which frame work has type safety in Hadoop?

2.What are the serializations in Hive? why do you choose that serialization explain in detail?

3. What modules you have worked in Scala and name the module and explain briefly?

4.What are the packages you have worked in Scala and name the package you have imported in your current project ?

5.What is the difference between map and map partition with clear explanation with real time example in Scala.

6. How do you connect to your cluster using data nodes or edge nodes?

7. How do you allocate buffer memory to your datanode?

8.How much buffer space have you allocated to your map task and reduce task in your data node

9. How do you achieve broadcast join automatically without out doing it manually? and how do you setup your driver program to detect where broadcast join can be good to use and how do you automate the process?

Most frequently asked Interview questions for experienced

In this era  in between 2-8 years experienced persons interviewer asked this type of questions in interview panel related to Big data and analytics and specially in Hadoop eco-system.
Mostly on Hands on experience in Hadoop and related to Project.
1. what properties you changed in Hadoop configuration files for your project?
Can you explain about your project related
2. where do you know Name Node and Datanode directory paths?
3. How do you handle incremental load in your project?
By using SQOOP incremental
4. can you do dynamic hive partitions through Sqoop?
Yes, dynamic partitions hive through SQOOP.
5. in which scenarios will we use Parquet and Avro?
It is based upon client and can you explore on it.
6. how do you handle Authentication and Authorization in your project?
Can you explain whether using Kerbreos and AD/LDAP. It is purely depends upon your project related.
7. How to Handle if Spark all jobs are failed?

Top 10 Hadoop Interview Questions

1.What exactly meaning of Hadoop?

Hadoop is a framework to Process and Store the huge amount of data. It is an open source software framework for distributed file system

2. Why do we need Hadoop in IT?



C.Data Quality

D.High Availability

E.Hardware Commodity

3. Difference between Hadoop 2.x and Hadoop 3.x?

Hadoop 2 handles only single Name Node to manage all Name Spaces.

Hadoop 3 has multiple Namenodes for multiple NameSpaces

Hadoop 2 has a lot more storage overhead than Hadoop3

Hadoop 2 not support GPUs but Hadoop 3 support GPUs.

4.Define Data Locality in Hadoop?

Sending the Logic near to the of HDFS.

5. How is Security achieved in Hadoop?

In Hadoop by using Kerberos Hadoop achieves more securiy

6. What are different modes in which Hadoop run?

A.Standalone mode

B.Pseudo Distributed

C.Fully Distributed

7. Explain about Safemode in Hadoop?

Safemode in Hadoop is a maintenance state of Name Node. During which Name Node doesn’t allow any modifications to the file system. During Safemode, HDFS cluster is in read-only and doesn’t replicate or delete blocks.


hadoop dfsadming -safemode get

hadoop dfsadming -safemode enter

hadoop dfsadming -safemode leave

8.What are the main components in Hadoop eco-system?

A)HDFS             -Hadoop Distributed File System

B)MapReduce  – Programming paradigm- based on Java

C)Pig                  – To process and analyse  the structured,semi-structured data

D)Hive              – To process and analyse structured data

E)HBASE        – NoSQL database

F)SQOOP       – Import/Export structured data

G)Oozie          -Scheduler

H)Zookeeper – Configuration

9.Explain the differencebetween Name Node,Check point ,Backup Node in Hadoop eco-system?

Name Node- HDFS that manages the metadata

Checkpoint Name Node- Directory structure as  Name Node, and creates checkpoints

Backup Node-It needs to save the current state in memory to an image file to create a new checkpoint.

10.Benefits of Hadoop?

A)Ability to handle bigdata

B)Commodity hardware and is open-source

C)Ability to handle multiple data types