Most Typical Hive Interview Questions and Answers




Hive Interview Questions and Answers

1. Does Hive support record level Insert, delete or Update?

Hive does not support recode level insert, delete or update. It doesn’t provide transactions also. If the user can go with CASE statements and built-in functions of Hive to satisfy the insert, update and delete.

2. What kind of data warehouse applications is suitable for Hive?

Basically, Hive is not a full database it is a data summarization tool in Hadoop eco-system. Hive can do below applications:

I)Fast response times are not required
II)When the data is not changing rapidly
III)Relatively static data is analyzed

3. How can the columns of a table in Hive be written to a File?

In Hive using the awk command in Hive shell, the output from HiveQL can be written to a file

Example : hive -S -e  "describe table_name" | awk -F " '{print 1}' > ~/output

4.Difference between order by and sort by in Hive?

In Hive SORT BY will sort the data within each reducer. It can use any number of reducers for SORT BY operations.
Coming to ORDER BY will sort all of the data together, which has to pass through one reducer. Thus, ORDER BY in Hive uses single reducers and guarantees total order in the output while SORT BY only guarantees ordering of the rows within a reducer




5. Wherever Different directory I run Hive query, it creates new metastore_db, please explain the reason for it?

Whenever you run the Hive in embedded mode, it means that it creates the local metastore. And before creating the metastore it looks whether metastore already exists or not. This property is defined in the configuration file in hive_site.xml properties.

"javax.jdo.option.ConnectionURL" with default value
"jdbc:derby::databaseName=metastore_db";create=true

6. Is it possible to use the same metastore by multiple users, in case of embedded Hive?

No, it is impossible to use metastore for multiple users, it is only for a single user in a single mode database like PostgreSQL, MySQL, etc.

Kafka Interview Questions and Answers

Kafka Interview Questions and Answers:

1. What is Kafka?

Kafka is an open source message broker project coded in Scala/Python/Java. Kafka is originally developed by LinkedIn and developed as an open sourced in early.




2. Which are the components of Kafka?

The major components of Kafka are:

Topic: A group of messages belongs to the same type

Producer: Using the producer can publish messages to the topic

Consumer: Pulls data from the brokers

Brokers: This is the place where the disclose messages are stored known as servers.

3. What role does Zookeeper play in a cluster of Kafka?

Kafka is an open source system and it also a distributed system and it is built to use Zookeeper. The basic responsibility of Zookeeper is to build coordination between different nodes in a cluster. Zookeeper works as periodically commit offset so that if any ode gets failure it will be used to recover from previously committed offset. The Zookeeper is also responsible for configuration management leader detection, finding if any node leaves or joins the cluster, synchronization.

4. Distinguish between the Kafka and Flume?

Flume’s major use-case is incorporated with the Hadoop’s monitoring system, file formats, file systems, and utilities. It is used for Hadoop integration. Flume will be the best option to use when you have non-relational data sources. But Kafka used for the distributed publish-subscribe messaging system. Kafka is not developed for Hadoop and using Kafka to read and write data to Hadoop considerably than the Flume. Kafka is a highly reliable and scalable enterprise messaging system to connect different multiple systems.




5. It is possible to use Kafka without Zookeeper?

It is impossible to use Kafka without Zookeeper because it is not possible to bypass Zookeeper and connect directly to the server. If the Zookeeper is down then we will not able to sever any client request.

6. How to start a Kafka Server?

Kafka uses Zookeeper, we have to start the zookeeper server. One can use the convince script packaged with Kafka with a single node Zookeeper
> bin/zookeeper-server-start.shconfig/zookeeper.properties Now the Kafka server can start> bin/Kafka-server-start.sh config/server.properties

Spark & Scala Interview Questions and Answers

1. What is Scala what are the importance of it and the difference between Scala and other Programming languages (Java/Python)?





Scala is the most powerful language for developing big data environment applications. Scala provides several benefits to achieve significant productivity. It helps to write robust code with fewer bugs.  Apache Spark is written in Scala, so Scala is a natural fit for the developing Spark applications.

2. What is RDD tell me in brief?

Spark RDD is a primary abstract class in Spark API. RDD is a collection of partitioned data elements that can be operated in parallel. Normally, RDD is supporting properties like Immutable, Cacheable, Type Infer, and Lazy evaluation.

Immutable: RDD’s are Immutable data structures. Once created, it cannot be modified

Partitioned: The Data in RDD’s are partitioned across the distributed cluster of nodes. However, multiple Cassandra partition can be mapped to one single RDD partition

Fault Tolerance: RDD is designed to be a fault – tolerant. Because the RDD data is stored across the large distributed cluster. So there is a chance for node failure in that cluster by this we can lose the Partitioned data in that node.

RDD automatically handles the node failure. Spark will maintain the metadata of each RDD and details about the RDD. So by using that information, we can get that data from other nodes.

Interface: RDD provides a uniform interface for processing data from a variety of data sources such as HDFS, HBase, Cassandra, MongoDB, and others. The same interface can also be used to process data stored in memory across a cluster of nodes.




InMemory: The RDD class provides the API for enabling in-memory cluster computing. Spark allows RDDs to be cached or persisted in memory

3. How to register a temporary table in Spark SQL?

When we creating the data frame by loading the data into it using SQL Context object. This is treated a temporary table. Because the scope of the data frame is to a particular session

4. How to count the number of lines in Scala?

In Scala programming language using getLines.size property we can count

Example: Val countlines = source.getLines.size

 

Latest: Hadoop Admin Interview Questions for 3 to 15 years Experience

Nowadays, emerging one of the skill is Hadoop administration. Below questions is the middle-level interview type questions:





1. Explain your projects according to your resume and using different types of distributions?

2. Explain about High Availability in Name node?

3. Explain about Kerberos, Ranger, Knox with scenario based?

4. Asking about any Scripting language like Python, Shell scripting?

5.Difference between Namnode and CLDB(Container Location DataBase in MapR)

6. How many Zookeepers are used in your project? Why it is odd one only can you please explain?

7. How to resolve Herat beat issue and explain the processes for resolve?

8. Recently resolved an issue from Cluster like Hive, HBase Master and how to resolve them?

9. Difference between Cloudera, MapR, and Hortonworks with examples?

10. Why Secondary Namenode concept picture in the Hadoop? and explain?

11. Explain step by step processing of  Hortworks Installation? No need to explain about prerequisites?

Hadoop and Spark Interview Questions

Nowadays IT market Hadoop and Spark interview question for experienced persons.

Round 1:

1. What is the future class in Scala programming language?




2.Difference between fold by fold Left or foldRight-in Scala?

3. How to distribute by will work in hive give some data tell me how to data will be distributed

4.dF.filter(Id == 3000) how to pass this condition in data frame on values in dynamically?

5. Have you worked on multithreading in Scala and explain?

7.On what basis you will increase the mappers in Apache Sqoop?

8. What will you mention last value while you are importing for the first time in Sqoop?

9. How do you mention date for incremental last modified in Spark?

10. Let’s say you have created the partition for Bengaluru but you loaded Hyderabad data what is the validation we have to do in this case to make sure that there won’t be any errors?

11. How many reducers will be launched in distributed by in Spark?

12. How to delete sqoop job in simple command?

13.In which location sqoop job last value will be stored?

14. What are the default input and output formats in Hive?

15. Can you explain brief idea about distributing cache in Spark with an example?

16. Did you use Kafka/Flume in your project and explain in detail?

17.Difference between Parquet and ORC file formats?

Round 2:

1. Explain your previous project?

2. How do you handle incremental data in apache sqoop?

3. Which Optimization techniques are used in sqoop?

4. What are the different parameters you pass your spark job?

5. In case one task is taking more time how will you handle?

6. What is stages and task in spark and give a real-time scenario?

7.On what basis you set mappers in Sqoop?

8. How will you export the data to Oracle without putting much load in the table?

9. What is column family in Hbase?




10. Can you create a table without mentioning column family

11.The number of column families limits for one table?

12. How to schedule Spark jobs in your previous project?

13. Explain Spark architecture with a real-time based scenario?

Deloitte Hadoop and Spark Interview Questions

Round 1:

1. Explain about your previous Project?




2. Write the Apache Sqoop code that you are using in your previous project?

3. What is the reason for moving data from DBMS to Hadoop Environment?

4. What happens when you increase mappers in MapReduce?

5. What is the command to check the last value of Apache Sqoop job?

6. Can you explain Distributed Cache?

7. Explain about Hive optimization techniques in your project?

8. Which Hive analytic functions you used in the project?

9. How to update records in Hive table in a single command?

10. How to limit the records when you are consuming the data in Hive table?

11. How to change the Hive engine to Apache Spark engine?

12.Difference between Parquet and ORC file format?

13. How to handle huge data flow situation in your project?

14. Explain about Apache Kafka with architecture?

15. Which tool will create partitions in the Apache Kafka topic?

16. Which transformation and actions are used in your project?

17. Explain a brief idea about Spark Architecture?

18. How will check if data is there or not in the 6th partition in RDD?

19. How do you debug in Spark code in Regex?

20. Give me the idea about a functional programming language?

21.Difference between Map Vs Flat Map in Spark?

22. For example, Spark word count while splitting which one do you use? what happens if you use map instead of flatMap in that program?

23. If you have knowledge on Hadoop Cluster then will you explain about capacity planning for four node cluster?

Round-2

1. Define YARN and MapReduce Architecture?

2. Explain Zookeeper functionalities and give how the flow when the node is down?

3. Explain Data modeling in your project?

4. In your project, reporting tools are used? if you yes then explain it?




5. Give me a brief idea about Broadcast variables in Apache Spark?

6. Can you explain about Agile methodology and give me the architecture of Agile?

Latest interview questions on Hadoop and Spark





1. Which internal algorithm used for NameNode to decide where the replica of a block will be stored exactly?

2. What will happen if a block of data is corrupted?

3. In the SCALA Program how to find out the number of transformations and actions?

4. If we are executing a query, how we can know that which are the joins taking more time especially in Hive and Spark query?

5. Scenario – I: In Hive, we have two tables A and B. B is the master table and A is the table which receives the updates of certain information. So I want to update the table B using the latest updated columns based upon the id

Question: How do we achieve that and what is exact query we use?

6.If Spark jobs are all failed without checking log files without WebUI how to handle it?

7. How to provide Security in Ambari without Kerberos?

8. Can you explain about High Availability Cluster in Hadoop Environment?

9. If you have a Spark job and there are 25 node cluster. How many executors are will be created by default?

10. How to change the column names in HIVE while importing the data into hive using Apache SQOOP?

11. How to handle the data type mismatch while importing the data from RDBMS to HIVE table?

12. How to handle when NULLS are present in the partition column? What is the internal mechanism for this simple scenario?
For suppose we have 4 node cluster having 128 GB ram per node, then we have 532 GB memory, now we have to process 1000 GB of data.




Question ) How spark process this data is more than available memory?

14. Did you use email reader in Oozie? How do you configure it?

15. In a Scala programming, you have to make two restful API calls, let’s say we have API 1 and API 2 and we have API 3. Then you have concurrently call API 1and API 2 and have to wait to finish both the call and make the 3rd call. How do you thin  SCALA concurrently?

Toughest Big Data(Spark, Kafka,Hive) Interview Questions

Hard Interview Questions for Spark, Kafka, and Hive:





1. How to handle Kafka back pressure with scripting parameters?

2. How to achieve performance tuning through executors?

3. What is the idle size of deciding the executors and what ram should be used?

4. How do you scale Kafka brokers and Integrate with spark streaming without stopping the cluster and along with script?

5.How to delete records in Hive and how to delete duplicate records with the scripting?

6. Can we have more than one replica exist in the same rack?

7. In a database out of 10 tables, one table is failed while importing from MySql into HDFS by using Sqoop? What is the solution?

8. If you submit a spark job in a cluster and almost rdd has already created in the middle of the process the cluster goes down what will happen to you are rdd and how data will tackle?

Summary: Nowadays asked these type of scenario-based interview questions in Big Data environment for Spark and Hive.

Hadoop and Spark Scenario Typed Questions




1.  Hadoop – Scenario :

If you working on Hadoop Cluster and you have already cache the RDD and got the output stored in cache now I want to clear the memory space and use that space for caching another RDD? How to achieve this?

2. Spark – Scenario :

I) Suppose you are running 100 SQL jobs which generally take 50 mins to complete, but one it took 5 hour to complete.

Q 1) In this case How do you report this errors?

Q 2)How do you debug to code and provide a proper solution for this scenario.

Rare interview questions on Hadoop Eco – System:

1.What do you about type safety and which frame work has type safety in Hadoop?

2.What are the serializations in Hive? why do you choose that serialization explain in detail?

3. What modules you have worked in Scala and name the module and explain briefly?

4.What are the packages you have worked in Scala and name the package you have imported in your current project ?

5.What is the difference between map and map partition with clear explanation with real time example in Scala.

6. How do you connect to your cluster using data nodes or edge nodes?

7. How do you allocate buffer memory to your datanode?

8.How much buffer space have you allocated to your map task and reduce task in your data node

9. How do you achieve broadcast join automatically without out doing it manually? and how do you setup your driver program to detect where broadcast join can be good to use and how do you automate the process?

Spark Most Typical Interview Questions List

Apache SPARK Interview Questions List




    1. Why  RDD resilient?
    2. Difference between Persist and Cache?
    3. Difference between Lineage and DAG?
    4. What are narrow and wide transformations?
    5. What are Shared variables and it uses?
    6. How to define custom accumulator?
    7. If we have 50 GB memory and 100 GB data, how spark will process it?
    8. How to create UDFs in Spark?
    9. How to use hive UDFs in Spark?
    10. What are accumulators and broadcast variables?
    11. How to decide various parameter values in Spark – Submit?
    12. Difference between Coalesce and Repartition?
    13. Difference between RDD DATA FRAME and DATA SET. When to use one?
    14. What is Data Skew and how to fix it?
    15. Why shouldn’t we use group by the transformation in Spark?
    16. How to do Map side join in Spark?




1. What Challenges are faced in the Spark Project?

2.Use of map, flat map, map partition, for each partition?

3. What is Pair RDD? When to use them?

4.Performance optimization techniques in Spark?

5.Difference between Cluster and Client mode?

6. How to capture log in client mode and Cluster mode?

7. What happens if a worker node is dead?

8. What types of file format does Spark support? Which of them are most suitable for our organization needs?

Basic Spark Developer Interview Questions:

1.Difference between reduceByKey() and groupByKey()?

2.Difference between Spark 1 and Spark 2?

3. How do you debug Spark jobs?

4.Difference between Var and Val?

5. What size of file do you use for development?

6. How long will take to run your script in production?

7. Perform joins using RDD’s?

8. How do run your job in Spark?

9. What is the difference between the Spark data frame and the data set?

10. How data sets are type safe?

11. What are sink processors?

12.Lazy evaluation in Spark and its benefits?

13. After Spark – Submit,  Whats’s process runs behind of application?

14. How to decide no.of stages in Spark job?

Above questions are related to Spark developers for experienced and beginners.