Nowadays interviewer asked below Spark interview questions for Data Engineers, Hadoop Developers & Hadoop Admins. Below are basic and intermediate Spark interview questions.
Latest 100 Hadoop and Spark Interview Questions and Answers
1. What is the major difference between Spark and Hadoop?
2. What are the differences between functional and imperative languages, and why is functional programming important?
3. What is a resilient distributed dataset(RDD), explain showing diagrams?
4. Explain transformations and actions in the context of RDDs?
5. What are the Spark streaming use cases?
6. What is the lazy evaluation and why is it useful?
7. What is Parallel Collection RDD?
8. Explain how ReduceByKey and GroupByKey work?
9What is the common workflow of a Spark Program?
10. Explain the Directed Acyclic Graph? Difference between DAG and Lineage Graph?
View Answer
11. What are the transformations and actions that you have used in Spark in your project?
12. How can you minimize data transfers when working within the Spark?
13. What is a lineage graph?
14.Describe the major libraries that constitute the Spark Ecosystem
View Answer
15. What are the pair RDDs?
View Answer
16. What are the different file formats that can be used in Hadoop and Spark?
17. Which Storage Level to choose in your’s project?
View Answer
18. What is the difference between cache() and persist()? Explain it with an example?
View Answer
19. What are the various levels of persistence in Spark?
View Answer
20What are the advantages and drawbacks of RDD? explain it?
View Answer
21. Why Dataset is preferred over RDDs?
View Answer
22. How to share data from Spark RDD between two applications?
View Answer
23. Explain Apache Spark provide checkpointing?
View Answer
24. Explain Apache Spark caching memory with example?
View Answer
25. What is the function of Block manager in Spark
View Answer
26. Why does Spark SQL consider the support of indexes unimportant?
View Answer
27. How to convert existing UDTFs in Hive to Scala functions and use them from Spark SQL to explain with example?
View Answer
28. Why use data frames and datasets when we have RDD?
View Answer
29. What is a Catalyst and how does it work?
View Answer
30. What are the top challenges developers face while writing Spark applications?
View Answer
31. Explain the difference in implementation between DataFrames and DataSet?
View Answer
32. How is memory handled in DataSets?
View Answer
33. What are the limitations of the dataset?
View Answer
34. What are the contentions with memory?
View Answer
35. Show command to run Spark in YARN client mode?
View Answer
36. Show command to run Spark in YARN cluster mode?
View Answer
37.What is Standalone and YARN mode?
View Answer
38. Explain client mode and cluster mode in Spark?
View Answer
39. Which cluster managers are supported by Spark?
View Answer
40. What is Executor memory?
View Answer
41. What is DStream and what is the difference between batch and DStream in Spark Streaming?
View Answer
42. How does Spark Streaming work?
View Answer
43.Difference between map () and flatMap()?
View Answer
44. What is reducing () actions, Is there any differences between reducing () and reduceByKey()?
View Answer
45. What is the disadvantage of reducing () action and how can we overcome this limitation?
View Answer
46. What are Accumulators and when are accumulators truly reliable?
View Answer
47. What are the Broadcast Variables and what advantages do they provide?
View Answer
48. What is a driver?
View Answer
49. What is the piping? Demonstrate an example of a data pipeline?
View Answer
50. What does a Spark Engine do?
View Answer
51. What are the steps that occur when you run a Spark application on the cluster?
View Answer
52. What is a schema RDD/Dataframe?
View Answer
53. What are the Row objects?
View Answer
54. How does Spark achieve fault tolerance?
View Answer
55. What parameter is set if cores need to be defined across executors?
View Answer
56. Name a few Spark master system properties?
View Answer
57. Define partitions in reference to Spark implementation?
View Answer
58.Difference between how Spark and MapReduce manage cluster resources under YARN?
View Answer
59. What is GraphX and what is PageRank?
View Answer
60. What does MLib do?
View Answer
61. What is a Parquet file? Explain it?
62. What is schema evolution and what is its disadvantage, explain schema merging in reference to parquet file?
63. How will Spark replace MapReduce?
View Answer
64. Why is Parquet & AVRO file used for Spark SQL?
View Answer
65. Explain Spark executors? with diagram?
View Answer
66. Name the different types of cluster managers in Spark?
View Answer
67. How many ways to create RDDS, with example?
View Answer
68. How you flatten rows in Spark? Explain with example?
View Answer
69. What is Hive on Spark?
View Answer
70. Briefly, explain about Spark Streaming Architecture?
View Answer
71. What are the types of Transformations on DStreams?
View Answer
72. What is Receiver in Spark Streaming, and can you build customer receivers?
View Answer
73. Explain the process of Live Streaming storing Dstreams data to the database?
View Answer
74. How is Spark streaming fault-tolerant?
View Answer
75. Explain the transform() method used in DStream?
View Answer
76. What file systems does support Spark in your project?
View Answer
77. How is data security achieved in Spark in your current Hadoop cluster?
View Answer
78. What is Security? Explain Kerberos security?
View Answer
79. Name various types of distributing that Spark supports?
View Answer
80. Explain some examples of queries using the Scala DataFrame API?
View Answer
81. What are the most important factors you want to consider when you start the machine learning project?
View Answer
82. What are the conditions where the Spark driver can parallelize dataSets as RDDs?
View Answer
83. Can repartition() operation decrease the number of partitions?
View Answer
84. What is the drawback of repartition() and coalesce() operation?
View Answer
85. Consider the following code in Spark, what are the final values in fVal variable?
View Answer
86. Scala pattern matching, show various ways code can be written?
View Answer
87. In a joint operation, for example, Val join Val =rddA.join(rddB) will generate partition?
View Answer
88.If we want to display just the schema of the data frame/dataset what method is called?
View Answer
89. Show various implementations for the following query in Spark?
View Answer
90. What are the most important factors you want to consider when you start the machine learning project?
View Answer
91. As a data scientist, which algorithm would you suggest if legal aspects and ease of explanation to no technical people are the main criteria?
View Answer
92. For the supervised learning algorithm, what percentage of data is split between training and test dataset?
View Answer
93. Compare the performance of Parquet and Avro file formats and their usage in the context of Spark?
View Answer
94. Spark master exposes a set of REST APIs to submit and monitor applications. Which data format is used for these web services?
View Answer
95. When you should not use Spark?
View Answer
96. Can you use Spark to access and analyze data stored in Cassandra databases?
View Answer
97. With which mathematical properties can you achieve parallelism?
View Answer
98. What are the various types of partitioning in Apache Spark?
View Answer
99. How to set partitioning for data in Apache Spark?
View Answer
100. Explain your project architecture? how to spark involvement for data processing?
View Answer