Latest 100 Hadoop and Spark interview Questions and Answers in Big Data





Nowadays interviewer asked below Spark interview questions for Data Engineers, Hadoop Developers & Hadoop Admins. Below are basic and intermediate Spark interview questions.

Latest 100 Hadoop and Spark Interview Questions and Answers

1. What is the major difference between Spark and Hadoop?

View Answer

2. What are the differences between functional and imperative languages, and why is functional programming important?

View Answer

3. What is a resilient distributed dataset(RDD), explain showing diagrams?

View Answer

4. Explain transformations and actions in the context of RDDs?

View Answer

5. What are the Spark streaming use cases?

View Answer

6. What is the lazy evaluation and why is it useful?

View Answer

7. What is Parallel Collection RDD?

View Answer

8. Explain how ReduceByKey and GroupByKey work?

View Answer

9What is the common workflow of a Spark Program?

View Answer

10. Explain the Directed Acyclic Graph? Difference between DAG and Lineage Graph?




View Answer

Spark interview questions

11. What are the transformations and actions that you have used in Spark in your project?

View Answer

12. How can you minimize data transfers when working within the Spark?

View Answer

13. What is a lineage graph?

View Answer

14.Describe the major libraries that constitute the Spark Ecosystem

View Answer

15. What are the pair RDDs?

View Answer

16. What are the different file formats that can be used in Hadoop and Spark?

View Answer

17. Which Storage Level to choose in your’s project?

View Answer

18. What is the difference between cache() and persist()? Explain it with an example?

View Answer

19. What are the various levels of persistence in Spark?

View Answer

20What are the advantages and drawbacks of RDD? explain it?

View Answer

21. Why Dataset is preferred over RDDs?

View Answer

22. How to share data from Spark RDD between two applications?

View Answer

23. Explain Apache Spark provide checkpointing?

View Answer

24. Explain Apache Spark caching memory with example?

View Answer

25. What is the function of Block manager in Spark

View Answer

26. Why does Spark SQL consider the support of indexes unimportant?

View Answer

27. How to convert existing UDTFs in Hive to Scala functions and use them from Spark SQL to explain with example?

View Answer

28. Why use data frames and datasets when we have RDD?

View Answer

29. What is a Catalyst and how does it work?

View Answer

30. What are the top challenges developers face while writing Spark applications?

View Answer

31. Explain the difference in implementation between DataFrames and DataSet?

View Answer

32. How is memory handled in DataSets?

View Answer

33. What are the limitations of the dataset?

View Answer

34. What are the contentions with memory?

View Answer

35. Show command to run Spark in YARN client mode?

View Answer

36. Show command to run Spark in YARN cluster mode?

View Answer

37.What is Standalone and YARN mode?

View Answer

38. Explain client mode and cluster mode in Spark?

View Answer

39. Which cluster managers are supported by Spark?

View Answer

40. What is Executor memory?

View Answer

41. What is DStream and what is the difference between batch and DStream in Spark Streaming?

View Answer

42. How does SparkĀ  Streaming work?

View Answer

43.Difference between map () and flatMap()?

View Answer




44. What is reducing () actions, Is there any differences between reducing () and reduceByKey()?

View Answer

45. What is the disadvantage of reducing () action and how can we overcome this limitation?

View Answer

46. What are Accumulators and when are accumulators truly reliable?

View Answer

47. What are the Broadcast Variables and what advantages do they provide?

View Answer

48. What is a driver?

View Answer

49. What is the piping? Demonstrate an example of a data pipeline?

View Answer

50. What does a Spark Engine do?

View Answer

51. What are the steps that occur when you run a Spark application on the cluster?

View Answer

52. What is a schema RDD/Dataframe?

View Answer

53. What are the Row objects?

View Answer

54. How does Spark achieve fault tolerance?

View Answer

55. What parameter is set if cores need to be defined across executors?

View Answer

56. Name a few Spark master system properties?

View Answer

57. Define partitions in reference to Spark implementation?

View Answer

58.Difference between how Spark and MapReduce manage cluster resources under YARN?

View Answer

59. What is GraphX and what is PageRank?

View Answer

60. What does MLib do?

View Answer

61. What is a Parquet file? Explain it?

View Answer

62. What is schema evolution and what is its disadvantage, explain schema merging in reference to parquet file?

View Answer

63. How will Spark replace MapReduce?

View Answer




64. Why is Parquet & AVRO file used for Spark SQL?

View Answer

65. Explain Spark executors? with diagram?

View Answer

66. Name the different types of cluster managers in Spark?

View Answer

67. How many ways to create RDDS, with example?

View Answer

68. How you flatten rows in Spark? Explain with example?

View Answer

69. What is Hive on Spark?

View Answer

70. Briefly, explain about Spark Streaming Architecture?

View Answer

71. What are the types of Transformations on DStreams?

View Answer

72. What is Receiver in Spark Streaming, and can you build customer receivers?

View Answer

73. Explain the process of Live Streaming storing Dstreams data to the database?

View Answer

74. How is Spark streaming fault-tolerant?

View Answer

75. Explain the transform() method used in DStream?

View Answer

76. What file systems does support Spark in your project?

View Answer

77. How is data security achieved in Spark in your current Hadoop cluster?

View Answer

78. What is Security? Explain Kerberos security?

View Answer

79. Name various types of distributing that Spark supports?

View Answer




80. Explain some examples of queries using the Scala DataFrame API?

View Answer

81. What are the most important factors you want to consider when you start the machine learning project?

View Answer

82. What are the conditions where the Spark driver can parallelize dataSets as RDDs?

View Answer

83. Can repartition() operation decrease the number of partitions?

View Answer

84. What is the drawback of repartition() and coalesce() operation?

View Answer

85. Consider the following code in Spark, what are the final values in fVal variable?

View Answer

86. Scala pattern matching, show various ways code can be written?

View Answer

87. In a joint operation, for example, Val join Val =rddA.join(rddB) will generate partition?

View Answer

88.If we want to display just the schema of the data frame/dataset what method is called?

View Answer

89. Show various implementations for the following query in Spark?

View Answer

90. What are the most important factors you want to consider when you start the machine learning project?

View Answer

91. As a data scientist, which algorithm would you suggest if legal aspects and ease of explanation to no technical people are the main criteria?

View Answer

92. For the supervised learning algorithm, what percentage of data is split between training and test dataset?

View Answer

93. Compare the performance of Parquet and Avro file formats and their usage in the context of Spark?

View Answer

94. Spark master exposes a set of REST APIs to submit and monitor applications. Which data format is used for these web services?

View Answer




95. When you should not use Spark?

View Answer

96. Can you use Spark to access and analyze data stored in Cassandra databases?

View Answer

97. With which mathematical properties can you achieve parallelism?

View Answer

98. What are the various types of partitioning in Apache Spark?

View Answer

99. How to set partitioning for data in Apache Spark?

View Answer

100. Explain your project architecture? how to spark involvement for data processing?

View Answer