Most Frequently Asked Apache Storm Interview Questions and Answers




Top 5 Apache Storm Interview Questions:


1.Difference between Apache Kafka and Apache Storm?

Apache Kafka is a distributed and robust messaging system that can handle a large amount of data and allows to passage of messages from one endpoint to another endpoint communication. Data streams are partitioned and spread over a cluster of machines to allow data streams larger than the capability.

Coming to Apache Storm is a real-time message processing system and can we edit or manipulate data in the real-time. It is pulling the data from Kafka and applies some required manipulation for easily process in streams data in the real-time data processing.

2. What are the key benefits of using Storm for Real-Time Processing?

Real fast: Apache Storm can process 1000 messages per 10seconds per one node.

Fault-Tolerant: Apache Storm detects the fault automatically and re-starts the functional attributes.

Easy to Operate: The Operating Apache Stor is very easy

3. Does Apache Storm act as a Proxy server?

Yes, Apache Storm acts as a proxy also by using the mod_proxy. It implements a proxy, gateway or cache for Apache Storm.




4. How can kill a topology in Apache Storm?

Simply we can run: storm kill {stormname}

Give the same name to storm kill as you used when submitting in Storm topology.

5. What are the common configurations in Apache Storm?

In Apache Storm there are different types of configurations can set topology. Here are some common ones that are set for a topology.

  1. Config.TOPOLOGY_WORKERS: In this set the number of worker processes to use to execute the topology.
  2. Config.TOPOLOGY_ACKER_EXECUTORS: In this set the number of executors that will track tuple trees and detect when a spout tuple has been fully processing by not setting this variable is null.
  3. Config.TOPOLOGY_MAX_SPOUT_PENDING: In this sets the maximum number of spouts tuples that can be pending on a single spout task at once.
  4. Config.TOPOLOGY_MESSAGE_TIMEOUT_SECS: This is the maximum amount of time a spout tuple has to be fully completed before it is considered failed.
  5. Config.TOPOLOGY_SERALIZATIONS: Can register more serializers to Storm using this config so that you can use custom types within tuples.

 

Difference between Hadoop Cluster and Storm Cluster




Hadoop Cluster VS Storm Cluster

Hadoop Cluster :

Hadoop is an open source framework for storage and processing in distributed systems on large clusters of commodity hardware.

1.  In Hadoop we run Map-Reduce jobs

2. Map Reduce job starts processes and ends eventually

3. Master Node runs a daemon called Job Tracker

4. Hadoop Map Reduce job runs 2 tasks in a job 1 mapper and 1 reducer task. This is restricting you can only have 2 tasks and they have to be only mapper and reducer only.

Storm Cluster :

Apache Storm is an open source distributed real-time computation system. It works on the continuous stream of data instead of stored data in a persistent storage system. It is also a framework for interacts with a running application

1. In storm, we run Statistics.

2. Topology once started is intended to keep on processing live data forever which it keeps on getting from data sources like zmq, Kafka, etc.

3. Master Node runs a daemon called Nimbus

4. Storm, in contrast, runs 2 task Spouts and Bolts in a topology spout will act as data received from external sources and creator of Streams for bolts for actual processing. Bolts can be chained serially or in parallel depending on what kind of processing we want to do.

Components of Storm:

Topology

Topology is a network of spouts and bolts. It is analogous to a Map Reduce Job in Hadoop. It is a graph of computation consisting of spouts and bolts.

Spouts as data stream source tasks and Bolts as actual processing tasks. Each node in the graph contains some processing logic and links.  In topology is submits the topology.




When a topology is submitted to a Storm Cluster, Nimbus service on master node consults the supervisor services on different worker nodes and submits the topology. Each supervisor, creates one or more worker processes each having its own separate jvm per each one.

Apache Storm With Architecture

What is Storm?





Apache Storm is distributed framework for real time processing of Big Data like Hadoop is a distributed framework for batch processing.

Advantages of Storm:

Fault Tolerance – where if worker threads die or a node goes down the worker s are automatically restarted.

Scalability – Where throughput rates throughput of even one million 100 byte messages per second per node can be achieved and ease of use in deploying and operating the system.

Architecture of Storm:

Apache Storm does not have its own state managing capabilities. Instead of uses Apache Zookeeper to manage the Cluster state all coordination between Nimbus and the Supervisors such as message acknowledgments, processing status, etc is done through a Zookeeper Cluster. Nimbus daemon and Supervisor daemons are stateless; all state is kept in Zookeeper or on the local disk.

Storm makes use of zeromq library for interprocess communication between different worker processes but after it was adopted as an Apache, storm developers replaced zeromq with Netty.

Explanation of the Components :

Nimbus :

Nimbus is a master node of Storm Cluster. All other nodes in the cluster are called as worker nodes. The master node is responsible for distributing data among all the worker nodes, assign tasks to worker nodes and monitoring failures.



Supervisor:

The nodes that follow instructions given by the nimbus are called as Supervisors. A supervisor has multiple worker processes and it governs worker processes to complete the task assigned by the nimbus.

 Worker Process:

A worker process will execute tasks related to a specific topology.  A worker process will not run a task by itself, instead, it creates executors and asks them to be to perform a particular task. A worker process will have multiple executors.

Executor :

An executor is nothing but a single thread by a worker process and it runs one or more tasks but only for a specific spout or bolt.

Task :

Here task performs actual data processing