Spark Streaming is a Spark’s module for a real time applications(Twitter tweets, statistics, page views). Lets user write streaming applications using a very similar API to batch jobs. Spark Streaming is a distributed data stream processing framework. It makes it easy to develop distributed applications for processing live data streams in real time. It only provides a simple programming model but also enables an application to process high velocity stream data. It also allows the combining of data streams and data for processing.
Spark Streaming is an extension of the core Spark API that enables scalable, fault-tolerant stream processing of live data streams. Data can be ingested from many sources like Flume, Kafka etc can be processed using complex algorithms expressed with high-level functions like map, reduce. Processed data can be pushed out to file systems and live dashboards.
Process Flow in Spark Streaming:
Spark Streaming receives live input data streams and divides the data into batches. Spark Engine will process the same data. Once processing is done Spark engine will generate the final stream of outputs in batches.
Streaming Context , a class defined in the Spark Streaming library, is the main entry point into the Spark Streaming library. It allows a Spark Streaming application to connect to a Spark Cluster.
Streaming Context provides methods for creating an instance of the data stream abstraction provided by Spark Streaming.
Every Spark Streaming application must create an instance of this class
import org. apache. spark._
import org. apache. spark. streaming._
val config = new Spark Conf(). setMaster (“spark : // host : port”) . setAppName (“Streaming app”)
val batch = 20
val ssc = new Streaming Context(conf, Seconds(batch)
The batch size can be as small as 500 milliseconds. The upper bound for the batch size is determined by the latency requirement of your application and the available memory in spark streaming.