What is Spark Streaming? Process flow in Spark Streaming




Spark Streaming

Spark Streaming is a Spark APIs core extension, offers fault-tolerant stream processing of live data streams to provides scalable and throughput processing.module for real-time application for example track statistics about page views in real-time.

Data ingestion is possible from many sources, for example, Kafka, Flume, and TCP sockets.

Spark streaming key programming abstraction is spark DStream. It also represents a stream of data categorized into small batches.

DStreams allows integrating with any other Apache spark components, for example, Spark MLib and Spark SQL.

Process flow in Spark Streaming





Spark streaming works in the below fashion:

  • Spark Streaming receives live input data streams and divides the data into batches.
  • Spark engine will process the same data
  • Once processing is done Spark engine will generate the final stream of results in batches.

What is Spark Streaming Context (SSC)?

It is the main entry point for any Spark Streaming application. Basically Streaming Context defined as a class defined in the Spark Streaming library is the main entry point into the Spark

import org.apache.Spark._
import org.apache.Spark.Streaming._
val config = new SparkConf().setMaster("spark://host:port").setAppName("First Streaming Apps")
val batchInterval =10
val ssc = new StreamingContext(conf,seconds(batchInterval))

Note: The batch size can be small as 500milli seconds. the upper bound for the batch size is determined by the latency requirements of your application and the available memory.

Spark Streaming Context methods:

Basically Spark Streaming Context computation methods are two types:

1.ssc.start()

2.ssc.awaitTermination()

  • ssc.start()

After invoking this method any spark streaming will start receiving data

  • ssc.awaitTermination()

By invoking about method only spark streaming will stop the computation.

DStreams or discretized streams:

DStreams is main programming abstraction in Spark Streaming. It is a sequence of data arriving over time. Internally each DStream is represented as a square of RDDs arriving at each timestamp.

Two types of DStreams in Spark Streaming:

1.Transformation: Which yield a new DStream

2.Output operations: Which write data to the external system