What is Apache Spark?
Spark is a fast, easy to use and flexible data processing and in-memory compute framework. It can run on top of Hadoop eco-system, and Cloud accessing diverse data sources including HDFS, HBase, and other services.
Different Key Features of Spark:
2.In General Purpose
What is the Spark Engine?
Spark engine is for scheduling, distributing and monitoring the large data applications.
What is RDD?
RDD means that Resilient Distribution DataSets. Designed to be fault-tolerant and represents data distributed across the cluster. If node failing is proportional to the number of nodes in a cluster.
RDD supports two operations:
What is Hive on Spark?
Hive support for Apache Spark, wherein Hive execution is configured to Spark below configurations:
hive > set spark.home=/location/ to /Spark_Home
hive > set hive.execution.engine=spark;
Hive on Spark supports Spark on yarn mode by default
1.Spark SQL – For developing
2.Spark Streaming – For live data streaming
3.GraphX for computing graphs
4.MLib for Machine learning
5.SparkR for Spark engine.
What is Spark SQL?
Spark SQL called as a Shark is a novel module. It introduced that Spark with structured data and processing. Spark executes relational SQL queries on data. The core of the Spark SQL is to supports the RDDs.
What is Spark Streaming?
Apache Spark streaming supports live data processing. It is an extension to the Spark API, allowing stream processing of continuous live data streams. For example data from different sources like HDFS, Flume services are streamed and finally processed to file systems.
What is Spark GraphX?
Spark GraphX means that processing the graphs to build and transform capable graphs. And its component enables programmers to reason about structured data at small.
What is Spark MLib?
Spark MLib is a scalable machine learning library provided by an Apache Spark. It provides easy to understand with algorithms and use different use cases like clustering, filtering, etc.