Spark SQL with example(Pictures)





In Hive Context provides a superset of the functionality provided by SQL Context. The parser that comes with Hive Context is more powerful than the SQL Context parser. It can execute both HiveQL (Hive Query Language) and SQL queries and it can read data from Hive tables. It also allows applications to access Hive UDFs(User Defined Functions). If we want to process existing Hive tables then add the hive-site.xml file to Spark’s classpath. Hive Context read Hive configuration from the hive-site.xml file.

Data Frames: Data Frame is a Spark SQL’s primary data abstraction. It represents a distributed collections of rows organized into named columns. It is similar to a relational database.

Spark SQL is a Spark module for structured data processing and it provides a programming abstraction called Data Frames and can also act as a distributed SQL query engine.

Few points for Spark SQL:

1. A Data Frame is a distributed collection of data organized into named columns.

2. It is conceptually equivalent to a table in a relational database

3. Data Frames can be constructed from a wide array of sources such as structured data files, tables in Hive, external databases, or existing RDDs converting “RDD to Data Frame”.

In Spark SQL provides an implicit conversion method named to Df which it creates a Data Frame simply.

Coming to RDD of objects represented by a case class when this technique is used Spark SQL infers the schema of a data set. The to DF method is not defined in the RDD class, but it is available through an implicit conversion. To convert an RDD to a Data Frame using to DF, then we need to import the implicit methods.

example:

Scala > val data =sc . parallelize (1 to 100)

Scala > val new Data=data.map(l=> (l, 1 – 10))

Scala > val result Data=new Data. toDF(“normal”, “transformed”)






result Data.print Schema

Scala > result Data.show

Above example very simple for Spark beginners.