Spark SQL with example(Pictures)




In Hive Context provides a super set of the functionality provided by SQL Context. The parser that comes with Hive Context is more powerful than the SQL Context parser. It can execute both HiveQL (Hive Query Language) and SQL queries and it can read data from Hive tables. It also allows applications to access Hive UDFs(User Defined Functions). If we want to process existing Hive tables then add hive-site.xml file to Spark’s class path. Hive Context read Hive configuration from the hive-site.xml file.

Data Frames: Data Frame is a Spark SQl’s primary data abstraction. It represents a distributed collections of rows organized into named columns . It is similar to relational data base.

Spark SQL is a Spark module for structured data processing and it provides a programming abstraction called Data Frames and can also act as distributed SQL query engine.

Few points for Spark SQL:

1.A Data Frame is a distributed collection of data organized into named columns.

2. It is conceptually equivalent to a table in a relational database

3.Data Frames can be constructed from a wide array of sources such as structured data files, tables in Hive, external databases, or existing RDDs converting “RDD to Data Frame” .

In Spark SQL provides an implicit conversion method named toDf which it creates a Data Frame simply.

Coming to RDD of objects represented by a case class when this technique is used Spark SQL infers the schema of a data set. The toDF method is not defined in the RDD class, but it is available through an implicit conversion . To convert an RDD to a Datd Frame using toDF, then we need import the implicit methods.



example:

Scala > val data =sc . parallelize (1 to 100)

Scala > val new Data=data.map(l=> (l, 1 – 10))

Scala > val result Data=new Data. toDF(“normal”, “transformed”)

result Data.print Schema

Scala > result Data.show

Above example very simple for Spark beginners.


Leave a Reply

Your email address will not be published. Required fields are marked *