DataFrame vs DataSet | Definition |Examples in Spark





In Apache Spark technology major people confuse with DATA FRAME and DATA SET while writing Scala programming. Here we explained the brief idea with examples. How to write DATA FRAME code in Scala using the CASE class with real-time examples and major differences between these two entities.

What is DATA FRAME (schemaRDD):

DataFrame is an abstraction which grants a schema view of data. This means to grant us a view of data as columns with name and types info, we can think data in the data frame as a table in the database.

DATA FRAME using CASE CLASS:

scala > case class Person( name : String, age : Int, address : String)

defined class Person

scala > val df = List ( Person ( “Sumanth”, 23, “BNG”), Person ( “Kishore”, 25, “HYD”), Person (“Venkat”, 29, “MUM”) , Person ( “Jagapathi”, 29, “LONDON”) . toDF

df: org.apache.spark.sql.DataFrame = [name: string, age: int …1 more field]

scala >  df . collect(). mkString (“\n”)

res : String [Sumanth,23,BNG]

Kishore,25, HYD

Venkat,29, MUM

Jagapathi, 29, LONDON

What is DATA SET [DS]

Data Set is an extension to Dataframe API, the latest abstraction which tries to give the best of both RDD and Dataframe.




CONVERT “DATA FRAME (DF)” TO “DATA SET (DS)”

Note: We can always convert a data frame at any point of time into a dataset by using the “as” method on the Data frame. For example df.as[YourClass].

It means that to provide the case class only we can convert a DATA FRAME into DATA SET

scala > Val s = df . as [ Person ]

ds: org .apache .spark. sql .Dataset [ Person ] = [ name : string, age: int …1 more field ]

scala > ds. show

Important Note: Data sets API provides “Compile-time safety” which was not available in Data frames.

What is Compile-time safety?




While running the program with an error is there it will be showing error in compile-time only not in runtime. If we get any error in compile-time it will not execute the next statement.

A Dataset can be constructed from JVM objects and then manipulated using functional transformations ( map, flatMap, filter, etc) The Dataset API is available in Scala and Java. Python does not have the support for the Dataset API.

Converting “DATA SET [DS] to DATA FRAME [DF]”

We can directly use toDF method to convert Data Set back to Data Frame, no need using any Case Class over here

Scala > Val newdf = ds. toDF




Summary: Here we explained what is DATA FRAME and DATA SET in Apache Spark with example. What is the major difference between DATA FRAME and DATA SET in Scala programming? How to convert DATA FRAME into DATA SET and DATA SET to DATA FRAME vice versa.