Parquet File with Example




Parquet:

Parquet is a columnar format that is supported by many other data processing systems, Spark SQL support for both reading and writing Parquet files that automatically preserves the schema of the original data. Parquet is a popular column-oriented storage format that can store records with nested fields efficiently. Parquet often used with tools in the Hadoop ecosystem and it supports all of the data types in Spark SQL.

Spark SQL provides methods for reading data directly to and from Parquet files.

Parquet is a columnar storage format for the Hadoop ecosystem. And has gotten good adoption due to highly efficient compression and encoding schemes used that demonstrate significant performance benefits. Its ground-up design allows it to be used regardless of any data processing framework, data model, and programming language used in Hadoop ecosystem including Map Reduce, Hive, Pig and Impala provided the ability to work with Parquet data and the number of data models such as AVRO, Thrift, etc have been expanded to be used with Parquet as storage format.

Parquet is widely adopted by a number of major companies including tech giants such as Social media to Save the file as parquet file use the method. people. saveAsParquetFile(“people.parquet”)

Example on Parquet file:





Scala > val parquet File = sql Context. parquet File(“/home/ sreekanth / SparkSQLInput /users.parquet”)

parquet File: org. apache. spark. sql. Data Frame=[name: string, favorite_hero: string, Favorite_color: string]

Scala > parquet File. register Temp Table(“parquet File”)

Scala>parquet File. print Schema

root

| name: string( nullable : false)

| favorite: hero( nullable : true)

| favorite_numbers( nullable : false)

Scala>val selected People = sql Context. sql (“SELECT name FROM parquet File”)

Scala> selected People.map(t=>”Name: ” + t(0)).collect().foreach( println )

OUTPUT:

Name: Alex

Name: Bob

Scala > val selected People = sql Context. sql (“SELECT name FROM parquet File”).show

+——+

|name|

+——+

|Alex|

|Bob|

+—–+

How to Save the Data in a “Parquet File” format

Scala> val sql Context = new org. apache. spark. sql. SQL Context( sc )

sql Connect: org. apache. spark. sql. SQL Context=org. apache. spark. sql. SQL Context@ hf0sf

Scala> val data frame=sql Context.read.load(“/home/ sreekanth /Spark SQL Input/users.parquet”)

data frame: org. apache. spark. sql. Data Frame=[name: string, favorite_hero:string, favorite_color:string]

data frame.select(“name”, “favorite_hero”).write.save(name And Fav Hero.parquet)