In Hadoop and Spark eco-systems has different file formats for large data loading and saving data. Here we provide different file formats in Spark with examples.
File formats in Hadoop and Spark:
What is the file format?
The file format is one of the best ways to which information to stored either encoded or decoded data on the computer.
1. What is the Avro file format?
Avro is one of the most useful file formats for the data serialization framework in the Spark eco-system because of its language neutrality. Most of the developers are used Avro because it can handle multi-purpose storage format within the Spark and they can be processed with different languages. It stores metadata with the data but also a specification of an independent schema for reading the file within the Spark eco-system. It is also splittable, support block compression as compared to CSV file format.
2. What is the Parquet file format?
Basically, the Parquet file is the columnar format is supported by many other data processing systems, Spark supports for both reading and writing files that can automatically maintain the schema of normal data.
Here is the full article of Parquet file
3. What is the JSON file format?
It stores metadata and supports schema evolution. JSON file format doesn’t support block compression.
4. What is the Text/CSV file format?
The text file format is a normal/default storage data file format, it’s the most readable, present every simple and easy to parse. It is used to interchange the data with other client applications.
Coming to CSV(Comma-Separated Values) is also used to interchange the data with other client applications. It is very simple to parsable. CSV file cannot be stored in HDFS(Hadoop Distributed File System) with any metadata and doesn’t support block compression.
5. What is the ORC file format?
ORC file format means that Optimized Row Columnar file format. Basically, it is used for compression(optimization) of large files but poor write performance. ORC doesn’t support schema evolution.