Parquet File with Example




Parquet:

Parquet is a columnar format that is supported by many other data processing systems, Spark SQL support for both reading and writing Parquet files that automatically preserves the schema of the original data. Parquet is a popular column-oriented storage format that can store records with nested fields efficiently. Parquet often used with tools in the Hadoop ecosystem and it supports all of the data types in Spark SQL.

Spark SQL provides methods for reading data directly to and from Parquet files.

Parquet is a columnar storage format for the Hadoop ecosystem. And has gotten good adoption due to highly efficient compression and encoding schemes used that demonstrate significant performance benefits. Its ground-up design allows it to be used regardless of any data processing framework, data model, and programming language used in Hadoop ecosystem including Map Reduce, Hive, Pig and Impala provided the ability to work with Parquet data and the number of data models such as AVRO, Thrift, etc have been expanded to be used with Parquet as storage format.

Parquet is widely adopted by a number of major companies including tech giants such as Social media to Save the file as parquet file use the method. people. saveAsParquetFile(“people.parquet”)

Example on Parquet file:





Scala > val parquet File = sql Context. parquet File(“/home/ sreekanth / SparkSQLInput /users.parquet”)

parquet File: org. apache. spark. sql. Data Frame=[name: string, favorite_hero: string, Favorite_color: string]

Scala > parquet File. register Temp Table(“parquet File”)

Scala>parquet File. print Schema

root

| name: string( nullable : false)

| favorite: hero( nullable : true)

| favorite_numbers( nullable : false)

Scala>val selected People = sql Context. sql (“SELECT name FROM parquet File”)

Scala> selected People.map(t=>”Name: ” + t(0)).collect().foreach( println )

OUTPUT:

Name: Alex

Name: Bob

Scala > val selected People = sql Context. sql (“SELECT name FROM parquet File”).show

+——+

|name|

+——+

|Alex|

|Bob|

+—–+

How to Save the Data in a “Parquet File” format

Scala> val sql Context = new org. apache. spark. sql. SQL Context( sc )

sql Connect: org. apache. spark. sql. SQL Context=org. apache. spark. sql. SQL Context@ hf0sf

Scala> val data frame=sql Context.read.load(“/home/ sreekanth /Spark SQL Input/users.parquet”)

data frame: org. apache. spark. sql. Data Frame=[name: string, favorite_hero:string, favorite_color:string]

data frame.select(“name”, “favorite_hero”).write.save(name And Fav Hero.parquet)

Hadoop Admin Roles and Responsibilities

Hadoop Admin Roles and Responsibilities:

Hadoop Administrator career is an excellent career and lot of growth opportunities because less amount of people and Hadoop is huge demand technology.

Hadoop Administrator is responsible for Hadoop Install and monitoring Cluster Management.

Roles and Responsibilities:




  1. Capacity Planning and Hardware requirement of the nodes, Network architecture and Planning.
  2. Hadoop Software Installation and configuration whether Cloudera Distribution or Horton Works distribution etc.
  3. Configuring Name Node, Data Nodes to ensure its high availability.
  4. Tuning of Hadoop Cluster and creating new users in Hadoop, handling permissions, performance upgrades.
  5. Hadoop Backup and Recovery tasks
  6. Every day finding out which jobs are taking more time, if users say that jobs are stuck to find out the reason.
  7. Health check of Hadoop cluster Monitoring
  8. Deployment in Hadoop Cluster and maintaining it.
  9. Support and maintenace of Hadoop Storage (HDFS)
  10. Security administration during installation and basic knowledge on Kerberos, Apache Knoz and Apache Ranger etc.
  11. Data migration between clusters if needed ex: using Falcon tool.
  12. Manage Hadoop Log files and analyzing failed jobs
  13. Troubleshoot Network and applications
  14. Knowledge on Scripting Skills on Linux environment
  15. Knowledge on Oozie, Hive , HCatalog and Hadoop Eco – System

 

Day to Day Activities of Hadoop Admin:

  1. Monitoring Console whether Cloudera Manager or Horton works and job tracker UI.
  2. HDFS Maintenance and Support
  3. Health check of Hadoop cluster monitoring
  4. Managing Hadoop log files and find out errors
  5. Managing users, permissions etc.
  6. Troubleshoot Network errors and application errors.

Skill sets required to become a Hadoop Administrator :

  1. Strong Knowledge on Linux/Unix
  2. Knowledge on Shell Scripting/Python Scripting
  3. Hands on Experience of Cluster Monitoring tools like Ambari, Gangila etc.
  4. Networking and Memory management




Summary: Hadoop Administration is one of the best careers in terms of growth and opportunities. Nowadays the Hadoop market is on rising. If you have knowledge on Linux and Database then admin it can be an advantage.

Hadoop Distributed File System – HDFS in Hadoop

HDFS – Hadoop Distributed File System is the primary storage system used by Hadoop application. HDFS is a Distributed File System that provides high-performance access to data across on Hadoop Clusters. When HDFS takes in data, it breaks the information into smaller parts called blocks.  Allowing for parallel processing.




HDFS is built to support the application with large data sets, including individual files.

It uses Master/Slave architecture

 

Features of HDFS:

A)Fault Tolerance

In HDFS refers to the working strength of a system in unfavorable conditions and how that system can handle such related situations. HDFS is highly faulted tolerant, in HDFS that data is divided into blocks and multiple copies of blocks are created on different machines in the cluster.

B)High Availability

HDFS is a high availability file system, data gets replicated among the nodes in the HDFS cluster by creating a replica of the blocks on the other slaves present in HDFS cluster. When your node failure, the user can access their data from other nodes. Because duplicate copies of blocks which contain user data are created on the other nodes present in the HDFS cluster.

C)Replication

In HDFS Data Replication is the most important and unique feature of HDFS. In HDFS replication of data is done to solve the problem of data loss in unfavorable conditions like crashing of a node or hardware failure etc. Data replicated across a number of machines in the cluster by creating blocks.




D)Reliability

HDFS is a distributed file system which provides reliable data storage. HDFS can store data in the range of 100s of petabytes. It stores data reliably on the cluster on nodes. HDFS divides the data into blocks these blocks are stored on nodes present in HDFS cluster. It stores data reliable by creating a replica of each and every block present on the nodes present in the cluster and hence provide fault tolerance.

E)Distributed Storage

In HDFS all the features are achieved by via distributed storage and replication. In HDFS data in stored in distributed wisely across the nodes in HDFS Cluster.

Hadoop Admin Commands with Examples(Pictures)

Apache Hadoop is opensource framework for Storage & Processing purpose. For Storage purpose HDFS(Hadoop Distributed File System) and for Processing purpose(Map Reduce) using. Here is some Hadoop admin commands for beginners.



Hadoop Admin Commands:

hadoop namenode -format:

In this command explain about format HDFS file system from Nam node in a cluster.

hadoop dfsadmin -report :

In this command showing report on the overall HDFS file system. This command very useful for how much disk is available , Name node information, how many Data Nodes are running and corrupted blocks are in a cluster.

hadoop dfsadmin -refreshNodes:

This commands used commission or decommission nodes

Safe mode commands in Hadoop:

It is a state of Name node, does not allow changes to file system and  Read only mode for Hadoop cluster. Mostly three commands are related to safe mode.

hadoop dfadmin -safemode get:

safemode get means that get status of the safe mode(maintenance mode)

hadoop dfadmin -safemode enter:

In this command exactly meaning that Safe mode is ON.

hadoop dfsadmin -safemode leave:

Leave command means that Safe mode is OFF.

start-all.sh:

Start the all daemons like name node, secondary name node, yarn, data nodes, node manager etc.

stop-all.sh:

Stop the all daemons like name node, data node, secondary name node, resource manager, node manager, yarn etc.

hadoop fs -copyFromLocal <file 1> <file 2> :

In hadoop environment we need to copy files from local file system to HDFS use this command

hadoop fs -copyToLocal< file 1> <file2> :

Copy files HDFS to local file system  in a hadoop cluster

hadoop fs -put <file 1> <file 2> :

This command same like as copyFromLocal but small difference is remote location to HDFS

hadoop fs -get <file 1 > <file 2> :

This command same like as copyToLocal but small difference is HDFS to remote location.

hadoop fs -setrep -w 5  file:

For set a replication factor manually using below command

hadoop distcp hdfs://<ip1>/input hdfs://<ip2>/output :

For copy file one cluster to another cluster using below command

hadoop job -status <job -id >:

To check hadoop jobs status we use this command.

hadoop job -submit <job – files > :

To submit hadoop job file using this command.

hadoop job -list all :

You want to hadoop jobs so will use this command simply.

hadoop job -kill-task < task – id>:

In hadoop jobs we need to urgently kill the task in processing time. At the time this command is more useful




Difference between Managed and External Tables with Syntax in HIVE

Difference between Managed and External Tables with Syntax in HIVE

Apache is HIVE is mainly used for data summarization for querying language.  Hive SQL is same like as SQL but a little bit different here how data summarized and data processing through the query language.




Here explain about Apache tables difference between tables with syntax.

Hive table is logically made up of the data associated with metadata describing the layout of the data in the table

There are two types of Hive table:

1.Managed or Default or Internal Table

2.External Table

1.Managed Table:

When will create a table in Hive, the default Hive will manage the data, which means that Hive moves the data into its warehouse directory.

The default warehouse location of HDFS:

“/usr/hive/warehouse/<table name>>/<<table data>>”

 

Syntax to create a managed table in Hive

CREATE TABLE <TableName>

ROW FORMAT DELIMITED

FIELDS TERMINATED BY<DELIMITER>

LINES TERMINATED BY<DELIMITER>

STORED AS <FILE FORMAT>;





2.External Table:

External tables are external to the hive warehouse path. External table data will be stored in an external location of HDFS which we specify at the table schema.

Warehouse location of HDFS:

“/usr/hive/warehouse/<table name>>/<<table data>>”

 

CREATE EXTERNAL TABLE <TableName>

ROW FORMAT DELIMITED

FIELDS TERMINATED BY<DELIMITER>

LINES TERMINATED BY<DELIMITER>

STORED AS <FILE FORMAT> LOCATION “HDFS PATH”;

Difference between Managed table and External Table:

Default tables mean that local tables on premises or within the database. External tables are external to hive warehouse system with HDFS path. In Managed tables no need to give an extra keyword for at the time of creating a table but in external tables, we need an external keyword for table creation. Mainly if the table was dropped in managed table entire data will be lost but in the external table, only metadata will be lost.

Will check with this video for how to create managed tables and load the data in the hive and go with as well as External Tables.

Meta Store in APACHE HIVE

In Hadoop eco-system Hive component processing all the structure information of the various tables and partitions in the warehouse including column, column type information to the necessary to read-write data and the corresponding HDFS files where data is stored. That is the central repository of Hive metadata.

Here mainly Metadata is the internal database of the hive which is responsible for managing metadata information.




Metadata which metastore stores contain things like:

1.Ids of database

2.Ids of Tables and index

3.Time of creation of the index

4.Time of creation of tables

5.Input format used for tables

6.Output format used for tables

Hive majorly Three modes of Metastore

1.Embedded Metastore

2.Local Metastore

3.Remote Metastore

1.Embedded Metastore:

Embedded metastore runs in the same JVM as the Hive service and contains an embedded Derby database instance backed by the local disk it is called the embedded metastore.

Default configuration for Embedded Metastore

below are the metastore configuration details in “hive-site.xml”

<property>

<name>javax.jdo.option.ConnectionURL</name>

<value>jdbc:derby: ;databaseName=/var/lib/hive/metastore/metastore_db;create=true</value>

<descripton>JDBC connect string for a Embedded metastore</description>

</property>

<property>

<name>javax.jdo.option.ConnectionDriverName</name>

<value>org.apache.derby.jdbc.EmbededDriver</value>

<descripton>Driver class name  for a Embedded metastore</description>

</property>

 

2. Local Metastore:

It supports multiple users to use a standalone database. Configuration same as local metastore but connect to a database running in a separate process.

If we use MySQL database as the local metastore, then we need some configuration in “hive-site.xml”

<property>

<name>javax.jdo.option.ConnectionURL</name>

<value>jdbc:mysql://host/dbname?create DAtabaseIfnotExist=trueue</value>

<descripton>JDBC connect string for a JDBC metastore</description>

</property>

<property>

<name>javax.jdo.option.ConnectionDriverName</name>

<value>com.mysql.jdbc.ConnectionDriverName</value>

<descripton>Driver class name  for a JDBCmetastore</description>

</property>

 

3.Remote Metastore: 

In remoter metastore, where one or more metastore servers run in separate processes to Hive service.

This metastore better manageability and security.

 

What is Hive and Architecture of Hive

What is the HIVE?




Apache Hive is data warehousing infrastructure based on Hadoop. Hadoop provided massive scale out and fault tolerance capabilities for data storage and processing on commodity hardware.

Hive is designed to enable data summarization, ad-hoc querying, and analysis of the large volume of data. At the same time, Hive’s SQL gives users multiple places to integrate their own functionality to do custom analysis like UDFs

Architecture of HIVE

Here CLI -Command Line Interface, JDBC- JavaDataBase Connector and Web GUI(Graphical User Interface). When the user comes with CLI  then directly connected with Drivers, the user comes with JDBC at that time by using API it connected to Hive driver. When Hive Driver receives the tasks queries from the user and sends to Hadoop architecture then architecture uses name node, data node, job tracker, task tracker for receiving data.

How to Install Flume on Ubuntu/Linux in Hadoop

Flume Installation on Ubuntu/Linux:





Step 1: Download Apache flume tarball from Apache Mirrors

Step 2: Extract the Downloaded Tarball using below command

tar -xzvf apache-flume-1.7.0.bin.tar.gz

Step 3: Update the FLUME_HOME &PATH variables in bashrc file

Step 4: To check the bashrc changes, open a new terminal and type ‘echo $FLUME_HOME‘ command

Step 5: To check the Flume version

How to Install MongoDB on Ubuntu/Linux in Hadoop

MongoDB is one kind of NoSQL database which is popular among many enterprises. It is purely open source document DB. Mongo stores data using document which is called BISON. BSON is a data format which is like JSON in javascript making an application and faster.

Why MongoDB?

It is Full index support:  Can just use an index like what you do in RDBMS

Replication & High Availability:  MongoDB supports  replication  of data between servers for fail over .

Querying: If you knowledge on query language querying is easy to you.

Step 1: Download tarball from Mongo DB website. Here select which version of your Ubuntu and download the tarball 

 

Step 2: Extract tarball using below command:

tar -xzvf  your tar ball full name

Step 3: After that  update the MONGODB_HOME & PATH variables in bashrc file using below command

nano ~/.bashrc

Step 4:  To check the bashrc changes, open a new terminal and type ‘echo $MONGODB_HOME‘ command.

After will check the exact version of Mongo DB

After Installation and configuration of MongoDB will start services

Step 5: Before starting the mongodb service for the first time, we need to create the data  directory:

 

 

Step 6: To start the MongoDB service use the  below command

mongod

After completion of start services and open mongo shell to write and read queries in your shell. In Non-relation ship databases didn’t offer that features which means usually it cannot provide full ACID properties. So it will not replace RDBMS in the future because of its weakness in business consistency.

How to Install HIVE with MySQL on Ubuntu/Linux in Hadoop

Apache Hive is a data warehouse system mostly used for data summarization for structured data type files. Hive is a one of the component of Hadoop built on top of HDFS and is a data warehouse kind of system in Hadoop. It is used in Tabular form(Structured data) not for FLAT files.

Step:1 Download the hive-1.2.2 tarball from Apache Mirrors official website

http://apache.mirrors.tds.net/hive/hive-1.2.2

Step 2: Extract the tar ball file in your path using below command:




tar-xzvf Apache-hive-1.2.2-bin.tar.gz

Step 3:Update HIVE_HOME & PATH variables in bashrc file

export HIVE_HOME=/home/sreekanth/Big_Data/Apache-hive-1.2.1-bin

export PATH=$PATH:$HIVE_HOME/bin

After update, the .bashrc file will change then go to the next step

Step 5: To check the bashrc changes, open a new terminal and type the command

echo $HIVE_HOME

Step 6: Remove jline-0.9.94.jar file from the below path to avoid the incompatibility issues of Hive version with hadoop-2.6.0

Step 7: There are 2 types of Meta Stores we can configure in Hive to store metadata.



Internally using Derby in Hive. It is only for one user

Externally using MySQL is used multiple users. In case your conf file does not contain hive-site.xml file then

Create hive-site.xml  file

Step 8: Configure hive-site.xml  file with MySQL configuration and add the below content:

Step 9: For External Meta Store ‘MySQL’ , we need MySQL connector jar file

Step 10: MySQL connector jar file into $HIVE_HOME/lib path

Step 11: Run hive command in terminal but it will showing connection refused

Due to daemons are not working so it is necessary to start all daemons other wise hive is not working

Step 11: First start all daemons using start-all.sh command

Step 12: Now successfully run the hive in your machine


Step 13: How to Check Hive version using below command:

hive –version

Why we use HIVE?

Because of data summarization or querying tabular data in the Hadoop system. Default hive database Derby it is only for one user. Mostly MySQL used for large data and multiple users.