Most frequently asked Apache PIG Interview Questions and Answers[Updated]

1. Are there any problems which can only be solved by MapReduce and cannot be solved by Apache PIG? On what scenarios MapReduce jobs will be more useful than PIG?




Here is take one scenario where we want to count the population in two cities. We have a data set and a sensor and a different list of different cities.  We want to count the population by using MapReduce for two cities. Let us assume that one is Hyderabad and other is Bangalore. So I need to consider the key of Hyderabad city similar to Bangalore through which I can bring the population data of these two cities to one reducer. The idea behind this is somehow I have to instruct map reducer program  – whenever you find city with the name “Hyderabad” and city with the name “Bangalore”, you create the alias name which will be the common name for these two cities so that you create a common key for both the cities and it gets passed to the same reducer. For this, we have to write customer partitioner.

In MapReduce when you create ‘key’ for the city, you have to consider ‘city’ as the key. Whenever the MapReduce framework comes across a different city, it considers it as a different key then need to use customized partitioner. If city = Hyderabad or Bangalore then go through the same hashcode. After that cannot create custom partitioner in Pig. It means that PIG is not a framework, we cannot direct the execution engine to customize the partitioner. This type of scenarios, MapReduce works better than Apache PIG.

2. What is the difference between MapReduce and Apache PIG?

In Hadoop, eco-system for processing MapReduce need to write entire logic for operations like join, group, filter, etc.
In Pig have inbuilt functions as compared to MapReduce.
In coding Pig 20 lines of PIG Latin equal to 400 lines of Java.
In PIG High Productivity compared to MapReduce programming.
MapReduce needs to more effort while writing coding.

3. Why should we use ‘distinct’ keyword in PIG scripts?




In Pig scripts distinct keyword is very simple it removes duplicate records. Distinct works only on entire records, not on individual fields like below example:
input = load ‘daily’ as (emails, name);
grads = distinct emails;

4. What is the difference between Pig and SQL?
Apache Pig and SQL a lot of difference here are the mentioned.

Pig is Procedural      SQL is Declarative   
OLAP works             OLAP+OLTP works
Schema is optional     SQL Schema

What are the different Hadoop Components and Definitions

What are the Different Hadoop Components in Hadoop Eco-System





HDFS – Filesystem of Hadoop ( Hadoop Distributed File System)
MapReduce – Processing of Large Datasets

HBase – Database (Hadoop+dataBase)

Apache Oozie – Workflow Scheduler

Apache Mahout – Machine learning and Data mining

Apache Hue – Hadoop user interface, Browser for HDFS, HBase, Query editors for Hive, etc.
Flume – To integrate other data source

Sqoop – Export / Import data from RDBMS to HDFS and HDFS to RDBMS

What is HDFS?

HDFS (Hadoop Distributed File System) is a filesystem that can store very large data sets by scaling out across a cluster of hosts.

What is Map Reduce?

MapReduce is a programming model and it is implemented for processing and generating large data sets. It specifies a map function that process a (key, value) pair to generate a set of intermediate(Key, Value) pairs.

What is Hive?




A data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis.

What is  Pig?

Pig is an analyzing large data sets that consist of a high-level (scripting) language for expressing data analysis programs.

What is Flume?

Flume is on top of Hadoop applications, we need to get data from the source into HDFS.

What is Sqoop?

Apache Sqoop is a tool designed for transferring bulk data between Hadoop and structured data stores it means that Export / Import data from RDBMS to HDFS vice versa.

What is HBase?

HBase ( Hadoop + dataBase) is a column-oriented store database layered on top of HDFS.

What is NoSQL database?

NoSQL means that Not Only SQL using traditional relational Data Base Management System.

Most frequently Asked Hadoop Admin interview Questions for Experienced

Hadoop Admin interview Questions for Experienced

1.Difference between Missing and Corrupt blocks in Hadoop 2.0 and how to handle it?

Missing block: Missing block means that there are blocks with no replicas anywhere in the cluster.

Corrupt block: It means that HDFS cannot find any replica containing data and replicas are all corrupted.




How to Handle :
By using  below command will handle

To find out which file is corrupted and remove a file

A) hdfs fsck /
B)hdfs fsck / | grep -v '^\.+$' | grep -v eplica
C) hdfs fsck /path/to/corrupt/file -location -block -files
D)hdfs fs -rm /path/to/file/

2. What is the reason behind of an odd number of zookeepers count?

Because Zookeeper elects a master based opinion of more than half of nodes from the cluster. If even number of zookeepers is there difficult to elects master so zookeepers count should be an odd number.

3. Why Kafka is required for zookeeper?

Apache Kafka uses zookeeper, need to first start zookeeper server. Zookeeper elects the controller topic configuration.

4. What is the retention period of Kafka logs?

When a message sent to Kafka cluster appended to the end of logs. The message remains on the topic for a configurable period of time. In this period of time Kafka generates a log file, it called retention period of Kafka log.
It defines log.retention.hours 

5. What is block size in your cluster, why not recommended for 54 MB block?

Depends upon your cluster, because of Hadoop standard is 64 MB

6. For suppose if the file is 270 MB then block size is 128 MB on your cluster so how many blocks if 3 blocks are 128+!28+14MB so 3rd block 14MB is wasted or other data can be appended?

7. What are the FS image and Edit logs?

FS image: In a Hadoop cluster the entire file system namespace, file system properties and block of files are stored into one image it is called an FS image (File System image). And total information in Editlogs.

8. What is your action plan if your PostgreSQL or MySQL down on your cluster?

First, check with the log file, then go with what is an error and find out the solution

For example: If connectionBad Postgres SQL
Solution: First status Postgres SQL service

sudo systemctl status postgressql

Then stop the Postgres SQL service

sudo systemctl stop postgressql

Then provide pg_ctlcluster with the right user and permissions

sudo systemctl enable postgressql

9. If both name nodes are in stand by name node, then if jobs are running or failed?

10. What is the Ambari port number?

By Default Ambari port number is 8080 for access to Ambari web and the REST API.




11. Is your Kerberos cluster which one using LDAP or Active Directory?

Depends upon your project if LDAP integration or Active Directory and explain it.

Spark Lazy Evaluation and Advantages with Example

Apache Spark is an in-memory cluster computing framework for processing and analyzing large amounts of data (Bigdata). Spark provides a simple programming model than that provided by Map Reduce. Developing a distributed data processing application with Apache Spark is a lot easier than developing the same application with Map Reduce. In Hadoop MapReduce provides only two operations for processing the data like “Map” & “Reduce”, whereas Spark comes with 80 plus data processing operations to work with big data application.




While data processing from source to destination. Spark is 100 times faster than Hadoop Map Reduce because it allows in-memory clustering computing, it implements an advanced execution engine.

What is meant by Apache Spark Lazy Evaluation?

In Apache Spark, two types of RDD operations are

I)Transformations

II) Actions.

We can define new RDDs any time, Apache Spark computes them only in a lazy evaluation. That is, the first time they are used in an action. The Lazy evaluation seems unusual at first but makes a lot of sense when you are working with large data(BigData).

Simply Lazy Evaluation in Spark means that the execution will not start until an action is triggered. In Apache Spark, the picture of lazy evaluation comes when Spark transformation occurs”.

Consider where we defined a text file and then filtered the lines that include “CISCO” client name if Apache Spark were to load and store all the lines in the file as soon as we wrote like lines = sc.text( file path ). Here Spark Context would waste a lot of o storage space, given that we then immediately filter out many lines. Instead, once Spark seems that whole chain transformation. It can compute the data needed for its result. Hence first() action, Apache Spark scans the file only until it finds the first matching line; it doesn’t even read the whole file.

Advantages Of Lazy Evaluation in Spark Transformations:

Some advantages of Lazy evaluation in Spark in below:

  • Increase Manageability: The Spark Lazy evaluation, users can divide into smaller operations. It reduces the number of passes on data by transformation grouping operation.
  • Increases Speed: By lazy evaluation in Spark to saves the trip between driver and cluster, speed up the process.
  • Reduces Complexities: There are two types of complexities of any operations are Time and Space complexity using Spark lazy evaluation we can overcome both complexities. The action is triggered only when the data is required.

Simple Example:

In Spark, Lazy evaluation below code writes in  Scala, who evaluates the expression as it’s declared.

With Lazy:

Scala> Val sparkList = List(1,2,3,4)

Scala> lazy val output = sparkList .map( 1 => 1*10)

Scala> println( output )

Output:

List( 10, 20, 30, 40 )



Spark Performance Tuning with pictures





Spark Execution with a simple program:

Basically, the Spark program consists of a single spark driver and process and a set of executors processes across nodes of the cluster.

Performance tuning of Spark is measured bottleneck using big data environment metrics for block time analysis. Spark is run on In-memory cache so need to avoid network and I/O are key role while performance.

For example: Take two clusters, one cluster with 25 machines and cluster size is 750 GB of data. The second cluster with 75 machines clusters with 4.5TB of raw data.  The network communication is always irrelevant for the performance of these workloads coming to network optimization is to reduce job completion by 5% for better performance. Finally serialized compressed data.

Mostly Apache Spark always supports transformations like groupByKey and reduceByKey dependencies. Spark executes a shuffle, which transfers data around the cluster. Below three operations  with different outputs:

sc.textFile(" /hdfs/path/sample.txt")
map(mapFunc) #Using Map function 
flatMap(flatMapFunc) #Flatmap is another transformer
filter(filterFunc)
count() # Count the data.

Above code executes a single performer, which depends on a sequence of transformations on an RDD derived from the sample.tx file.

If the code contains how many times each character appears in all the words that appear more 1000 times in a given text file. Below code in Scala:

Val  token=sc.textFile(args(0).flatMap(_.split(' ' ))
Val wc=token.map((_,1)).reduceByKey(_+_)
Val filtered = wc.filter(_._2 > =1000)
val charCount = filtered.flatMap(_._1.toCharArray).map((_, reduceByKey(_+_)
charCount.collect

Above code breaks into mainly three stages. The reduceByKey operations result in stage boundaries

 

 

 

Hadoop Architecture vs MapR Architecture





Basically, In BigData environment Hadoop is a major role for storage and processing. Coming to MapR is distribution to provide services to Eco-System. Hadoop architecture and MapR architecture have some of the difference in Storage level and Naming convention wise.

For example in In Hadoop single storage unit is called Block. But in MapR it is called Container.

Hadoop VS MapR

Coming to Architecture wise somehow the differences in both:
In Hadoop Architecture based on the Master Node (Name node) and Slave (Data Node) Concept. For Storage purpose using HDFS and Processing for MapReduce.




In MapR Architecture is Native approach it means that SAN, NAS or HDFS approaches to store the metadata. It will directly approach to SAN  no need to JVM. Sometimes Hypervisor, Virtual machines are crashed then data directly pushed into HardDisk it means that if a server goes down the entire cluster re-syncs the data node’s data. MapR has its own filesystem called MapR File System for storage purpose. For processing using MapReduce in background.

There is no Name node concept in MapR Architecture. It completely on CLDB ( Container Location Data Base). CLDB contains a lot of information about the cluster. CLDB  installed one or more nodes for high availability.

It is very useful for failover mechanism to recovery time in just a few seconds.

In Hadoop Architecture Cluster Size will mention for Master and Slave machine nodes but in MapR CLDB default size is 32GB in a cluster.




 

In Hadoop Architecture:

NameNode
Blocksize
Replication

 

In MapR Architecture:

Container Location DataBase
Containers
Mirrors

Summary: The MapR Architecture is entirely on the same architecture of Apache Hadoop including all the core components distribution. In BigData environment have different types of distributions like Cloudera, Hortonworks. But coming to MapR is Enterprise edition. MapR is a stable distribution compare to remaining all. And provide default security for all services.

Data Engineer Vs Data Science Vs Data Analyst





Nowadays the world’s runs completely on Data. Data engineers are like builders of construction it means that make data usable by data analyst and data scientist through APIs, data applications.

Data Scientist is a Researcher: Use data for advanced analysis, algorithms, data structures, and machine learning.

Data Analyst is a data translate into business insights are like data visualization tools like QlikView, Tableau, etc.

Data Engineer:

  • Data Engineer is the process of extracting the raw data and making it analysis for transforming data from source to destination.
  • Strong knowledge with the ability to create and integrate APIs and understanding the data related queries and performance optimization.
  • Data Engineer must have the skill set on Data infrastructure, Data warehouse management, Extraction transform load, Reporting tools, etc.
  • Technical skills: Python, SQL, Java, ETL Tools, Hadoop, Spark, Bigdata environment, Tableau.

Data Science:

  • The data scientist is the analyses and interprets the complexity of data and must have the skill in Statistical modeling, Machine learning, IDentifying actionable insights, Maths and Data Mining.
  • Technicals skills: In-depth programming knowledge on Python, R or SAS, Big Data analytics.
  • Responsible for developing operational models and data analytics with analytics.

Data Analytics:

Data Analytics is a collect, perform statistical analysis of data and processing of data. numeric data and uses it to help for better decisions. Some of the tasks are to present the insights in non-technical actionable results. Data modeling and reporting techniques along with strong analysis of data statistics.

It simply says that get business value from data through insights(Translate data into business values).

DataAnalytics = Data Engineering + Data Science

Pay Scale :

According to Glassdoor Average Pay:

Data Engineer: $123070 /year

Data Scientist:$115,815/year

Data Analyst: $71,589/year
Summary: In the present market, Data is highly incremented compared to previous years. So we need to skill up with Data Engineer, Data Scientist, and Data Analyst for growth in knowledge and Payscale for future enhancement.




Above three roles are emerging and more sustainable roles with huge demanding in IT sector.

Talend Installation on Windows with Pictures

First, we need Prerequisites for Installation of Talend in the Windows operating system.

Prerequisites for Talend Installation:

While Talend installation memory usage heavily depends on the size and nature of your Talend projects. If your jobs include so much of transformations components you should be upgrading the total amount of memory allocated to your servers.




Based on the following recommendations for Talend installation:

Product - Studio
Client/Server - Client
Memory - 4GB recommended
Disk space installation - 6 GB

Before installing of Talend we need to configure the JAVA_HOME environment variable so that it points to the JDK directory.
Example:C:\Program Files\Java\jdk1.x.x_x\bin

Simple Steps to Talend Installation:

After completion of the above steps then download Talend studio files like Data Integration, Big Data, and cloud trail version:

Talend Studio Download

Step 1: Goto Talend official website for Talend studio zip file contains binaries for All platforms (Windows/Mac).

Once the download is complete, extract the zip file on your Windows operating system.

Step 2: We get the Talend folder then double click on the TOS_DI_win_x86_64.exe. And executable file to launch your Talend studio-like below image.

You need to recommend more memory and space to avoid spaces in the target installation directory.

Step 3: After clicking on Talend icon then ‘accept’ the user license agreement.

Step4: After Accepted the agreement then starts your journey with Talend studio. For the first time user, need to set up a new project or you can also import a demo project and click the Finish button.

 

After successful completion of Talend Studio Data Integration installation on windows. To open a welcome window and launch the studio simply.




Talend studio requires different types of libraries like Java (.jar) files and database drivers to be installed to connect a source to target.
These libraries are also known as external modules in Talend studio. If you want a cloud trial version or Big Data version for more features then simply installed it.

What is Data Mart? Types and Implementation Steps:




What is Data Mart?

In the Data Warehouse system Data Mart is a major role. Data Mart contains a subset of an organization. In other words, a data mart contains only those data that are specific to a particular group. For example, the marketing data mart may contain only data related to items, customers, and sales.

  • Data Marts improve end-user response time by allowing users to have access the specific type of data they need to view most often by providing the data in a way that supports the collective view of a group of users.
  • Data Marts are confined to subjects.

Three Types of Data Mart:

Based on data source dividing the Data Mart into three types
1.Dependent: In contrast, are standalone systems built by drawing data directly from internal or external sources of data or both. This allows you to unite the organization’s data in one data warehouse system with centralization.

2.Independent: Data Mart is created without the use of a central data warehouse for smaller groups within an organization.

  • 3.Hybrid: Data Marts can draw data from operational systems or data warehouses. It is a combination of dependent and independent data warehouse.

Implementing Datamart with simple steps:

Data Mart implementing is a bit of a complex procedure. Here are the detailed steps to implementing the Data Mart:

Designing:

This is the first phase of Data Mart implementation while tasks assigned. At the time gathering the information about the requirements. Then create the physical and logical design of the data mart.

Constructing:

Constructing is the second phase of implementation. It involves creating the physical database and logical structures in data warehouse system. Here are storage management, fast data access, data protection, and security for constructing the database structure.

Populating: 

Populating is the third phase of implementation. It involves Mapping data from source to destination, extraction of source data and loading data into the Data Mart.

Accessing: 

Accessing is the fourth step phase of implementation. It involves querying data, creating reports and charts, etc.

Managing:

Managing is the final step of Data Mart implementation. It involves access to management, tuning the data for the required database and managing fresh data into the data mart.

What is Data Warehouse? Types with Examples

First, we need to basic knowledge on Database then will go with Data Warehouse and different types of Data Warehouse system.



What is Data Base?

Database is a collection of related data and data is a collection of characteristics and figures that can be processed to produce information. Mostly data represents recordable facts. The data aids in producing information, which is based on facts. If we have data about salary obtained by all students, we can then conclude about the highest salary, etc.

What is a Data Warehouse?

Data Warehouse is also an enterprise data warehouse, it is a subject – oriented, integrated, time -variant and non – violent collection of Data management’s decision making.

Data is populated into the Data Warehouse through the processes of extraction, transformation, and loading (ETL). It contributes to future decision making. For example, data will take from different sources is extracted into a single area and transformed according to the data then loaded into storage systems.

Subject Oriented: A data warehouse used to analyze a particular subject area.

Integrated Oriented: A data warehouse integrates data from multiple data sources.

Time-Variant: Historical data is kept in a data warehouse.

Non – Violate: Once data is in the data warehouse, it will not change.

Types of Data Warehouse?

The data warehouse system is majorly three types:

1.Information Processing: A data warehouse allows to process the data stored in it. The data can be processed by means of querying, basic statistical analysis and reporting using charts or graphs

2.Analytical Processing: A data warehouse supports the analytical processing of the information stored in it. Basically, it is OLAP operations including drill – up and drill -down and pivoting.

3.Data Mining: In Data warehouse system Data Mining supports knowledge discovery by finding hidden patterns and associations, constructing analytical models. These results using the visualization tools.