1. Are there any problems which can only be solved by MapReduce and cannot be solved by Apache PIG? On what scenarios MapReduce jobs will be more useful than PIG?
Here is take one scenario where we want to count the population in two cities. We have a data set and a sensor and a different list of different cities. We want to count the population by using MapReduce for two cities. Let us assume that one is Hyderabad and other is Bangalore. So I need to consider the key of Hyderabad city similar to Bangalore through which I can bring the population data of these two cities to one reducer. The idea behind this is somehow I have to instruct map reducer program – whenever you find city with the name “Hyderabad” and city with the name “Bangalore”, you create the alias name which will be the common name for these two cities so that you create a common key for both the cities and it gets passed to the same reducer. For this, we have to write customer partitioner.
In MapReduce when you create ‘key’ for the city, you have to consider ‘city’ as the key. Whenever the MapReduce framework comes across a different city, it considers it as a different key then need to use customized partitioner. If city = Hyderabad or Bangalore then go through the same hashcode. After that cannot create custom partitioner in Pig. It means that PIG is not a framework, we cannot direct the execution engine to customize the partitioner. This type of scenarios, MapReduce works better than Apache PIG.
2. What is the difference between MapReduce and Apache PIG?
In Hadoop, eco-system for processing MapReduce need to write entire logic for operations like join, group, filter, etc.
In Pig have inbuilt functions as compared to MapReduce.
In coding Pig 20 lines of PIG Latin equal to 400 lines of Java.
In PIG High Productivity compared to MapReduce programming.
MapReduce needs to more effort while writing coding.
3. Why should we use ‘distinct’ keyword in PIG scripts?
In Pig scripts distinct keyword is very simple it removes duplicate records. Distinct works only on entire records, not on individual fields like below example:
input = load ‘daily’ as (emails, name);
grads = distinct emails;
4. What is the difference between Pig and SQL?
Apache Pig and SQL a lot of difference here are the mentioned.
Pig is Procedural SQL is Declarative OLAP works OLAP+OLTP works Schema is optional SQL Schema