Hadoop Architecture vs MapR Architecture




Basically, In BigData environment Hadoop is a major role for storage and processing. Coming to MapR is distribution to provide services to Eco-System. Hadoop architecture and MapR architecture have some of the difference in Storage level and Naming convention wise.

For example in In Hadoop single storage unit is called Block. But in MapR it is called Container.

Hadoop VS MapR

Coming to Architecture wise somehow the differences in both:
In Hadoop Architecture based on the Master Node (Name node) and Slave (Data Node) Concept. For Storage purpose using HDFS and Processing for MapReduce.

In MapR Architecture is Native approach it means that SAN, NAS or HDFS approaches to store the metadata. It will directly approach to SAN  no need to JVM. Sometimes Hypervisor, Virtual machines are crashed then data directly pushed into HardDisk it means that if a server goes down the entire cluster re-syncs the data node’s data. MapR has its own filesystem called MapR File System for storage purpose. For processing using MapReduce in background.

There is no Name node concept in MapR Architecture. It completely on CLDB ( Container Location Data Base). CLDB contains a lot of information about the cluster. CLDB  installed one or more nodes for high availability.

It is very useful for failover mechanism to recovery time in just a few seconds.



In Hadoop Architecture Cluster Size will mention for Master and Slave machine nodes but in MapR CLDB default size is 32GB in a cluster.

 

In Hadoop Architecture:

NameNode
Blocksize
Replication

 

In MapR Architecture:

Container Location DataBase
Containers
Mirrors

 



Summary: The MapR Architecture is entirely on the same architecture of Apache Hadoop including all the core components distribution. In BigData environment have different types of distributions like Cloudera, Hortonworks. But coming to MapR is Enterprise edition. MapR is a stable distribution compare to remaining all. And provide default security for all services.

Data Engineer Vs Data Science Vs Data Analyst




Nowadays the world’s runs completely on Data. Data engineers are like builders of construction it means that make data usable by data analyst and data scientist through APIs, data applications.

Data Scientist is a Researcher: Use data for advanced analysis, algorithms, data structures, and machine learning.

Data Analyst is a data translate into business insights are like data visualization tools like QlikView, Tableau, etc.

Data Engineer:

  • Data Engineer is the process of extracting the raw data and making it analysis for transforming data from source to destination.
  • Strong knowledge with the ability to create and integrate APIs and understanding the data related queries and performance optimization.
  • Data Engineer must have the skill set on Data infrastructure, Data warehouse management, Extraction transform load, Reporting tools, etc.
  • Technical skills: Python, SQL, Java, ETL Tools, Hadoop, Spark, Bigdata environment, Tableau.



Data Science:

  • The data scientist is the analyses and interprets the complexity of data and must have the skill in Statistical modeling, Machine learning, IDentifying actionable insights, Maths and Data Mining.
  • Technicals skills: In-depth programming knowledge on Python, R or SAS, Big Data analytics.
  • Responsible for developing operational models and data analytics with analytics.

Data Analytics:

Data Analytics is a collect, perform statistical analysis of data and processing of data. numeric data and uses it to help for better decisions. Some of the tasks are to present the insights in non-technical actionable results. Data modeling and reporting techniques along with strong analysis of data statistics.

It simply says that get business value from data through insights(Translate data into business values).

DataAnalytics = Data Engineering + Data Science

Pay Scale :

According to Glassdoor Average Pay:

Data Engineer: $123070 /year

Data Scientist:$115,815/year

Data Analyst: $71,589/year



Summary: In the present market, Data is highly incremented compared to previous years. So we need to skill up with Data Engineer, Data Scientist, and Data Analyst for growth in knowledge and Payscale for future enhancement.

Above three roles are emerging and more sustainable roles with huge demanding in IT sector.

Talend Installation on Windows with Pictures




First, we need Prerequisites for Installation of Talend in the Windows operating system.

Prerequisites for Talend Installation:

While Talend installation memory usage heavily depends on the size and nature of your Talend projects. If your jobs include so much of transformations components you should be upgrading the total amount of memory allocated to your servers.

Based on the following recommendations for Talend installation:

Product - Studio
Client/Server - Client
Memory - 4GB recommended
Disk space installation - 6 GB

Before installing of Talend we need to configure the JAVA_HOME environment variable so that it points to the JDK directory.



Example:C:\Program Files\Java\jdk1.x.x_x\bin

Simple Steps to Talend Installation:

After completion of the above steps then download Talend studio files like Data Integration, Big Data, and cloud trail version:

Talend Studio Download

Step 1: Goto Talend official website for Talend studio zip file contains binaries for All platforms (Windows/Mac).

Once the download is complete, extract the zip file on your Windows operating system.

Step 2: We get the Talend folder then double click on the TOS_DI_win_x86_64.exe. And executable file to launch your Talend studio like below image.

You need to recommend more memory and space to avoid spaces in the target installation directory.

Step 3: After clicking on Talend icon then ‘accept’ the user license agreement.

Step4: After Accepted the agreement then starts your journey with Talend studio. For the first time user, need to set up a new project or you can also import a demo project and click the Finish button.

 

After successful completion of Talend Studio Data Integration installation on windows. To open a welcome window and launch the studio simply.

Talend studio requires different types of libraries like Java (.jar) files and database drivers to be installed to connect a source to target.



These libraries are also known as external modules in Talend studio. If you want a cloud trail version or Big Data version for more features then simply installed it.

What is Data Mart? Types and Implementation Steps:



What is Data Mart?

In the Data Warehouse system Data Mart is a major role. Data Mart contains a subset of an organization. In other words, a data mart contains only those data that are specific to a particular group. For example, the marketing data mart may contain only data related to items, customers, and sales.

  • Data Marts improve end-user response time by allowing users to have access the specific type of data they need to view most often by providing the data in a way that supports the collective view of a group of users.
  • Data Marts are confined to subjects.

Three Types of Data Mart:

Based on data source dividing the Data Mart into three types



1.Dependent: In contrast, are standalone systems built by drawing data directly from internal or external sources of data or both. This allows you to unite the organization’s data in one data warehouse system with centralization.

2.Independent: Data Mart is created without the use of a central data warehouse for smaller groups within an organization.

  • 3.Hybrid: Data Marts can draw data from operational systems or data warehouses. It is a combination of dependent and independent data warehouse.

Implementing Datamart with simple steps:

Data Mart implementing is a bit of a complex procedure. Here are the detailed steps to implementing the Data Mart:

Designing:

This is the first phase of Data Mart implementation while tasks assigned. At the time gathering the information about the requirements. Then create the physical and logical design of the data mart.

Constructing:

Constructing is the second phase of implementation. It involves creating the physical database and logical structures in data warehouse system. Here are storage management, fast data access, data protection, and security for constructing the database structure.

Populating: 

Populating is the third phase of implementation. It involves Mapping data from source to destination, extraction of source data and loading data into the Data Mart.

Accessing: 

Accessing is the fourth step phase of implementation. It involves querying data, creating reports and charts, etc.

Managing:

Managing is the final step of Data Mart implementation. It involves access to management, tuning the data for the required database and managing fresh data into the data mart.


What is Data Warehouse? Types with Examples




First, we need to basic knowledge on Database then will go with Data Warehouse and different types of Data Warehouse system.

What is Data Base?

Database is a collection of related data and data is a collection of characteristics and figures that can be processed to produce information. Mostly data represents recordable facts. The data aids in producing information, which is based on facts. If we have data about salary obtained by all students, we can then conclude about the highest salary, etc.

What is a Data Warehouse?

Data Warehouse is also an enterprise data warehouse, it is a subject – oriented, integrated, time -variant and non – violent collection of Data management’s decision making.

Data is populated into the Data Warehouse through the processes of extraction, transformation, and loading (ETL). It contributes to future decision making. For example, data will take from different sources is extracted into a single area and transformed according to the data then loaded into storage systems.



Subject Oriented: A data warehouse used to analyze a particular subject area.

Integrated Oriented: A data warehouse integrates data from multiple data sources.

Time-Variant: Historical data is kept in a data warehouse.

Non – Violate: Once data is in the data warehouse, it will not change.

Types of Data Warehouse?

The data warehouse system is majorly three types:

1.Information Processing: A data warehouse allows to process the data stored in it. The data can be processed by means of querying, basic statistical analysis and reporting using charts or graphs

2.Analytical Processing: A data warehouse supports the analytical processing of the information stored in it. Basically, it is OLAP operations including drill – up and drill -down and pivoting.

3.Data Mining: In Data warehouse system Data Mining supports knowledge discovery by finding hidden patterns and associations, constructing analytical models. These results using the visualization tools.


Sample Talend Questions – MCQs



1. Which of the following in the design workspace indicates an error with a component in Talend

A) A red ‘X’                                                                      B) A red exclamation point

C)A green ‘ I’                                                                    D)A yellow  exclamation point

2. Which of the following components can be used to implement lookup in Talend

A)tJoin                                                                         B) tLookUp

C)tMap                                                                        D)tUnite

3.tMap offers following match modes for a lookup in Talend studio

A)Unique match                                             B)Only unique match

C)First match                                                   D)All matches

4.tMap offers following join model in Talend tool

A)Left Outer Join                                     B)Right Outer Join

C)Inner Join                                                D)Full Outer Join

5. Which of the following the components is used to execute a job infinite times in Talend

A)tInfiniteLoop                                          B) tFileWatcher

C) tForEach                                                  D)tRunJon



6.How to access parameters in the Global Map in Talend ETL tool

A)globalMap.put(“Key”, Object)

B)globalMap.get(“Key”, Object)

C)globalMap.put(“key”)

D)globalMap.get(“key”)

7. How do you reference the value of Context Variable FileName in configuration while Talend programming

A)Context.FileName                                B)context.FileName

C)FileName.value                                     D)$context.FileName

8. While Installing your Talend solutions, you have to set the following variable is mandatory?

A)JAVA_HOME                           B)TALEND_HOME

C)TIS_HOME                                 D)JRE_HOME

9. What is the use of a tReplicate component? Choose one best answer?

A)To duplicate the configuration of an existing component

B)To copy the input row to an output row without processing it

C)To duplicate a sub job

D)To send duplicates of an output row to multiple target components

10. How do you see the configuration of an error message for a component in Talend studio?

A)Right-click the component and then click show problem

B)From the errors view

C)Place the mouse pointer over error symbol in the design workspace

D)From the problems view

11. How do you create a row between two components in Talend

A) Drag the target component to source component

B)Right-click the source component click Row followed by the row type and then click the target component

C)Drag the source component onto target components

D)Right-click the source component and then click the target component

12. How do you ensure that a subjob completes before a second subjob runs in Talend?

A)Using RunIf trigger

B)Using the main connection

C)Using onComponentOk or OnComponentError trigger

D)Using onSubJobOk or onSubJobError trigger

13. Which of the following the components will be used to load JSON file to MySQL database in Talend?

A)tMySQLInput

B)tFileInputJSON

C)tMySQLOutput

D)tMap

14. How do you run a job in Talend Studio?

A)Click the Run button in the Run view

B)Click the Run button in the Job view

C)Click the Run button in the File Menu

D)Click the Start button in the Run view



15. What is the best practice for arranging components on the design workspace in Talend studio?

A)Bottom to Top

B)Right to Left

C)Top to Bottom

D)Matching the flow of data

16. From which tab in component view would you specify the component label in Talend

A)View

B)Advanced settings

C)Basic settings

D)Documentation

17. How to place your component in a job in Talend Studio?

A) Click it on Edit Menu

B) Click it in the Repository and then click in the design workspace

C) Click it from Repository to the design workspace

D)Click it in the Palette and then click in the design workspace


Talend Questions for Certification

Sample Questions for Talend Certification




1. Which of the Following transformations/operations are possible for using tMap?

A) Lookup                                                           B)Join

C) Sorting                                                            D) Filtering

2. What is Job in Talend?

A) visual set of components graphically connected using different connections
B) visual set of metadata graphically connected using different components

C) collection of components and metadata
D)a & c

3. Which of the following component is used to generate sample data
A) tFixedFlowInput                                                              B)tGenerateDate

C)tRowGenerator                                                                D) tSampleData

4.Which Layout exported from a component in Talend
A)Excel format                          B)Text file

C)XML file                                       D)CSV file

5.Which of the following is a correct way to parse String column to Date

A)TalendDate.parseDate(“MM/dd/yyyy”, row2.date)
B) Date.parseDate(“MM/dd/yyyy”, row2.date)
C) TalendDate.getDate(“MM/dd/yyyy”, row2.date)
D)TalendDate.formatDate(“MM/dd/yyyy”, row2.date)

6. Which of the Following components are used to store log and statistical information about your job
A)tStatsCatcher
B)tLogCatcher
C)tFlowMeterCatcher
D)none of the above

7.In order to filter all files with a name having string “AMZ_AMZ001” using the tFileList component in Talend



A) Directory property should be set to “AMZ_AMZ001”
B)Set FileMask property to “AMZ_AMZ001”
C) Set FileMask property to “*AMZ_AMZ001*”
D)Set FileMask property to “*AMZ_AMZ001”

8. In which user interface element do you find Business Models, Job designs & Metadata?

A)The Job view                                                              B) The Repository

C)The design workspace                                          D)The Palette

9. What is indicated by an asterisk next to the job name in the design workspace

A)That this is an active job                              B)That the job contains an error

C)That the job contains unsaved changes  D)That the job is currently running

10. When you first start Talend Open Studio what are the advantages of creating a Talend account? Choose all that apply

A)You can visit MyTalend.com
B)You are required to create an account
C)You can post questions/answers to Talend forum
D)You can download components from Talend Exchange

11.From which View in Talend Open Studio would you clear the statistics from the design workspace? Choose one Answer
A)The component view
B)The context view
C)The problems view
D)The run view
E)The job view


Hadoop Admin Vs Hadoop Developer




Basically in Hadoop environment Hadoop Admin and Hadoop Developer major roles according to present IT market survey Admin has more responsibilities and salaries compared to Hadoop developers. But we can differentiate below-mentioned points:

Hadoop Developer:

  1. In Big Data environment Hadoop is a major role, especially in Hadoop developers. A developer primarily responsible for Coding in Hadoop developer also the same kind of thing here developing like:

A)Apache Spark – Scala, Python, Java, etc.

B) Map Reduce – Java

C)Apache Hive  – HiveQL (Query Language & SQL)

D) Apache Pig  – Pig Scripting language etc.

2. Familiarity with ETL backgrounds for data loading and ingestion tools like:

A)Flume

B)Sqoop

3. Bit of knowledge on Hadoop admin part also like Linux environment and some of the basic commands while developing and executing.

4. Nowadays most preferably Spark & Hive developers with high-level experience and huge packages.

2.Hadoop Administration:

1. Coming to Hadoop Administration is a good and respectable job in the IT industry. Whereas, admin is responsible for performing the operational tasks to keep the infrastructure and running jobs.

2. Strong knowledge of the Linux environment. Setting up Cluster and Security authentication like Kerberos and testing the HDFS environment.

3. To provide new user access to Hive, Spark, etc. And cluster maintenance like adding (commissioning) node and removing (decommissioning) nodes. Resolve errors like memory issues, user access issues, etc.

4.Must and should knowledge on BigData platforms like:

A) Cloudera Manager

B) Horontworks Data Platform

C) MapR



D) Pseudo-distributed and Single node cluster setup etc.

5. Review and Managing log files and setting up of XML files.

6. As of now trending and career growth job.

7. Compared to Hadoop developers, Hadoop Admins are getting high salary packages in present marketing.



Summary: In the Bigdata environment Hadoop has valuable and trending jobs. And provide huge packages for both Hadoop developers and Hadoop administration. Depends upon skill set will prefer what we need for future growth.

Big Data Spark Multiple Choice Questions



Spark Multiple Choice Questions and Answers:

1)Point out the incorrect  statement in the context of Cassandra:

A) Cassandra is a centralized key -value store

B) Cassandra is originally designed at Facebook

C) Cassandra is designed to handle a large amount of data across many commodity servers, providing high availability with no single point if failure.

D) Cassandra uses a right based DHT*Distribution Hash Table) but without finger tables or routing

Ans : D

2. Which of the following are the simplest NoSQL databases in BigData environment?

A) Document                                    B) Key-Value Pair

C) Wide – Column                        D) All of the above mentioned 

Ans : ) All of the above mentioned

3) Which of the following is not a NoSQL database?

A) Cassandra                          B) MongoDB

C) SQL Server                           D) HBase

Ans: SQL Server


4) Which of the following is a distributed graph processing framework on top of Spark?

A) Spark Streaming                   B)MLlib

C)GraphX                                          D) All of the above

Ans: GraphX

5) Which of the following is leverage of Spark core fast scheduling capability to perform streaming analytics?

A) Spark Streaming                     B) MLlib

C)GraphX                                       D) RDDs

Ans: Spark Streaming

6) Which of the following Machine Learning API for Spark based on Which one:

A) RDD                                 B) Dataset

C)DataFrame          D) All of the above

Ans: DataFrame

7) Based on which functional programming language construct for Spark optimizer

A) Python                         B) R

C) Java                                   D)Scala

Ans: Scala is a functional programming language

8) Which of the following is a basic abstraction of Spark Streaming?

A)Shared variable                 B)RDD

C)Dstream                                  D)All of the above

Ans: Dstream

9) In a which cluster manager to do support of Spark?

A) MESOS                                B)YARN

C) Standalone Cluster manager   D) Pseudo Cluster manager

E) All of the above

Ans: All of the above

10) Which of the following is the reason for Spark being faster than MapReduce while execution time?

A) It supports different programming languages like Scala, Python, R, and Java.

B)RDDs

C)DAG execution engine and in-memory computation (RAM based)

D) All of the above

Ans: DAG execution engine and in-memory computation (RAM based)


BigData and Spark Multiple Choice Questions – I




1. In Spark, a —————– is a read-only collection of objects partitioned across a set of machines that can be rebuilt if a partition is lost.

A) Resilient Distributed Dataset (RDD)                  C)Driver

B)Spark Streaming                                                          D) Flat Map

Ans: Resilient Distributed Dataset (RDD)

2. Consider the following statement is the correct context of Apache Spark   :

Statement 1: Spark allows you to choose whether you want to persist Resilient Distributed Dataset (RDD) onto the disk or not.

Statement 2: Spark also gives you control over how you can partition your Resilient Distributed Datasets (RDDs).

A)Only statement 1 is true                 C)Both statements are true

B)Only statement 2 is true                  D)Both statements are false

Ans: Both statements are true




3) Given the following definition about the join transformation in Apache Spark:

def : join [W] (other: RDD[(K, W)]) : RDD [(K, (V, W))]

Where join operation is used for joining two datasets. When it is called on datasets of type (K, V) and (K, W), it returns a dataset of (K, (V, W)) pairs with all pairs of elements for each key.

Output the result of joinrdd, when the following code is run.

val rdd1 = sc.parallelize (Seq ((“m”,55), (“m”,56), (“e”,57), (“e”,58), (“s”,59),(“s”,54)))
val rdd2 = sc.parallelize (Seq ((“m”,60),(“m”,65),(“s”,61),(“s”,62),(“h”,63),(“h”,64)))
val joinrdd = rdd1.join(rdd2)
joinrdd.collect
A) Array[(String, (Int, Int))] = Array((m,(55,60)), (m,(55,65)), (m,(56,60)), (m,(56,65)), (s,(59,61)), (s,(59,62)), (h,(63,64)), (s,(54,61)), (s,(54,62)))
B) Array[(String, (Int, Int))] = Array((m,(55,60)), (m,(55,65)), (m,(56,60)), (m,(56,65)), (s,(59,61)), (s,(59,62)), (e,(57,58)), (s,(54,61)), (s,(54,62)))
C) Array[(String, (Int, Int))] = Array((m,(55,60)), (m,(55,65)), (m,(56,60)), (m,(56,65)), (s,(59,61)), (s,(59,62)), (s,(54,61)), (s,(54,62)))
D)None of the mentioned.

Ans: Array[(String, (Int, Int))] = Array((m,(55,60)), (m,(55,65)), (m,(56,60)), (m,(56,65)), (s,(59,61)), (s,(59,62)), (s,(54,61)), (s,(54,62)))

4)Consider the following statements are correct:

Statement 1: Scale up means incrementally grow your cluster capacity by adding more COTS machines (Components Off the Shelf)

Statement 2: Scale out means grow your cluster capacity by replacing with more powerful machines

A) Only statement 1 is true               C) Both statements are true

B) Only statement 2 is true              D) Both statements are false

Ans: Both statements are true