Spark Lazy Evaluation and Advantages with Example

Apache Spark is an in-memory cluster computing framework for processing and analyzing large amounts of data (Bigdata). Spark provides a simple programming model than that provided by Map Reduce. Developing a distributed data processing application with Apache Spark is a lot easier than developing the same application with Map Reduce. In Hadoop MapReduce provides only two operations for processing the data like “Map” & “Reduce”, whereas Spark comes with 80 plus data processing operations to work with big data application.




While data processing from source to destination. Spark is 100 times faster than Hadoop Map Reduce because it allows in-memory clustering computing, it implements an advanced execution engine.

What is meant by Apache Spark Lazy Evaluation?

In Apache Spark, two types of RDD operations are

I)Transformations

II) Actions.

We can define new RDDs any time, Apache Spark computes them only in a lazy evaluation. That is, the first time they are used in an action. The Lazy evaluation seems unusual at first but makes a lot of sense when you are working with large data(BigData).

Simply Lazy Evaluation in Spark means that the execution will not start until an action is triggered. In Apache Spark, the picture of lazy evaluation comes when Spark transformation occurs”.

Consider where we defined a text file and then filtered the lines that include “CISCO” client name if Apache Spark were to load and store all the lines in the file as soon as we wrote like lines = sc.text( file path ). Here Spark Context would waste a lot of o storage space, given that we then immediately filter out many lines. Instead, once Spark seems that whole chain transformation. It can compute the data needed for its result. Hence first() action, Apache Spark scans the file only until it finds the first matching line; it doesn’t even read the whole file.

Advantages Of Lazy Evaluation in Spark Transformations:

Some advantages of Lazy evaluation in Spark in below:

  • Increase Manageability: The Spark Lazy evaluation, users can divide into smaller operations. It reduces the number of passes on data by transformation grouping operation.
  • Increases Speed: By lazy evaluation in Spark to saves the trip between driver and cluster, speed up the process.
  • Reduces Complexities: There are two types of complexities of any operations are Time and Space complexity using Spark lazy evaluation we can overcome both complexities. The action is triggered only when the data is required.

Simple Example:

In Spark, Lazy evaluation below code writes in  Scala, who evaluates the expression as it’s declared.

With Lazy:

Scala> Val sparkList = List(1,2,3,4)

Scala> lazy val output = sparkList .map( 1 => 1*10)

Scala> println( output )

Output:

List( 10, 20, 30, 40 )



Hadoop Architecture vs MapR Architecture





Basically, In BigData environment Hadoop is a major role for storage and processing. Coming to MapR is distribution to provide services to Eco-System. Hadoop architecture and MapR architecture have some of the difference in Storage level and Naming convention wise.

For example in In Hadoop single storage unit is called Block. But in MapR it is called Container.

Hadoop VS MapR

Coming to Architecture wise somehow the differences in both:
In Hadoop Architecture based on the Master Node (Name node) and Slave (Data Node) Concept. For Storage purpose using HDFS and Processing for MapReduce.




In MapR Architecture is Native approach it means that SAN, NAS or HDFS approaches to store the metadata. It will directly approach to SAN  no need to JVM. Sometimes Hypervisor, Virtual machines are crashed then data directly pushed into HardDisk it means that if a server goes down the entire cluster re-syncs the data node’s data. MapR has its own filesystem called MapR File System for storage purpose. For processing using MapReduce in background.

There is no Name node concept in MapR Architecture. It completely on CLDB ( Container Location Data Base). CLDB contains a lot of information about the cluster. CLDB  installed one or more nodes for high availability.

It is very useful for failover mechanism to recovery time in just a few seconds.

In Hadoop Architecture Cluster Size will mention for Master and Slave machine nodes but in MapR CLDB default size is 32GB in a cluster.




 

In Hadoop Architecture:

NameNode
Blocksize
Replication

 

In MapR Architecture:

Container Location DataBase
Containers
Mirrors

Summary: The MapR Architecture is entirely on the same architecture of Apache Hadoop including all the core components distribution. In BigData environment have different types of distributions like Cloudera, Hortonworks. But coming to MapR is Enterprise edition. MapR is a stable distribution compare to remaining all. And provide default security for all services.

Talend Installation on Windows with Pictures

First, we need Prerequisites for Installation of Talend in the Windows operating system.

Prerequisites for Talend Installation:

While Talend installation memory usage heavily depends on the size and nature of your Talend projects. If your jobs include so much of transformations components you should be upgrading the total amount of memory allocated to your servers.




Based on the following recommendations for Talend installation:

Product - Studio
Client/Server - Client
Memory - 4GB recommended
Disk space installation - 6 GB

Before installing of Talend we need to configure the JAVA_HOME environment variable so that it points to the JDK directory.
Example:C:\Program Files\Java\jdk1.x.x_x\bin

Simple Steps to Talend Installation:

After completion of the above steps then download Talend studio files like Data Integration, Big Data, and cloud trail version:

Talend Studio Download

Step 1: Goto Talend official website for Talend studio zip file contains binaries for All platforms (Windows/Mac).

Once the download is complete, extract the zip file on your Windows operating system.

Step 2: We get the Talend folder then double click on the TOS_DI_win_x86_64.exe. And executable file to launch your Talend studio-like below image.

You need to recommend more memory and space to avoid spaces in the target installation directory.

Step 3: After clicking on Talend icon then ‘accept’ the user license agreement.

Step4: After Accepted the agreement then starts your journey with Talend studio. For the first time user, need to set up a new project or you can also import a demo project and click the Finish button.

 

After successful completion of Talend Studio Data Integration installation on windows. To open a welcome window and launch the studio simply.




Talend studio requires different types of libraries like Java (.jar) files and database drivers to be installed to connect a source to target.
These libraries are also known as external modules in Talend studio. If you want a cloud trial version or Big Data version for more features then simply installed it.

What is Data Mart? Types and Implementation Steps:




What is Data Mart?

In the Data Warehouse system Data Mart is a major role. Data Mart contains a subset of an organization. In other words, a data mart contains only those data that are specific to a particular group. For example, the marketing data mart may contain only data related to items, customers, and sales.

  • Data Marts improve end-user response time by allowing users to have access the specific type of data they need to view most often by providing the data in a way that supports the collective view of a group of users.
  • Data Marts are confined to subjects.

Three Types of Data Mart:

Based on data source dividing the Data Mart into three types
1.Dependent: In contrast, are standalone systems built by drawing data directly from internal or external sources of data or both. This allows you to unite the organization’s data in one data warehouse system with centralization.

2.Independent: Data Mart is created without the use of a central data warehouse for smaller groups within an organization.

  • 3.Hybrid: Data Marts can draw data from operational systems or data warehouses. It is a combination of dependent and independent data warehouse.

Implementing Datamart with simple steps:

Data Mart implementing is a bit of a complex procedure. Here are the detailed steps to implementing the Data Mart:

Designing:

This is the first phase of Data Mart implementation while tasks assigned. At the time gathering the information about the requirements. Then create the physical and logical design of the data mart.

Constructing:

Constructing is the second phase of implementation. It involves creating the physical database and logical structures in data warehouse system. Here are storage management, fast data access, data protection, and security for constructing the database structure.

Populating: 

Populating is the third phase of implementation. It involves Mapping data from source to destination, extraction of source data and loading data into the Data Mart.

Accessing: 

Accessing is the fourth step phase of implementation. It involves querying data, creating reports and charts, etc.

Managing:

Managing is the final step of Data Mart implementation. It involves access to management, tuning the data for the required database and managing fresh data into the data mart.

What is Data Warehouse? Types with Examples

First, we need to basic knowledge on Database then will go with Data Warehouse and different types of Data Warehouse system.



What is Data Base?

Database is a collection of related data and data is a collection of characteristics and figures that can be processed to produce information. Mostly data represents recordable facts. The data aids in producing information, which is based on facts. If we have data about salary obtained by all students, we can then conclude about the highest salary, etc.

What is a Data Warehouse?

Data Warehouse is also an enterprise data warehouse, it is a subject – oriented, integrated, time -variant and non – violent collection of Data management’s decision making.

Data is populated into the Data Warehouse through the processes of extraction, transformation, and loading (ETL). It contributes to future decision making. For example, data will take from different sources is extracted into a single area and transformed according to the data then loaded into storage systems.

Subject Oriented: A data warehouse used to analyze a particular subject area.

Integrated Oriented: A data warehouse integrates data from multiple data sources.

Time-Variant: Historical data is kept in a data warehouse.

Non – Violate: Once data is in the data warehouse, it will not change.

Types of Data Warehouse?

The data warehouse system is majorly three types:

1.Information Processing: A data warehouse allows to process the data stored in it. The data can be processed by means of querying, basic statistical analysis and reporting using charts or graphs

2.Analytical Processing: A data warehouse supports the analytical processing of the information stored in it. Basically, it is OLAP operations including drill – up and drill -down and pivoting.

3.Data Mining: In Data warehouse system Data Mining supports knowledge discovery by finding hidden patterns and associations, constructing analytical models. These results using the visualization tools.

Sample Talend Questions – MCQs





1. Which of the following in the design workspace indicates an error with a component in Talend

A) A red ‘X’                                                                      B) A red exclamation point

C)A green ‘ I’                                                                    D)A yellow  exclamation point

2. Which of the following components can be used to implement lookup in Talend

A)tJoin                                                                         B) tLookUp

C)tMap                                                                        D)tUnite

3.tMap offers following match modes for a lookup in Talend studio

A)Unique match                                             B)Only unique match

C)First match                                                   D)All matches

4.tMap offers following join model in Talend tool

A)Left Outer Join                                     B)Right Outer Join

C)Inner Join                                                D)Full Outer Join

5. Which of the following the components is used to execute a job infinite times in Talend

A)tInfiniteLoop                                          B) tFileWatcher

C) tForEach                                                  D)tRunJon

6.How to access parameters in the Global Map in Talend ETL tool

A)globalMap.put(“Key”, Object)

B)globalMap.get(“Key”, Object)

C)globalMap.put(“key”)

D)globalMap.get(“key”)

7. How do you reference the value of Context Variable FileName in configuration while Talend programming

A)Context.FileName                                B)context.FileName

C)FileName.value                                     D)$context.FileName

8. While Installing your Talend solutions, you have to set the following variable is mandatory?

A)JAVA_HOME                           B)TALEND_HOME

C)TIS_HOME                                 D)JRE_HOME

9. What is the use of a tReplicate component? Choose one best answer?

A)To duplicate the configuration of an existing component

B)To copy the input row to an output row without processing it

C)To duplicate a sub job

D)To send duplicates of an output row to multiple target components

10. How do you see the configuration of an error message for a component in Talend studio?

A)Right-click the component and then click show problem

B)From the errors view

C)Place the mouse pointer over error symbol in the design workspace

D)From the problems view

11. How do you create a row between two components in Talend

A) Drag the target component to source component

B)Right-click the source component click Row followed by the row type and then click the target component

C)Drag the source component onto target components

D)Right-click the source component and then click the target component

12. How do you ensure that a subjob completes before a second subjob runs in Talend?

A)Using RunIf trigger

B)Using the main connection

C)Using onComponentOk or OnComponentError trigger

D)Using onSubJobOk or onSubJobError trigger

13. Which of the following the components will be used to load JSON file to MySQL database in Talend?

A)tMySQLInput

B)tFileInputJSON

C)tMySQLOutput

D)tMap

14. How do you run a job in Talend Studio?

A)Click the Run button in the Run view

B)Click the Run button in the Job view

C)Click the Run button in the File Menu

D)Click the Start button in the Run view
15. What is the best practice for arranging components on the design workspace in Talend studio?

A)Bottom to Top

B)Right to Left

C)Top to Bottom

D)Matching the flow of data

16. From which tab in component view would you specify the component label in Talend

A)View

B)Advanced settings

C)Basic settings

D)Documentation

17. How to place your component in a job in Talend Studio?

A) Click it on Edit Menu

B) Click it in the Repository and then click in the design workspace

C) Click it from Repository to the design workspace

D)Click it in the Palette and then click in the design workspace

MapR

What is MapR?

MapR is one of the Big Data Distribution. It is a complete enterprise distribution for Apache Hadoop which is designed to improve the Hadoop’s reliability, performance, and ease of use.

Why MapR?

1. High Availability:




MapR provides High Availability features such as Self – Healing it means that no Namenode architecture.

It has job tracker High Availability and NFS. MapR achieves only distributing its file system metadata.

2. Disaster Recovery:

MapR provides mirroring facility which allows users to enable policies and mirror data. It automatically within the multinode cluster or single node cluster between on-premise and cloud infrastructure

3.Record Performance:

MapR is a world record performance cost only $9 to the earlier cost of $5M at a  speed of 54 sec. And it handles the large size of clusters like 2,200 nodes.

4.Consistent Snapshots:

MapR is the only big data distribution which provides a consistent, point in time recovery because of its unique read and writes storage architecture.

5. Complete Data Protection:

MapR has own security system for data protection in cluster level.

6.Compression:

MapR provides automatic behind the scenes compression to data. It applies compression automatically to files in the cluster.

7.Unbiased Open Source:

MapR completely unbiased opensource distribution

8. Real Multitenancy Including YARN also

9.Enterprise-grade NoSQL

10. Read and Write file system:

MapR has Read and Write file system.

MapR Ecosystem Packs (MEP):

The “MapR Ecosystem” is the set of open source that is included in the MapR Platform, and the “pack” means a bundled set of MapR Ecosystem projects with specific versions.

Mostly MapR Ecosystem Packs are released in every quarter and yearly also

A single version of MapR may support multiple MEPs, but only one at a time.




In a familiar case, Hadoop Ecosystem components and Opensource components are like Spark, Hive etc components are included in MapR Ecosystem Packs are like below tools:

Collectd
Elasticsearch
Grafana
Fluentd
Kibana
Open TSDB

Toughest Big Data(Spark, Kafka,Hive) Interview Questions

Hard Interview Questions for Spark, Kafka, and Hive:





1. How to handle Kafka back pressure with scripting parameters?

2. How to achieve performance tuning through executors?

3. What is the idle size of deciding the executors and what ram should be used?

4. How do you scale Kafka brokers and Integrate with spark streaming without stopping the cluster and along with script?

5.How to delete records in Hive and how to delete duplicate records with the scripting?

6. Can we have more than one replica exist in the same rack?

7. In a database out of 10 tables, one table is failed while importing from MySql into HDFS by using Sqoop? What is the solution?

8. If you submit a spark job in a cluster and almost rdd has already created in the middle of the process the cluster goes down what will happen to you are rdd and how data will tackle?

Summary: Nowadays asked these type of scenario-based interview questions in Big Data environment for Spark and Hive.

Top 7 Technology Trends for 2019




1.Artificial Intelligence :

In future Artificial Intelligence is one most emerging technology in industry will continue to talk for the next 30 years. At present Google built algorithms for optimized results for example Google Brain developed algorithms created new encryption methods, policies and neural networks. Facebook also created its own languages through AI. It is completely different to Human Intelligence.

2. Blockchain:

Blockchain  is a mostly used in Financially Services industry and Business. It is continuously evolving and the next decade of technology. In digital market situation digital currency such as Bitcoin will take huge  market. Blockchain is a full of Transparency and Security.  According to Gartner, It will generate $3.1 trillion business value by 2030.

3.Machine Learning:


Now a days Machine Learning is hyper emerging technology because it enable machines to learn from data. In this time when machines reach a higher level of intelligence than human intelligence. With complex tasks like image recognition no need to longer time and most applications will include machine learning. Finally computers will get really good at talking like humans.

4. IoT:

Internet of Things (IoT) will play major rule in the future and there is expected to be excellent market in the up coming years. As many of the organization have deployed IoT based solutions for example in a car booking system GPS in the car to tracking location and security. It is also one of the IoT.

5.Big Data  and Analytics:

At present IT market emergence and growth of Big data and Analytics it examines large amount of data . In most of the companies uses Big data technologies like Hadoop and Cloud based analytics. Especially Data science perspective, Business perspective and Real time perspective. In real time world used in Banking, Healthcare ,manufacturing etc.

6. Cloud Computing & DevOps:

In recent years most of the people talks about Cloud Computing and DevOps. Especially in Cloud Computing like AWS, Microsoft Azure and Google Cloud. In a digital world is a connected to the Cloud. It will provide the digital infrastructure and remote hosted servers on the Internet store  and process data. Present IT market also Cloud based products are available.

Coming to DevOps it means that set of developments and operations are together to complete software  development . It reduces failures and continuous integrations and increase efficiencies.

7 .RPA:

Robotic Process Automation is an emerging form of automation technology based on Artificial Intelligence. So many people think that digital enterprise involves coding and testing in a lot of applications. So we need BOT that handle repeatedly and predictive tasks



Parquet File with Example




Parquet:

Parquet is a columnar format that is supported by many other data processing systems, Spark SQL support for both reading and writing Parquet files that automatically preserves the schema of the original data. Parquet is a popular column-oriented storage format that can store records with nested fields efficiently. Parquet often used with tools in the Hadoop ecosystem and it supports all of the data types in Spark SQL.

Spark SQL provides methods for reading data directly to and from Parquet files.

Parquet is a columnar storage format for the Hadoop ecosystem. And has gotten good adoption due to highly efficient compression and encoding schemes used that demonstrate significant performance benefits. Its ground-up design allows it to be used regardless of any data processing framework, data model, and programming language used in Hadoop ecosystem including Map Reduce, Hive, Pig and Impala provided the ability to work with Parquet data and the number of data models such as AVRO, Thrift, etc have been expanded to be used with Parquet as storage format.

Parquet is widely adopted by a number of major companies including tech giants such as Social media to Save the file as parquet file use the method. people. saveAsParquetFile(“people.parquet”)

Example on Parquet file:





Scala > val parquet File = sql Context. parquet File(“/home/ sreekanth / SparkSQLInput /users.parquet”)

parquet File: org. apache. spark. sql. Data Frame=[name: string, favorite_hero: string, Favorite_color: string]

Scala > parquet File. register Temp Table(“parquet File”)

Scala>parquet File. print Schema

root

| name: string( nullable : false)

| favorite: hero( nullable : true)

| favorite_numbers( nullable : false)

Scala>val selected People = sql Context. sql (“SELECT name FROM parquet File”)

Scala> selected People.map(t=>”Name: ” + t(0)).collect().foreach( println )

OUTPUT:

Name: Alex

Name: Bob

Scala > val selected People = sql Context. sql (“SELECT name FROM parquet File”).show

+——+

|name|

+——+

|Alex|

|Bob|

+—–+

How to Save the Data in a “Parquet File” format

Scala> val sql Context = new org. apache. spark. sql. SQL Context( sc )

sql Connect: org. apache. spark. sql. SQL Context=org. apache. spark. sql. SQL Context@ hf0sf

Scala> val data frame=sql Context.read.load(“/home/ sreekanth /Spark SQL Input/users.parquet”)

data frame: org. apache. spark. sql. Data Frame=[name: string, favorite_hero:string, favorite_color:string]

data frame.select(“name”, “favorite_hero”).write.save(name And Fav Hero.parquet)