Spark Performance Tuning with pictures





Spark Execution with a simple program:

Basically, the Spark program consists of a single spark driver and process and a set of executors processes across nodes of the cluster.

Performance tuning of Spark is measured bottleneck using big data environment metrics for block time analysis. Spark is run on In-memory cache so need to avoid network and I/O are key role while performance.

For example: Take two clusters, one cluster with 25 machines and cluster size is 750 GB of data. The second cluster with 75 machines clusters with 4.5TB of raw data.  The network communication is always irrelevant for the performance of these workloads coming to network optimization is to reduce job completion by 5% for better performance. Finally serialized compressed data.

Mostly Apache Spark always supports transformations like groupByKey and reduceByKey dependencies. Spark executes a shuffle, which transfers data around the cluster. Below three operations  with different outputs:

sc.textFile(" /hdfs/path/sample.txt")
map(mapFunc) #Using Map function 
flatMap(flatMapFunc) #Flatmap is another transformer
filter(filterFunc)
count() # Count the data.

Above code executes a single performer, which depends on a sequence of transformations on an RDD derived from the sample.tx file.

If the code contains how many times each character appears in all the words that appear more 1000 times in a given text file. Below code in Scala:

Val  token=sc.textFile(args(0).flatMap(_.split(' ' ))
Val wc=token.map((_,1)).reduceByKey(_+_)
Val filtered = wc.filter(_._2 > =1000)
val charCount = filtered.flatMap(_._1.toCharArray).map((_, reduceByKey(_+_)
charCount.collect

Above code breaks into mainly three stages. The reduceByKey operations result in stage boundaries

 

 

 

GCP VS AWS

A few years ago, No one knows about Cloud computing but now world moves into Cloud computing and major role in the IT industry.

Cloud Computing: Is a general term for the on-demand delivery of computer-based, databases, applications, storage, and different other IT services through the internet. But how do you confirm it which Cloud Service provider is best? Which Cloud Service provider is cheapest and expensive? has different types of services?

Today we compare Amazon Web Services(AWS) and Google Cloud Platform(GCP). Nowadays, mostly preferable three cloud services are Amazon Web Services, Google Cloud Platform, and  Microsoft Azure.

Here is will discuss the following  below services on Google Cloud Platform and Amazon Web Services:

The Major Difference between AWS and Google Cloud Platform

AWS:

Amazon Web Services are most required preferable Cloud platform so it AWS is the leader of the Cloud computing services due to explore IaaS ( Infrastructure as a Service) since 2006. AWS has already built a powerful global network to provide a virtual host for the IT industry.

Data centers are fiber linked an arranged all over the global network system.

Amazon Web Services mainly focused on Security, Automation, programmable, etc.  AWS Cloud Build is extensible, fully managed build service that provides continuous integration, continuous development ( CI/CD). It helps for automatic scaling and grows on demand with your customization. Depends on different types of versions.

AWS Code Deploy: Code Deploy delivers the working package to every instance outlined pre-configured parameters. Including EC2 instance on-premises services.




AWS Compute: Amazon EC2 (Elastic Compute Cloud), containers, AWS Batch, Auto Scaling, AWS Lambda, Amazon VPC.

AWS Storage: Amazon Simple Storage Service(S3), Elastic Block storage, AWS Storage Gateway.

Database: Amazon DynamoDB, Amazon Elastic Cache, Amazon RedShift.

Migration: AWS  Migration Hub, AWS Database Migration Service, AWS Snowball.

Networking & Content Delivery:  Amazon VPC (Virtual Private Cloud), CloudFront, RouteS3

Developer Tools: Amazon Web Service CodeStar, AWS CodeBuild, Code Deploy, AWS X-Ray, AWS Tools & SDKs.

Management Tools: Amazon CloudWatch, AWS Cloud Trail, AWS config, AWS Managed Sevices, AWS Management Console.

Security, Identity & Compliance: AWS Identity and Access Management(IAM), Amazon Cloud Directory, Amazon Inspector, AWS Key Management Service.

Big Data and Machine Learning  & Artificial intelligence products:

Amazon Web Services (AWS) is mostly building the Big Data systems due to the integration with DevOps tools like Kuberbetes, Docker, etc. AWS supports Hadoop platforms like Hortonworks, Cloudera distributions in Big Data environment with AWS Lambda, which is a perfect match for Bigdata analysis tasks. Artificial intelligence services provide developers with the ability to add intelligence applications through an API call to pre-trained services Amazon Lex uses for Amazon Alexa to provide advanced deep learning functionalities of ASR and Natural Programming Language to enable to build applications.

Google Cloud Platform:

Google Cloud Platform is a lot of varieties of services and solutions to software and hardware infrastructure that Google uses for YouTube and Gmail. GCP is one of the largest and most advanced computer network, storage for some applications. Monitoring, Stackdriver debugger, Stackdriver Logging, Security Scanner services.

Some management tools for the Google Cloud Platform environment in follow below tools:

Google Compute Engine: Google Compute Engine allows users to launch virtual desktops, machines on the cloud.  It is VMs boot quickly come with persistent on storage, performance. Virtual servers are available in a different configuration including sizes of machines.

Google Deployment Manager: Google Cloud Deployment manager allows all the resources needed for your application. Deployment Manager is one of the DevOps teams. For deploy many resources at one time, in parallel in Google Cloud Console in a hierarchical view.

GCP Cloud Console: GCP Cloud console gives a detailed view of every day in Cloud platform in web applications, virtual machines, data analysis, data store, networking, developer services. It is scalable and diagnoses productions issues in applications.

Google Cloud Platform is Google Compute instance up to 96 vCPUs and 634 GB of RAM.Google Cloud for storage/disk with volume sizes up to 64 TB.




Coming to Network information GCP is subject to a 2Gbits /seconds(Gbps) for better performance. It may increase the network up to 16Gbps. Google cloud platform to provide more good network system in Cloud.
GCP is to provide Security for applications with strong authentications factor. NSA has infiltrated the data center connections on Google Cloud. Even the stored data is encrypted, not to mention the traffic between data centers. There is Relational Database Sevice does provide data encryption as an option in different multiple availability zones.

  • Predictions and Facts:

Cloud predicated that Infrastructure as a service(IaaS), currently growing at 24 %. Nowadays clearly that the market for Cloud Computing is growing at different rates.

  • Market Share (AWS vs Google Cloud):

Present market share the topmost competing for Cloud Computing are AWS, Google Cloud Platform, Microsoft Azure, etc. Then comparatively AWS alarming rate is high compared to all cloud services.

  • Service Comparision:

Coming to service comparisons in Google Cloud and AWS, various services offered by AWS and GCP.

IaaS in GCP is Google Compute Engine coming to AWS is Amazon Elastic Compute Cloud.

PaaS in GCP is Google App Engine. In AWS is AWS Elastic Beanstalk

Containers are in GCP is Google Kubernetes Engine. In AWS is Amazon Elastic Compute Cloud Containers Services.

Finally, Serviceless Functions are in GCP is Google Cloud Functions. In WAS is AWS Lambda.

  • Storage Services:

Coming storage services File Storage in AWS is Amazon Elastic File System. In GCP is ZFZ/Avere.

Object storage in AWS is Elastic Load Balancer. GCP is Google Cloud Storage.

  • Management Services:

For Monitoring AWS Services we are using in Amazon CloudWatch. In GCP using Stackdriver Monitoring.

For Deployment AWS Services using AWS Cloud Formation in GCP using Google Cloud Deployment Manager.

Pricing Comparision :

Now Google Cloud Platform is a clear winner to the cost of services. First GCP provides $300 for free tier account for 12 months. AWS also provides less cost for one-month free tier services how much will spending on the machines. For example 2 CPU cores, 8GB RAM instance for GCP priced at $50 per month. Coming to AWS instances with same configurations priced at  $69 per month. M

Summary:  Amazon Web Services supports AWS documentation and AWS ForumsGoogle. Coming to Google Cloud Platform to provide some support documentation Cloud Forums and Google Cloud Documentation. Now go with Billing and Pricing AWS simple monthly calculator and Google Cloud Platform pricing Calculator. Both are very good Cloud Computing services at present market.

How to setup Google Cloud Free Tier Account with Pictures




How to Create Google Cloud Free Tier Account

Step 1: To Open Google console page for registration. Login with Gmail account then click on “TRY FOR FREE” button. To get $300 free for 12 months.

Step 2: Here are two steps to verify the account. First, select the Country which located you are. Click on  I have read and agreed to the Google Cloud Platform Free Trial terms of services. After that click on “AGREE AND CONTINUE”.

Step 3: Then choose Account info and contact information. Fill with permanent address with Account details. Either debit card or credit card. Then click on Active account.

Step 4: After Activation with Free $300 account will get below the window. Welcome future for signing up for the 12-month free trial. Google we’ve given you $300in free trial credit to spend up to 1 year. Then click on “GOT IT”.

Step 5: Click on Compute Engine on top of Left corner in Google cloud page. Below Compute Engine will see VM instance, Instance groups, Disks, Snapshots, etc. components are there in GC page.

Step6: Click on VM instances Create option. If you have existing VM details then click on the Import option. If you want to quick start option then will get info about VM instances.

Create your own instance name, select Region, Zone.

Select CPU cores and choose which operating system you want then choose them.

Step 7: If you need firewall setting then choose HTTPS or HTTP. After that click on the Create button for instance creation.

Step8: Check with your VM instance details with Internal IP and External IP with SSH for communication. If you want VM instance “stop” or “delete” then choose options on top of VM.






Summary: Google Cloud Free tier version they provide $300 free account up to 12-months. Before that, we must give Debit card or Credit card details with CVV also. After that will get Google Cloud VM instances. For storage and processing. It is like Amazon AWS but here some of their properties are different compared to AWS VMs.

Hadoop Architecture vs MapR Architecture





Basically, In BigData environment Hadoop is a major role for storage and processing. Coming to MapR is distribution to provide services to Eco-System. Hadoop architecture and MapR architecture have some of the difference in Storage level and Naming convention wise.

For example in In Hadoop single storage unit is called Block. But in MapR it is called Container.

Hadoop VS MapR

Coming to Architecture wise somehow the differences in both:
In Hadoop Architecture based on the Master Node (Name node) and Slave (Data Node) Concept. For Storage purpose using HDFS and Processing for MapReduce.




In MapR Architecture is Native approach it means that SAN, NAS or HDFS approaches to store the metadata. It will directly approach to SAN  no need to JVM. Sometimes Hypervisor, Virtual machines are crashed then data directly pushed into HardDisk it means that if a server goes down the entire cluster re-syncs the data node’s data. MapR has its own filesystem called MapR File System for storage purpose. For processing using MapReduce in background.

There is no Name node concept in MapR Architecture. It completely on CLDB ( Container Location Data Base). CLDB contains a lot of information about the cluster. CLDB  installed one or more nodes for high availability.

It is very useful for failover mechanism to recovery time in just a few seconds.

In Hadoop Architecture Cluster Size will mention for Master and Slave machine nodes but in MapR CLDB default size is 32GB in a cluster.




 

In Hadoop Architecture:

NameNode
Blocksize
Replication

 

In MapR Architecture:

Container Location DataBase
Containers
Mirrors

Summary: The MapR Architecture is entirely on the same architecture of Apache Hadoop including all the core components distribution. In BigData environment have different types of distributions like Cloudera, Hortonworks. But coming to MapR is Enterprise edition. MapR is a stable distribution compare to remaining all. And provide default security for all services.

Java.lang.ClassNotFoundException: oracle.jdbc.driver.OracleDriver in Spark Scala

While writing Apache Spark in Scala / Python (PySpark) programming language to read data from Oracle Data Base using Scala / Python in Linux operating system/ Amazon Web Services, sometimes will get below error in




spark.driver.extraClassPath in either executor class or driver class.

Caused by: java.lang.ClassNotFoundException: oracle.jdbc.driver.OracleDriver

at scala.tools.nsc.interpreter.AbstractFileClassLoader.findClass(AbstractFileClassLoader.scala:83)

at Java .lang.ClassLoader.loadClass(ClassLoader.java.424)

at Java.lang.ClassLoader.loadClass(ClassLoader.java.357)

at org.apache.spark.sql.execution.datasources.jdbc.DriverRegistry$.register(DriverRegistry.scala:35)

at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$anofun$createConnectionFactory$1.api

at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$anofun$createConnectionFactory$1.api

at scala.Option.foreach ( Option .scla:236)

at org . apache . spark . sql . execution . datasources . jdbc.JdbcUtils $ anofun $ createConnection Factory $ (JdbcUtils.scala)

at <init> ( < console >:46)

at . <init> (<console>:52)

at. <clinit> (<console>)

at. <init> (<console>:7)

at. <clinit> (<console>)

at $print (<console>)

Solution:

After getting this error will provide a simple solution in Spark Scala. Sometimes these are coming to Python (PySpark).

import related jars to both executor class and driver class. First, we need to edit the configuration file as spark defaults in spark-default.conf file.

Adding below two jar files path in spark-default.conf file.

spark.driver.extraClassPath /home/hadoop/ojdbc7.jar
spark.executor.extraClassPath /home/hadoop/ojdbc7.jar

Above two jar files path in configurations with exact version is matched to your Spark version otherwise will get the compatible issue.

Sometimes these two classpaths get an error then will add in your code either Scala or Pyspark programming –conf before Spark driver, executor jar files in bottom of the page example.

If will get the same issue again then will follow the below solution:
Step 1: Download Spark ODBC jar files from the official Maven website.

Step 2: Copy the download jar files into the below path in the share location in Spark.

/usr/lib/spark/jars

For Example –  PySpark programming code snippet for more information.

 

pyspark --driver-classpath /home/hadoop/odbc7.jar --jars #  jar file path
from pyspark import SparkContext, Spark conf # import Spark Context and configuration

from pyspark.sql import SparkContext #Sql context

sqlContext = sqlContext (sc)

dbconnection = sqlContext . read . format ("jdbc") . options (url: "Give your jdbc connection url path").load()
#Data Base connection to read data from Spark with the help of jdbc

 

Data Engineer Vs Data Science Vs Data Analyst





Nowadays the world’s runs completely on Data. Data engineers are like builders of construction it means that make data usable by data analyst and data scientist through APIs, data applications.

Data Scientist is a Researcher: Use data for advanced analysis, algorithms, data structures, and machine learning.

Data Analyst is a data translate into business insights are like data visualization tools like QlikView, Tableau, etc.

Data Engineer:

  • Data Engineer is the process of extracting the raw data and making it analysis for transforming data from source to destination.
  • Strong knowledge with the ability to create and integrate APIs and understanding the data related queries and performance optimization.
  • Data Engineer must have the skill set on Data infrastructure, Data warehouse management, Extraction transform load, Reporting tools, etc.
  • Technical skills: Python, SQL, Java, ETL Tools, Hadoop, Spark, Bigdata environment, Tableau.

Data Science:

  • The data scientist is the analyses and interprets the complexity of data and must have the skill in Statistical modeling, Machine learning, IDentifying actionable insights, Maths and Data Mining.
  • Technicals skills: In-depth programming knowledge on Python, R or SAS, Big Data analytics.
  • Responsible for developing operational models and data analytics with analytics.

Data Analytics:

Data Analytics is a collect, perform statistical analysis of data and processing of data. numeric data and uses it to help for better decisions. Some of the tasks are to present the insights in non-technical actionable results. Data modeling and reporting techniques along with strong analysis of data statistics.

It simply says that get business value from data through insights(Translate data into business values).

DataAnalytics = Data Engineering + Data Science

Pay Scale :

According to Glassdoor Average Pay:

Data Engineer: $123070 /year

Data Scientist:$115,815/year

Data Analyst: $71,589/year
Summary: In the present market, Data is highly incremented compared to previous years. So we need to skill up with Data Engineer, Data Scientist, and Data Analyst for growth in knowledge and Payscale for future enhancement.




Above three roles are emerging and more sustainable roles with huge demanding in IT sector.

What is the difference between Instagram and Facebook




Instagram Vs Facebook

Instagram and Facebook are two of the most popular social networking applications. Facebook has some pros and cons as well as Instagram also.

Instagram:

Instagram is a completely Visual platform with quality pictures and videos.

The medium compress of interactive and makes for easy scrolling.

In marketing companies should be advertising and promoting the business on Instagram.

Instagram now has 1 billion monthly active users. 60% of users log in daily. It making the second most engaged social networking application after facebook.

Instagram is mainly used for a video or pics sharing service and Instagram was acquired by Facebook in 2012.

In Instagram #hashtags work best for connection with the related community of users.

Instagram uses HTML5 and Python(Django Framework) languages for development and utilizes Java or Kotlin languages for Andriod application.

Facebook:

Facebook can be considered to be a more large-scale social networking application compared to Instagram.

The basic functions on Facebook are more developed like Facebook pages, texts, events, groups, Updates with location, share extensive images and videos.

Coming to marketing some companies are should be advertising on Facebook.

Facebook is older than Instagram. In the world, around 80% of all students are using Facebook accounts.

Facebook now has 2.37 billion monthly active users.65% of users log in daily. It making the topmost engaged social networking application.

In Facebook Promotion of events and content information to users. Mostly about textual content compared to Instagram.

In Facebook, we can save posts but Instagram we can’t save the posts.

Facebook uses PHP, Java and C++  programming language for developing the application.




Summary: Facebook and Instagram are owned by Mark Zuckerberg. Headquarters in Menlo Park, California, United States. Both are popular social network applications in the world. Insta is like one of the features of Facebook.  Facebook can post status updates are like videos, photos.

How to Become a Blockchain Developer : Skills | Roles & Responsibilities





Become a Blockchain Developer?

Nowadays Blockchain is one of the most emerging technology. Apart from that being the revolutionary technology in the present market. The information in its publicly available for everyone and each of the blocks of data is highly secured with multiple chains.

The Basic concepts of Blockchain are creating digital identities, tracking everything and monitoring supply chains. According to one of the social networking site (LinkedIn), Blockchain development is one of the most emerging jobs of 2018.




Blockchain developers are basically two types of developers:

1. Core Blockchain developers – To design architecture of Blockchain

2. Blockchain developers – Use the architecture and create applications

For both of the developers, we need some basic knowledge and became Blockchain developers.

Below skills are you should learn to become a Blockchain developer with simple concepts:

1. Data Structures:

This is a very basic concept to learn Blockchain. To the understanding of solid data structures concepts with algorithms because of Blockchain is quite complex to understanding and developing. Blocks are even secured and strong by Cryptographic techniques we need to Data structure concepts.

Mostly preferable concepts in Data structures are like LinkedList, binary trees, Mapping techniques, and graphs. As well as an upgrade one of best programming languages is Java, Python, C, and C++.

2.Cryptography:

After completion of Data Structures, concepts will go with Cryptography for public key encryption and decryption for digital signatures.  Basic knowledge of RSA and ECDSA concepts with solid knowledge of Mathematics.

3.Networking:

Blockchain developers need an idea on Networking concepts like peer to peer networks, routing, configurations, and topologies for a chain mechanism. For communication exchange information need to learn OSI model and Protocols is enough for Blockchain developing.

4.Distributes Systems:

The distributed system is an autonomous computer that is connected using the distribution environment for sharing resource data within a single network. It is for reliability and transparency in blockchain mechanism.

5.Smart Contracts:

Last but not least smart contracts are a major role in Blockchain developing. It is a program that runs on the blockchain once the transaction is done. Smart Contracts are unbiasedly enforced to increase blockchain capabilities.




Summary: Blockchain developers Roles and Responsibilities are to developing coding in C/C++ or any other programming language including web development with Cryptography and strong knowledge on Maths for tracing chains.

WhatsApp New Features in 2019: Latest





The Facebook-owned the instant(WhatsApp) messenger app has already launched specific features for Andriod and iPhone users both in the present market.

WhatsApp Messenger is introducing new features in 2019 like below features:

1.WhatsApp Dark Mode feature:

WhatsApp is working on Dark Mode, is disabled by default at present in Andriod but spotted in varied beta versions. It looks at how the black color scheme will look in the “Profile Section” Settings and Status bar. Once it has been completed the feature has been enabled.

2.WhatsApp FingerPrint Feature for Andriod:

In WhatsApp, Andriod beta version released the fingerprint feature and including two-factor authentication. Facebook and WhatsApp have been working on bringing this feature. At present disabled by default
This feature should be enabled in Settings > Account>Privacy>Use Fingerprint to unlock.

3.Group Video Call:

WhatsApp another amazing feature is Group Video call with all the members of a group. This feature is already been rolling out to some of the users in the WhatsApp group.

4.WhatsApp Privacy Settings for Group Invites:

WhatsApp is working on new WhatsApp Privacy Settings for Group invites. This feature is helpful to prevent Fake news and Spam group invites.

This feature is should be enabled in WhatsApp >Settings > Account>Privacy>Groups.

4.WhatsApp Forwarding info enabled:

WhatsApp is processing on the new forwarding info enabled the feature for WhatsApp beta version. How many times a message has been forwarded to multiple users.

5.Consecutive Voice Messages:





This is a new feature of WhatsApp android beta version. A new feature called Consecutive Voice Messages is auto-playing of one or more voice messages sent one after the other.

6.3D touch actions for Whatsapp status:

This is one of the best features of WhatsApp 3D touch actions support to a contact’s status. This feature may be hide who ever seen to your status in WhatsApp.
Summary: WhatsApp is an instant messenger end-end to encrypting with a secured application. Above features are mostly updated and will come soon for WhatsApp users.

MongoDB Error: The Program can’t start because MSVCP140.dll is missing from your computer.

Error:





The Program can’t start because MSVCP140.dll is missing from your computer. Try reinstalling the program to fix this problem in while installing MongoDB on Windows Operating System.

Resolutions:

Solution 1:

Step 1: Uninstall the MongoDB from your Windows machine.

Step 2: Clean your junk files (using CCleaner, etc) from your Windows

Step 3: Remove MongoDB all files from your system.

Step 4: Download the latest version of  MongoDB. If you need Robo 3T studio also download from the MongoDB official website.

Step 5: Trying to install the .exe file using Run as Administration. After completion of MongoDB restarts the windows machine.
After these steps error is still pending so try to follow the second solution

Solution 2:

If DLL (Dynamic Link Libraries) files are missing from your Windows machine. Some of the applications depend on the DLL files because external libraries sync up with these files to fix this issue.

Step 1:  Downloading the missing dll file from the internet and copy the file into a particular file location(C:\Windows\System32).

Step2: After Installing the missing dll file in your local machine then try to install MongoDB or other applications.

If still is not working go with below solution

Solution 3:

Step 1: Run the built-in System File checker tool for corrupted or missing files in the Windows operating system.

Step2: Try to Repair or reinstall of the MongoDB or some other application like Visual Studio.

Step 3: Then copy the DLL file from another Windows operating system and restore it on your computer and followed by re-registering the dll files in your computer.




Summary: In the Windows operating system most of the applications are not complete run the different files. If the Windows OS or software is not able to find any concerned DLL file is missing or corrupted then will receive this type of error: The Program can’t start because MSVCP140.dll is missing from your computer.