Python: Introduction to Numpy and To create several types of Array

What is Numpy?





Numpy is a module that contains several classes, functions or variables, etc. To do with scientific calculations in Python. Numpy is useful to create and also process single and multidimensional arrays.

An array is an object that store a group of elements of same data type it means that in case of numbers we can store only integer or only float but not one integer and one is a float. If you want to work with Numpy we need to import the Numpy module.

Example 1:

import numpy
arr=numpy.array([1,2,3,4,5,6,7])
print(arr)
output: 
array([1,2,3,4,5,6,7])

Example 2:

import numpy np
arr=np.array([1,2,3,4,5,6,7])
print(arr)
output: 
array([1,2,3,4,5,6,7])

Example 3:

from numpy import*
arr=array([1,2,3,4,5,6,7])
print(arr)

output: 
array([1,2,3,4,5,6,7])

Creating an array can be done in several always:

1.Using array()function

2.Using arrange()function

3.Using Zeroes() and once() function

1.Using array()function:

>>arr=array([1,2,3,4,5,6,7],int)

>>type(arr)

<class 'numpy.nparray'>

>>arr

array([1,2,3,4,5,6,7]

>>arr=array([1.5,2.4,3.7,4.3,5.8,6.9,7.1],int)

>>arr

array([1,2,3,4,5,6,7])

>>arr=array([1.5,2.4,3.7,4.3,5.8,6.9,7.1],float)

>>arr

array([1.5,2.4,3.7,4.3,5.8,6.9,7.1])

For String no need to specify data type

>arr=array(['a','b','c'])

>>print(arr)

['a' 'b' 'c']

2.Using arrange()function:

arrange(start,stop,steppoint)
>>from numpy import*
>>a=arrange(2,11,2)
>>>a
array([2,4,6,8,10])

3. Creating an array using zeros() and ones() functions:

zeros(n, datatype) --> create array with all zeros
ones(n,datatype)--> create array with all 1's
>> from numpy import*
>> a=zeros(5, int)
>>a
array([0,0,0,0,0])
>> b= ones(5,int)
>>b
array([1,1,1,1,1])

Mathematical Operations on arrays:

>>arr = array([1,2,3,4,5])
>>arr1=arr+5
>>arr1
array([6,7,8,9,10])




How to Find Wifi Password on Windows 10 with Pictures

In Windows 10 operating system to find Wifi Password bit of difficulty because here more options to find out the password.




Here are simple steps to find out the password with step by step process:

Step 1: First go to Desktop of your PC then go to the bottom panel right corner showing Wif symbol. Right click on the Wifi symbol then chooses “Open Network & Network Settings” in your Windows 10 operating system.

Step 2: After clicking on Open Network & Network Settings then showing Network & Internet options in below screenshot. Then click the “Status” option after that select the “Network and Sharing Center”.

Step 3: Then it showing below screenshot. Click on  “Wi-Fi (Internet)”.  If it not showing anything please connect your wifi otherwise it won’t be showing anything. Please connect your Wi-Fi connection.

Step 4: Select the “Wireless Properties” for Wireless information like Connection and Security related information. Don’t go to the Properties because in this tab showing only IPV4 and IPV6 information.

Step 5: After clicking on Wireless Properties then go to “Security” tab it showing Security type and “Network Security Key” in below to “Show character” Dialbox is there click on that Dialog box then it will showing your Wifi Password.

After that click on “Ok”  then close all windows.




Any Password choose strong password including special characters, numbers, and Captial letters for security purpose. Coming to in Windows 7 Operating system easy to find out Wifi Password but Windows 10 operating system little bit of difficulty to choose the option without knowing anything.

What is Heartbeat in Hadoop? How to resolve Heartbeat lost in Cloudera and Hortonworks

Heartbeat in Hadoop:





In Hadoop, eco-system Heartbeat is an in-between Namenode and Datanode communication. It is the signal that is sent by the Datanode to Namenode after a regular interval. If Datanode in HDFS does not send a heartbeat to Namenode around 10 minutes by default then Namenode considers the Datanode is not available.

The default heartbeat interval is 3 seconds. Put in dfs.heartbeat.interval in a hdfs-site.xml file in Hadoop installation directory.

What is Heartbeat lost:

In Hadoop eco-system, the Datanode does not send a heartbeat to Namenode around 10 minutes by default. So, in this case, Namenode considers a Datanode is unavailable it is known as “Heartbeat lost”.

How to resolve Heartbeat lost:

In Bigdata distribution environment will take Hortonworks (HDP)In Hortonworks:
1. In HDP check Amabari agents status whether it is running or not by using” ambari-agent status ”
2. If it is not running then check with log files for Ambari server and Ambari agent as well as in the directory of /var/log/ambari-server and /var/log/ambari-agent.

3. Follow the below steps:

A) Stop ambari-server
B) Stop ambari-agent service on all nodes
C) Start ambari-agent service on all nodes
D) Start ambari-server server

Cloudera:

1. First Check the Cloudera scm agent status whether it is running or not by using” sudo service cloudera-scm-agent status ”





2.check the agent log files in this directory in /var/log/cloudera-scm-agent/

2. Then follow the below commands with root user

sudo service cloudera-scm-agent status
sudo service cloudera-scm-agent stop
sudo service cloudera-scm-agent start

Summary: Hadoop is following Master, Slave architecture. The master node stores the metadata and slave nodes stores the actual data. So while sending data communication between Namenode and Datanode is called as a “Heartbeat”. If it fails simply called as a “Heartbeat lost” it means that Datanode is unavailable.  To find resolution steps for Bigdata distributions like Hortonworks (HDP) and Cloudera (CDH) with step by step process for this issue.

Hadoop job (YARN Staging) error while executing simple job

In a Hadoop eco-system, no.of jobs are executing in a fraction of time in that time. I am trying to execute the Hive job for Data validation in Hive server in Production server. While executing a Hive job in the hive command line I got this type of error.



at org.apache.hadoop.ipc.Client.call(Client.java:1468)
at org.apache.hadoop.ipc.Client.call(Client.java:1399)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
at com.sun.proxy.$Proxy9.addBlock(Unknown Source)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.addBlock(ClientNamenodeProtocolTranslatorPB.java:399)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
at com.sun.proxy.$Proxy10.addBlock(Unknown Source)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.locateFollowingBlock(DFSOutputStream.java:1532)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1349)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:588)
22:33:33 INFO mapreduce.JobSubmitter: Cleaning up the staging area /tmp/hadoop-yarn/staging//.staging/job_1562044010976_0003
Exception in thread "main" org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /tmp/hadoop-yarn/staging//.staging/job_1562044010976_0003/job.jar could only be replicated to 0 nodes instead of minReplication (=1). There are 1 datanode(s) running and no node(s) are excluded in this operation.
at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:1549)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:3200)
at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:641)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:482)
at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2039)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2035)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2033)

The above error belongs to a connection error in Datanode while executing the code. At the time Datanode not running properly. so find below resolution for this issue:

Stop all services:

stop-all.sh
start-all.sh

Here restart all services including Namenode, Secondary Namenode, DataNodes and remaining services like Hive, Spark,
etc.

If still showing this type of error then start the distributed file system.

start-dfs.sh

Check all the Hadoop Daemons like Name node, Secondary Name node, Datanode, Resource Manager and Node Manager, etc. By using below command

jps

And then check All node information by using “hadoop dfsadmin -report ” for the status of the Datanode whether it is running fine or not.

Above steps for Local, Pseudo distributed,  and standalone mode only in Hadoop eco-system.

For Cloudera, Hortonworks, MapR distributions are simply “Restart” DataNodes and Services like Hive, Spark, etc.




Summary: In Big Data environment we executing so many jobs like Hadoop/Spark/Hive for the result but some times showing above error. At the time we stuck but here the simple solution for the above error

Most frequently Asked Hadoop Admin interview Questions for Experienced

Hadoop Admin interview Questions for Experienced

1.Difference between Missing and Corrupt blocks in Hadoop 2.0 and how to handle it?

Missing block: Missing block means that there are blocks with no replicas anywhere in the cluster.

Corrupt block: It means that HDFS cannot find any replica containing data and replicas are all corrupted.




How to Handle :
By using  below command will handle

To find out which file is corrupted and remove a file

A) hdfs fsck /
B)hdfs fsck / | grep -v '^\.+$' | grep -v eplica
C) hdfs fsck /path/to/corrupt/file -location -block -files
D)hdfs fs -rm /path/to/file/

2. What is the reason behind of an odd number of zookeepers count?

Because Zookeeper elects a master based opinion of more than half of nodes from the cluster. If even number of zookeepers is there difficult to elects master so zookeepers count should be an odd number.

3. Why Kafka is required for zookeeper?

Apache Kafka uses zookeeper, need to first start zookeeper server. Zookeeper elects the controller topic configuration.

4. What is the retention period of Kafka logs?

When a message sent to Kafka cluster appended to the end of logs. The message remains on the topic for a configurable period of time. In this period of time Kafka generates a log file, it called retention period of Kafka log.
It defines log.retention.hours 

5. What is block size in your cluster, why not recommended for 54 MB block?

Depends upon your cluster, because of Hadoop standard is 64 MB

6. For suppose if the file is 270 MB then block size is 128 MB on your cluster so how many blocks if 3 blocks are 128+!28+14MB so 3rd block 14MB is wasted or other data can be appended?

7. What are the FS image and Edit logs?

FS image: In a Hadoop cluster the entire file system namespace, file system properties and block of files are stored into one image it is called an FS image (File System image). And total information in Editlogs.

8. What is your action plan if your PostgreSQL or MySQL down on your cluster?

First, check with the log file, then go with what is an error and find out the solution

For example: If connectionBad Postgres SQL
Solution: First status Postgres SQL service

sudo systemctl status postgressql

Then stop the Postgres SQL service

sudo systemctl stop postgressql

Then provide pg_ctlcluster with the right user and permissions

sudo systemctl enable postgressql

9. If both name nodes are in stand by name node, then if jobs are running or failed?

10. What is the Ambari port number?

By Default Ambari port number is 8080 for access to Ambari web and the REST API.




11. Is your Kerberos cluster which one using LDAP or Active Directory?

Depends upon your project if LDAP integration or Active Directory and explain it.

Spark Lazy Evaluation and Advantages with Example

Apache Spark is an in-memory cluster computing framework for processing and analyzing large amounts of data (Bigdata). Spark provides a simple programming model than that provided by Map Reduce. Developing a distributed data processing application with Apache Spark is a lot easier than developing the same application with Map Reduce. In Hadoop MapReduce provides only two operations for processing the data like “Map” & “Reduce”, whereas Spark comes with 80 plus data processing operations to work with big data application.




While data processing from source to destination. Spark is 100 times faster than Hadoop Map Reduce because it allows in-memory clustering computing, it implements an advanced execution engine.

What is meant by Apache Spark Lazy Evaluation?

In Apache Spark, two types of RDD operations are

I)Transformations

II) Actions.

We can define new RDDs any time, Apache Spark computes them only in a lazy evaluation. That is, the first time they are used in an action. The Lazy evaluation seems unusual at first but makes a lot of sense when you are working with large data(BigData).

Simply Lazy Evaluation in Spark means that the execution will not start until an action is triggered. In Apache Spark, the picture of lazy evaluation comes when Spark transformation occurs”.

Consider where we defined a text file and then filtered the lines that include “CISCO” client name if Apache Spark were to load and store all the lines in the file as soon as we wrote like lines = sc.text( file path ). Here Spark Context would waste a lot of o storage space, given that we then immediately filter out many lines. Instead, once Spark seems that whole chain transformation. It can compute the data needed for its result. Hence first() action, Apache Spark scans the file only until it finds the first matching line; it doesn’t even read the whole file.

Advantages Of Lazy Evaluation in Spark Transformations:

Some advantages of Lazy evaluation in Spark in below:

  • Increase Manageability: The Spark Lazy evaluation, users can divide into smaller operations. It reduces the number of passes on data by transformation grouping operation.
  • Increases Speed: By lazy evaluation in Spark to saves the trip between driver and cluster, speed up the process.
  • Reduces Complexities: There are two types of complexities of any operations are Time and Space complexity using Spark lazy evaluation we can overcome both complexities. The action is triggered only when the data is required.

Simple Example:

In Spark, Lazy evaluation below code writes in  Scala, who evaluates the expression as it’s declared.

With Lazy:

Scala> Val sparkList = List(1,2,3,4)

Scala> lazy val output = sparkList .map( 1 => 1*10)

Scala> println( output )

Output:

List( 10, 20, 30, 40 )



Latest: Hadoop Admin Interview Questions for 3 to 15 years Experience

Nowadays, emerging one of the skill is Hadoop administration. Below questions is the middle-level interview type questions:





1. Explain your projects according to your resume and using different types of distributions?

2. Explain about High Availability in Name node?

3. Explain about Kerberos, Ranger, Knox with scenario based?

4. Asking about any Scripting language like Python, Shell scripting?

5.Difference between Namnode and CLDB(Container Location DataBase in MapR)

6. How many Zookeepers are used in your project? Why it is odd one only can you please explain?

7. How to resolve Herat beat issue and explain the processes for resolve?

8. Recently resolved an issue from Cluster like Hive, HBase Master and how to resolve them?

9. Difference between Cloudera, MapR, and Hortonworks with examples?

10. Why Secondary Namenode concept picture in the Hadoop? and explain?

11. Explain step by step processing of  Hortworks Installation? No need to explain about prerequisites?

Latest PLSQL(Manadatory)Interview Questions for Freshers/Experience

Latest PLSQL interview Questions in Technical round for All.



PLSQL Interview Questions for Freshers/Experience:

1. How can you reduce query execution time in SQL Tunning

2.The major difference between SP and Triggers?

3. Explain about Analytical functions PL/SQL

4. Which one is execution faster? Truncate or Delete?

5. Coming to Anchored declaration explain about %Type and %rowType?

6. Explain about Cursor different attributes?

7. Write a query to select duplicate values from a table?

8.Difference between Rownum and Rowid with example?

9. Can you explain the difference between the implicit cursor and explicit cursor?
10. Write a query to get first 10 records from a table

11. Explain about Table Type variables?

12. Can you explain the difference between Analytical and aggregate functions?

 

Spark Performance Tuning with pictures





Spark Execution with a simple program:

Basically, the Spark program consists of a single spark driver and process and a set of executors processes across nodes of the cluster.

Performance tuning of Spark is measured bottleneck using big data environment metrics for block time analysis. Spark is run on In-memory cache so need to avoid network and I/O are key role while performance.

For example: Take two clusters, one cluster with 25 machines and cluster size is 750 GB of data. The second cluster with 75 machines clusters with 4.5TB of raw data.  The network communication is always irrelevant for the performance of these workloads coming to network optimization is to reduce job completion by 5% for better performance. Finally serialized compressed data.

Mostly Apache Spark always supports transformations like groupByKey and reduceByKey dependencies. Spark executes a shuffle, which transfers data around the cluster. Below three operations  with different outputs:

sc.textFile(" /hdfs/path/sample.txt")
map(mapFunc) #Using Map function 
flatMap(flatMapFunc) #Flatmap is another transformer
filter(filterFunc)
count() # Count the data.

Above code executes a single performer, which depends on a sequence of transformations on an RDD derived from the sample.tx file.

If the code contains how many times each character appears in all the words that appear more 1000 times in a given text file. Below code in Scala:

Val  token=sc.textFile(args(0).flatMap(_.split(' ' ))
Val wc=token.map((_,1)).reduceByKey(_+_)
Val filtered = wc.filter(_._2 > =1000)
val charCount = filtered.flatMap(_._1.toCharArray).map((_, reduceByKey(_+_)
charCount.collect

Above code breaks into mainly three stages. The reduceByKey operations result in stage boundaries

 

 

 

GCP VS AWS

A few years ago, No one knows about Cloud computing but now world moves into Cloud computing and major role in the IT industry.

Cloud Computing: Is a general term for the on-demand delivery of computer-based, databases, applications, storage, and different other IT services through the internet. But how do you confirm it which Cloud Service provider is best? Which Cloud Service provider is cheapest and expensive? has different types of services?

Today we compare Amazon Web Services(AWS) and Google Cloud Platform(GCP). Nowadays, mostly preferable three cloud services are Amazon Web Services, Google Cloud Platform, and  Microsoft Azure.

Here is will discuss the following  below services on Google Cloud Platform and Amazon Web Services:

The Major Difference between AWS and Google Cloud Platform

AWS:

Amazon Web Services are most required preferable Cloud platform so it AWS is the leader of the Cloud computing services due to explore IaaS ( Infrastructure as a Service) since 2006. AWS has already built a powerful global network to provide a virtual host for the IT industry.

Data centers are fiber linked an arranged all over the global network system.

Amazon Web Services mainly focused on Security, Automation, programmable, etc.  AWS Cloud Build is extensible, fully managed build service that provides continuous integration, continuous development ( CI/CD). It helps for automatic scaling and grows on demand with your customization. Depends on different types of versions.

AWS Code Deploy: Code Deploy delivers the working package to every instance outlined pre-configured parameters. Including EC2 instance on-premises services.




AWS Compute: Amazon EC2 (Elastic Compute Cloud), containers, AWS Batch, Auto Scaling, AWS Lambda, Amazon VPC.

AWS Storage: Amazon Simple Storage Service(S3), Elastic Block storage, AWS Storage Gateway.

Database: Amazon DynamoDB, Amazon Elastic Cache, Amazon RedShift.

Migration: AWS  Migration Hub, AWS Database Migration Service, AWS Snowball.

Networking & Content Delivery:  Amazon VPC (Virtual Private Cloud), CloudFront, RouteS3

Developer Tools: Amazon Web Service CodeStar, AWS CodeBuild, Code Deploy, AWS X-Ray, AWS Tools & SDKs.

Management Tools: Amazon CloudWatch, AWS Cloud Trail, AWS config, AWS Managed Sevices, AWS Management Console.

Security, Identity & Compliance: AWS Identity and Access Management(IAM), Amazon Cloud Directory, Amazon Inspector, AWS Key Management Service.

Big Data and Machine Learning  & Artificial intelligence products:

Amazon Web Services (AWS) is mostly building the Big Data systems due to the integration with DevOps tools like Kuberbetes, Docker, etc. AWS supports Hadoop platforms like Hortonworks, Cloudera distributions in Big Data environment with AWS Lambda, which is a perfect match for Bigdata analysis tasks. Artificial intelligence services provide developers with the ability to add intelligence applications through an API call to pre-trained services Amazon Lex uses for Amazon Alexa to provide advanced deep learning functionalities of ASR and Natural Programming Language to enable to build applications.

Google Cloud Platform:

Google Cloud Platform is a lot of varieties of services and solutions to software and hardware infrastructure that Google uses for YouTube and Gmail. GCP is one of the largest and most advanced computer network, storage for some applications. Monitoring, Stackdriver debugger, Stackdriver Logging, Security Scanner services.

Some management tools for the Google Cloud Platform environment in follow below tools:

Google Compute Engine: Google Compute Engine allows users to launch virtual desktops, machines on the cloud.  It is VMs boot quickly come with persistent on storage, performance. Virtual servers are available in a different configuration including sizes of machines.

Google Deployment Manager: Google Cloud Deployment manager allows all the resources needed for your application. Deployment Manager is one of the DevOps teams. For deploy many resources at one time, in parallel in Google Cloud Console in a hierarchical view.

GCP Cloud Console: GCP Cloud console gives a detailed view of every day in Cloud platform in web applications, virtual machines, data analysis, data store, networking, developer services. It is scalable and diagnoses productions issues in applications.

Google Cloud Platform is Google Compute instance up to 96 vCPUs and 634 GB of RAM.Google Cloud for storage/disk with volume sizes up to 64 TB.




Coming to Network information GCP is subject to a 2Gbits /seconds(Gbps) for better performance. It may increase the network up to 16Gbps. Google cloud platform to provide more good network system in Cloud.
GCP is to provide Security for applications with strong authentications factor. NSA has infiltrated the data center connections on Google Cloud. Even the stored data is encrypted, not to mention the traffic between data centers. There is Relational Database Sevice does provide data encryption as an option in different multiple availability zones.

  • Predictions and Facts:

Cloud predicated that Infrastructure as a service(IaaS), currently growing at 24 %. Nowadays clearly that the market for Cloud Computing is growing at different rates.

  • Market Share (AWS vs Google Cloud):

Present market share the topmost competing for Cloud Computing are AWS, Google Cloud Platform, Microsoft Azure, etc. Then comparatively AWS alarming rate is high compared to all cloud services.

  • Service Comparision:

Coming to service comparisons in Google Cloud and AWS, various services offered by AWS and GCP.

IaaS in GCP is Google Compute Engine coming to AWS is Amazon Elastic Compute Cloud.

PaaS in GCP is Google App Engine. In AWS is AWS Elastic Beanstalk

Containers are in GCP is Google Kubernetes Engine. In AWS is Amazon Elastic Compute Cloud Containers Services.

Finally, Serviceless Functions are in GCP is Google Cloud Functions. In WAS is AWS Lambda.

  • Storage Services:

Coming storage services File Storage in AWS is Amazon Elastic File System. In GCP is ZFZ/Avere.

Object storage in AWS is Elastic Load Balancer. GCP is Google Cloud Storage.

  • Management Services:

For Monitoring AWS Services we are using in Amazon CloudWatch. In GCP using Stackdriver Monitoring.

For Deployment AWS Services using AWS Cloud Formation in GCP using Google Cloud Deployment Manager.

Pricing Comparision :

Now Google Cloud Platform is a clear winner to the cost of services. First GCP provides $300 for free tier account for 12 months. AWS also provides less cost for one-month free tier services how much will spending on the machines. For example 2 CPU cores, 8GB RAM instance for GCP priced at $50 per month. Coming to AWS instances with same configurations priced at  $69 per month. M

Summary:  Amazon Web Services supports AWS documentation and AWS ForumsGoogle. Coming to Google Cloud Platform to provide some support documentation Cloud Forums and Google Cloud Documentation. Now go with Billing and Pricing AWS simple monthly calculator and Google Cloud Platform pricing Calculator. Both are very good Cloud Computing services at present market.