How to generate PPK from PEM and open AWS console




Nowadays most of the technical people suffer from PEM file to PPK file generating with little bit easy to understand.

Login AWS account as per your credentials and click on  Instance ( Step 7: Review Instance Launch) then window showing like below image.

Then choose your option whether it existing or creating a key pair.

First, download the PEM file from AWS account whether to create a new key pair or existing key pair.

Here choose an existing key pair then give a name for that key pair and acknowledge it.

After that Launch instance machine as per requirement.

Download Putty Key Generator from Putty official website then load the PEM file like below snapshot.

First load the PEM file then clicks on Generate button.

Note: After the generating time some randomness by moving the mouse over the blank area otherwise, it will not generate the PPK file.

Then Save the generate the PPK file as either save private key or save public key.

After generating of PPK file then go with Putty

Note: Putty Generator only used to generate files.

Open Putty then give IP address and Port number as per machine details.

Here will give IPV4 address or completely Hostname ( check with “hostname” command in Linux machine). Don’t give  IPV6 address.



Next will go within the category clicks on SSH option -> Auth -> Browse the PPK file for authentication as per below snapshot in Putty.

Here SSH means that Secure Shell key management system for the authentication system in the network services.

After selecting the SSH option go with Auth option then will get direct Browse option so simply browse the PPK or PEM file then clicks on Open button.

Finally, open the command prompt ( terminal ) console then will give username after that will get Yes or No option then click on YES option.


Launch AWS Instance




Here Free tier version of Amazon Web Services Instances in simple steps for beginners and how to connect machines and generate PEM files.

Step 1: Login AWS account click on AWS Management Console then give you credentials.

Step 2: Click on Launch Virtual Machine EC2 (Amazon Elastic Compute Cloud).

Step3: Choose AMI step then go with Free tier version and click on Amazon Linux 2 AMI (HVM), SSD Volume Type then clicks on Select button.




Step 4: Choose Instance Type here we selected General purpose and t2.micro (Available 1GB memory and 1 CPU core) and then directly select Review and the Launch button.

Note: If you don’t need configuration of instances then directly go with Review and Launch of the machine in Dashboard.

Step5: Clicks on Configure Instance whether you need one or more instances and click on Next: Add storage steps.

Step6: Clicks on Add Storage. It acts like Hard Disk in the computer so choose the size of the machine.

Step 7:  Clicks on Add Tags otherwise no need to configured.

Step 8: Next Goto Configure Security Group for security purpose on the machines. It provides strong security to choose types like SSH or any other.

Step 9: After clicks on Review and  Launch button then directly will launch  AWS free tier version machines for beginners.

Step 10: Select an existing key pair or create a new key pair option then select a key pair will give a specific name for that pem file.

Step 11: Start the AWS Instance.

Step 12: After successfully launching machine go and check the status of the machine and click on Instance ID.

Step 13: After completing the above steps go to start Amazon Web Services Instances then connect with pem file. Here we must and should change pem file into ppf file from putty generator. Then start with putty and use it is simple.


Prerequisites for MapR Installation on CentOS




In Hadoop Eco-System we preferable mostly three Big data distributions:

1.Cloudera Distribution Hadoop

2.Horton Works Data Platform

3.MapR Distributions Platform

In Cloudera, Distribution Platform is a free version, express, and enterprise edition up to 60 days trial version.

Coming to Hortonworks Data Platform completely open source platform for production, developing and testing environment.

Then finally MapR distribution platform is a complete enterprise edition but in MapR 3 is free version is available with fewer features to compare to MapR 5 and MapR 7.

How to install MapR free version on Pseduo Cluster:

Before the install of MapR, we configured prerequisites as  below:

——-Prerequisites——–

1.Configure hostname like FQDN by using the setup command (mapr.hadoop.com) after that check your hostname using hostname -f

2. vi/etc/hosts

3.hostname < your Fully Qualified Domain>

4. vim/etc/selinux/config ===> SELinux = disabled

——-Disable Firewalls and IPTables——-

If you enable firewalls and iptables doesn’t allow some ports so we must and should disable it.

1.service iptables save

2.service iptables stop

3.chkconfig iptables off

4.service ip6table save

5.service ip6tables stop

6.chkconfig ip6tables off

—– Enable NTP service for machines —–

NTP is a Network Time Protocol is a networking protocol for time synchronization between computers and packet switched data.

1.yum -y install ntp ntpupdate ntp-doc

2.chkconfig ntpd on

3.vi /etc/ntp.conf

4.server 0.rhel.pool.ntp.org

5.server 1.rhel.pool.ntp.org

6.server 2.rhel.pool.ntp.org

7.ntpq -p

8.date ( All machines have the same date otherwise it will showing error)


—— Install some additional packages in Linux OS —-

Here will install JAVA 1.8 and Python

1.yum -y install java-1.8.0 -openjdk-devel

2.yum -y install python perl expect expectk

—- setup passwordless SSH On all nodes form master node ——

For passwordless authentication in between master and slave nodes

1.ssh-keygen -t rsa

2.cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

3.ssh-copy-id root@<FQDN1, FQDN2>

—–Additional Linux configuration or Transparent Huge Pages(THP)—-

1. echo never > /sys/kernel/mm/redhat_transparent_hugepage/enabled

2.echo never > /sys/kernel/mm/redhat_transparent_hugepage/defrag

3.sysctl vm.swapiness=10

set up EPEL repository for installing additional packages on the system

Here  EPEL repository for installing the additional packages in centos machine

1.Install -uvh the EPEL repository

2.wget http://http://download.fedoraproject.org/pub/epel/6/x86_64/epel-release -6.8.norach.rpm



HBase Table(Single&Multiple) data migration from one cluster to another cluster



HBase single table migration from one cluster to another cluster:

Here will be shown about Hbase single data table migration existing cluster to a new cluster simple steps:

Step 1: First export the hbase table data into the local hdfs path (Hadoop Distributed File System)

Step 2: After that copy the HBase table data from the source cluster to destination cluster by using the distcp command. (mostly distcp is a copy command for one cluster data to another cluster)

Step 3: Then create an Hbase table in the destination cluster (target cluster)

Step 4: After that import the Hbase table data from local to HBase table in the destination cluster.

Source Cluster:

1.  hbase.org.apache.hadoop.hbase.mapreduce.Driver export <hbase _table _name >  < source _hdfs _path >

2. hbase distcp hdfs :// <source_cluster_ipaddress:8020> to </source _hdfs _path>

3.hdfs: // < destination_cluster_ipaddress: 8020 > to <destination _hdfs _path>

Destination Cluster:

1.hbase org.hadoop.hbase.mapreduce.import < hbase _ table_ name > to < hbase _table _hdfs _path >

HBase multiple table migration from one cluster to another cluster:

We know how to Hbase single table migration then coming to multiple table migration from one cluster to another cluster in simple manner by below steps.

We have script files then simply multiple Hbase data migrations happening to go through below steps:


Step 1: First step place the hbase-export.sh and hbase-table.txt in the source cluster

Step 2: After that place the hbase -import.sh and hbase-table.txt in the destination cluster.

Step 3: Mention all the table list in the hbase-table.txt file

Step 4: Create all the HBase table on the destination cluster

Step 5: Execute the hbase-export-generic.sh in the source cluster

Step 6: Execute the hbase-import.sh in the destination cluster.



Summary: I tried in Cloudera Distribute Hadoop environment for Hbase data migration from one cluster to another cluster. For Hbase single table data and multiple table data migration in very simple for Hadoop administrator as well as Hadoop developers. It is the same as Hortonword Distribution also.

Replication Factor in Hadoop




How to Replication Factor comes into the picture:

The Backup mechanism in the traditional distribution system:

In Hadoop, Backup mechanism didn’t provide high availability. This system is followed by shaded architecture.

The first request from File to Master node then divided into blocksize. It is a continuous process but node 1(slave1) is failed to another node(Slave 2).

Replication Factor:

Replication factor is the process of duplicating the data on the different slave machines to achieve high availability processing.

Replication is a Backup mechanism or Failover mechanism or Fault tolerant mechanism.

In Hadoop, Replication factor default is 3 times. No need to configure.

Hadoop 1.x :
Replication Factor is 3
Hadoop 2.x:
Replication Factor is also 3.

In Hadoop, Minimum Replication factor is 1 time. It is possible for a single node Hadoop cluster.

In Hadoop, Maximum Replication factor is 512 times.

If 3 minimum replication factor then minimum 3 slave nodes are required.

If the replication factor is 10 then we need 10 slave nodes are required.



Here is simple for the replication factor:

'N' Replication Factor = 'N' Slave Nodes

Note: If the configured replication factor is 3 times but using 2 slave machines than actual replication factor is also 2 times.

How to configure Replication in Hadoop?

It is configured in the  hdfs-site.xml file.

/usr/local/hadoop/conf/hdfs-site.xml
<configuration>
<property>
<name> dfs.replication</name>
<value> 5 </value>
</property>
</value>

Design Rules Of Replication In Hadoop:

1. In Hadoop Replication is only applicable to Hadoop Distributed File System (HDFS) but not for Metadata.

2. Keep One Replication per slave node as per design.

3. Replication will only happen on Hadoop slave nodes alone but not on Hadoop Master node (because the master node is only for metadata management on its own. It will not maintain the data).

Storage only duplicates in Hadoop but not processing because processing us always unique.

Summary: In Hadoop, Replication factor is a major role for data backup mechanism in earlier days. Default replication factor always 3 except single node cluster environment.


Blocksize in Hadoop



How the data storage on HDFS:

BLOCK:

Individual storage unit on the Hadoop Distributed File System.

In Hadoop 1.X default block size is 64MB

In Hadoop 2.X default block size is 128MB

If any file request is coming to Hadoop cluster what are the steps:

Step 1: Hadoop Master node only receives the file request.

Step2: Based on the Blocksize configuration at that time, data will be divided into no.of blocks.

How to configure “Blocksize” in Hadoop?

/usr/local/hadoop/conf/hdfs-site.xml
<configuration>
<property>
<name>dfs.block.size</name>
<value>14323883></value>
</property>
</configuration>

How to store data in HDFS:

Assume that we have A.log, B.log, and C.log files:

Scenario1:

A.log -> 200mb -> 200/64 -> 64mb 64mb 64mb 8mb+remaining

Senaario2:

B.log->192mb->192/64-> 64mb 64mb 64mb

Design Rules of Blocksize:

1.Irrespective of the file size: In Blocksize for each and every file dedicated to no.of blocks will be there in Hadoop.

2.Except for the last block: Remaining all the blocks of a file will hold the equal volume of data.

Hadoop master node only looks at the block size at the time of blocking the data(dividing data). Not at the time of reading the data because at the time of reading the data only metadata matters.


Apache SQOOP in Hadoop



Apache Sqoop:

Apache Sqoop is a tool designed to transfer data between Hadoop and relational databases. Mostly used for import/export data from RDBMS to HDFS vice versa. Sqoop works with relational databases such as Teradata, Oracle, MySQL etc.

Apache Sqoop is a tool designed for efficiently transferring bulk data between Hadoop.

Where Sqoop is used?

Developers feel the transferring of data between relational database systems and HDFS is not interesting, the interesting work starts after data is loaded into HDFS. They always write custom scripts to transfer data in and out of Hadoop.

In case of Map-Reduce programs needs to do similar jobs, the database server would experience very high load, for a large number of concurrent connections, while Map Reduce programs were running for performance issues.

Apache Sqoop makes this possible with a single command line mostly Sqoop uses MapReduce to import and export the data, which provides parallel operations as well as fault tolerance purpose.

What Sqoop Does?

1. Sqoop import sequential data sets from mainframe – the growing need to move data from the mainframe to HDFS.

2. Data import – moves certain data from external stores into Hadoop to optimize the cost-effectiveness of combined data storage and processing.

3. Fast Data copies –  from external systems into Hadoop

4. Parallel data transfer – faster performance and optimal system utilization

5. Load balancing – excessive storage and processing loads to other systems.

Apache Sqoop latest version:

Latest stable release 1.4.7


Sqoop Architecture:

Apache Sqoop command submitted by the end user is parsed by Sqoop and launches Hadoop Map Reduce the only job to import or export data and aggregations are needed. Sqoop just imports and exports the data. It does not do any aggregations. Map job launch multiple mappers depends on the number defined by the user in the command line. Each mapper creates a connection with the database using JDBC and fetches the part of data assigned.

SQOOP Architecture diagram:

 

Sqoop only for imports and exports the data.


From Human Neurons to Artificial Neurons continuation




After Simple Neuron and Firing rule remaining rules are below :

3. Pattern Recognition:

After simple neuron, firing rules and important application of neural networks is pattern recognition. Pattern recognition can be enforced by using a pro-act neural network that has been trained accordingly during training the network is trained to associate outputs with input patterns. When the network is used it identifies the input pattern and tries to output the associated output pattern. The power of neural networks comes to life when a pattern that has no output associated with it, is given as an input. In this case, the network gives the output that corresponds to a trained input pattern that is least different from the given pattern.

Above example is trained to recognize the patterns T and H. The associated patterns are all black and all white respectively.

Input                          Output                   Input                  Output

Above white squares represent with 1  and black squares represent with 0 then the truth tables for the 3 neurons after generalizations are below truth table.

Top Neuron:

Middle Neuron :

Bottom Neuron:



From the above tables, it can be seen the following associations can be extracted

Input                               Output

In this case, it is obvious that the output should be all blacks since the input pattern is almost the same as the “T” pattern.

Input                     Output

In this case, it is obvious that the output should be all whites since the input pattern is almost the same as the ‘H’ Pattern.

Input                                        Output

Above case, the top row is 2 errors away from the T and 3 from an H. So the top output is black. The middle row is 1 error away from both T and H so the output is random.

The bottom row is 1 error away from T and 2 away from H. Therefore the output is black. The total output of the network is still in favor of the T shape.

4. A more complicated Neuron:

The most sophisticated neuron is the McCulloch and Pitts model. It is a variety from the remaining model is that the inputs are ‘weighted’ the effect that each input has at decision making is dependent on the weight of the particular input.




From Human Neurons to Artificial Neurons




To easy to understand Human neurons to Artificial Neurons is a little bit tough but we conduct these neural networks by first trying to conclude the essential features of neurons and their internal connections in Artificial Intelligence. Then typically program to the computer to replicate these characteristics. However, because our knowledge of neurons is insufficient and our computing power is limited, our models are necessarily gross idealizations of real networks of neurons.

The Neuron model:

An Engineering Approach


1.A Simple Neuron:

In a simple neuron, Artificial Neuron is a device with many inputs but only one output. The neuron has different modes of operations in a simple neuron. One is training mode and another one is user mode. Basically the training mode, the neuron can be trained to fire, for a particular input pattern. And the user mode when a taught input pattern is detected at the input, its related to output becomes the current output in Artificial neuron. If the input pattern does not belong in the taught list of input patterns the firing rule is used to determine whether to fire or not in a simple neuron.

2.Firing rules:

In an Artificial Intelligence, the firing rules is a most important concept in neural networks and account for their high adaptability. A firing rule verifies a neuron should fire for any input pattern. Firing rules understand to all the input patterns not only the ones on which the node was trained in Artificial neurons.

A simple firing rule can be performed by using the Hamming distance technique.

In simple firing rule can take a collection of training patterns for a node, some of which generate it to fire and others which intercept it from doing so then the patterns not in the collection cause the node to fire if, on the comparison, they have more input elements in common with the nearest pattern in the 1 – taught set than with the nearest pattern in the 0 – taught set. If there is a tie then the pattern remains in the undefined state.

Example: In firing, rule take 3 – input neuron is trained to output 1 when the input (X1, X2, and X3)  is 101 or 111 and to output 1 when the input is 000 or 001 and to output is 0 the final output  truth table below is:

In the above example of the way the after applying Firing, a rule is to take the pattern 010. Firing rule differs from 000 in 1 element, from 001 in 2 elements, from 101 in 3 elements and from 111 in 2 elements. Therefore, the close the pattern is 000 which belongs in the 0 – taught set. It necessary that the neurons do not fire when the input is 001, on the other hand, is equal distance from two trained patterns that have different outputs and consequently the output stays undefined 0/1.


For more the difference between the two truth tables is called the generalization of the neuron. The firing rule gives the neuron a sense of similarity and authorizes it to respond sensibly to patterns not seen during training.

Hive: SortBy Vs OrderBy Vs DistributeBy Vs ClusterBy



SortBy:

Hive uses the column in SortBy to sort the rows before sustaining the rows to a reducer in Hive environment. The sort order will be dependent on the column types especially for the column is of numeric type, then the sort order is also in numeric order. If the column is of string type, then the sort order will be lexicographical order in Hive Query. It orders data at each of ‘N’ reducers, but each reducer can have overlapping ranges of data in Hive.

Output: N or more sorted files with overlapping ranges.

Example Query for SortBy

SELECT key, value FROM source SORTBY key ASC, value DESC

Order By:

This is similar to ORDER BY in SQL language. In Hive, ORDER BY guarantees total ordering of data, but for that, it has to be passed on to a single reducer which is normally intolerable and therefore in inflexible mode, in hive makes it compulsory to use LIMIt with ORDER BY so that reducer doesn’t get exhausted.

Ordering: Total Order DATA.

Output: Single output i.e fully ordered.

Example Query for OrderBy

SELECT key, value FROM source ORDER BY key ASC, value DESC

Distribute By:

Apache Hive uses the columns in Distribute By to distribute the rows between reducers in a query language. All rows with the same Distribute By columns will go to the same reducer.



Distribute By protecting each of N reducers gets non-overlapping ranges of the column but doesn’t sort the output of each reducer.

Example:

In  Distribute By x on the following 5 rows to 2 reducers:

x1
x2
x4
x1

Reducer 1 got

x1
x2
x1

Reducer 2 got

x4
x3

Cluster By:

Cluster By is a combination of both Distribute By and Sort By. CLUSTER BY x protecting each of N reducers gets non-overlapping ranges, then sorts by those ranges at the reducers.

Ordering: Global ordering between multiple reducers.

Output: N or more sorted files with non-overlapping ranges.

Example:

Refer to same example as above, if we use Cluster By x, the two reducers will further sort rows on x:

Reducer 1 got

x1
x1
x2

Reducer 2 got

x3
x4