Hadoop Architecture vs MapR Architecture





Basically, In BigData environment Hadoop is a major role for storage and processing. Coming to MapR is distribution to provide services to Eco-System. Hadoop architecture and MapR architecture have some of the difference in Storage level and Naming convention wise.

For example in In Hadoop single storage unit is called Block. But in MapR it is called Container.

Hadoop VS MapR

Coming to Architecture wise somehow the differences in both:
In Hadoop Architecture based on the Master Node (Name node) and Slave (Data Node) Concept. For Storage purpose using HDFS and Processing for MapReduce.




In MapR Architecture is Native approach it means that SAN, NAS or HDFS approaches to store the metadata. It will directly approach to SAN  no need to JVM. Sometimes Hypervisor, Virtual machines are crashed then data directly pushed into HardDisk it means that if a server goes down the entire cluster re-syncs the data node’s data. MapR has its own filesystem called MapR File System for storage purpose. For processing using MapReduce in background.

There is no Name node concept in MapR Architecture. It completely on CLDB ( Container Location Data Base). CLDB contains a lot of information about the cluster. CLDB  installed one or more nodes for high availability.

It is very useful for failover mechanism to recovery time in just a few seconds.

In Hadoop Architecture Cluster Size will mention for Master and Slave machine nodes but in MapR CLDB default size is 32GB in a cluster.




 

In Hadoop Architecture:

NameNode
Blocksize
Replication

 

In MapR Architecture:

Container Location DataBase
Containers
Mirrors

Summary: The MapR Architecture is entirely on the same architecture of Apache Hadoop including all the core components distribution. In BigData environment have different types of distributions like Cloudera, Hortonworks. But coming to MapR is Enterprise edition. MapR is a stable distribution compare to remaining all. And provide default security for all services.

Complete mapR Installation on Linux machine

After completion of Prerequisite set up will go through directly with MapR actual steps for Installation on Linux machine.

Actual steps for MapR installation:

Step 1:  fdisk -l

Powerful and popular command it is used for the list of disk partition tables.

Step 2: cat /etc/yum.repos.d/mapr_ecosystem.repo

Install/Update mapr eco system repo files

Step 3:  cat /etc/yum.repos.d/mapr_installer.repo

Install/Update mapr installer repo  files

Step 4:  cat /etc/yum

configuring yum repos

Step 5:cat /etc/yum.repos.d/mapr_core.repo

Install/Update mapr repo repo files

Step 6: yum clean all

Yum un necessary repos cleaned

Step 7: yum update

Yum update

Step 8: yum list | grep mapr

Check yum list files in mapr by using grep command

Step 9: rpm –import http://package.mapr.com/releases/pub/maprgpg.key

Import mapr public key

Step 10: yum install mapr-cldb mapr-fileserver mapr-webserver mapr-resourcemanager mapr-nodemanager mapr-nfs mapr-gateway mapr-historyserver

Install mapr CLDB file server, Web server, Resource manager, node manager, nfs ,gateway and History server by using above single command.

Step 11: yum install mapr-zookeeper

Install MapR Zookeeper for configuration

Step 12:  ls -l /opt/mapr/roles

Check mapr roles

Step  13: rpm -qa | grep mapr

Step 14: id mapr

ID creation of mapr user

Step 15: hostname -i

Check Fully Qualified Domain Name

Step 16: /opt/mapr/server/configure.sh -N training -C 192.0.0.0 -Z  192.0.0.0:5181

Configure server with your ip

Step 17: cat /root/maprdisk.txt

Check disk files
Step 18: /opt/mapr/server/disksetup -F /root/maprdisk.txt

Disk setup in mapr disk.
Step 19: service mapr-zookeeper start

Start the MapR Zookeeper service

Step 20: service mapr-zookeeper status

Status of the MapR Zookeeper service

Step 21: service mapr-warden start

Start the MapR Warden service

Step 22: service mapr-warden status

Status of the MapR Warden service

Step 23: maprcli node cldbmaster

Step 24: maprcli license showid

Show your mapr license id

Step 25: https://<ipaddress>:8443

Open a web browser with your < IP address : 8443 > then will check it working or not

Step 26: hadoop fs -ls /

Check hadoop file list

Summary: Above steps are worked for Linux single node cluster for complete MapR Installation with the explanation each and every command.

MapR Installation steps on AWS

MapR Installation on Amazon Web Service Machine with simple steps for Hadoop environment.

Step 1: Login with AWS credentials and then open the root machine.

[ec2-user@ip----~]$ sudo su -

Step 2: Put off the IP tables  services

[root@ip---- ~]# service iptables stop

Step 3: Check the configuration of iptables

[root@ip----- ~]# chkconfig iptables off

Step 4: Edit the SELinux configuration

[root@ip----~]# vim /etc/selinux/config

Step 5: EDIT replace enforcing with disabled (save and exit)

[root@ip----~]# SELINUX = disabled

Step 6: Open repos by using below command

[root@ip----~]# cd /etc/yum.repos.d/

Step 7: edit mar ecosystem repo file.

[root@ip----yum.repos.d]# vi mapr_ecosystem.repo

Put the following lines into the above file

[MapR_Ecosystem]
name = MapR Ecosystem Components
baseurl = http://package.mapr.com/releases/MEP/MEP-3.0.4/redhat
gpgcheck = 0
enabled = 1
protected = 1

Step 8: edit mapr installer repo files.

[root@ip----yum.repos.d]# vi mapr_installer.repo

Step 9: Edit mapr core repo files.

[root@ip----yum.repos.d]# vi mapr_core.repo

Put the following lines into the above file

[MapR_Core]
name = MapR Core Components
baseurl = http://archive.mapr.com/releases/v5.0.0/redhat/
gpgcheck = 1
enabled = 1
protected = 1

Step 10: create yum repolist

[root@ip----- yum.repos.d]# yum repolist

(here you will seen all packages)
Step 11: Search mapr package files.

[root@ip------ yum.repos.d]# yum list all | grep mapr

(this displays all packages related to mapr)

Step 12: import rpm package files

[root@ip----- yum.repos.d]# rpm --import

http://package.mapr.com/releases/pub/maprgpg.key

Step 13:  install mapr cldb file server,webserver,resource manager and node manager

[root@ip------ yum.repos.d]# yum install mapr-cldb mapr-fileserver mapr-

webserver mapr-resourcemanager mapr-nodemanager

Step 14: Install mapr Zookeeper

[root@ip------ yum.repos.d]# yum install mapr-zookeeper

Step 15: list of mapr files

[root@ip----- yum.repos.d]# ls -l /opt/mapr/roles/

Step 16: search for mapr rpm files by using files grep command.

[root@ip------ yum.repos.d]# rpm -qa | grep mapr

(displays installed packages related to mapr)

Step 17: Adding Group for mapr system

[root@ip------ yum.repos.d]# groupadd -g 5000 mapr

Step 18: Adding a user for mapr group system

[root@ip------ yum.repos.d]# useradd -g 5000 -u 5000 mapr

Step 19 : Set passwd for mapr user

[root@ip------ yum.repos.d]#passwd mapr

(here you will give password for mapr user)
(you can give any name)

Step 20: create id mapr

[root@ip------ yum.repos.d]# id mapr

Step 21: check Fully Qualified Doman Name using below command

[root@ip------ yum.repos.d]# hostname -f

Step 22: check disk availability

[root@ip------ yum.repos.d]# fdisk -l

(here you have seen available disks in that machine and select the second disk for mapr)

Step 23: Edit second disk information for maprdisk file system.

[root@ip----- yum.repos.d]# vi /root/maprdisk.txt

(here that second disk put here)(save and exit)

Step 24: Set the configuration server in different zones.

[root@ip----- yum.repos.d]# /opt/mapr/server/configure.sh -N training -C ip--------.ap-southeast-1.compute.internal -Z ip------.ap-southeast-1.compute.internal:5181

Step 25: Edit second disk files

[root@ip------ yum.repos.d]# cat /root/maprdisk.txt

Step 26: Download the rpm files

[root@ip------ ~]# wget http://download.fedoraproject.org/pub/epel/6/x86_64/epel-release-6-8.noarch.rpm

Step 27: Extra package for enterprise linux system

[root@ip------ ~]# rpm -Uvh epel-release-6*.rpm

Step 28: Start Zookeeper services

[root@ip------ ~]# service mapr-zookeeper start

Step 29 :Start warden services

[root@ip-1----- ~]# service mapr-warden start

Step 30: Start MapR CLI NODE CLDB MASTER service

[root@ip----- ~]# maprcli node cldbmaster

Here you will go with your machine ip in web server for mcs..shown below..
example: http://192.168.0.0:8443

Adding Hive Service in MapR

After successful installation of MapR distribution, we need to add services like Hive, Sqoop, Spark, Impala etc. Here we are adding Hive service with simple commands in MapR for Hadoop Environment.

Add Hive Service in MapR :

We must should follow below commands for Hive services:

Step 1: yum install for Hive Mapr.

[root@master1 ~]# yum install mapr-hive mapr-hiveserver2 mapr-hivemetastore mapr-hivewebhcat

Here Loaded plugins like  fastest mirrors, refresh-package kit, security yu
Setting up Install Process is done in this step

Installing below packages of MapR Hiver Services:
mapr – hive noarch
mapr -hivemetastore
mapr-hiveserver2
mapr-hivewebhcat

Step 2:  To install MySQL server for external Database for multiple users.

[root@master1 ~]# yum install MySQL - server

Download below rpm files for MySQL servers:

mysql-5.1.73-8.el6_8.x86_64.rpm
mysql-server-5.1.73-8.el6_8.x86_64.rpm
perl-DBD-MySQL-4.013-3.el6.x86_64.rpm
perl-DBI-1.609-4.el6.x86_64.rpm

Step 3:  Checking of MySQL Status

[root@master1 ~]# service mysqld status

Step 4: Start MySQL service by using below command:

[root@master1 ~]# service mysqld start

After start MySQL services set the password for mysql service

#mysql -u root -p

Step 5: Grant all privileges.

mysql>grant all privileges on *.* to 'your name '@'localhost' identified by 'your name ';

Step 6: Flush all privileges.

mysql>flush privileges;

Step 7: Exit from MySQL cli

mysql>exit

Step 8: Set the hive site .xml file for fully configurations

[root@master1 ~] # vi /opt/mapr/hive/hive-2.1/conf/hive-site.xml
<configuration>

<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://localhost:3306/hive?createDatabaseIfNotExist=true</value>
<description>JDBC connect string for a JDBC metastore</description>
</property>

<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
<description>Driver class name for a JDBC metastore</description>
</property>

<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>siva</value>
<description>username to use against metastore database</description>
</property>

<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value> your name</value>
<description>password to use against metastore database</description>
</property>

<property>
<name>hive.metastore.uris</name>
<value>thrift://localhost:9089</value>
</property>

</configuration>

Step 9: export the metastotr with port number.

[root @ master1 ~]# export METASTORE_PORT=9089

Step 10: For MySQL DB schema

[root @ master1 ~]# /opt/mapr/hive/hive-2.1/bin/schematool -dbType mysql -initSchema

Step 11: Login with MySQL CLI with your credentials

[root @ master 1 ~]# mysql -u name -p
Enter password:

Step 12: To check databases

mysql> show databases;
+--------------------+
| Database |
+--------------------+
| information_schema |
| hive |
| mysql | 
| test |
+--------------------+

Step 13: Exit from MySQL CLI

mysql> exit
Bye

Step 14: Install MySQL connector java file for connection

[root@master1 ~]# yum -y install mysql-connector-java

Step 15: Start Meta store services

[root@master1 ~]# /opt/mapr/hive/hive-2.1/bin/hive --service metastore --start

Step 16: Start Hive services:

[root@master1 ~]# hive
Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.

Prerequisites for MapR Installation on CentOS

In Hadoop Eco-System we preferable mostly three Big data distributions:

1.Cloudera Distribution Hadoop

2.Horton Works Data Platform

3.MapR Distributions Platform

In Cloudera, Distribution Platform is a free version, express, and enterprise edition up to 60 days trial version.

Coming to Hortonworks Data Platform completely open source platform for production, developing and testing environment.

Then finally MapR distribution platform is a complete enterprise edition but in MapR 3 is free version is available with fewer features to compare to MapR 5 and MapR 7.

How to install MapR free version on Pseduo Cluster:

Before the install of MapR, we configured prerequisites as  below:

——-Prerequisites——–

1.Configure hostname like FQDN by using the setup command (mapr.hadoop.com) after that check your hostname using hostname -f

2. vi/etc/hosts

3.hostname < your Fully Qualified Domain>

4. vim/etc/selinux/config ===> SELinux = disabled

——-Disable Firewalls and IPTables——-

If you enable firewalls and iptables doesn’t allow some ports so we must and should disable it.

1.service iptables save

2.service iptables stop

3.chkconfig iptables off

4.service ip6table save

5.service ip6tables stop

6.chkconfig ip6tables off

—– Enable NTP service for machines —–

NTP is a Network Time Protocol is a networking protocol for time synchronization between computers and packet switched data.

1.yum -y install ntp ntpupdate ntp-doc

2.chkconfig ntpd on

3.vi /etc/ntp.conf

4.server 0.rhel.pool.ntp.org

5.server 1.rhel.pool.ntp.org

6.server 2.rhel.pool.ntp.org

7.ntpq -p

8.date ( All machines have the same date otherwise it will showing error)

—— Install some additional packages in Linux OS —-

Here will install JAVA 1.8 and Python

1.yum -y install java-1.8.0 -openjdk-devel

2.yum -y install python perl expect expectk

—- setup passwordless SSH On all nodes form master node ——

For passwordless authentication in between master and slave nodes

1.ssh-keygen -t rsa

2.cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

3.ssh-copy-id root@<FQDN1, FQDN2>

—–Additional Linux configuration or Transparent Huge Pages(THP)—-

1. echo never > /sys/kernel/mm/redhat_transparent_hugepage/enabled

2.echo never > /sys/kernel/mm/redhat_transparent_hugepage/defrag

3.sysctl vm.swapiness=10

set up EPEL repository for installing additional packages on the system

Here  EPEL repository for installing the additional packages in centos machine

1.Install -uvh the EPEL repository

2.wget http://http://download.fedoraproject.org/pub/epel/6/x86_64/epel-release -6.8.norach.rpm

MapR Architecture

MapR Architecture:

Before Hadoop was introduced in 2007, there was not a single data platform that can provide the scalable architecture to handle fast-growing data with a unified security model.

There are four important pillars of a data platform

1.Distributed Metadata

2.Variety of Protocols and API support

3.Variety of Data persistence like objects, files, tables and event queues.

4.Security

Distributed Metadata:

In Distributed metadata is a centralized metadata service leads to a number of restrictions as below:

1.Creates a single point of failure

2.Creates a hotspot that limits the scalability of the cluster

3.Limits sharing of data artifacts

4. Limits the number of data artifacts that can be stored in the cluster.

MapR has built a distributed metadata service from the top that removes all these restrictions.

CLDB (Container Location Data Base) serves as MapR’s level – I metadata service and maintains metadata about volumes, containers, nodes in the entire cluster.

The metadata about data artifacts such as objects, files, tables, topics, directories are maintained in the level-Il metadata is stored in the name container.

Variety of APIs and Protocol Support:

MapR Data Platform provides data ability among the different APIs. In different applications using different APIs:

1.HDFS API

2.S3 API

3.NFS

4.POSIX

5.OJAI API

6.CDC API

Variety of Data persistence:

MapR data container is the unit of storage allocation and management. Each container stores a variety of data elements such as objects, files, tables, and directories.

It supports two types of data elements:

1.File chunks

2.Key – Value stores

These two are data elements in MapR for thread file chunks across containers. Directories are built over Key-Value stores. The tables are built on top of files and key-value stores in an index.

MapR Data Platform war architected in such a way to solve most data problems for enterprise and eliminate data tools.

The heart of the MapR data platform is the Data Container.

And Data Container provides:

1.Different data persistence models, such as files, tables, objects etc.

2.Distributed scale-out storage

3.Data loss prevention

4.Failure resilience and disaster recovery

MapR

What is MapR?

MapR is one of the Big Data Distribution. It is a complete enterprise distribution for Apache Hadoop which is designed to improve the Hadoop’s reliability, performance, and ease of use.

Why MapR?

1. High Availability:

MapR provides High Availability features such as Self – Healing it means that no Namenode architecture.

It has job tracker High Availability and NFS. MapR achieves only distributing its file system metadata.

2. Disaster Recovery:

MapR provides mirroring facility which allows users to enable policies and mirror data. It automatically within the multinode cluster or single node cluster between on-premise and cloud infrastructure

3.Record Performance:

MapR is a world record performance cost only $9 to the earlier cost of $5M at a  speed of 54 sec. And it handles the large size of clusters like 2,200 nodes.

4.Consistent Snapshots:

MapR is the only big data distribution which provides a consistent, point in time recovery because of its unique read and writes storage architecture.

5. Complete Data Protection:

MapR has own security system for data protection in cluster level.

6.Compression:

MapR provides automatic behind the scenes compression to data. It applies compression automatically to files in the cluster.

7.Unbiased Open Source:

MapR completely unbiased opensource distribution

8. Real Multitenancy Including YARN also

9.Enterprise-grade NoSQL

10. Read and Write file system:

MapR has Read and Write file system.

MapR Ecosystem Packs (MEP):

The “MapR Ecosystem” is the set of open source that is included in the MapR Platform, and the “pack” means a bundled set of MapR Ecosystem projects with specific versions.

Mostly MapR Ecosystem Packs are released in every quarter and yearly also

A single version of MapR may support multiple MEPs, but only one at a time.

In a familiar case, Hadoop Ecosystem components and Opensource components are like Spark, Hive etc components are included in MapR Ecosystem Packs are like below tools:

Collectd
Elasticsearch
Grafana
Fluentd
Kibana
Open TSDB

MapR Vs Cloudera Vs Hortonworks

In Bigdata distributions are mostly three familiar in the present market.

1.Cloudera

2.Hortonworks

3.MapR

 

Cloudera, HDP (Hadoop Data Platform) are open source and enterprise editions are also available but MapR is a complete enterprise distribution for Apache Hadoop which is designed to improve the Hadoop’s reliability, performance, and ease of use.

                                                      Hortonworks              Cloudera                         MapR

Manageability:

Management Tools                 Ambari                  Cloudera Manager       MapR CS

Volume Support                             No                              No                                  Yes

Heat map, Alarms                         Yes                              Yes                                  Yes

Alerts                                                  Yes                               Yes                                 Yes

REST API                                           Yes                               Yes                                 Yes

High Availability:

Hortonworks  - Single failure recovery
Cloudera     - Single failure recovery
MapR         - Self healing across multiple failures

Replication:

Hortonworks - Data
Cloudera    - Data
MapR        - Data + Metadata

Disaster Recovery:

Hortonworks - No
Cloudera    - File Copy Scheduling
MapR        -  Monitoring

Upgrading:

Hortonworks - Planned downtime
Cloudera    - Rolling Upgrades
MapR        - Rolling Upgrades

Summary:  Nowadays Big data and Analytics are the most emerging technology. Especially Big data distributions are Cloudera, HDP, and MapR. These are some special features and open source and enterprise editions. MapR is used in the Banking and Finance sectors are used mostly. Cloudera is used anywhere with enterprise and open source. Hortonworks is also same like as Cloudera.