How to install PyCharm(Python) on Linux/Ubuntu and How to create project




PyCharm Installation on Linux:

Step 1: Download tarball from PyCharm official website from JetBrains for Professionals(For both Scientific and Web Python development) and Community (For Python development) edition than simply Download the Community edition.

Step 2: After downloaded tarball then extract it by using ” command and then find out the folder in the directory

tar -xzvf pycharm-community-2019.2.tar.gz"

Step 3: Go to /home/sreekanth/Downloads/pycharm-community-2019.2/bin folder

Step 4: Run the “pycharm.sh” file using below command in the bin folder.

"bash pycharm.sh "

 

Step 5: Then automatically open the PyCharm window like below image:

If you want to check every Tip click on “Next Tip” otherwise directly click on “Close”

Step 6: Go to Top left side click on File -> New Project for creating a new project

Step 7: After clicking on the “New Project” option then choose the Project Location where you will choose for applications like below snapshot.




If choosing a location then click on the “Create” button.

Step 7: After created your project and completed the programming then execute will check the results in the “Python Console”. If you want to write sample programs then directly write and execute simple.

Example:  >> print(“Hello”) then click on the enter button will see the results

Summary: Above steps are to install the PyCharm community edition on Linux or Ubuntu operating system. Here is provided step by step processing with pictures.

Permission Denied error in Hive while creating Database in Hadoop eco-system

I have installed Hive service on top Hadoop eco-system then trying to create a database but I got below error and find out a solution as well.



Permission Denied Error in Hive:

FAILED: Execution Error, return code1 from org.apache.hadoop.hive.ql.exec.DDLTask
hive> set hive.auto.convert.join.nonconditional task = false:
hive> create database myhive:
FAILED: Error in metadata: MetaException(message:Got exception: org.apache.hadoop.security.AccessControlException Permission denied user = hadoop access = WRITE, inode*/user*: hdfs : supergroup : drwxr-rx-r
at org.apache.hadoop.hdfs.server.namenode.FSPErmissionChecker.check(FSPermissionChecker.java:224)
at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:149)
at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:149)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkPermission(FSNamesystem.java:4891)
at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.mkdirs(NameNodeRpcServer.java:669)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.mkdirs
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java.453)
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask
hive>

Above error belongs to Permission issue in Hive component:

Resolution:

To resolve the permission issue to the user Hadoop in HDFS. Can you please follow step for the solutions is just change permission to the user using chmod commands:

Step 1: Login to as hduser then execute the below commands one by one.
Step 2: sudo - u hdfs hadoop fs -mkdir /user/hive/warehouse
Step 3: sudo -u hdfs hadoop fs -chmod g+w /tmp
Step 4: sudo -u hdfs hadoop fs -chmod g+w /user/hive/warehouse
Step 5: sudo - u hdfs hadoop fs  -chown -R /user/hive/warehouse
Step 6: sudo chmod 777 /var/lib/hive/metastore
Step 7: cd /var/lib/hive/metastore/metastore_db/
Step 8 :sudo rm *.lck

Summary: I have tried above resolutions then working fine now for above error in Hive.

What is Apache Spark Eco-System | Spark SQL | Spark Streaming | GrapX





What is Apache Spark?
Spark is a fast, easy to use and flexible data processing and in-memory compute framework. It can run on top of Hadoop eco-system, and Cloud accessing diverse data sources including HDFS, HBase, and other services.

Different Key Features of Spark:
1.Fast

2.In General Purpose

3.Scalable

4.Fault-Tolerant

What is the Spark Engine?

Spark engine is for scheduling, distributing and monitoring the large data applications.

What is RDD?

RDD means that Resilient Distribution DataSets. Designed to be fault-tolerant and represents data distributed across the cluster. If node failing is proportional to the number of nodes in a cluster.

RDD supports two operations:
1. Transformations

2.Actions




What is Hive on Spark?

Hive support for Apache Spark, wherein Hive execution is configured to Spark below configurations:

hive > set spark.home=/location/ to /Spark_Home

hive > set hive.execution.engine=spark;

Hive on Spark supports Spark on yarn mode by default

Spark Eco-System

1.Spark SQL – For developing

2.Spark Streaming – For live data streaming

3.GraphX for computing graphs

4.MLib for Machine learning

5.SparkR for Spark engine.

What is Spark SQL?

Spark SQL called as a Shark is a novel module. It introduced that Spark with structured data and processing. Spark executes relational SQL queries on data. The core of the Spark SQL is to supports the RDDs.

What is Spark Streaming?

Apache Spark streaming supports live data processing. It is an extension to the Spark API, allowing stream processing of continuous live data streams. For example data from different sources like HDFS, Flume services are streamed and finally processed to file systems.

What is Spark GraphX?

Spark GraphX means that processing the graphs to build and transform capable graphs. And its component enables programmers to reason about structured data at small.

What is Spark MLib?

Spark MLib is a scalable machine learning library provided by an Apache Spark. It provides easy to understand with algorithms and use different use cases like clustering, filtering, etc.

 

Opera Failed to Uninstall on Windows 10





In Windows 10 operating system automatically installed Opera browser so now I am trying to uninstall the Opera browser on Windows 10

Opera Browser Uninstall on Windows 10

Click on Windows button it showing Opera Browser below :

While uninstalling time I get below error with snapshot for understanding

Opera failed to uninstall: Unable to uninstall Opera.exe.
Please make sure Opera is not running and try again.

Resolutions:

Step 1: Click on the Windows button then search on Opera then right-click on the application. After that select the uninstall option. It will redirect to the Control Panel uninstall the program.

Step2: If it will not get the above option then directly ho with Control panel uninstall programs then choose Opera Browser application to right-click on Uninstall button then select the options. Click on Delete my Opera user data and click on Uninstall button.






Step 3: After clicking on Uninstall button then will get Yes/no window. Choose Yes option will get completely uninstall from Windows 10 operating system. If it is not uninstalled then restart your Windows 10. Start with every step from scratching.

How to Remove(Uninstall) WebDiscover Browser on Windows[Virus Malware] with Pictures




What is WebDiscover Browser in Windows?

It is one of unwanted browser to provide bundled with other software like facebook, youtube, etc. It is automatically downloaded from the internet. It is customized google chromium-browser and changes the search engine automatically. Some of the operating systems in Windows 7 it showing in on top of the desktop window like below image:
Picture 1:

How to Remove WebDiscover Browser

Here is step by step processing to uninstall of WebDiscover browser with pictures.
Step 1: Open “Control Panel” in your operating system whether it is Window 7 or Windows 10. I am going with Windows 10.

Step 2: After opening Control Panel then go to  Programs option then select the  Uninstall a program for web discover remove it.

Step 3: Then select the WebDiscover Browser and Right-click on that. Uninstall the software simply.

Step 4: After uninstalling completion then go with LocalDisk(c) program files then delete that folder. If that folder is not deleted completely then restart the system again delete the folder.

Step 5: If you have Antivirus software are there then scan the entire system.

When the WebDicover Browser installed automatically on Windows operating system some common changes in your machine. Mostly changing the web browsers homepage to WebDiscover Homepage like picture 1 and change search engine also. New tab functionality to launch the modified search portal increase the loads into the Mozilla add – ons or chrome extension.




Summary: WebDiscover Browser one of the browser to search for something. But it is default applications are there without human interaction. So will uninstall the browser with simple steps from windows operating system.

Most frequently asked HBase Interview Questions and Answers





1. When should you HBase and What are the Key components of HBase in Hadoop eco-system?

In Hadoop eco-system, HBase should be used in the big data application has a variable schema in data is stored in the form of collections the applications should be demand key-based access and retrieving data. Region Server is monitors the Region and HBase Master is responsible for monitoring the region server simply.
Zookeeper takes care of the coordination and configuration between the HBase Master component and the client. Catalog Tables are two catalog tables is ROOT and META.ROOT.

2. What are the different operational commands in HBase at a record level and table level?
One is Record level  – put, get, increment, scan and delete.
The second one is Table level – describing, list, drop disable and scan.

3. Explain the difference between RDBMS data model and HBase data mode in Big Data environment?

A. In Big Data environment RDBMS is a schema-based database model
B.HBase is a schemaless database model
C.RDBMS doesn’t have support for in-built partitioning in Data modeling
D.HBase there is automated partitioning in Data modeling




4. What is the difference between HBase and Hive in Hadoop?

HBase and Hive both are different Hadoop based technologies. Whereas Hive is Data summarization on top of Hadoop. HBase is a NoSQL key-value store that runs on top Hadoop

HBase supports 4 primary operations like put, get, scan and delete. whereas Hive helps for SQL to run MapReduce job.

5. What are different types of tombstone markers in HBase for deletion?
In HBase, three types of tombstone markers are there for deletion

A. Family Delete Marker B. Version Delete Marker C. Column Delete Marker.
6. Explain the process of row deletion in HBase on top of Hadoop?

In HBase, the deleted command is not actually deleted from the cells but rather the cells are made invisible by setting up a tombstone marker.

How to Install Kakfa in Linux/Ubuntu (Single node cluster) with Pictures

Apache Kafka is one of the distributed messaging systems. Here is step by step processing to install of Apache Kafka in Linux/Ubuntu operating system.



Prerequisites:

To install Kafka required Zookeeper and java to run. Mandatory for JDK 1.7 or above version for Kafka installation using below commands:

$ sudo apt update
$ sudo apt install default - jdk

Step 1: Download the Kafka binary files from official website like Download from apache website.

https://archive.apache.org/dist/kafka/

Step 2:  Extract the tarball using the below command:

$ tar -xzvf kafka_2.12.-0.10.1.1tgz


Step 3: After extraction, we see Kafka directory

Step 4: Update the KAFKA_HOME & PATH variable in bashrc file

 export KAFKA_HOME = /home/your_path/INSTALL/kafka_2.12-0.10.1.1
 export PATH=$PATH:$KAFKA_HOME/bin


Step 5: Ater bashrc changes, open a new terminal and check the bashrc changes using below command:

$ echo $KAFKA_HOME






After installing Apache Kafka on Linux/Ubuntu Start Kafka Server. Before start, the Kafka server, start Zookeeper server on your single node cluster using below commands:

$ cd/usr/local/kafka
$ bin/zookeeper-server-start.sh config/zookeeper.properties

After the start, the Zookeeper server then start the Kafka server

$ bin/kafka-server-start.sh config/server.properties

After starting Kafka server then create topics after that will go with message passing from producer to consumer. Above steps Kafka installation for single/pseudo cluster setup on top of the Hadoop ecosystem.

Summary: Apache Kafka installation in Linux/Ubuntu operating system it is very simple and uses it. If you need installation in Clouder to need to download separately bundle in Cloudera manager to set up in multi-node cluster setup. In Hortonworks need to the installation of Kafka in Ambari.

 

Most frequently asked Apache PIG Interview Questions and Answers[Updated]

1. Are there any problems which can only be solved by MapReduce and cannot be solved by Apache PIG? On what scenarios MapReduce jobs will be more useful than PIG?




Here is take one scenario where we want to count the population in two cities. We have a data set and a sensor and a different list of different cities.  We want to count the population by using MapReduce for two cities. Let us assume that one is Hyderabad and other is Bangalore. So I need to consider the key of Hyderabad city similar to Bangalore through which I can bring the population data of these two cities to one reducer. The idea behind this is somehow I have to instruct map reducer program  – whenever you find city with the name “Hyderabad” and city with the name “Bangalore”, you create the alias name which will be the common name for these two cities so that you create a common key for both the cities and it gets passed to the same reducer. For this, we have to write customer partitioner.

In MapReduce when you create ‘key’ for the city, you have to consider ‘city’ as the key. Whenever the MapReduce framework comes across a different city, it considers it as a different key then need to use customized partitioner. If city = Hyderabad or Bangalore then go through the same hashcode. After that cannot create custom partitioner in Pig. It means that PIG is not a framework, we cannot direct the execution engine to customize the partitioner. This type of scenarios, MapReduce works better than Apache PIG.

2. What is the difference between MapReduce and Apache PIG?

In Hadoop, eco-system for processing MapReduce need to write entire logic for operations like join, group, filter, etc.
In Pig have inbuilt functions as compared to MapReduce.
In coding Pig 20 lines of PIG Latin equal to 400 lines of Java.
In PIG High Productivity compared to MapReduce programming.
MapReduce needs to more effort while writing coding.

3. Why should we use ‘distinct’ keyword in PIG scripts?




In Pig scripts distinct keyword is very simple it removes duplicate records. Distinct works only on entire records, not on individual fields like below example:
input = load ‘daily’ as (emails, name);
grads = distinct emails;

4. What is the difference between Pig and SQL?
Apache Pig and SQL a lot of difference here are the mentioned.

Pig is Procedural      SQL is Declarative   
OLAP works             OLAP+OLTP works
Schema is optional     SQL Schema

Unable to Integrate Hive with Spark and different resolutions




How to integrate (connect) Hive and Spark:

Here are to provide solutions for how to integrate (connect) Hive DB with Spark in Hadoop development.
The first time, we tried to connect the Hive and Spark then we got below error and find different types of resolutions with different modes.

caused by: org.datanucleus.exceptions. NucleusExcepiton: Attempt tp invoke 
the ONECP" plugin to create a ConnectionPool gave an error: The specified 
data driver ("co.mysql.jdbc.Driver) was not found in the CLASSPATH. Please 
change our CLASSPATH specification and the name of the driver.

Different types of solution for the above error:

Resolution 1:

1.Download MySQL connector java jar file from maven official website like below link
https://mvnrepository.com/artifact/mysql/mysql-connector-java/5.1.21
2. Paste the jar file into jars folder which is present in the Spark installed directory.

Resolution 2:

Without JDBC driver:

1. Goto hive-site.xml and give hive.metastore.uri in that hive xml file
2. Import the org.apache.spark.sql.hive.HiveContext, as it can perform SQL query over Hive tables then define the sqlContext param like below code:
Val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
3. Finally, verify Tables in Spark SQL

Resolution 3:





Go with the beeline for Hive and Spark connection in Hive CLI. In beeline, they provide high security and provide a remote server through directly and check with below two commands for beeline with Hive 2 server configurations.

Step 1: ./bin/beeline
Step 2:  !connect jdbc.hive2.//remote_hive:10000

Hadoop Cluster Interview Questions and Answers Updated

1. In Which directory Hadoop installed?




Apache Hadoop and Cloudera have the same directory structure. Hadoop installed in cd/usr/lib/hadoop

2. Which are three modes in which Hadoop can be run?
In Hadoop eco-system three different types of odes can be run are:
A. Standalone (Local) mode
B.Pseudo-distributed mode
C.Fully distributed ode

3. What are the features of Standalone mode in a Hadoop environment?
In Hadoop eco-system Standalone mode there are one of the modes there are no daemons and everything runs on a single JVM (Java Virtual Machine). There is no DFS(Distributed File System) and utilizes the local file system. Stand-alone mode is suitable only for running MapReduce programs during development.

4. Does Hadoop follow the UNIX pattern?
Yes, Hadoop follows UNIX Pattern.

5. What are the features of Pseudo distributed mode in a Hadoop environment?
In Hadoop eco-system, Pseudo distributed mode is used for both the QA environment and development. In the Pseudo distributed mode all the daemons run on the same machine.




6. Can we call VMs as Pseudos?
No, VMs are not Pseudos because of VM is different and Pseudos are only for Hadoop environment.

7. What are the features of Fully distributed mode in a Hadoop environment?
In Hadoop eco-system, the Fully distributed mode is used for Production, Development and QA environment. Where we have a number of machines forming the Hadoop cluster.

8. What are the default port numbers of Namenode, job tracker and task tracker in Hadoop eco-system?
The port number of Namenode is “50070” for job tracker “50030” and task tracker”50060″.

9. What is the Hadoop configuration files only for Hadoop Installation?
There are three files to configured for Hadoop:
1.core-site.xml
2.hdfs-site.xml
3.mapred-site.xml
These files are located in hadoop/conf/ directory

10. What happens if you get a “connection refused java exception” in Hadoop? check Hadoop fsck? What happened?
It means that the Name node is not working on your machine.

11. What does /etc/init.d do?
/etc/init.d specifies that where daemons are placed or to see the status of these daemons