How to Remove(Uninstall) WebDiscover Browser on Windows[Virus Malware] with Pictures

What is WebDiscover Browser in Windows?

It is one of unwanted browser to provide bundled with other software like facebook, youtube, etc. It is automatically downloaded from the internet. It is customized google chromium-browser and changes the search engine automatically. Some of the operating systems in Windows 7 it showing in on top of the desktop window like below image:
Picture 1:

How to Remove WebDiscover Browser

Here is step by step processing to uninstall of WebDiscover browser with pictures.
Step 1: Open “Control Panel” in your operating system whether it is Window 7 or Windows 10. I am going with Windows 10.

Step 2: After opening Control Panel then go to  Programs option then select the  Uninstall a program for web discover remove it.

Step 3: Then select the WebDiscover Browser and Right-click on that. Uninstall the software simply.

Step 4: After uninstalling completion then go with LocalDisk(c) program files then delete that folder. If that folder is not deleted completely then restart the system again delete the folder.

Step 5: If you have Antivirus software are there then scan the entire system.

When the WebDicover Browser installed automatically on Windows operating system some common changes in your machine. Mostly changing the web browsers homepage to WebDiscover Homepage like picture 1 and change search engine also. New tab functionality to launch the modified search portal increase the loads into the Mozilla add – ons or chrome extension.

Summary: WebDiscover Browser one of the browser to search for something. But it is default applications are there without human interaction. So will uninstall the browser with simple steps from windows operating system.

Most frequently asked HBase Interview Questions and Answers

1. When should you HBase and What are the Key components of HBase in Hadoop eco-system?

In Hadoop eco-system, HBase should be used in the big data application has a variable schema in data is stored in the form of collections the applications should be demand key-based access and retrieving data. Region Server is monitors the Region and HBase Master is responsible for monitoring the region server simply.
Zookeeper takes care of the coordination and configuration between the HBase Master component and the client. Catalog Tables are two catalog tables is ROOT and META.ROOT.

2. What are the different operational commands in HBase at a record level and table level?
One is Record level  – put, get, increment, scan and delete.
The second one is Table level – describing, list, drop disable and scan.

3. Explain the difference between RDBMS data model and HBase data mode in Big Data environment?

A. In Big Data environment RDBMS is a schema-based database model
B.HBase is a schemaless database model
C.RDBMS doesn’t have support for in-built partitioning in Data modeling
D.HBase there is automated partitioning in Data modeling

4. What is the difference between HBase and Hive in Hadoop?

HBase and Hive both are different Hadoop based technologies. Whereas Hive is Data summarization on top of Hadoop. HBase is a NoSQL key-value store that runs on top Hadoop

HBase supports 4 primary operations like put, get, scan and delete. whereas Hive helps for SQL to run MapReduce job.

5. What are different types of tombstone markers in HBase for deletion?
In HBase, three types of tombstone markers are there for deletion

A. Family Delete Marker B. Version Delete Marker C. Column Delete Marker.
6. Explain the process of row deletion in HBase on top of Hadoop?

In HBase, the deleted command is not actually deleted from the cells but rather the cells are made invisible by setting up a tombstone marker.

How to Install Kakfa in Linux/Ubuntu (Single node cluster) with Pictures

Apache Kafka is one of the distributed messaging systems. Here is step by step processing to install of Apache Kafka in Linux/Ubuntu operating system.


To install Kafka required Zookeeper and java to run. Mandatory for JDK 1.7 or above version for Kafka installation using below commands:

$ sudo apt update
$ sudo apt install default - jdk

Step 1: Download the Kafka binary files from official website like Download from apache website.

Step 2:  Extract the tarball using the below command:

$ tar -xzvf kafka_2.12.-

Step 3: After extraction, we see Kafka directory

Step 4: Update the KAFKA_HOME & PATH variable in bashrc file

 export KAFKA_HOME = /home/your_path/INSTALL/kafka_2.12-
 export PATH=$PATH:$KAFKA_HOME/bin

Step 5: Ater bashrc changes, open a new terminal and check the bashrc changes using below command:

$ echo $KAFKA_HOME

After installing Apache Kafka on Linux/Ubuntu Start Kafka Server. Before start, the Kafka server, start Zookeeper server on your single node cluster using below commands:

$ cd/usr/local/kafka
$ bin/ config/

After the start, the Zookeeper server then start the Kafka server

$ bin/ config/

After starting Kafka server then create topics after that will go with message passing from producer to consumer. Above steps Kafka installation for single/pseudo cluster setup on top of the Hadoop ecosystem.

Summary: Apache Kafka installation in Linux/Ubuntu operating system it is very simple and uses it. If you need installation in Clouder to need to download separately bundle in Cloudera manager to set up in multi-node cluster setup. In Hortonworks need to the installation of Kafka in Ambari.


Most frequently asked Apache PIG Interview Questions and Answers[Updated]

1. Are there any problems which can only be solved by MapReduce and cannot be solved by Apache PIG? On what scenarios MapReduce jobs will be more useful than PIG?

Here is take one scenario where we want to count the population in two cities. We have a data set and a sensor and a different list of different cities.  We want to count the population by using MapReduce for two cities. Let us assume that one is Hyderabad and other is Bangalore. So I need to consider the key of Hyderabad city similar to Bangalore through which I can bring the population data of these two cities to one reducer. The idea behind this is somehow I have to instruct map reducer program  – whenever you find city with the name “Hyderabad” and city with the name “Bangalore”, you create the alias name which will be the common name for these two cities so that you create a common key for both the cities and it gets passed to the same reducer. For this, we have to write customer partitioner.

In MapReduce when you create ‘key’ for the city, you have to consider ‘city’ as the key. Whenever the MapReduce framework comes across a different city, it considers it as a different key then need to use customized partitioner. If city = Hyderabad or Bangalore then go through the same hashcode. After that cannot create custom partitioner in Pig. It means that PIG is not a framework, we cannot direct the execution engine to customize the partitioner. This type of scenarios, MapReduce works better than Apache PIG.

2. What is the difference between MapReduce and Apache PIG?

In Hadoop, eco-system for processing MapReduce need to write entire logic for operations like join, group, filter, etc.
In Pig have inbuilt functions as compared to MapReduce.
In coding Pig 20 lines of PIG Latin equal to 400 lines of Java.
In PIG High Productivity compared to MapReduce programming.
MapReduce needs to more effort while writing coding.

3. Why should we use ‘distinct’ keyword in PIG scripts?

In Pig scripts distinct keyword is very simple it removes duplicate records. Distinct works only on entire records, not on individual fields like below example:
input = load ‘daily’ as (emails, name);
grads = distinct emails;

4. What is the difference between Pig and SQL?
Apache Pig and SQL a lot of difference here are the mentioned.

Pig is Procedural      SQL is Declarative   
OLAP works             OLAP+OLTP works
Schema is optional     SQL Schema

Unable to Integrate Hive with Spark and different resolutions

How to integrate (connect) Hive and Spark:

Here are to provide solutions for how to integrate (connect) Hive DB with Spark in Hadoop development.
The first time, we tried to connect the Hive and Spark then we got below error and find different types of resolutions with different modes.

caused by: org.datanucleus.exceptions. NucleusExcepiton: Attempt tp invoke 
the ONECP" plugin to create a ConnectionPool gave an error: The specified 
data driver ("co.mysql.jdbc.Driver) was not found in the CLASSPATH. Please 
change our CLASSPATH specification and the name of the driver.

Different types of solution for the above error:

Resolution 1:

1.Download MySQL connector java jar file from maven official website like below link
2. Paste the jar file into jars folder which is present in the Spark installed directory.

Resolution 2:

Without JDBC driver:

1. Goto hive-site.xml and give hive.metastore.uri in that hive xml file
2. Import the org.apache.spark.sql.hive.HiveContext, as it can perform SQL query over Hive tables then define the sqlContext param like below code:
Val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
3. Finally, verify Tables in Spark SQL

Resolution 3:

Go with the beeline for Hive and Spark connection in Hive CLI. In beeline, they provide high security and provide a remote server through directly and check with below two commands for beeline with Hive 2 server configurations.

Step 1: ./bin/beeline
Step 2:  !connect jdbc.hive2.//remote_hive:10000

Hadoop Cluster Interview Questions and Answers Updated

1. In Which directory Hadoop installed?

Apache Hadoop and Cloudera have the same directory structure. Hadoop installed in cd/usr/lib/hadoop

2. Which are three modes in which Hadoop can be run?
In Hadoop eco-system three different types of odes can be run are:
A. Standalone (Local) mode
B.Pseudo-distributed mode
C.Fully distributed ode

3. What are the features of Standalone mode in a Hadoop environment?
In Hadoop eco-system Standalone mode there are one of the modes there are no daemons and everything runs on a single JVM (Java Virtual Machine). There is no DFS(Distributed File System) and utilizes the local file system. Stand-alone mode is suitable only for running MapReduce programs during development.

4. Does Hadoop follow the UNIX pattern?
Yes, Hadoop follows UNIX Pattern.

5. What are the features of Pseudo distributed mode in a Hadoop environment?
In Hadoop eco-system, Pseudo distributed mode is used for both the QA environment and development. In the Pseudo distributed mode all the daemons run on the same machine.

6. Can we call VMs as Pseudos?
No, VMs are not Pseudos because of VM is different and Pseudos are only for Hadoop environment.

7. What are the features of Fully distributed mode in a Hadoop environment?
In Hadoop eco-system, the Fully distributed mode is used for Production, Development and QA environment. Where we have a number of machines forming the Hadoop cluster.

8. What are the default port numbers of Namenode, job tracker and task tracker in Hadoop eco-system?
The port number of Namenode is “50070” for job tracker “50030” and task tracker”50060″.

9. What is the Hadoop configuration files only for Hadoop Installation?
There are three files to configured for Hadoop:
These files are located in hadoop/conf/ directory

10. What happens if you get a “connection refused java exception” in Hadoop? check Hadoop fsck? What happened?
It means that the Name node is not working on your machine.

11. What does /etc/init.d do?
/etc/init.d specifies that where daemons are placed or to see the status of these daemons

Most Frequently Asked Apache Storm Interview Questions and Answers

Top 5 Apache Storm Interview Questions:

1.Difference between Apache Kafka and Apache Storm?

Apache Kafka is a distributed and robust messaging system that can handle a large amount of data and allows to passage of messages from one endpoint to another endpoint communication. Data streams are partitioned and spread over a cluster of machines to allow data streams larger than the capability.

Coming to Apache Storm is a real-time message processing system and can we edit or manipulate data in the real-time. It is pulling the data from Kafka and applies some required manipulation for easily process in streams data in the real-time data processing.

2. What are the key benefits of using Storm for Real-Time Processing?

Real fast: Apache Storm can process 1000 messages per 10seconds per one node.

Fault-Tolerant: Apache Storm detects the fault automatically and re-starts the functional attributes.

Easy to Operate: The Operating Apache Stor is very easy

3. Does Apache Storm act as a Proxy server?

Yes, Apache Storm acts as a proxy also by using the mod_proxy. It implements a proxy, gateway or cache for Apache Storm.

4. How can kill a topology in Apache Storm?

Simply we can run: storm kill {stormname}

Give the same name to storm kill as you used when submitting in Storm topology.

5. What are the common configurations in Apache Storm?

In Apache Storm there are different types of configurations can set topology. Here are some common ones that are set for a topology.

  1. Config.TOPOLOGY_WORKERS: In this set the number of worker processes to use to execute the topology.
  2. Config.TOPOLOGY_ACKER_EXECUTORS: In this set the number of executors that will track tuple trees and detect when a spout tuple has been fully processing by not setting this variable is null.
  3. Config.TOPOLOGY_MAX_SPOUT_PENDING: In this sets the maximum number of spouts tuples that can be pending on a single spout task at once.
  4. Config.TOPOLOGY_MESSAGE_TIMEOUT_SECS: This is the maximum amount of time a spout tuple has to be fully completed before it is considered failed.
  5. Config.TOPOLOGY_SERALIZATIONS: Can register more serializers to Storm using this config so that you can use custom types within tuples.


Most Typical Hive Interview Questions and Answers

Hive Interview Questions and Answers

1. Does Hive support record level Insert, delete or Update?

Hive does not support recode level insert, delete or update. It doesn’t provide transactions also. If the user can go with CASE statements and built-in functions of Hive to satisfy the insert, update and delete.

2. What kind of data warehouse applications is suitable for Hive?

Basically, Hive is not a full database it is a data summarization tool in Hadoop eco-system. Hive can do below applications:

I)Fast response times are not required
II)When the data is not changing rapidly
III)Relatively static data is analyzed

3. How can the columns of a table in Hive be written to a File?

In Hive using the awk command in Hive shell, the output from HiveQL can be written to a file

Example : hive -S -e  "describe table_name" | awk -F " '{print 1}' > ~/output

4.Difference between order by and sort by in Hive?

In Hive SORT BY will sort the data within each reducer. It can use any number of reducers for SORT BY operations.
Coming to ORDER BY will sort all of the data together, which has to pass through one reducer. Thus, ORDER BY in Hive uses single reducers and guarantees total order in the output while SORT BY only guarantees ordering of the rows within a reducer

5. Wherever Different directory I run Hive query, it creates new metastore_db, please explain the reason for it?

Whenever you run the Hive in embedded mode, it means that it creates the local metastore. And before creating the metastore it looks whether metastore already exists or not. This property is defined in the configuration file in hive_site.xml properties.

"javax.jdo.option.ConnectionURL" with default value

6. Is it possible to use the same metastore by multiple users, in case of embedded Hive?

No, it is impossible to use metastore for multiple users, it is only for a single user in a single mode database like PostgreSQL, MySQL, etc.

Kafka Interview Questions and Answers

Kafka Interview Questions and Answers:

1. What is Kafka?

Kafka is an open source message broker project coded in Scala/Python/Java. Kafka is originally developed by LinkedIn and developed as an open sourced in early.

2. Which are the components of Kafka?

The major components of Kafka are:

Topic: A group of messages belongs to the same type

Producer: Using the producer can publish messages to the topic

Consumer: Pulls data from the brokers

Brokers: This is the place where the disclose messages are stored known as servers.

3. What role does Zookeeper play in a cluster of Kafka?

Kafka is an open source system and it also a distributed system and it is built to use Zookeeper. The basic responsibility of Zookeeper is to build coordination between different nodes in a cluster. Zookeeper works as periodically commit offset so that if any ode gets failure it will be used to recover from previously committed offset. The Zookeeper is also responsible for configuration management leader detection, finding if any node leaves or joins the cluster, synchronization.

4. Distinguish between the Kafka and Flume?

Flume’s major use-case is incorporated with the Hadoop’s monitoring system, file formats, file systems, and utilities. It is used for Hadoop integration. Flume will be the best option to use when you have non-relational data sources. But Kafka used for the distributed publish-subscribe messaging system. Kafka is not developed for Hadoop and using Kafka to read and write data to Hadoop considerably than the Flume. Kafka is a highly reliable and scalable enterprise messaging system to connect different multiple systems.

5. It is possible to use Kafka without Zookeeper?

It is impossible to use Kafka without Zookeeper because it is not possible to bypass Zookeeper and connect directly to the server. If the Zookeeper is down then we will not able to sever any client request.

6. How to start a Kafka Server?

Kafka uses Zookeeper, we have to start the zookeeper server. One can use the convince script packaged with Kafka with a single node Zookeeper
> bin/zookeeper-server-start.shconfig/ Now the Kafka server can start> bin/ config/

Spark & Scala Interview Questions and Answers

1. What is Scala what are the importance of it and the difference between Scala and other Programming languages (Java/Python)?

Scala is the most powerful language for developing big data environment applications. Scala provides several benefits to achieve significant productivity. It helps to write robust code with fewer bugs.  Apache Spark is written in Scala, so Scala is a natural fit for the developing Spark applications.

2. What is RDD tell me in brief?

Spark RDD is a primary abstract class in Spark API. RDD is a collection of partitioned data elements that can be operated in parallel. Normally, RDD is supporting properties like Immutable, Cacheable, Type Infer, and Lazy evaluation.

Immutable: RDD’s are Immutable data structures. Once created, it cannot be modified

Partitioned: The Data in RDD’s are partitioned across the distributed cluster of nodes. However, multiple Cassandra partition can be mapped to one single RDD partition

Fault Tolerance: RDD is designed to be a fault – tolerant. Because the RDD data is stored across the large distributed cluster. So there is a chance for node failure in that cluster by this we can lose the Partitioned data in that node.

RDD automatically handles the node failure. Spark will maintain the metadata of each RDD and details about the RDD. So by using that information, we can get that data from other nodes.

Interface: RDD provides a uniform interface for processing data from a variety of data sources such as HDFS, HBase, Cassandra, MongoDB, and others. The same interface can also be used to process data stored in memory across a cluster of nodes.

InMemory: The RDD class provides the API for enabling in-memory cluster computing. Spark allows RDDs to be cached or persisted in memory

3. How to register a temporary table in Spark SQL?

When we creating the data frame by loading the data into it using SQL Context object. This is treated a temporary table. Because the scope of the data frame is to a particular session

4. How to count the number of lines in Scala?

In Scala programming language using getLines.size property we can count

Example: Val countlines = source.getLines.size