Real-time Twitter data analysis using Hadoop ecosystem | Big Data | Hadoop




In the era of the web , social media has become an integral a part of modern society. People use social media to share their opinions and to possess an up-to-date knowledge about the present trends on a day to day . Twitter is one among the renowned social media that gets an enormous amount of tweets every day . This information are often used for economic, industrial, social or government approaches by arranging and analyzing the tweets as per our demand.
Since Twitter contains an enormous volume of data, storing and processing this data may be a complex problem. As an analysis, we’ve proposed a way for locating recent trends in tweets and performed sentiment analysis on real-time tweets. The analysis is completed using Hadoop ecosystem tools like Apache Hive and Apache Pig.

Introduction

Considering the favored micro-blogging forum Twitter, where every minute many tweets are being posted, which greatly varies with the sector of topic, analyzing these tweets, identifying what they express is tedious. during a situation where a corporation is interested to understand its customers’ opinion regarding the merchandise that it’s manufactured, it’s very difficult for organizations to stay track of every and each opinion. during this scenario, sentiment analysis plays a really important role. Sentiment analysis is identifying the polarity of the text, i.e. it identifies the attitude of the user toward a specific topic. Sentiment analysis greatly helps in knowing a person’s behavior toward a specific topic. Importantly, all the info from the web are going to be in unstructured or in semi-structured format. It seems that the most important challenge is in processing this sort of knowledge efficiently in order that the knowledge out of this data are often extracted for further survey.

All the normal data analysis techniques have gradually did not perform analysis effectively on larger data sets. Recently, the powerful framework that has proved to be efficient in processing large sets of knowledge is Hadoop, which is taken into account to be efficient for distributed processing also as distributed storage of huge sets of knowledge . the most core components of Hadoop framework are MapReduce and HDFS. MapReduce may be a programming model for processing larger data sets and HDFS may be a Hadoop Distributed filing system that stores data within the sort of memory blocks and distributes them across clusters.

Apart from core components, Apache also delivers different tools/components to satisfy developer’s needs. These tools along side core components of Hadoop are called the Hadoop Ecosystem

The real-time Twitter data is take out using Apache Flume (https://flume.apache.org/). the info being extracted would be in JSON (JavaScript Object Notation) format. In JSON format, every data are depict in key/value pairs and separated by commas. the info stored within the HDFS are analyzed using data access components of Hadoop ecosystem Apache Pig (https://pig.apache.org) and Apache Hive (https://hive.apache.org).

 Data access components of Hadoop ecosystem

The two key data access elements of Hadoop Ecosystem are Apache Pig and Apache Hive. Hadoop’s basic programming layer is MapReduce but these components ease the writing of complex Java MapReduce program. Apache Pig is an abstraction over Map Reduce and it provides a high-level procedural data flow language for processing the info referred to as Pig Latin. Programmers use Pig Latin and write Pig scripts to research the info which is internally converted to Map Reduce jobs. Pig shorten the complexity of writing long lines of codes, and using built operators, users can develop their own function.

Apache Hive may be a data warehousing software that address how data is structured and queried. Hive features a declarative SQL like language, i.e. HiveQL or HQL for querying data. Traditional SQL queries can easily be performed using HiveQL. In Hive, queries are implicitly converted to the mapper and reducer job. the benefits of using Hive are the features it provides, i.e. fast, scalable, extensible, etc. people that aren’t good at programming can also enter for this to research data on HDFS.

Related work

In this part, the literature survey focuses on fetching tweets, Hadoop being efficiently used as a feedback system, recommendation system and Sentiment analysis of Twitter data

Fetching tweets

Nowadays, people use social media sites to precise their opinions toward some issues, or product. Twitter is one among the social media sites which generates an enormous amount of knowledge . Collecting and processing this huge amount of knowledge is challenging. Twitter Application Programming Interface provides a streaming API to stream real-time data.

Because of a couple of restrictions set by Twitter on streaming API, one can download a limited number of tweets during a given time-frame . Therefore, we require an efficient tool to retrieve huge amount of knowledge from Twitter. Apache Flume is an efficient tool to retrieve real-time huge amount of knowledge from Twitter.

This section describes the general framework for capturing and analyzing tweets streamed in real time. As a primary part, real-time tweets are collected from Twitter. These tweets are retrieved from Twitter using Twitter streaming API as shown in Figure . The Flume (https://flume.apache.org/) Fernandes & Rio D’Souza, is liable for communicating with the Twitter streaming API and retrieving tweets matching certain trends or keywords. The tweets retrieved from Flume are in JSON format which are passed on to HDFS. This semi-structured twitter data is given as input to the PIG module also because the HIVE module which can convert nested JSON data into a structured form that’s suitable for analysis.




figure 1: system model for capturing and analyzing the tweets

Finding recent trends

Trend may be a subject of the many posts on social media for a brief duration of your time . Finding recent trends means to process the large amount of knowledge collected over the needed period of your time .

Finding popular hashtags using Apache Pig

To find popular hashtags of given tweets, the tweets are loaded into Apache Pig module, wherein these tweets are skilled a series of Pig scripts for locating popular hashtags. Following are the steps to work out the favored hashtags in tweets:

a)Loading the Twitter data on Pig

This pour twitter data is in JSON format and consists of map data types that’s data with key and value pair. For dissection, the tweets stored in HDFS are loaded into PIG module.

Algorithm 1: Finding popular hashtag

Data: dataset=collection of tweets

Result: popular hashtag

Load tweets from HDFS to Hadoop ecosystem module

for each tweet in module

feature = extract(extract id, hashtag text)

end

for each feature

count_id = Count(id€hashtag text)

endpopular_hashtag = max(count_id)
b) Feature extraction

This step is understood as preprocessing where Twitter messages containing many fields like id, text, entities, language, time zone, etc. are verified . to hunt out famous hashtags, we’ve extracted tweet id and entities fields where the entity field features a member hashtag. This member is employed for further analysis in conjunction with tweet id.

c) Extract hashtags

Each hashtag object holds two fields: they’re text and indices where text field contains the hashtag. So to hunt out famous hashtags, we’ve extracted text field. The output of this part is hashtag followed by the tweet id.

For example, GST; 11769715070618401

d) Counting hashtags

After presenting all the above steps, we get hashtag and tweet ids. to hunt out popular hashtags, we first group the relation with reference to hashtag, next we count the number of times the hashtag appeared. Hashtags, which have perform highest number of times, are categorized as famous hashtags or recent trends.

 Finding recent trends using Apache Hive

Recent trends from real-time tweets also can be found using Hive queries. Since tweets gathered from twitter are in JSON format, we’ve to use JSON input format to load the tweets into Hive. we’ve used Cloudera Hive JsonSerDe for this purpose.

This jar file possesses to be present in Hive to process the info . This jar file are often added using following command.

add jar<path to jar file> ;

Following steps are performed to hunt out the recent trend:

a) Loading and Feature extraction

The tweets gathered from twitter are stored in HDFS. so on figure with Data stored in HDFS using HiveQL, first an external table is made which creates the table definition within the Hive metastore . The following figure  shows the query wont to create the Twitter table. this question not only creates a schema to store the tweets, but also extracts required fields like id and entities.






figure 2: Query to make a table in Hive.

b) Extracting Hashtags

In order to extract actual hashtags from entities, we created another table which contains id and thus the list of hashtags. Since multiple hashtags are existing in one tweet, we used UDTF (User Defined Table generation function) to extract each hashtag on the new row .The outcome of this phase is id and hashtag.

c) Counting hashtag

After performing all the above steps, we’ve id and hashtag text .A hive query is written to count the hashtags.

 Sentiment analysis using Apache Pig

The tweets from trending topics fetched using Apache Flume are stored in Hadoop’s filing system . so on perform sentiment analysis, these data need to be loaded on Apache Pig. Following are the steps to detect the sentiment in tweets using Pig:

a)Extracting the tweets

The twitter data loaded into Apache Pig is extremely nested JSON format that consists not only of tweets, but also other details like image, URL, Twitter user id, tweet id, user profile explanation, location from where the tweets were posted, tweet posting time, etc. The sentiment analysis is completed only on tweets. As a preprocessing step of sentiment analysis, we’ve extracted Twitter id, and tweet from the JSON Twitter data.

Algorithm 2: Sentiment Classification
Data: dataset: = Collection of tweets, list of dictionary words with rating

Result: classification: = positive, negative or neutral

Notations: d_r: dictionary_word_rating, T:Tweets, C:Corpus

While T in C do

while words in tweet do

if word = = any word in dictionary then

word_rating = d_r;

continue;

end

end

avg_rating = avg(word_rating)

if avg_rating ≥ = 0.0 then

Given tweet is positive

end

else if avg_rating < 0.0 then

Given tweet is negative

end

else

Given tweet is neutral

end

end



b)Tokenizing the tweets

Sentiment analysis means finding polarity of sentiment words. so on seek out sentiment words, we’ve to separate the whole sentence into words. the tactic of splitting from stream of text into words is claimed to be tokenization. All the tweets collected from the previous step are tokenized and split into words. This tokenized list of words is fed as input for further processing of sentiment analysis.

c)Sentiment word detection

In order to hunt out the sentiment words from tokenized tweets, we’ve created a dictionary of sentiment words. This dictionary consists of an inventory of sentiment words which are rated from + 5 to −5 counting on their meaning. The words which are rated from 1 to five is taken into account to be positive and thus the words rated from −5 to −1 are considered to be negative. With the assistance of this dictionary, tokenized words are rated. To rate the tokenized words, we performed a map side join by joining the token statement and thus the contents of the dictionary.

Classification of Tweets

After performing all the above steps, we’ve tweet id, tweet and its associated rating. we’ve grouped all the tweets that are rated with reference to the tweet id. We now need to calculate the quality of the rating given to the tweets. Average rating, AR, of the tweets are calculated using formula (1), where SWR indicates the sum of word ratings and TWP denotes total number words during a tweet.
AR=SWR / TWP

Based on the calculated average rating, we will classify the tweets into positive tweets and negative tweets. The tweets that are rated above zero are considered as positive tweets and below zero are treated as negative tweets. The tweets which don’t contain any sentiment words are considered as neutral.

 Sentiment Analysis using Apache Hive

The tweets of trending topics set aside in HDFS are used for sentiment analysis and processed using HiveQL. The steps are discussed within the following section.

a)Loading the tweets and feature extraction

In order to operate sentiment analysis, we first got to load the tweets on Hive warehouse. to try to to that, we create an external hive table within the same directory where the tweets are stored in HDFS. To operate sentiment analysis, we only need tweet id and text. Hence, we create a table with these two fields. The below figure shows the table created to store the Twitter data.

b) Tokenizing the tweets

figure 3: Hive table to store twitter data.

In order to seek out the sentiment words, the tweet is split into words using one among the Hive UDF functions. A Hive table is made to store the tweet id and therefore the array of words present in each tweet. As multiple words are present in an array, we used some inbuilt UDTF function to extract each word from an array and created a replacement row for every word. Another table is made to store id and word.

c) Sentiment word detection

Sentiment analysis is completed using dictionary-based method. A table is made to store the contents present within the dictionary.

In order to rate the tokenized words, the tokenized words need to be mapped with the loaded dictionary. We operated left outer join operation on a table that contains id, word and dictionary table if the word matches with the sentiment word within the dictionary, then a rating is given to the matched word alternatively NULL value is assigned. A hive table is made to store id, word then rating.

d) Classification of tweets

After performing all the above steps, we’ve id, word and rating. Then group by operation is performed on id to group all the words belonging to at least one tweet after which average operation is performed on the ratings given to every word during a tweet. supported the typical ratings, tweets are classified into positive and negative.

Performance evaluation

We have used Pig Latin and HiveQL languages to proceeding real-time tweets. counting on sort of data and purpose, developers can choose either Pig or Hive. Some differences are, Pig is especially employed by researchers and programmers whereas Hive is ETL (Extract-Transform-Load) tool employed by Data Analysts to get reports. Hive performs on the server side of cluster but Pig operates on client side. In Hive, we’d like to explain the table already and store schema details whereas in Pig there’s no dedicated metadata database and schemas. Pig assists complex data types like Map, Tuple and Bags and provides high-level operators, namely ordering and filtering which help in structuring semi-structured data whereas Hive is suitable for structured data. Tweets are stored in semi-structured format and that we process them using Pig and Hive. Both convey same results, but Pig execution was faster than Hive. The execution of Pig is quicker because its architectural design supports nested data structures and provides high-level operators for processing semi-structured data.

Execution time for finding recent trends

Finding recent trends is that the first stage within the proposed work. The steps followed to seek out the recent trends are discussed within the proposed work. We execute this method on both Pig and Hive and estimated the execution time. The diagram shows the time taken for execution by Pig and Hive. In the below figure, the solid line indicates the time taken for counting hashtags by Hive framework and dashed line indicates the time taken by Pig framework. From the below diagram, we will conclude that Hive module takes longer for execution than Pig.

figure 4: Execution time for finding recent trends under Pig vs Hive