Text mining in big data analysis

Text mining in big data data analysis
This is my first blog and I would like to start by sharing my knowledge on text mining.

Wondering why the word “mining” in text analysis?                                                                  Well, text mining is basically extracting useful information from a pool of unstructured text. Purpose of Text Mining is to process unstructured textual information from the text and thus make the information contained in the text accessible to statistical algorithms. It helps in summarizing the documents based on words contained in them. As most of the information stored or available online is in the form of texts, text mining has a high commercial potential value.

Some of the Softwares which can be used for text mining are SAS, R, Rapidminer, IBM Language Ware etc. Since I have worked on text mining using R, I would like to share my experience. R contains a package called “TM” abbreviated as text mining which can be installed and the mining ops can be done without any hassles. Furthermore I used it for generating a word cloud out of the processed text; “word cloud” is a way of representing the words used in text document in a pictorial way such that the most repeated words can be easily identified….!!

In general, text mining is nothing but “turning text into numbers” which can be later used for predictive data mining projects.

Many organizations use Text mining to derive business insights from text-based content such as word documents, email and postings on social media streams like Facebook, Twitter and LinkedIn etc. hope you have heard of term called “Sentiment analysis” and if not let me explain this to you, it is one of the important Applications of text mining; large amount of textual data gathered from any social media networks such as Facebook, twitter or in fact even whatsapp chats with your bestie’s can be used as a data source for sentiment analysis. Result of this helps in understanding public opinion on a given topic and can also be used for feedback (positive or negative) on a product thus helping retail businesses to make decisions accordingly.

Now, I will tell you about one of the applications of text mining i.e. Twitter analysis using R in brief.

Twitter analysis helps us in understanding public opinion on any given topic.

I did it using R, let me explain the code to you….!!

There are a lot of packages to be installed in R before going to the coding part


library(twitteR) # Provides an interface to the Twitter web API.

library(RCurl)  # Processes the results returned by the Web server.

library(RJSONIO) #This package allows conversion to, from data in JavaScript object notation.

library(stringr) #helps in transferring output from one function into the input of another

library(tm) # A framework for text mining applications within R.

library(wordcloud) #for generating word cloud.

As mentioned earlier twitter provides API to crawl through it to get the text, but consumer key and secret key is also required for the same and that can be obtained from https://apps.twitter.com.

Follow these steps to obtain the keys

Log in >create application> fill in the essential details

The screen shot is given for assistance.

consumer_key=’CcuMJOVXpumqbSfD50avfKAsL’  # consumer key

consumer_secret=’KcmuacvfzyyJ68sfFdmyiJaUuz1gf9ah95LPw78pZ1aU3VLY3j’ #consumer  #secret key

access_token= ‘2243190881-8Hu6RYBcxt9hfuCUivOnnY1azf1nplTTQ0rwtUk’ #access key

access_secret=’v5j7ACNvGwZHFSmME1GvLyXJiqiH6aOFSkPn6HJvL6Q0d’#access secret

setup_twitter_oauth(consumer_key,consumer_secret,access_token,access_secret)# This line of code is used for setting up of authorisation.

test_tweets_d=searchTwitter(“obama”,n=25,lang=”en”) #searches the twitter for the word given such as “obama”

test_tweets_d_text <- sapply(test_tweets_d,function(x) x$getText())#sapply is a function over list or a vector

y = unlist(test_tweets_d_text, recursive = TRUE)# converting list data type to character data type.

docs <- Corpus(VectorSource(y)) #creating a corpus of text.

dtm <- TermDocumentMatrix(docs)#creates a term document matrix

m <- as.matrix(dtm)

v <- sort(rowSums(m),decreasing=TRUE)

d <- data.frame(word = names(v),freq=v)

wordcloud(words = d$word, freq = d$freq, min.freq = 1,

max.words=200, random.order=FALSE, rot.per=0.35,

colors=brewer.pal(8, “Dark2”)) #creates word cloud of the text from the tweets


The word cloud generated from twitter is as shown.



Text mining using R

Text analysis basics

Twitter analysis using R


















Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s