Text mining in big data data analysis
This is my first blog and I would like to start by sharing my knowledge on text mining.
Wondering why the word “mining” in text analysis? Well, text mining is basically extracting useful information from a pool of unstructured text. Purpose of Text Mining is to process unstructured textual information from the text and thus make the information contained in the text accessible to statistical algorithms. It helps in summarizing the documents based on words contained in them. As most of the information stored or available online is in the form of texts, text mining has a high commercial potential value.
Some of the Softwares which can be used for text mining are SAS, R, Rapidminer, IBM Language Ware etc. Since I have worked on text mining using R, I would like to share my experience. R contains a package called “TM” abbreviated as text mining which can be installed and the mining ops can be done without any hassles. Furthermore I used it for generating a word cloud out of the processed text; “word cloud” is a way of representing the words used in text document in a pictorial way such that the most repeated words can be easily identified….!!
In general, text mining is nothing but “turning text into numbers” which can be later used for predictive data mining projects.
Many organizations use Text mining to derive business insights from text-based content such as word documents, email and postings on social media streams like Facebook, Twitter and LinkedIn etc. hope you have heard of term called “Sentiment analysis” and if not let me explain this to you, it is one of the important Applications of text mining; large amount of textual data gathered from any social media networks such as Facebook, twitter or in fact even whatsapp chats with your bestie’s can be used as a data source for sentiment analysis. Result of this helps in understanding public opinion on a given topic and can also be used for feedback (positive or negative) on a product thus helping retail businesses to make decisions accordingly.
Now, I will tell you about one of the applications of text mining i.e. Twitter analysis using R in brief.
Twitter analysis helps us in understanding public opinion on any given topic.
I did it using R, let me explain the code to you….!!
There are a lot of packages to be installed in R before going to the coding part
library(twitteR) # Provides an interface to the Twitter web API.
library(RCurl) # Processes the results returned by the Web server.
library(stringr) #helps in transferring output from one function into the input of another
library(tm) # A framework for text mining applications within R.
library(wordcloud) #for generating word cloud.
As mentioned earlier twitter provides API to crawl through it to get the text, but consumer key and secret key is also required for the same and that can be obtained from https://apps.twitter.com.
Follow these steps to obtain the keys
Log in >create application> fill in the essential details
The screen shot is given for assistance.
consumer_key=’CcuMJOVXpumqbSfD50avfKAsL’ # consumer key
consumer_secret=’KcmuacvfzyyJ68sfFdmyiJaUuz1gf9ah95LPw78pZ1aU3VLY3j’ #consumer #secret key
access_token= ‘2243190881-8Hu6RYBcxt9hfuCUivOnnY1azf1nplTTQ0rwtUk’ #access key
setup_twitter_oauth(consumer_key,consumer_secret,access_token,access_secret)# This line of code is used for setting up of authorisation.
test_tweets_d=searchTwitter(“obama”,n=25,lang=”en”) #searches the twitter for the word given such as “obama”
test_tweets_d_text <- sapply(test_tweets_d,function(x) x$getText())#sapply is a function over list or a vector
y = unlist(test_tweets_d_text, recursive = TRUE)# converting list data type to character data type.
docs <- Corpus(VectorSource(y)) #creating a corpus of text.
dtm <- TermDocumentMatrix(docs)#creates a term document matrix
m <- as.matrix(dtm)
v <- sort(rowSums(m),decreasing=TRUE)
d <- data.frame(word = names(v),freq=v)
wordcloud(words = d$word, freq = d$freq, min.freq = 1,
max.words=200, random.order=FALSE, rot.per=0.35,
colors=brewer.pal(8, “Dark2”)) #creates word cloud of the text from the tweets
The word cloud generated from twitter is as shown.