Twitter tweet analysis

Hi there! Welcome to the week 3 session.

Today we will discuss Twitter tweet analysis.

We will be analyzing following tweet trends:

  • Distribution of tweets over time (by year, month, week, weekdays etc)
  • Tweet trends, like how many tweets are tweeted in midnight etc
  • Categorizing tweets with hashtags, re-tweeted tweets, replied tweets.
  • Analyzing number of characters used per tweets.

Pr-requisites:

  1. Windows/Mac/Linux machine with r-base and RStudio installed (if you don’t have it yet, you can refer my previous post and get them on your PC).
  2. Basic understanding of R data types and syntax.
  3. And finally, YOU.

The very first thing is creating data set for our operations. We need to download our twitter archive for this purpose. Follow below instructions to get your twitter archive.

Step: 1

Navigate to your twitter account settings page by following this link.

Step: 2 

Request your twitter archive by clicking on Request Your Archive link.

Twitter will send your archive via email, check your email inbox (associated with twitter account) and download the archive file.

Step3:

Extract the zipped file and find tweets.csv file. Copy the file to your working directory.

By default, your RStudio will set Documents folder as working directory. But you can change the working directory by executing setwd() command in your RStudio.

Ex:

setwd("C:/Users/Sharath/Downloads")

So, now we have the data source. Let us jump into the R code.

We need to use 3 packages.

Install those libraries first.

install.packages("ggplot2")
install.packages("lubridate")
install.packages("scales")

Let us load those libraries.

library(ggplot2)
library(lubridate)
library(scales)

Read data from tweets.csv

tweets <- read.csv("tweets.csv", stringsAsFactors = FALSE)

convert timestamp to date-time object


tweets$timestamp <- ymd_hms(tweets$timestamp)
tweets$timestamp <- with_tz(tweets$timestamp, "America/Chicago")

Now let us analyze the your tweeting trend, like when do you tweet more etc.

#basic histogram showing the distribution of my tweets over time
ggplot(data = tweets, aes(x = timestamp)) +
geom_histogram(aes(fill = ..count..)) +
theme(legend.position = "none") +
xlab("Time") + ylab("Number of tweets") +
scale_fill_gradient(low = "midnightblue", high = "aquamarine4")

 

#tweets by year
ggplot(data = tweets, aes(x = year(timestamp))) +
geom_histogram(breaks = seq(2007.5, 2016.2, by =1), aes(fill = ..count..)) +
theme(legend.position = “none”) +
xlab(“Time”) + ylab(“Number of tweets”) +
scale_fill_gradient(low = “midnightblue”, high = “aquamarine4”)

 

#group by week days
ggplot(data = tweets, aes(x = wday(timestamp, label = TRUE))) +
geom_histogram(breaks = seq(0.5, 7.5, by =1), aes(fill = ..count..)) +
theme(legend.position = "none") +
xlab("Day of the Week") + ylab("Number of tweets") +
scale_fill_gradient(low = "midnightblue", high = "aquamarine4")

 

#chi-square test to test the distribution of my tweets over week days
chisq.test(table(wday(tweets$timestamp, label = TRUE)))

###
myTable <- table(wday(tweets$timestamp, label = TRUE))
mean(myTable[c(2:5)])/mean(myTable[c(1,6,7)])

###
chisq.test(table(wday(tweets$timestamp, label = TRUE)), p = c(4, 5, 5, 5, 5, 4, 4)/32)

#tweets by months
ggplot(data = tweets, aes(x = month(timestamp, label = TRUE))) +
geom_histogram(aes(fill = ..count..)) +
theme(legend.position = “none”) +
xlab(“Month”) + ylab(“Number of tweets”) +
scale_fill_gradient(low = “midnightblue”, high = “aquamarine4”)

 

###
chisq.test(table(month(tweets$timestamp, label = TRUE)))

#fetch time of tweet and add it to existing tweet holder
tweets$timeonly <- as.numeric(tweets$timestamp – trunc(tweets$timestamp, “days”))

tweets[(minute(tweets$timestamp) == 0 & second(tweets$timestamp) == 0),11] <- NA
mean(is.na(tweets$timeonly))

class(tweets$timeonly) <- “POSIXct”

#number of tweets by time
ggplot(data = tweets, aes(x = timeonly)) +
geom_histogram(aes(fill = ..count..)) +
theme(legend.position = “none”) +
xlab(“Time”) + ylab(“Number of tweets”) +
scale_x_datetime(breaks = date_breaks(“3 hours”),
labels = date_format(“%H:00”)) +
scale_fill_gradient(low = “midnightblue”, high = “aquamarine4”)

 

#late night tweets by year
latenighttweets <- tweets[(hour(tweets$timestamp) < 6),]
ggplot(data = latenighttweets, aes(x = timestamp)) +
geom_histogram(aes(fill = ..count..)) +
theme(legend.position = "none") +
xlab("Time") + ylab("Number of tweets") + ggtitle("Late Night Tweets") +
scale_fill_gradient(low = "midnightblue", high = "aquamarine4")

 

 

#number of tweets with hashtags
ggplot(tweets, aes(factor(grepl("#", tweets$text)))) +
geom_bar(fill = "midnightblue") +
theme(legend.position="none", axis.title.x = element_blank()) +
ylab("Number of tweets") +
ggtitle("Tweets with Hashtags") +
scale_x_discrete(labels=c("No hashtags", "Tweets with hashtags"))

 

#number of tweets retweeted

ggplot(tweets, aes(factor(!is.na(retweeted_status_id)))) +

  geom_bar(fill = "midnightblue") +

  theme(legend.position="none", axis.title.x = element_blank()) +

  ylab("Number of tweets") +

  ggtitle("Retweeted Tweets") +

  scale_x_discrete(labels=c("Not retweeted", "Retweeted tweets"))

 

 

#number of replied tweets
ggplot(tweets, aes(factor(!is.na(in_reply_to_status_id)))) +
geom_bar(fill = "midnightblue") +
theme(legend.position="none", axis.title.x = element_blank()) +
ylab("Number of tweets") +
ggtitle("Replied Tweets") +
scale_x_discrete(labels=c("Not in reply", "Replied tweets"))

#categorize tweets under types
tweets$type <- “tweet”
tweets[(!is.na(tweets$retweeted_status_id)),12] <- “RT”
tweets[(!is.na(tweets$in_reply_to_status_id)),12] <- “reply”
tweets$type <- as.factor(tweets$type)
tweets$type = factor(tweets$type,levels(tweets$type)[c(3,1,2)])

#plot with types tweeting, retweeting, and replying

ggplot(data = tweets, aes(x = timestamp, fill = type)) +

  geom_histogram() +

  xlab("Time") + ylab("Number of tweets") +

  scale_fill_manual(values = c("midnightblue", "deepskyblue4", "aquamarine3"))


#proportion of tweets among them
ggplot(data = tweets, aes(x = timestamp, fill = type)) +
geom_bar(position = "fill") +
xlab("Time") + ylab("Proportion of tweets") +
scale_fill_manual(values = c("midnightblue", "deepskyblue4", "aquamarine3"))

#calculate characters per tweet
tweets$charsintweet <- sapply(tweets$text, function(x) nchar(x))

#plot char per tweet
ggplot(data = tweets, aes(x = charsintweet)) +
geom_histogram(aes(fill = ..count..), binwidth = 8) +
theme(legend.position = “none”) +
xlab(“Characters per Tweet”) + ylab(“Number of tweets”) +
scale_fill_gradient(low = “midnightblue”, high = “aquamarine4”)

We will discuss about twitter sentiment analysis  in the next post.

Thanks for visiting my blog. I always love to hear constructive feedback. Please give your feedback in the comment section below or write to me personally here.

References:

https://sites.google.com/site/miningtwitter/questions/sentiment/sentiment

http://blog.revolutionanalytics.com/2016/01/twitter-sentiment.html

 

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s