Web Scraping using R

Hi all!!As I promised the previous time to come up with a more technical blog, here I present a part of my learning which deals with extracting movie ratings and other information from certain sites(I chose IMDB as personally I find it to be the best.)I have attached the complete code as well and feel free to write to me in case of any clarifications on the code.

I remember the other day I was in a fix as the weekend was around the corner and I was not able to decide which movie I wanted to watch. This might sound silly to most of you but believe me I was having a tough time zeroing in on a movie. That is when I decided to pay a visit to what I consider the bible for all movie-mongers—IMDB. Now being a computer science graduate, I take pride in developing programs which would automate most of my daily work. As I have mentioned in my previous blog that I was introduced to the R language for performing various sorts of analysis, I decided to use it to help me arrive at a solution. Indeed it paved the way to create a script which would automatically extract relevant information from the website. All it needs is a few lines of code and voila!

Talking from a more technical point of view, let us deep dive into the code and understand the various intricasies.

I used the “rvest” package present in R to retrieve movie information.

So let’s start and load the rvest package in the current working environment .

library(rvest)

I am a huge DeCaprio fan and was waiting eagerly for the movie “The Revenant” to arrive. Hence it is only natural that I tried extracting various information about this movie.

Firstly, you need to specify the URL path of the movie before you can proceed :

the_revenant <- read_html(“http://www.imdb.com/title/tt1663202/?ref_=inth_ov_tt ”)

Once the path is set, the open source tool called ”selectorgadget” can be used to figure out which CSS selector matches the data which is required. If you want to extract the movie rating, the desired selector is “strong span.” Type the following code in your R console to obtain the movie rating :

the_revenant %>%

html_node(“strong span”) %>%

html_text() %>%

as.numeric

A rating of 8.7/10….Not bad DeCaprio I thought to myself.

In case you are overwhelmed with the code written above, let me help you by breaking it down. I have used html_node() to find out the first node that matches the selector used, and then extract its contents with html_text(), ultimately converting the result into numeric type.

Now that you understand the code a fair bit, let us proceed to obtain some more information about the movie.

 

To extract the movie cast , type the following chunk of code :

the_revenant %>%

html_nodes(“#titleCast .itemprop span”) %>%

html_text()

 

To retrieve titles and the authors of the recent message board postings on the site which are stored on the second table on the page, type the following :

the_revenant %>%

html_nodes(“table”) %>%

.[[2]] %>%

html_table()

 

Now, let us extract the first review :

review <- the_revenant %>%

html_nodes(“#titleUserReviewsTeaser p”) %>%

html_text()

review # this prints the review on your screen

 

So there you go fellas. One might dare say that there is nothing too difficult to comprehend here, this certainly goes to show that R can be used in our daily activities to prevent us from breaking a sweat.

References:

  • Few of the code snippets used in the blog have been inspired from this source.
  • For another example of web scraping(again IMDB), click here.

 

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s