I have been working on how can R and Hadoop integrated to be used together. By very hard verification process, finally I got the possible ways to use R and Hadoop together for performing Big Data Analytics, RHadoop is one of the four different ways of using Hadoop and R together.
Hadoop is a disruptive Java-based programming framework that supports the processing of large data sets in a distributed computing environment, while R is a programming language and software environment for statistical computing and graphics.
RHadoop is an open source project developed by Revolution Analytics that provides client-side integration of R and Hadoop. RHadoop is a collection of five R packages that allow users to manage and analyse data with Hadoop. The packages have been tested (and always before a release) on recent releases of the Cloudera and Hortonworks Hadoop distributions and should have broad compatibility with open source Hadoop and mapR’s distribution.
RHadoop consists of the following packages:
|rhdfs||This package provides basic connectivity to the Hadoop Distributed File System. R programmers can browse, read, write, and modify files stored in HDFS from within R. Install this package only on the node that will run the R client.|
|rhbase||This package provides basic connectivity to the HBASE distributed database, using the Thrift server. R programmers can browse, read, write, and modify tables stored in HBASE from within R. Install this package only on the node that will run the R client.|
|plyrmr||This package enables the R user to perform common data manipulation operations, as found in popular packages such as plyr and reshape2, on very large data sets stored on Hadoop. Like rmr, it relies on Hadoop MapReduce to perform its tasks, but it provides a familiar plyr-like interface while hiding many of the MapReduce details. Install this package only every node in the cluster.|
|rmr2||A package that allows R developer to perform statistical analysis in R via Hadoop MapReduce functionality on a Hadoop cluster. Install this package on every node in the cluster.|
|ravro||A package that adds the ability to read and write avro files from local and HDFS file system and adds an avro input format for rmr2. Install this package only on the node that will run the R client.|
Setting up RHadoop is a complicated task as RHadoop has dependencies on other R packages. Working with RHadoop implies to install R and RHadoop packages with dependencies on each Data node of the Hadoop cluster.
Setting up RHadoop on Ubuntu 14.04
Prerequisites for installing RHadoop on Ubuntu 14.04
- Make sure Java and Hadoop binaries are installed in the machine
$ java -version
$ hadoop version
In the terminal type the below commands to install the necessary R packages.
sudo apt-get install r-base
sudo apt-get install r-base-core
sudo apt-get install r-base-dev
Download Rstudio Desktop/Server from https://www.rstudio.com/products/rstudio/ .
Double click on the downloaded file to install Rstudio on the system.
- Thrift 0.9.0
Thrift is needed for installing rhbase. If you do not use HBase, you might skip thrift installation.
Install thrift 0.9.0 instead of 0.9.1. I first installed thrift 0.9.1 (which was the latest version at that time), and found it didn’t work well for rhbase installation. And then it was a painful process to figure out the reason, uninstall 0.9.1 and then install 0.9.0.
- Check JAVA_HOME, HADOOP_HOME and HADOOP_CMD variables are set in the $HOME/.bashrc file. If not add them.
- Set Environment variables in R/RStudio:
Installing RHadoop Packages :
a) Using Terminal
Install the RHadoop package dependencies at one go by using :
install.packages(c(“rJava”, “Rcpp”, “RJSONIO”, “bitops”, “digest”, “functional”, “stringr”, “plyr”, “reshape2”, “dplyr”, “R.methodsS3”, “caTools”, “Hmisc”, dependencies=TRUE, repos=’http://cran.rstudio.com/‘))
or else if you have downloaded the .tar files you can install the Rhadoop package dependencies separately one by one, using :
install.packages(“~/rJava_0.9-8.tar.gz”, repos = NULL, type = “source”)
b) If your need to use R requires a particular package/library to be installed in R-studio. You can follow the instructions below to do so
- Run R studio
- Click on the Packages tab in the bottom-right section and then click on install. The following dialog box will appear
- In the Install Packages dialog, write the package name you want to install under the Packages field and then click install. This will install the package you searched for or give you a list of matching package based on your package text.
Other packages such as rhdfs, rhbase etc. can be installed in the similar way.
After installing Rhadoop packages, now you will able to access Hadoop FileSystem via R/RStudio.
Thank you for reading.