Before starting with RHDFS, let’s have a look on what is HDFS and R, and the connection between these two i.e., what is RHDFS and how does it work.
Hadoop Distributed File System (HDFS) is the file system component of Hadoop. While the interface to HDFS is patterned after the UNIX file system, faithfulness to standards was sacrificed in favour of improved performance for the applications at hand.
It is a Java-based file system that provides scalable and reliable data storage, and it was designed to span large clusters of commodity servers. HDFS has demonstrated production scalability of up to 200 PB of storage and a single cluster of 4500 servers, supporting close to a billion files and blocks.
HDFS is a scalable, fault-tolerant, distributed storage system that works closely with a wide variety of concurrent data access applications, coordinated by YARN. HDFS will “just work” under a variety of physical and systemic circumstances. By distributing storage and computation across many servers, the combined storage resource can grow linearly with demand while remaining economical at every amount of storage. When that quantity and quality of enterprise data is available in HDFS, and YARN enables multiple data access applications to process it, Hadoop users can confidently answer questions that eluded previous data platforms.
What is R?
R is a language and environment for statistical computing and graphics. The term “environment” is intended to characterize, it is a fully planned and coherent system, rather than an incremental accretion of very specific and inflexible tools, as is frequently the case with other data analysis software.
R is an integrated suite of software facilities for data manipulation, calculation and graphical display. Among other things it has the following features :
- An effective data handling and storage facility
- A suite of operators for calculations on arrays, in particular matrices
- A large, coherent, integrated collection of intermediate tools for data analysis
- Graphical facilities for data analysis and display either directly at the computer or on hard-copy
- A well-developed, simple and effective programming language (called ‘S’) which includes
- Conditionals, loops, user defined recursive functions and input and output facilities. (Indeed most of the system supplied functions are themselves written in the S language.)I find R to be very interesting as it is very much like a vehicle for newly developing methods of interactive data analysis. It has developed rapidly, and has been extended by a large collection of packages. However, most programs written in R are essentially ephemeral, written for a single piece of data analysis.
In the areas of interactive data analysis, general purpose statistics and predictive modelling, R has gained massive popularity due to its classification, clustering and ranking capabilities. The R language is widely used among statisticians and data miners for developing statistical software and performing data analysis.
This R package provides basic connectivity to the Hadoop Distributed File System. R programmers can browse, read, write, and modify files stored in HDFS.
The following functions are part of this package
- File Manipulations hdfs.copy, hdfs.move, hdfs.rename, hdfs.delete, hdfs.rm, hdfs.del, hdfs.chown, hdfs.put, hdfs.get
- File Read/Write hdfs.file, hdfs.write, hdfs.close, hdfs.flush, hdfs.read, hdfs.seek, hdfs.tell, hdfs.line.reader, hdfs.read.text.file
- Directory hdfs.dircreate, hdfs.mkdir
- Utility hdfs.ls, hdfs.list.files, hdfs.file.info, hdfs.exists
- Initialization hdfs.init, hdfs.defaults
Pre-requisites of RHDFS
- This package has a dependency on rJava
- Access to HDFS via this R package is dependent upon the HADOOP_CMD environment variable. As you can see in the below screenshot, HADOOP_CMD points to the full path for the Hadoop binary. If this variable is not properly set, the package will fail when the init() function is invoked.
Below are few commands which I have used in RStudio for accessing Hadoop FileSystem.
- HADOOP_CMD environment should point to the Hadoop.
- Use library(rhdfs), to load rhdfs package into RStudio.
For initiating the HDFS, type hdfs.init() and press enter.
Once the above command is executed you will have access over Hadoop FileSytem, thus you are being able to access HDFS using R console or RStudio.
Type hdfs.ls(“/user”) to list the files in ‘/user’.
To view the files under ‘/user/hduser’, type hdfs.ls(“/user/hduser”)
hdfs.put() is used to uploading files from local system to Hadoop FileSystem as shown in the screenshot below.
I am uploading a folder named logs from local FileSystem to HDFS.
If it returns TRUE, it means that the folder/file has been successfully uploaded else you will have look into the error that has occurred and rectify it.
hdfs.get() is used to downloading files from Hadoop FileSystem to Local FileSystem as shown in the screenshot below.
And if you want to delete a folder/file use the command hdfs.rm().
In the below screenshot you can notice that the logs folder has been removed and it doesn’t exist in the HDFS.
Going a bit more into RHDFS, the next topic is reading a CSV file using RHDFS.
I have created a new directory rhdfs under deepika in the Hadoop FileSystem in which I shall upload all the RHDFS related works. As you can see in the screenshot below, I have uploaded a new csv file named sample1.csv into the Hadoop FileSystem.
Here is a small code snippet on how to read the csv data from HDFS using rhdfs
f = hdfs.file(“/usr/hduser/deepika/rhdfs/sample.csv”,”r”)
m = hdfs.read(f)
c = rawToChar(m)
In the screenshot, you can see that the sample.csv file is being read and raw data is converted to char so that it would be in format which we can understand.
The output is displayed in tabular format which is depicted in the below screenshot:
As I am still new to R and RStudio, these are the few basic commands which I have tried to begin with RHDFS. I shall share more information in detail in the upcoming blogs.
Thanks for Reading.