Introduction to Apache Thrift

Hello readers,

Firstly, let me begin with a brief definition of Apache Thrift, it is a software framework, for scalable cross-language services development, combines a software stack with a code generation engine to build services that work efficiently and seamlessly between C++, Java, Python, PHP, Ruby, Erlang, Perl, Haskell, C#, Cocoa, JavaScript, Node.js, Smalltalk, OCaml and Delphi and other languages.

Apache Thrift allows you to define data types and service interfaces in a simple definition file. Taking that file as input, the compiler generates code to be used to easily build RPC clients and servers that communicate seamlessly across programming languages. Instead of writing a load of boilerplate code to serialize and transport your objects and invoke remote methods, you can get right down to business.

Architecture

Thrift includes a complete stack for creating clients and servers. The top part is generated code from the Thrift definition. The services generate from this file client and processor code. In contrast to built-in types, created data structures are sent as result in generated code. The protocol and transport layer are part of the runtime library. With Thrift, it is possible to define a service and change the protocol and transport without recompiling the code. Besides the client part, Thrift includes server infrastructure to tie protocols and transports together, like blocking, non-blocking, and multi-threaded servers. The underlying I/O part of the stack is differently implemented for different languages.

1

Installation of Apache Thrift

 First install the dependencies and then you are ready to install Thrift.

Install the languages with which you plan to use thrift. To use with Java for example, install a Java JDK you prefer and check if is installed on the system.

Install JAVA:

$ sudo apt-get install openjdk-8-jdk -y

$ java -version

 java-version

To use  Java you will also need to install Apache ant

$ sudo apt-get install ant

 ant.png

 

Install required tools and libraries:

 Ubuntu

$ sudo apt-get install libboost-dev libboost-test-dev libboost-program-options-dev libevent-dev automake libtool flex bison pkg-config g++ libssl-dev

CentOS5/Rhel5

$ sudo yum install automake libtool flex bison pkgconfig gcc-c++ boost-devel libevent-devel zlib-devel python-devel ruby-devel

 1

 1) Download apache thrift version 0.9.0 rather than 0.9.1

 

 T1.png 

 2) Copy the downloaded file into the desired directory and untar the file

 2

$ tar -xvf thrift-0.9.0.tar.gz

3) Build Thrfit according to instructions – Update PKG_CONFIG_PATH in bashrc by typing the following command in terminal:

$ sudo nano ~/.bashrc

  • paste the following line at the end

export PKG_CONFIG_PATH=$PKG_CONFIG_PATH:/home/hduser/hadoop/thrift-0.9.0/lib/cpp

  • Then source the bashrc via

source ~/.bashrc

  • Verifiy pkg-config path is correct: Type this in terminal

# pkg-config –cflags thrift

-I/usr/local/include/thrift

  • Copy Thrift library sudo cp /usr/local/lib/libthrift-0.9.0 /usr/lib/

 I.png

4) Get some core development tools that are not direct dependencies below

$ sudo apt-get install git-core make

5) The following pulls in openjdk-6-jre-headless to provide /usr/bin/java:

$ sudo apt-get install maven2

3

6)  Pull in the main thrift dependencies

4

$ sudo apt-get install libboost-dev libboost-test-dev libboost-program-options-dev libevent-dev automake libtool flex bison pkg-config g++ libssl-dev

7) Add curl since we’ll need it shortly:

$ sudo apt-get install curl

 5

8) Python is enabled by default (although broken since there is no Python.h header). This line will build with C++ and Java:

$ cd thrift-0.9.0

$ ./configure –prefix=/usr –without-python JAVA_PREFIX=/usr/share/java

9) Build:

3

$ make

7

10) Install:

$ sudo make install

After the execution of this command, Apache Thrift will be installed in the system. I had to install thrift for installing rhbase on my Ubuntu system as thrift is the main dependency of rhbase for building services. But, you can install thrift whenever there is need to build services between languages such as C++, Java, Python, PHP, Ruby, Erlang, Perl, Haskell, C#, etc.

Thank you for Reading.

 

References :

 [1] https://thrift.apache.org/

[2] https://en.wikipedia.org/wiki/Apache_Thrift

[3] http://thrift-tutorial.readthedocs.io/en/latest/intro.html

[4] https://thrift.apache.org/docs/install/debian

[5] http://www.thrift.pl/

Advertisements

Introduction to RHadoop

Hello readers,

I have been working on how can R and Hadoop integrated to be used together. By very hard verification process, finally I got the possible ways to use R and Hadoop together for performing Big Data Analytics, RHadoop is one of the four different ways of using Hadoop and R together.

Hadoop is a disruptive Java-based programming framework that supports the processing of large data sets in a distributed computing environment, while R is a programming language and software environment for statistical computing and graphics.

 2

RHadoop is an open source project developed by Revolution Analytics that provides client-side integration of R and Hadoop. RHadoop is a collection of five R packages that allow users to manage and analyse data with Hadoop. The packages have been tested (and always before a release) on recent releases of the Cloudera and Hortonworks Hadoop distributions and should have broad compatibility with open source Hadoop and mapR’s distribution.

RHadoop consists of the following packages:

Package Name Description
rhdfs This package provides basic connectivity to the Hadoop Distributed File System. R programmers can browse, read, write, and modify files stored in HDFS from within R. Install this package only on the node that will run the R client.
rhbase This package provides basic connectivity to the HBASE distributed database, using the Thrift server. R programmers can browse, read, write, and modify tables stored in HBASE from within R. Install this package only on the node that will run the R client.
plyrmr This package enables the R user to perform common data manipulation operations, as found in popular packages such as plyr and reshape2, on very large data sets stored on Hadoop. Like rmr, it relies on Hadoop MapReduce to perform its tasks, but it provides a familiar plyr-like interface while hiding many of the MapReduce details. Install this package only every node in the cluster.
rmr2 A package that allows R developer to perform statistical analysis in R via Hadoop MapReduce functionality on a Hadoop cluster. Install this package on every node in the cluster.
ravro A package that adds the ability to read and write avro files from local and HDFS file system and adds an avro input format for rmr2. Install this package only on the node that will run the R client.

Setting up RHadoop is a complicated task as RHadoop has dependencies on other R packages. Working with RHadoop implies to install R and RHadoop packages with dependencies on each Data node of the Hadoop cluster.

Setting up RHadoop on Ubuntu 14.04

Prerequisites for installing RHadoop on Ubuntu 14.04

  1. Make sure Java and Hadoop binaries are installed in the machine

$ java -version

java-version

$ hadoop version

hadoop-version

  1. R

In the terminal type the below commands to install the necessary R packages.

sudo apt-get install r-base

sudo apt-get install r-base-core

sudo apt-get install r-base-dev

  1. Rstudio

Download Rstudio Desktop/Server from https://www.rstudio.com/products/rstudio/ .

Double click on the downloaded file to install Rstudio on the system.

  1. Thrift 0.9.0

 Thrift is needed for installing rhbase. If you do not use HBase, you might skip thrift installation.

Install thrift 0.9.0 instead of 0.9.1. I first installed thrift 0.9.1 (which was the latest version at that time), and found it didn’t work well for rhbase installation. And then it was a painful process to figure out the reason, uninstall 0.9.1 and then install 0.9.0.

Installation

bashrc

  1. Check JAVA_HOME, HADOOP_HOME and HADOOP_CMD variables are set in the $HOME/.bashrc file. If not add them.
  2. Set Environment variables in R/RStudio:

Sys.setenv(HADOOP_HOME=”/home/hduser/hadoop/apache-hadoop”)

Sys.setenv(HADOOP_CMD=”/home/hduser/hadoop/apache-hadoop/bin/hadoop”)

Sys.setenv(HADOOP_STREAMING=”/home/hadoop/apache-hadoop/share/hadoop/tools/lib/hadoop-streaming-2.6.0.jar”)1

Installing RHadoop Packages :

a) Using Terminal

Install the RHadoop package dependencies at one go by using :

install.packages(c(“rJava”, “Rcpp”, “RJSONIO”, “bitops”, “digest”, “functional”, “stringr”, “plyr”, “reshape2”, “dplyr”, “R.methodsS3”, “caTools”, “Hmisc”, dependencies=TRUE, repos=’http://cran.rstudio.com/‘))

or else if you have downloaded the .tar files you can install the Rhadoop package dependencies separately one by one, using : 

install.packages(“~/rJava_0.9-8.tar.gz”, repos = NULL, type = “source”)

b) If your need to use R requires a particular package/library to be installed in R-studio. You can follow the instructions below to do so

  1. Run R studio
  2. Click on the Packages tab in the bottom-right section and then click on install. The following dialog box will appear
  3. In the Install Packages dialog, write the package name you want to install under the Packages field and then click install. This will install the package you searched for or give you a list of matching package based on your package text.

Other packages such as rhdfs, rhbase etc. can be installed in the similar way.

After installing Rhadoop packages, now you will able to access Hadoop FileSystem via R/RStudio.

Thank you for reading.

References :

 [1] http://www.slideshare.net/ChiuYW/big-data-analysis-with-rhadoop

[2] https://dzone.com/articles/r-and-hadoop-data-analytics

[3] http://blog.revolutionanalytics.com/2015/06/using-hadoop-with-r-it-depends.html

[4] https://www.quora.com/How-can-R-and-Hadoop-be-used-together

[5] https://blog.udemy.com/r-hadoop/

[6] https://github.com/RevolutionAnalytics/RHadoop/wiki

[7] http://www.dummies.com/how-to/content/hadoop-integration-with-r.html

[8] https://bighadoop.wordpress.com/2013/02/25/r-and-hadoop-data-analysis-rhadoop/

[9] https://github.com/RevolutionAnalytics/RHadoop/wiki

[10] http://www.r-bloggers.com/how-to-learn-r-2/

Introduction to RHDFS

Hi readers,

Before starting with RHDFS, let’s have a look on what is HDFS and R, and the connection between these two i.e., what is RHDFS and how does it work.

HDFS

Hadoop Distributed File System (HDFS) is the file system component of Hadoop. While the interface to HDFS is patterned after the UNIX file system, faithfulness to standards was sacrificed in favour of improved performance for the applications at hand.

It is a Java-based file system that provides scalable and reliable data storage, and it was designed to span large clusters of commodity servers. HDFS has demonstrated production scalability of up to 200 PB of storage and a single cluster of 4500 servers, supporting close to a billion files and blocks.

HDFS is a scalable, fault-tolerant, distributed storage system that works closely with a wide variety of concurrent data access applications, coordinated by YARN. HDFS will “just work” under a variety of physical and systemic circumstances. By distributing storage and computation across many servers, the combined storage resource can grow linearly with demand while remaining economical at every amount of storage. When that quantity and quality of enterprise data is available in HDFS, and YARN enables multiple data access applications to process it, Hadoop users can confidently answer questions that eluded previous data platforms.

What is R?

R is a language and environment for statistical computing and graphics. The term “environment” is intended to characterize, it is a fully planned and coherent system, rather than an incremental accretion of very specific and inflexible tools, as is frequently the case with other data analysis software.

R is an integrated suite of software facilities for data manipulation, calculation and graphical display. Among other things it has the following features :

  • An effective data handling and storage facility
  • A suite of operators for calculations on arrays, in particular matrices
  • A large, coherent, integrated collection of intermediate tools for data analysis
  • Graphical facilities for data analysis and display either directly at the computer or on hard-copy
  • A well-developed, simple and effective programming language (called ‘S’) which includes
  • Conditionals, loops, user defined recursive functions and input and output facilities. (Indeed most of the system supplied functions are themselves written in the S language.)I find R to be very interesting as it is very much like a vehicle for newly developing methods of interactive data analysis. It has developed rapidly, and has been extended by a large collection of packages. However, most programs written in R are essentially ephemeral, written for a single piece of data analysis.

In the areas of interactive data analysis, general purpose statistics and predictive modelling, R has gained massive popularity due to its classification, clustering and ranking capabilities. The R language is widely used among statisticians and data miners for developing statistical software and performing data analysis.

RHDFS

Rhdfs1

This R package provides basic connectivity to the Hadoop Distributed File System. R programmers can browse, read, write, and modify files stored in HDFS.

 The following functions are part of this package

  • File Manipulations hdfs.copy, hdfs.move, hdfs.rename, hdfs.delete, hdfs.rm, hdfs.del, hdfs.chown, hdfs.put, hdfs.get
  • File Read/Write hdfs.file, hdfs.write, hdfs.close, hdfs.flush, hdfs.read, hdfs.seek, hdfs.tell, hdfs.line.reader, hdfs.read.text.file
  • Directory hdfs.dircreate, hdfs.mkdir
  • Utility hdfs.ls, hdfs.list.files, hdfs.file.info, hdfs.exists
  • Initialization hdfs.init, hdfs.defaults

1Pre-requisites of RHDFS

  • This package has a dependency on rJava
  • Access to HDFS via this R package is dependent upon the HADOOP_CMD environment variable. As you can see in the below screenshot, HADOOP_CMD points to the full path for the Hadoop binary. If this variable is not properly set, the package will fail when the init() function is invoked.

Below are few commands which I have used in RStudio for accessing Hadoop FileSystem.

  • HADOOP_CMD environment should point to the Hadoop.
  • Use library(rhdfs), to load rhdfs package into RStudio.

rhdfs4

For initiating the HDFS, type hdfs.init() and press enter.

rhdfs5

Once the above command is executed you will have access over Hadoop FileSytem, thus you are being able to access HDFS using R console or RStudio.

rhdfs6

Type hdfs.ls(“/user”) to list the files in ‘/user’.

 rhdfs7

To view the files under ‘/user/hduser’, type hdfs.ls(“/user/hduser”)

 rhdfs8

 hdfs.put() is used to uploading files from local system to Hadoop FileSystem as shown in the screenshot below.

I am uploading a folder named logs from local FileSystem to HDFS.

 rhdfs9

If it returns TRUE, it means that the folder/file has been successfully uploaded else you will have look into the error that has occurred and rectify it.

rhdfs10

hdfs.get() is used to downloading files from Hadoop FileSystem to Local FileSystem as shown in the screenshot below.

 rhdfs11

 And if you want to delete a folder/file use the command hdfs.rm().

 rhdfs12

In the below screenshot you can notice that the logs folder has been removed and it doesn’t exist in the HDFS.

 rhdfs13

Going a bit more into RHDFS, the next topic is reading a CSV file using RHDFS.

I have created a new directory rhdfs under deepika in the Hadoop FileSystem in which I shall upload all the RHDFS related works. As you can see in the screenshot below, I have uploaded a new csv file named sample1.csv into the Hadoop FileSystem.

 1

Here is a small code snippet on how to read the csv data from HDFS using rhdfs

f = hdfs.file(“/usr/hduser/deepika/rhdfs/sample.csv”,”r”)

m = hdfs.read(f)

c = rawToChar(m)

In the screenshot, you can see that the sample.csv file is being read and raw data is converted to char so that it would be in format which we can understand.

 rhdfsw4

 The output is displayed in tabular format which is depicted in the below screenshot:

 rhdfsw6

As I am still new to R and RStudio, these are the few basic commands which I have tried to begin with RHDFS. I shall share more information in detail in the upcoming blogs.

Thanks for Reading.

References :

[1] https://github.com/RevolutionAnalytics/RHadoop/wiki

[2] http://hsinay.blogspot.in/p/rhadoop-reading-csv-using-rhdfs-here-is.html

[3] http://www.rdocumentation.org/packages/rhdfs/functions/hdfs-file-access

Components of HADOOP ECOSYSTEM

Hello Everyone,

1

Big Data is the buzz word circulating in IT industry from 2008, you must be knowing this, if not let me tell you that the amount of data being generated by social networks, manufacturing, retail, stocks, telecom, insurance, banking, and health care industries is way beyond our imaginations.

Before the advent of Hadoop, storage and processing of big data was a big challenge. But now that Hadoop is available, companies have realized the business impact of Big Data and how understanding this data will drive the growth. For example:

  • Banking sectors have a better chance to understand loyal customers, loan defaulters and fraud transactions.
  • Retail sectors now have enough data to forecast demand.
  • Manufacturing sectors need not depend on the costly mechanisms for quality testing. Capturing sensors data and analysing it would reveal many patterns.
  • E-Commerce, social networks can personalize the pages based on customer interests.
  • Stock markets generate a humongous amount of data, correlating from time to time will reveal beautiful insights.

Big Data has many useful and insightful applications.

Hadoop is the straight answer for processing Big Data. Hadoop ecosystem is a combination of technologies which have proficient advantage in solving business problems.

In this blog post, I’ll give an overview of that ecosystem. If you’re new to Hadoop, this could be an easy introduction.  If you are steep in some Hadoop technology, this post could give you the opportunity to look around the ecosystem for things you aren’t specialized in.

We all know Hadoop is a framework which deals with Big Data but unlike any other frame-work it’s not a simple framework, it has its own family for processing different thing which is tied up in one umbrella called as Hadoop Ecosystem.

Hadoop Ecosystem can be divided into various categories as mentioned below :

  1. Core Hadoop
  2. Data Access
  3. Data Storage
  4. Interaction-Visualization-Execution-Development
  5. Data Intelligence
  6. Data Serialization
  7. Data Integration
  8. Management, Monitoring and Orchestration

2

Let’s try to understand each component in detail:

3

1. Core Hadoop

Core Hadoop in the first category comprises of 4 core components –

Hadoop Common

Apache Foundation has pre-defined set of utilities and libraries that can be used by other modules within the Hadoop ecosystem. For example, if HBase and Hive want to access HDFS they need to make of Java archives (JAR files) that are stored in Hadoop Common.

Hadoop Distributed File System (HDFS)

The default big data storage layer for Apache Hadoop is HDFS. Hadoop Distributed File System (HDFS) is a distributed file system. It is built to make sure data is local to the machine processing it. HDFS is the “Secret Sauce” of Apache Hadoop components as users can dump huge datasets into HDFS and the data will sit there nicely until the user wants to leverage it for analysis. It is the key tool for managing Big Data and supporting analytic applications in a scalable, cheap and rapid way. Hadoop is usually used on low-cost commodity machines, where server failures are fairly common. To accommodate a high failure environment the file system is designed to distribute data throughout different servers in different server racks making the data highly available.

HDFS operates on a Master-Slave architecture model where the NameNode acts as the master node for keeping a track of the storage cluster and the DataNode acts as a slave node summing up to the various systems within a Hadoop cluster. HDFS component creates several replicas of the data block to be distributed across different clusters for reliable and quick data access.

Nokia deals with more than 500 terabytes of unstructured data and close to 100 terabytes of structured data. Nokia uses HDFS for storing all the structured and unstructured data sets as it allows processing of the stored data at a petabyte scale.

MapReduce

MapReduce is a Java-based system created by Google where the actual data from the HDFS store gets processed efficiently. Map Reduce is the distributed, parallel computing programming model for Hadoop.

The major advantage of MapReduce is that it is easy to scale data processing over multiple computing nodes. Under the MapReduce model, the data processing primitives are called mappers and reducers. Decomposing a data processing application into mappers and reducers is sometimes nontrivial. But, once we write an application in the MapReduce form, scaling the application to run over hundreds, thousands, or even tens of thousands of machines in a cluster is merely a configuration change. This simple scalability is what has attracted many programmers to use the MapReduce model.

Skybox has developed an economical image satellite system for capturing videos and images from any location on earth. Skybox uses Hadoop to analyse the large volumes of image data downloaded from the satellites. The image processing algorithms of Skybox are written in C++. Busboy, a proprietary framework of Skybox makes use of built-in code from java based MapReduce framework.

YARN

YARN which stands for Yet Another Resource Negotiator is a programming model for processing and generating large sets of data. It is the successor of Hadoop MapReduce. It is great enabler for dynamic resource utilization on Hadoop framework as users can run various Hadoop applications without having to bother about increasing workloads.

YARN combines a central resource manager that reconciles the way applications use Hadoop system resources with node manager agents that monitor the processing operations of individual cluster nodes.  Running on commodity hardware clusters, Hadoop has attracted particular interest as a staging area and data store for large volumes of structured and unstructured data intended for use in analytics applications. Separating HDFS from MapReduce with YARN makes the Hadoop environment more suitable for operational applications that can’t wait for batch jobs to finish.

Yahoo has close to 40,000 nodes running Apache Hadoop with 500,000 MapReduce jobs per day taking 230 compute years extra for processing every day. YARN at Yahoo helped them increase the load on the most heavily used Hadoop cluster to 125,000 jobs a day when compared to 80,000 jobs a day which is close to 50% increase.

The above listed core components of Apache Hadoop form the basic distributed Hadoop framework. Let me tell you, there are several other Hadoop components that form an integral part of the Hadoop ecosystem with the intent of enhancing the power of Apache Hadoop in some way or the other like- providing better integration with databases, making Hadoop faster or developing novel features and functionalities.

Here are some of the eminent Hadoop components used by enterprises extensively –

2. Data Access Components of Hadoop Ecosystem 

Under this category, we have Hive, Pig, HCatalog and Tez which are explained below :

Hive

Hive is a data warehouse management and analytics system that is built for Hadoop. It was initially developed by Facebook, but soon after became an open-source project and is being used by many other companies ever since.

Apache Hive supports analysis of large datasets stored in Hadoop’s HDFS and compatible file systems such as Amazon S3 Filesystem. It provides an SQL-like language called HiveQL with schema on read and transparently converts queries to map/reduce, Apache Tez and Spark jobs.

Hive also enables data serialization/deserialization and increases flexibility in schema design by including a system catalog called Hive-Metastore. By default, It stores metadata in an embedded Apache Derby database, and other client/server databases like MySQL can optionally be used. It supports text files (also called flat files), SequenceFiles (flat files consisting of binary key/value pairs) and RCFiles (Record Columnar Files which store columns of a table in a columnar database way).

Hive simplifies Hadoop at Facebook with the execution of 7500+ Hive jobs daily for Ad-hoc analysis, reporting and machine learning.

Pig

Pig is a convenient tools developed by Yahoo for analysing huge data sets efficiently and easily. It is a programming language that simplifies the common tasks of working with Hadoop, like loading data, expressing transformations on the data, and storing the final results. Pig’s built-in operations can make sense of semi-structured data, such as log files, and the language is extensible using Java to add support for custom data types and transformations.

Similar to Hive, Pig also deals with structured data using Pig Latin language. It is an alternative provided to programmer who loves scripting and don’t want to use Java/Python or SQL to process data. A Pig Latin program is made up of a series of operations, or transformations, that are applied to the input data which runs MapReduce program in backend to produce output. The most outstanding feature of Pig programs is that their structure is open to considerable parallelization making it easy for handling large data sets.

The personal healthcare data of an individual is confidential and should not be exposed to others. This information should be masked to maintain confidentiality, am I not right, but the healthcare data is so huge that identifying and removing personal healthcare data is crucial. Apache Pig can be used under such circumstances to de-identify health information.

Hive & Pig overlap :

You would choose Pig in scenarios where you are importing and transforming data and you would like to be able to see the intermediate.

Apache Pig is somewhat similar to Apache Hive though some users say that it is easier to transition to Hive rather than Pig if you come from a RDBMS SQL background. However, both platforms have a place in the market. Hive is more optimised to run standard queries and is easier to pick up where as Pig is better for tasks that require more customisation. Where Hive is used for structured data, Pig excels in transforming semi-structured and unstructured data.

Hcatalog

HCatalog is a table and storage management layer for Hadoop that enables users with different data processing tools — Pig, MapReduce — to more easily read and write data on the grid. HCatalog’s table abstraction presents users with a relational view of data in the Hadoop distributed file system (HDFS) and ensures that users need not worry about where or in what format their data is stored — RCFile format, text files, SequenceFiles, or ORC files.

HCatalog presents a relational view of data. Data is stored in tables and these tables can be placed into databases. Tables can also be partitioned on one or more keys. For a given value of a key (or set of keys) there will be one partition that contains all rows with that value (or set of values).

Tez

Apache Tez is an extensible framework for building high performance batch and interactive data processing applications, coordinated by YARN in Apache Hadoop. It is an application framework allowing complex directed-acyclic-graph of tasks to processing data. It is built on top of YARN and is a substitute to Map Reduce in some scenarios

Tez generalizes the MapReduce paradigm to a more powerful framework based on expressing computations as a dataflow graph. It is not meant directly for end-users – in fact it enables developers to build end-user applications with much better performance and flexibility.

However, there are a lot of use cases for near-real-time performance of query processing. There are also several workloads, such as Machine Learning, which do not fit well into the MapReduce paradigm. Tez helps Hadoop address these use cases.

3. Data Storage Component of Hadoop Ecosystem

HBase and Cassandra come under this category,

HBase

Hadoop Database or HBASE is a non-relational (NoSQL) database that runs on top of HDFS. It was created for large table which have billions of rows and millions of columns with fault tolerance capability and horizontal scalability and based on Google Big Table. Hadoop can perform only batch processing, and data will be accessed only in a sequential manner for random access of huge data HBASE is used.

HBase is a column-oriented database that uses HDFS for underlying storage of data. HBase supports random reads and also batch computations using MapReduce. With HBase NoSQL database enterprise can create large tables with millions of rows and columns on hardware machine. The best practice to use HBase is when there is a requirement for random ‘read or write’ access to big datasets.

Facebook is one the largest users of HBase with its messaging platform built on top of HBase in 2010.HBase is also used by Facebook for streaming data analysis, internal monitoring system, nearby Friends Feature, Search Indexing and scraping data for their internal data warehouses.

Cassandra

Apache Cassandra is a column oriented NoSQL data store which offers scalability, high availability without compromising on performance. It is the right choice when you need scalability and high availability without compromising performance.

It offers capabilities that relational databases and other NoSQL databases simply cannot match such as: continuous availability, linear scale performance, operational simplicity and easy data distribution across multiple data.

Linear scalability  and proven fault-tolerance on commodity hardware or cloud infrastructure make it the perfect platform for mission-critical data. It’s support for replicating across multiple datacenters is best-in-class, providing lower latency for your users and the peace of mind of knowing that you can survive regional outages.

Cassandra’s data model offers the convenience of column indexes with the performance of log-structured updates, strong support for de-normalization and materialized views, and powerful built-in caching.

4. Data Interaction-Visualization-Execution-Development

Spark

Apache Spark is actually complementary to Hadoop. Spark provides an interface for programming entire clusters with implicit data parallelism and fault-tolerance. Spark Core is the foundation of the overall project. It combines SQL, streaming and complex analytics together seamlessly in the same application to handle a wide range of data processing scenarios. Spark runs on top of Hadoop, Mesos, standalone, or in the cloud. It can access diverse data sources such as HDFS, Cassandra, HBase, or S3. It is a lightning-fast cluster computing technology, designed for fast computation. It is based on Hadoop MapReduce and it extends the MapReduce model to efficiently use it for more types of computations, which includes interactive queries and stream processing.

The main feature of Spark is its in memory cluster computing that increases the processing speed of an application. Spark is designed to cover a wide range of workloads such as batch applications, iterative algorithms, interactive queries and streaming. Apart from supporting all these workload in a respective system, it reduces the management burden of maintaining separate tools. 

5. Data Intelligence Components of Hadoop Ecosystem

Mahout

Apache Mahout is a platform to run scalable Machine Learning algorithms leveraging Hadoop distributed computing. Mahout is an open source machine learning library written in java. Mahout aims to be the machine learning tool of choice when the collection of data to be processed is very large, perhaps far too large for a single machine. In its current incarnation, these scalable machine learning implementations in Mahout are written in Java, and some portions are built upon Apache’s Hadoop distributed computation project.

It is not simply a collection of pre-existing algorithms; many machine learning algorithms are intrinsically non-scalable; that is, given the types of operations they perform, they cannot be executed as a set of parallel processes. Algorithms in the Mahout library belong to the subset that can be executed in a distributed fashion.

Every organization’s data are diverse and particular to their needs. However, there is much less diversity in the kinds of analyses performed on that data. The Mahout project is a library of Hadoop implementations of common analytical computations. Use cases include user collaborative filtering, user recommendations, clustering and classification.

6. Data Serialization Components of Hadoop Ecosystem

Thrift and Avro which are meant for serialization come under this category,

Apache Thrift

The Apache Thrift software framework, for scalable cross-language services development, combines a software stack with a code generation engine to build services that work efficiently and seamlessly between C++, Java, Python, PHP, Ruby, Erlang, Perl, Haskell, C#, Cocoa, JavaScript, Node.js, Smalltalk, OCaml and Delphi and other languages.

It allows you to define data types and service interfaces in a simple definition file. Taking that file as input, the compiler generates code to be used to easily build RPC clients and servers that communicate seamlessly across programming languages. Instead of writing a load of boilerplate code to serialize and transport your objects and invoke remote methods, you can get right down to business.

Thrift includes a complete stack for creating clients and servers. The top part is generated code from the Thrift definition. The services generate from this file client and processor code. In contrast to built-in types, created data structures are sent as result in generated code.

Avro

Avro is a remote procedure call and data serialization framework developed within Apache’s Hadoop project. It is a serialization system for efficient, cross-language RPC and persistent data storage. It is a framework for performing remote procedure calls and data serialization. It uses JSON for defining data types and protocols, and serializes data in a compact binary format. Its primary use is in Apache Hadoop, where it can provide both a serialization format for persistent data, and a wire format for communication between Hadoop nodes, and from client programs to the Hadoop services.

In the context of Hadoop, it can be used to pass data from one program or language to another, e.g. from C to Pig. It is particularly suited for use with scripting languages such as Pig, because data is always stored with its schema in Avro.

7. Data Integration Components of Hadoop Ecosystem

This category includes Sqoop, Flume and Chukwa,

Sqoop : SQL + HADOOP = SQOOP

Apache Sqoop is Hadoop data movement involving booth relational and non-relational data sources. It provides a bi-directional data transfer between Hadoop-HDFS and your favourite relational database. For example you might be storing your app data in relational store such as Oracle, now you want to scale your application with Hadoop so you can migrate Oracle database data to Hadoop HDFS using Sqoop.

This component is used for importing data from external sources into related Hadoop components like HDFS, HBase or Hive. It can also be used for exporting data from Hadoop o other external structured data stores. Sqoop parallelized data transfer, mitigates excessive loads, allows data imports, efficient data analysis and copies data quickly.

When we import any structured data from table (RDBMS) to HDFS a file is created in HDFS which we can process by either Map Reduce program directly or by HIVE or PIG. Similarly after processing data in HDFS we can store the processed structured data back to another table in RDBMS by exporting through Sqoop.

Online Marketer Coupons.com uses Sqoop component of the Hadoop ecosystem to enable transmission of data between Hadoop and the IBM Netezza data warehouse and pipes backs the results into Hadoop using Sqoop.

Flume

Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. It has a simple and flexible architecture based on streaming data flows. It is robust and fault tolerant with tunable reliability mechanisms and many failover and recovery mechanisms. It uses a simple extensible data model that allows for online analytic application. Apache Flume is a distributed, reliable, and available system for efficiently collecting, aggregating and moving large amounts of log data from many different sources to a centralized data store.

Enterprises use Flume’s powerful streaming capabilities to land data from high-throughput streams in the Hadoop Distributed File System (HDFS). Typical sources of these streams are application logs, sensor and machine data, geo-location data and social media. These different types of data can be landed in Hadoop for future analysis using interactive queries in Apache Hive. Or they can feed business dashboards served ongoing data by Apache HBase.

Twitter source connects through the streaming API and continuously downloads the tweets (called as events). These tweets are converted into JSON format and sent to the downstream Flume sinks for further analysis of tweets and retweets to engage users on Twitter.

Chukwa

Chukwa is a Hadoop subproject devoted to large-scale log collection and analysis. It is an open source data collection system for monitoring large distributed systems. It is built on top of the Hadoop distributed Filesystem (HDFS) and MapReduce framework and inherits Hadoop’s scalability and robustness. It also includes a flexible and powerful toolkit for displaying monitoring and analyzing results, in order to make the best use of this collected data.

It aims to provide a flexible and powerful platform for distributed data collection and rapid data processing. The goal is to produce a system that’s usable today, but that can be modified to take advantage of newer storage technologies (HDFS appends, HBase, etc.) as they mature. In order to maintain this flexibility, Chukwa is structured as a pipeline of collection and processing stages, with clean and narrow interfaces between stages. This will facilitate future innovation without breaking existing code.

8. Monitoring, Management and Orchestration components of Hadoop Ecosystem

OOZIE

Apache Oozie is a server-based workflow scheduling system to manage Hadoop jobs. It is a Java Web-Application that runs in a Java servlet-container – Tomcat and uses a database to store:

  • Workflow definitions
  • Currently running workflow instances, including instance states and variables

It is a workflow scheduler system to manage Hadoop jobs. It is a server-based Workflow Engine specialized in running workflow jobs with actions that run Hadoop MapReduce and Pig jobs. It is implemented as a Java Web-Application that runs in a Java Servlet-Container. Hadoop basically deals with Bigdata and when some programmer wants to run many job in a sequential manner like output of job A will be input to Job B and similarly output of job B is input to job C and final output will be output of job C. To automate this sequence we need a workflow and to execute same we need engine for which OOZIE is used.

The American video game publisher Riot Games uses Hadoop and the open source tool Oozie to understand the player experience.

Zookeeper

Zookeeper provides centralized services for Hadoop cluster configuration management, synchronization and group services. For example, think about how a global configuration file works on a Web application; Zookeeper is like that configuration file, but at a much higher level. It provides primitives such as distributed locks that can be used for building the highly scalable applications. It is used to manage synchronization for cluster.

With a growing family of services running as part of a Hadoop cluster, there’s a need for coordination and naming services. As computing nodes can come and go, members of the cluster need to synchronize with each other, know where to access services, and know how they should be configured. This is the purpose of Zookeeper.

Ambari

One of most useful tools you’ll use if you’re administering a Hadoop cluster, Ambari that allows administrators to install, manage and monitor Hadoop clusters with a simple Web interface. It provides an easy-to-follow wizard for setting up a Hadoop cluster of any size.

Ambari makes Hadoop management simpler by providing a consistent, secure platform for operational control. Ambari provides an intuitive Web UI as well as a robust REST API, which is particularly useful for automating cluster operations. With Ambari, Hadoop operators get the following core benefits:

  • Simplified Installation, Configuration and Management
  • Centralized Security Setup
  • Full Visibility into Cluster Health
  • Highly Extensible and Customizable

By now, you must have got to know that Hadoop is a very rich ecosystem. Hadoop has gained its popularity due to its ability of storing, analyzing and accessing large amount of data, quickly and cost effectively through clusters of commodity hardware. It won’t be wrong if we say that Apache Hadoop is actually a collection of several components and not just a single product.

Thanks for reading.

References :

[1] https://dzone.com/articles/hadoop-101-explanation-hadoop

[2] http://hadooptutorials.co.in/tutorials/hadoop/understanding-hadoop-ecosystem.html

[3] http://devops.com/2015/06/01/bigdata-understanding-hadoop-ecosystem/

[4] http://www.hadoopadmin.co.in/important-hadoop-componentsecosystem/

Integrating R and Hadoop

Hello Everyone,

In this blog, I am going to discuss different ways of integrating R and Hadoop.

Firstly, let me tell you that Hadoop and R are a natural match and are quite complementary in terms of visualization and analytics of big data.

Let’s check about the outline of the ways, R and Hadoop can be integrated to scale data Analytics to Big Data Analytics. Before that I shall mention about the knowledge that I have about R and Hadoop. I was familiar with Hadoop from long before. As I have a keen interest in Big Data, the terms related to it has always fascinated me. But R is something very new to me, as it is getting integrated (in a way getting connected) to Hadoop I researched about R as well.

So what is R?

R is a language and environment for statistical computing and graphics. The term “environment” is intended to characterize, it as a fully planned and coherent system, rather than an incremental accretion of very specific and inflexible tools, as is frequently the case with other data analysis software.

R is an integrated suite of software facilities for data manipulation, calculation and graphical display. Among other things it has

  • an effective data handling and storage facility,

  • a suite of operators for calculations on arrays, in particular matrices,

  • a large, coherent, integrated collection of intermediate tools for data analysis,

  • graphical facilities for data analysis and display either directly at the computer or on hard-copy

  • a well-developed, simple and effective programming language (called ‘S’) which includes conditionals, loops, user defined recursive functions and input and output facilities. (Indeed most of the system supplied functions are themselves written in the S language.)

I find R to be very interesting as it is very much like a vehicle for newly developing methods of interactive data analysis. It has developed rapidly, and has been extended by a large collection of packages. However, most programs written in R are essentially ephemeral, written for a single piece of data analysis.

In the areas of interactive data analysis, general purpose statistics and predictive modelling, R has gained massive popularity due to its classification, clustering and ranking capabilities. The R language is widely used among statisticians and data miners for developing statistical software and performing data analysis.

It’s been quite a time I have been working on Hadoop,

Hadoop is an open-source framework that allows to store and process big data in a distributed environment across clusters of computers using simple programming models. Hadoop is written in Java for distributed storage and distributed processing of very large data sets on computer clusters built from commodity hardware. All the modules in Hadoop are designed with a fundamental assumption that hardware failures are common and should be automatically handled by the framework. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage.

The two main concepts associated with Hadoop are the Hadoop FileSystem (HDFS) and the MapReduce processing engine. Hadoop provides a distributed file system and a framework for the analysis and transformation of very large data sets using the MapReduce paradigm. The HDFS provides the storage whereas the MapReduce executes the programs. It implements MapReduce for scalable, reliable and distributed computing. It is a framework to support distributed processing of large datasets across clusters of computers with the help of a simple programming model. The advent of Hadoop was inspired by Google’s MapReduce and GFS papers.

Hadoop splits files into large blocks and distributes them across nodes in a cluster. To process data, Hadoop transfers packaged code for nodes to process in parallel based on the data that needs to be processed. This approach takes advantage of data locality— nodes manipulating the data they have access to— to allow the dataset to be processed faster and more efficiently than it would be in a more conventional supercomputer architecture that relies on a parallel file system where computation and data are distributed via high-speed networking.

Hadoop can run in three different modes- standalone, pseudo-distributed and fully distributed operation modes.

When it comes down the processing of large data sets, Hadoop’s MapReduce programming allows for the processing of such large volumes of data in a completely safe and cost-effective manner. Hadoop also triumphs over relational database management systems when it comes to the processing of large data clusters. Finally, many businesses have already realized the promise that Hadoop holds and it is imperative that its value to businesses will grow as unstructured data keeps growing.

By this we get know that Hadoop is a disruptive Java-based programming framework that supports the processing of large data sets in a distributed computing environment, while R is a programming language and software environment for statistical computing and graphics.

Now for bringing R and Hadoop together, the information that I have gathered is that Hadoop and R complement each other quite well in terms of visualization and analytics of big data.

The most common way to link R and Hadoop is to use HDFS (potentially managed by Hive or HBase) as the long-term store for all data, and use MapReduce jobs (potentially submitted from Hive, Pig, or Oozie) to encode, enrich, and sample data sets from HDFS into R. Data analysts can then perform complex modelling exercises on a subset of prepared data in R.

Let’s look up ahead on R and Hadoop integration.

There are four different ways of using Hadoop and R together:

1. RHadoop

RHadoop is a great open source solution for R and Hadoop provided by Revolution Analytics. RHadoop is bundled with four main R packages to manage and analyse the data with Hadoop framework.

RHadoop is a collection of three R packages: rmr, rhdfs and rhbase. rmr package provides Hadoop MapReduce functionality in R, rhdfs provides HDFS file management in R and rhbase provides HBase database management from within R. Each of these primary packages can be used to analyse and manage Hadoop framework data better.

2. ORCH

ORCH stands for Oracle R Connector for Hadoop. ORCH is Oracle R connector for Hadoop. ORCH can be used on the Oracle Big Data Appliance or on non-Oracle Hadoop clusters. It is a collection of R packages that provide the relevant interfaces to work with Hive tables, the Apache Hadoop compute infrastructure, the local R environment, and Oracle database tables. Additionally, ORCH also provides predictive analytic techniques that can be applied to data in HDFS files.

3. RHIPE

RHIPE is an R package which provides an API to use Hadoop. RHIPE stands for R and Hadoop Integrated Programming Environment, and is essentially RHadoop with a different API. RHIPE is the R and Hadoop Integrated Programming Environment specially designed with Divide and Recombine (D&R) techniques to analyse the large datasets.

4. Hadoop streaming

Hadoop Streaming is a utility which allows users to create and run jobs with any executables as the mapper and/or the reducer. Using the streaming system, one can develop working Hadoop jobs with just enough knowledge of Java to write two shell scripts that work in tandem.

I will have to say that the combination of R and Hadoop is emerging as a must-have toolkit for people working with statistics and large data sets. However, certain Hadoop enthusiasts have raised a red flag while dealing with extremely large Big Data fragments. They claim that the advantage of R is not its syntax but the exhaustive library of primitives for visualization and statistics. These libraries are fundamentally non-distributed, making data retrieval a time-consuming affair. This is an inherent flaw with R, and if you choose to overlook it, R and Hadoop in tandem can still work wonders.

REFERENCES :

[1] http://www.inside-r.org/what-is-r

[2] https://www.r-project.org/about.html

[3] https://en.wikipedia.org/wiki/Apache_Hadoop

[4] http://www.tutorialspoint.com/hadoop/hadoop_big_data_overview.htm

[5] http://blog.revolutionanalytics.com/2015/06/using-hadoop-with-r-it-depends.html

[6] https://www.quora.com/How-can-R-and-Hadoop-be-used-together

[7] https://blog.udemy.com/r-hadoop/

[8] https://github.com/RevolutionAnalytics/RHadoop/wiki

[9] http://www.dummies.com/how-to/content/hadoop-integration-with-r.html

[10] https://bighadoop.wordpress.com/2013/02/25/r-and-hadoop-data-analysis-rhadoop/