Big Data is the buzz word circulating in IT industry from 2008, you must be knowing this, if not let me tell you that the amount of data being generated by social networks, manufacturing, retail, stocks, telecom, insurance, banking, and health care industries is way beyond our imaginations.
Before the advent of Hadoop, storage and processing of big data was a big challenge. But now that Hadoop is available, companies have realized the business impact of Big Data and how understanding this data will drive the growth. For example:
- Banking sectors have a better chance to understand loyal customers, loan defaulters and fraud transactions.
- Retail sectors now have enough data to forecast demand.
- Manufacturing sectors need not depend on the costly mechanisms for quality testing. Capturing sensors data and analysing it would reveal many patterns.
- E-Commerce, social networks can personalize the pages based on customer interests.
- Stock markets generate a humongous amount of data, correlating from time to time will reveal beautiful insights.
Big Data has many useful and insightful applications.
Hadoop is the straight answer for processing Big Data. Hadoop ecosystem is a combination of technologies which have proficient advantage in solving business problems.
In this blog post, I’ll give an overview of that ecosystem. If you’re new to Hadoop, this could be an easy introduction. If you are steep in some Hadoop technology, this post could give you the opportunity to look around the ecosystem for things you aren’t specialized in.
We all know Hadoop is a framework which deals with Big Data but unlike any other frame-work it’s not a simple framework, it has its own family for processing different thing which is tied up in one umbrella called as Hadoop Ecosystem.
Hadoop Ecosystem can be divided into various categories as mentioned below :
- Core Hadoop
- Data Access
- Data Storage
- Data Intelligence
- Data Serialization
- Data Integration
- Management, Monitoring and Orchestration
Let’s try to understand each component in detail:
1. Core Hadoop
Core Hadoop in the first category comprises of 4 core components –
Apache Foundation has pre-defined set of utilities and libraries that can be used by other modules within the Hadoop ecosystem. For example, if HBase and Hive want to access HDFS they need to make of Java archives (JAR files) that are stored in Hadoop Common.
Hadoop Distributed File System (HDFS)
The default big data storage layer for Apache Hadoop is HDFS. Hadoop Distributed File System (HDFS) is a distributed file system. It is built to make sure data is local to the machine processing it. HDFS is the “Secret Sauce” of Apache Hadoop components as users can dump huge datasets into HDFS and the data will sit there nicely until the user wants to leverage it for analysis. It is the key tool for managing Big Data and supporting analytic applications in a scalable, cheap and rapid way. Hadoop is usually used on low-cost commodity machines, where server failures are fairly common. To accommodate a high failure environment the file system is designed to distribute data throughout different servers in different server racks making the data highly available.
HDFS operates on a Master-Slave architecture model where the NameNode acts as the master node for keeping a track of the storage cluster and the DataNode acts as a slave node summing up to the various systems within a Hadoop cluster. HDFS component creates several replicas of the data block to be distributed across different clusters for reliable and quick data access.
Nokia deals with more than 500 terabytes of unstructured data and close to 100 terabytes of structured data. Nokia uses HDFS for storing all the structured and unstructured data sets as it allows processing of the stored data at a petabyte scale.
MapReduce is a Java-based system created by Google where the actual data from the HDFS store gets processed efficiently. Map Reduce is the distributed, parallel computing programming model for Hadoop.
The major advantage of MapReduce is that it is easy to scale data processing over multiple computing nodes. Under the MapReduce model, the data processing primitives are called mappers and reducers. Decomposing a data processing application into mappers and reducers is sometimes nontrivial. But, once we write an application in the MapReduce form, scaling the application to run over hundreds, thousands, or even tens of thousands of machines in a cluster is merely a configuration change. This simple scalability is what has attracted many programmers to use the MapReduce model.
Skybox has developed an economical image satellite system for capturing videos and images from any location on earth. Skybox uses Hadoop to analyse the large volumes of image data downloaded from the satellites. The image processing algorithms of Skybox are written in C++. Busboy, a proprietary framework of Skybox makes use of built-in code from java based MapReduce framework.
YARN which stands for Yet Another Resource Negotiator is a programming model for processing and generating large sets of data. It is the successor of Hadoop MapReduce. It is great enabler for dynamic resource utilization on Hadoop framework as users can run various Hadoop applications without having to bother about increasing workloads.
YARN combines a central resource manager that reconciles the way applications use Hadoop system resources with node manager agents that monitor the processing operations of individual cluster nodes. Running on commodity hardware clusters, Hadoop has attracted particular interest as a staging area and data store for large volumes of structured and unstructured data intended for use in analytics applications. Separating HDFS from MapReduce with YARN makes the Hadoop environment more suitable for operational applications that can’t wait for batch jobs to finish.
Yahoo has close to 40,000 nodes running Apache Hadoop with 500,000 MapReduce jobs per day taking 230 compute years extra for processing every day. YARN at Yahoo helped them increase the load on the most heavily used Hadoop cluster to 125,000 jobs a day when compared to 80,000 jobs a day which is close to 50% increase.
The above listed core components of Apache Hadoop form the basic distributed Hadoop framework. Let me tell you, there are several other Hadoop components that form an integral part of the Hadoop ecosystem with the intent of enhancing the power of Apache Hadoop in some way or the other like- providing better integration with databases, making Hadoop faster or developing novel features and functionalities.
Here are some of the eminent Hadoop components used by enterprises extensively –
2. Data Access Components of Hadoop Ecosystem
Under this category, we have Hive, Pig, HCatalog and Tez which are explained below :
Hive is a data warehouse management and analytics system that is built for Hadoop. It was initially developed by Facebook, but soon after became an open-source project and is being used by many other companies ever since.
Apache Hive supports analysis of large datasets stored in Hadoop’s HDFS and compatible file systems such as Amazon S3 Filesystem. It provides an SQL-like language called HiveQL with schema on read and transparently converts queries to map/reduce, Apache Tez and Spark jobs.
Hive also enables data serialization/deserialization and increases flexibility in schema design by including a system catalog called Hive-Metastore. By default, It stores metadata in an embedded Apache Derby database, and other client/server databases like MySQL can optionally be used. It supports text files (also called flat files), SequenceFiles (flat files consisting of binary key/value pairs) and RCFiles (Record Columnar Files which store columns of a table in a columnar database way).
Hive simplifies Hadoop at Facebook with the execution of 7500+ Hive jobs daily for Ad-hoc analysis, reporting and machine learning.
Pig is a convenient tools developed by Yahoo for analysing huge data sets efficiently and easily. It is a programming language that simplifies the common tasks of working with Hadoop, like loading data, expressing transformations on the data, and storing the final results. Pig’s built-in operations can make sense of semi-structured data, such as log files, and the language is extensible using Java to add support for custom data types and transformations.
Similar to Hive, Pig also deals with structured data using Pig Latin language. It is an alternative provided to programmer who loves scripting and don’t want to use Java/Python or SQL to process data. A Pig Latin program is made up of a series of operations, or transformations, that are applied to the input data which runs MapReduce program in backend to produce output. The most outstanding feature of Pig programs is that their structure is open to considerable parallelization making it easy for handling large data sets.
The personal healthcare data of an individual is confidential and should not be exposed to others. This information should be masked to maintain confidentiality, am I not right, but the healthcare data is so huge that identifying and removing personal healthcare data is crucial. Apache Pig can be used under such circumstances to de-identify health information.
Hive & Pig overlap :
You would choose Pig in scenarios where you are importing and transforming data and you would like to be able to see the intermediate.
Apache Pig is somewhat similar to Apache Hive though some users say that it is easier to transition to Hive rather than Pig if you come from a RDBMS SQL background. However, both platforms have a place in the market. Hive is more optimised to run standard queries and is easier to pick up where as Pig is better for tasks that require more customisation. Where Hive is used for structured data, Pig excels in transforming semi-structured and unstructured data.
HCatalog is a table and storage management layer for Hadoop that enables users with different data processing tools — Pig, MapReduce — to more easily read and write data on the grid. HCatalog’s table abstraction presents users with a relational view of data in the Hadoop distributed file system (HDFS) and ensures that users need not worry about where or in what format their data is stored — RCFile format, text files, SequenceFiles, or ORC files.
HCatalog presents a relational view of data. Data is stored in tables and these tables can be placed into databases. Tables can also be partitioned on one or more keys. For a given value of a key (or set of keys) there will be one partition that contains all rows with that value (or set of values).
Apache Tez is an extensible framework for building high performance batch and interactive data processing applications, coordinated by YARN in Apache Hadoop. It is an application framework allowing complex directed-acyclic-graph of tasks to processing data. It is built on top of YARN and is a substitute to Map Reduce in some scenarios
Tez generalizes the MapReduce paradigm to a more powerful framework based on expressing computations as a dataflow graph. It is not meant directly for end-users – in fact it enables developers to build end-user applications with much better performance and flexibility.
However, there are a lot of use cases for near-real-time performance of query processing. There are also several workloads, such as Machine Learning, which do not fit well into the MapReduce paradigm. Tez helps Hadoop address these use cases.
3. Data Storage Component of Hadoop Ecosystem
HBase and Cassandra come under this category,
Hadoop Database or HBASE is a non-relational (NoSQL) database that runs on top of HDFS. It was created for large table which have billions of rows and millions of columns with fault tolerance capability and horizontal scalability and based on Google Big Table. Hadoop can perform only batch processing, and data will be accessed only in a sequential manner for random access of huge data HBASE is used.
HBase is a column-oriented database that uses HDFS for underlying storage of data. HBase supports random reads and also batch computations using MapReduce. With HBase NoSQL database enterprise can create large tables with millions of rows and columns on hardware machine. The best practice to use HBase is when there is a requirement for random ‘read or write’ access to big datasets.
Facebook is one the largest users of HBase with its messaging platform built on top of HBase in 2010.HBase is also used by Facebook for streaming data analysis, internal monitoring system, nearby Friends Feature, Search Indexing and scraping data for their internal data warehouses.
Apache Cassandra is a column oriented NoSQL data store which offers scalability, high availability without compromising on performance. It is the right choice when you need scalability and high availability without compromising performance.
It offers capabilities that relational databases and other NoSQL databases simply cannot match such as: continuous availability, linear scale performance, operational simplicity and easy data distribution across multiple data.
Linear scalability and proven fault-tolerance on commodity hardware or cloud infrastructure make it the perfect platform for mission-critical data. It’s support for replicating across multiple datacenters is best-in-class, providing lower latency for your users and the peace of mind of knowing that you can survive regional outages.
Cassandra’s data model offers the convenience of column indexes with the performance of log-structured updates, strong support for de-normalization and materialized views, and powerful built-in caching.
4. Data Interaction-Visualization-Execution-Development
Apache Spark is actually complementary to Hadoop. Spark provides an interface for programming entire clusters with implicit data parallelism and fault-tolerance. Spark Core is the foundation of the overall project. It combines SQL, streaming and complex analytics together seamlessly in the same application to handle a wide range of data processing scenarios. Spark runs on top of Hadoop, Mesos, standalone, or in the cloud. It can access diverse data sources such as HDFS, Cassandra, HBase, or S3. It is a lightning-fast cluster computing technology, designed for fast computation. It is based on Hadoop MapReduce and it extends the MapReduce model to efficiently use it for more types of computations, which includes interactive queries and stream processing.
The main feature of Spark is its in memory cluster computing that increases the processing speed of an application. Spark is designed to cover a wide range of workloads such as batch applications, iterative algorithms, interactive queries and streaming. Apart from supporting all these workload in a respective system, it reduces the management burden of maintaining separate tools.
5. Data Intelligence Components of Hadoop Ecosystem
Apache Mahout is a platform to run scalable Machine Learning algorithms leveraging Hadoop distributed computing. Mahout is an open source machine learning library written in java. Mahout aims to be the machine learning tool of choice when the collection of data to be processed is very large, perhaps far too large for a single machine. In its current incarnation, these scalable machine learning implementations in Mahout are written in Java, and some portions are built upon Apache’s Hadoop distributed computation project.
It is not simply a collection of pre-existing algorithms; many machine learning algorithms are intrinsically non-scalable; that is, given the types of operations they perform, they cannot be executed as a set of parallel processes. Algorithms in the Mahout library belong to the subset that can be executed in a distributed fashion.
Every organization’s data are diverse and particular to their needs. However, there is much less diversity in the kinds of analyses performed on that data. The Mahout project is a library of Hadoop implementations of common analytical computations. Use cases include user collaborative filtering, user recommendations, clustering and classification.
6. Data Serialization Components of Hadoop Ecosystem
Thrift and Avro which are meant for serialization come under this category,
It allows you to define data types and service interfaces in a simple definition file. Taking that file as input, the compiler generates code to be used to easily build RPC clients and servers that communicate seamlessly across programming languages. Instead of writing a load of boilerplate code to serialize and transport your objects and invoke remote methods, you can get right down to business.
Thrift includes a complete stack for creating clients and servers. The top part is generated code from the Thrift definition. The services generate from this file client and processor code. In contrast to built-in types, created data structures are sent as result in generated code.
Avro is a remote procedure call and data serialization framework developed within Apache’s Hadoop project. It is a serialization system for efficient, cross-language RPC and persistent data storage. It is a framework for performing remote procedure calls and data serialization. It uses JSON for defining data types and protocols, and serializes data in a compact binary format. Its primary use is in Apache Hadoop, where it can provide both a serialization format for persistent data, and a wire format for communication between Hadoop nodes, and from client programs to the Hadoop services.
In the context of Hadoop, it can be used to pass data from one program or language to another, e.g. from C to Pig. It is particularly suited for use with scripting languages such as Pig, because data is always stored with its schema in Avro.
7. Data Integration Components of Hadoop Ecosystem
This category includes Sqoop, Flume and Chukwa,
Sqoop : SQL + HADOOP = SQOOP
Apache Sqoop is Hadoop data movement involving booth relational and non-relational data sources. It provides a bi-directional data transfer between Hadoop-HDFS and your favourite relational database. For example you might be storing your app data in relational store such as Oracle, now you want to scale your application with Hadoop so you can migrate Oracle database data to Hadoop HDFS using Sqoop.
This component is used for importing data from external sources into related Hadoop components like HDFS, HBase or Hive. It can also be used for exporting data from Hadoop o other external structured data stores. Sqoop parallelized data transfer, mitigates excessive loads, allows data imports, efficient data analysis and copies data quickly.
When we import any structured data from table (RDBMS) to HDFS a file is created in HDFS which we can process by either Map Reduce program directly or by HIVE or PIG. Similarly after processing data in HDFS we can store the processed structured data back to another table in RDBMS by exporting through Sqoop.
Online Marketer Coupons.com uses Sqoop component of the Hadoop ecosystem to enable transmission of data between Hadoop and the IBM Netezza data warehouse and pipes backs the results into Hadoop using Sqoop.
Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. It has a simple and flexible architecture based on streaming data flows. It is robust and fault tolerant with tunable reliability mechanisms and many failover and recovery mechanisms. It uses a simple extensible data model that allows for online analytic application. Apache Flume is a distributed, reliable, and available system for efficiently collecting, aggregating and moving large amounts of log data from many different sources to a centralized data store.
Enterprises use Flume’s powerful streaming capabilities to land data from high-throughput streams in the Hadoop Distributed File System (HDFS). Typical sources of these streams are application logs, sensor and machine data, geo-location data and social media. These different types of data can be landed in Hadoop for future analysis using interactive queries in Apache Hive. Or they can feed business dashboards served ongoing data by Apache HBase.
Twitter source connects through the streaming API and continuously downloads the tweets (called as events). These tweets are converted into JSON format and sent to the downstream Flume sinks for further analysis of tweets and retweets to engage users on Twitter.
Chukwa is a Hadoop subproject devoted to large-scale log collection and analysis. It is an open source data collection system for monitoring large distributed systems. It is built on top of the Hadoop distributed Filesystem (HDFS) and MapReduce framework and inherits Hadoop’s scalability and robustness. It also includes a ﬂexible and powerful toolkit for displaying monitoring and analyzing results, in order to make the best use of this collected data.
It aims to provide a flexible and powerful platform for distributed data collection and rapid data processing. The goal is to produce a system that’s usable today, but that can be modified to take advantage of newer storage technologies (HDFS appends, HBase, etc.) as they mature. In order to maintain this flexibility, Chukwa is structured as a pipeline of collection and processing stages, with clean and narrow interfaces between stages. This will facilitate future innovation without breaking existing code.
8. Monitoring, Management and Orchestration components of Hadoop Ecosystem
Apache Oozie is a server-based workflow scheduling system to manage Hadoop jobs. It is a Java Web-Application that runs in a Java servlet-container – Tomcat and uses a database to store:
- Workflow definitions
- Currently running workflow instances, including instance states and variables
It is a workflow scheduler system to manage Hadoop jobs. It is a server-based Workflow Engine specialized in running workflow jobs with actions that run Hadoop MapReduce and Pig jobs. It is implemented as a Java Web-Application that runs in a Java Servlet-Container. Hadoop basically deals with Bigdata and when some programmer wants to run many job in a sequential manner like output of job A will be input to Job B and similarly output of job B is input to job C and final output will be output of job C. To automate this sequence we need a workflow and to execute same we need engine for which OOZIE is used.
The American video game publisher Riot Games uses Hadoop and the open source tool Oozie to understand the player experience.
Zookeeper provides centralized services for Hadoop cluster configuration management, synchronization and group services. For example, think about how a global configuration file works on a Web application; Zookeeper is like that configuration file, but at a much higher level. It provides primitives such as distributed locks that can be used for building the highly scalable applications. It is used to manage synchronization for cluster.
With a growing family of services running as part of a Hadoop cluster, there’s a need for coordination and naming services. As computing nodes can come and go, members of the cluster need to synchronize with each other, know where to access services, and know how they should be configured. This is the purpose of Zookeeper.
One of most useful tools you’ll use if you’re administering a Hadoop cluster, Ambari that allows administrators to install, manage and monitor Hadoop clusters with a simple Web interface. It provides an easy-to-follow wizard for setting up a Hadoop cluster of any size.
Ambari makes Hadoop management simpler by providing a consistent, secure platform for operational control. Ambari provides an intuitive Web UI as well as a robust REST API, which is particularly useful for automating cluster operations. With Ambari, Hadoop operators get the following core benefits:
- Simplified Installation, Configuration and Management
- Centralized Security Setup
- Full Visibility into Cluster Health
- Highly Extensible and Customizable
By now, you must have got to know that Hadoop is a very rich ecosystem. Hadoop has gained its popularity due to its ability of storing, analyzing and accessing large amount of data, quickly and cost effectively through clusters of commodity hardware. It won’t be wrong if we say that Apache Hadoop is actually a collection of several components and not just a single product.
Thanks for reading.