Hadoop for distributed storage and processing of Big data

Before starting with Hadoop let’s know about big data. As the new technologies, devices emerging, the amount of data produced by mankind is growing rapidly every year.

Big data is a large datasets of both structured and unstructured data that is difficult to process using traditional database. Big data involves the data produced by different devices and applications like black box, social media, search engine, power grid etc. The “big data” technology seeks to transform all the raw data into meaningful and actionable insights for the enterprise. Its key characteristics are the “Three V’s,” which include volume (size) as well as velocity (speed) and variety (type). To process and store this big data Apache hadoop is one of the framework.

Apache Hadoop is a framework for scalable and reliable distributed data storage and processing. It allows for the processing of large data sets across clusters of computers using a simple programming model. It is designed to scale up from single server to thousands of machines, aggregating the local computation and storage from each server. Rather than relying on expensive hardware, the Hadoop software detects and handles any failures that occur, allowing it to achieve high availability on top of inexpensive commodity computers, each individually prone to failure.

Hadoop framework includes following four modules:

  • Hadoop Common
  • HDFS
  • MapReduce
  • Yarn

hadoop1

Hadoop Common

These are Java libraries and utilities required by other Hadoop modules. These libraries provides filesystem and OS level abstractions and contains the necessary Java files and scripts required to start Hadoop.

HDFS

The Hadoop Distributed File System is a scalable and reliable distributed storage system that aggregates the storage of every node in a Hadoop cluster into a single global file system. HDFS stores individual files in large blocks, allowing it to efficiently store very large or numerous files across multiple machines and access individual chunks of data in parallel, without needing to read the entire file into a single computer’s memory. Reliability is achieved by replicating the data across multiple hosts, with each block of data being stored, by default, on three separate computers. If an individual node fails, the data remains available and an additional copy of any blocks it holds may be made on new machines to protect against future failures. This approach allows HDFS to dependably store massive amounts of data.

MapReduce

MapReduce is the programming model that allows Hadoop to efficiently process large amounts of data. MapReduce breaks large data processing problems into multiple steps, namely a set of Maps and Reduces that can each be worked on at the same time (in parallel) on multiple computers in a reliable, fault-tolerant manner.MapReduce is designed to work with of HDFS. It has two tasks, namely

The Map Task: This is the first task, which takes input data and converts it into a set of data, where individual elements are broken down into tuples (key/value pairs).

The Reduce Task: This task takes the output from a map task as input and combines those data tuples into a smaller set of tuples.

The architecture of map-reduce is shown in below picture.

hadoop2
YARN

Its acronym for Yet Another Resource Negotiator. It provides a resource management framework for any processing engine, including MapReduce. It allows new processing frameworks to use the power of the distributed scale of Hadoop. It transforms Hadoop from a single application to a multi-application data system.

Advantages of Hadoop:

let’s have a look at some of the advantages of hadoop

  • Cost effective

    Hadoop also offers a cost effective storage solution for businesses’ exploding data sets. Since it is using commodity hardware to store the data, the cost will be less compared to traditional system.

  • Scalable

    Hadoop is a highly scalable storage platform, because it can store and distribute very large data sets across hundreds of inexpensive servers that operate in parallel. Hadoop enables businesses to run applications on thousands of nodes involving thousands of terabytes of data.

  • Flexible

    Hadoop enables businesses to easily access new data sources and tap into different types of data (both structured and unstructured) to generate value from that data.Hadoop can be used for a wide variety of purposes, such as log processing, recommendation system and data warehousing and market campaign analysis and fraud detection.

  • Fast

    Hadoop’s unique storage method is based on a distributed file system that basically ‘maps’ data wherever it is located on a cluster. The tools for data processing are often on the same servers where the data is located, resulting in much faster data processing.

  • Resilient to failure

    A key advantage of using Hadoop is its fault tolerance. When data is sent to an individual node, that data is also replicated to other nodes in the cluster, which means that in the event of failure, there is another copy available for use.

References

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s