This blog provides a brief introduction to HBase. So let’s get started!!
Since the 1970s, relational database management systems (RDBMS) have dominated the data landscape. But as businesses collect, store and process more and more data, relational databases are harder and harder to scale. HBase is a database that provides real-time, random read and write access to tables meant to store billions of rows and millions of columns. It is designed to run on a cluster of commodity servers and to automatically scale as more servers are added, while retaining the same performance. It is fault tolerant precisely because data is divided across servers in the cluster and stored in a redundant file system such as the Hadoop Distributed File System (HDFS).
When it comes to the Data Storage we first think of Relational Databases with structured data storage and a sophisticated query engine. But a Relational Database has drawback in terms of cost to improve performance as the data size increases.
HBase, on the other hand, is designed to provide scalability and partitioning to enable efficient data structure serialization, storage and retrieval.
The main differences between Relational Databases and HBase are as follows:
|· Based on a Fixed Schema.||· It is Schema-less.|
|· Is a Row-oriented data store.||· Is a Column-oriented data store|
|· Designed to store Normalized Data.||· Designed to store Denormalized Data.|
|· No built-in support for partitioning.||· Supports Automatic Partitioning.|
Where to Use HBase??
- Apache HBase is used to have random, real-time read/write access to Big Data.
- It hosts very large tables on top of clusters of commodity hardware.
- Apache HBase is a non-relational database modelled after Google’s Big-table. Big table acts up on Google File System, likewise Apache HBase works on top of Hadoop and HDFS.
Features of HBase:
HBase is a key/value store. Specifically it is a Sparse, Consistent, Distributed, Multidimensional and Sorted map.
- Sparse: HBase stores key ->value mappings and that a “row” is nothing more than a grouping of these mappings (identified by the row key mentioned above). Unlike NULL in most relational databases, no storage is needed for absent information, there will be just no cell for a column that does not have any value. It also means that every value carries all its coordinates with it.
- Multi-dimensional: The key itself has structure. Each key consists of the following parts: row-key, column family, column, and time-stamp. The rowkey and value are just bytes, so you can store anything that you can serialize into a byte  into a cell.
- Consistent: All the changes with the same rowkey are atomic. A reader will always read the last written and committed values.
- Distributed: Key feature of HBase is that the data can be spread over 100s or 1000s of machines and reach billions of cells. HBase manages the load balancing automatically.
- Sorted: These cells are sorted by the key. This is a very important property as it allows for searching, rather than just retrieving a value for a known key.
HBase partitions the key space. Each partition is called a Table. Each table declares one or more column families. Column families define the storage properties for an arbitrary set of columns.
The principle operations supported by HBase are Put (add some data), Delete (“delete” some data), Scan (retrieve some cells), Get (which is just a special case of Scan).
HBase is a non-relational, strongly consistent, distributed key-value store with automatic data versioning. It is horizontally scalable via adding additional servers to a cluster and provides fault-tolerance so data is not lost when servers fail.
Architectural Overview of HBase:
HBase is a distributed database, meaning it is designed to run on a cluster of dozens to possibly thousands or more servers. As a result it is more complicated to install than a single RDBMS running on a single server.
The typical problems of distributed computing such as coordination and management of remote processes, locking, data distribution, network latency and number of round trips between servers have to be resolved.
HBase makes use of several other mature technologies, such as Apache Hadoop and Apache ZooKeeper, to solve many of these issues.
The figure (1) shows the major architectural components in HBase.
Components of HBase Architecture:
- HMaster: There is a single HBase master node and multiple region servers. It is also possible to run HBase in a multiple master setup, in which there is a single active master. The HMaster is responsible to assign the regions to each HRegionServer when HBase is started. It is responsible for managing everything related to rows, tables and their co-ordination activities. The HMaster also has the details of the metadata.
- Regions: HBase tables are partitioned into multiple regions with each region storing a range of the table’s rows, and multiple regions are assigned by the master to a region server.
- Region Server: It is a system which acts similar to a data node. When Region Server (RS) receives write request, it directs the request to specific Region. Each Region stores set of rows. Rows data can be separated in multiple column families (CFs).
- MemStore and Hfile: Regions contain an in-memory data store (MemStore) and a persistent data store (Hfile).
MemStore keeps track of all the logs for the read and write operations that have been performed within that particular region server. From this we can say that is acting similar to a name node in Hadoop. MemStore is an in-memory storage, hence the MemStore utilizes the in-memory storage of each data node to store the logs.
HFiles forms the low level of HBase’ s architecture. HFiles are storage files created to store HBase’ s data fast and efficiently.
- WAL: All the regions on a region server share a reference to the write-ahead log (WAL) which is used to store new data that hasn’t yet been persisted to permanent storage and to recover from region server crashes.
- ZooKeeper: HBase utilizes ZooKeeper (a distributed coordination service) to manage region assignments to region servers, and to recover from region server crashes by loading the crashed region server’s regions onto other functioning region servers.
- HDFS: The HDFS component is the Hadoop Distributed File system, a distributed, fault-tolerant and scalable file system which guards against data loss by dividing files into blocks and spreading them across the cluster; it is where HBase actually stores data.
- Java APIs: Clients interact with HBase via one of several available APIs, including a native Java API
- External APIs: Clients can also interact with the HBase via REST-based interface and several RPC interfaces (Apache Thrift, Apache Avro).
How the Components Work Together:
- As shown in the figure 2, Region servers and the active HMaster connect with a session to ZooKeeper. Each Region Server creates an ephemeral node.
- Zookeeper is used to coordinate shared state information for members of distributed systems. The ZooKeeper maintains ephemeral nodes for active sessions via heartbeats.
- The HMaster monitors these nodes to discover available region servers, and it also monitors these nodes for server failures. HMaster creates an ephemeral node.
- Zookeeper determines the first one and uses it to make sure that only one master is active. The active HMaster sends heartbeats to Zookeeper, and the inactive HMaster listens for notifications of the active HMaster failure.
- If a region server or the active HMaster fails to send a heartbeat, the session is expired and the corresponding ephemeral node is deleted.
- The active HMaster listens for region servers, and will recover region servers on failure. The Inactive HMaster listens for active HMaster failure, and if an active HMaster fails, the inactive HMaster becomes active.
HBase a new way of thinking about the data storage and processing. The SQL-like process of extracting and transforming the data in a monolithic system is replaced with a divide-and-conquer approach, in which the database supports Create, Read, Update, Delete (CRUD) operations, while complex transformations are delegated to external components designed for parallel processing.
HBase lives on top of a Hadoop Distributed file system (HDFS). It is a distributed, column oriented database and uses HDFS for the storage.
It facilitates real time read/write random access for huge dataset. HBase supports in lots of languages: ruby, python, C#, R, Java. Its shell supports HBase script files.