HBase Fundamentals

Sai Geetha M N
Jun 3, 2021
9 min read

HBase is a NoSQL DB that uses some capabilities of the Hadoop ecosystem to provide its features.

NoSQL DBs (a.k.a Not Only SQL) are non-tabular stores that store data very differently from Relational Databases. The main types are document DBs, key-value stores, columnar DBs and Graph DBs. They provide flexible schemas and high levels of scalability.

In this article, I introduce HBase and its associated concepts - its basic features, the data model and the various ways to access HBase. Also, I talk about some criteria that need to be considered while deciding whether HBase suits your use case or not.

In a subsequent article, I will get into the architecture that will equip you with a stronger understanding of HBase, thus, empowering you to make the right choices and good design.

We all agree with greater power comes greater responsibility and this is true when you deal with NoSQL databases as well.

What is HBase?

A very basic definition of HBase is that it is a distributed database management system that runs on top of Hadoop. Some of its important characteristics are:

Distributed - as it stored on HDFS
Scalable - to the number of nodes in the cluster and that could be any nmber
Fault-Tolerant - as it relies on the fault-tolerant capability of HDFS
Low-Latency - It provides real-time access to read and update (using row keys)
Structured - a loose data structure is supported that allows flexibility and yet advantages of structured data

It is a columnar database that allows ACID compliance at a row level. Each of these features will be understood as we go along.

Why HBase?

As we saw HBase is a NoSQL DB that works on top of the HDFS file system that is part of the Hadoop ecosystem. Why do we need an extra database on Hadoop when we already have database engines like Hive that serve the purpose of querying Hadoop data?

You can even access data on Hadoop using Spark SQL, Spark, map-reduce jobs etc. But there are some inherent limitations of data stored on HDFS.

HBase was designed to use the HDFS storage but with the main focus of overcoming its limitations, such as:

No Random Access
High Latency
Not ACID Compliant
No updates supported
Totally unstructured data i.e. HDFS does not impose any structure to data that is ingested into it.

HBase allows random access fantastically, is very efficient with reads with very low latency, is ACID-compliant at a row level, supports updates and allows for a flexible structure. It brings in some level of structure but does not impose it like relational databases. Isn't it fantastic that all the above limitations have been overcome in HBase?

So it is often used as a complementary tool with other tools on the Hadoop stack when you need the above set of features.

HBase as a Columnar Store

HBase falls in the category of Columnar databases among the NoSQL databases. Let us understand, why?

It fundamentally stores data in a columnar way and can be understood better by taking an example of data in a relational DB and how it gets stored in HBase.

A relational DB has a fixed schema and supports a 2-dimensional data model with rows and columns as shown here:

Here the attributes of a particular entity i.e. employee, are predefined as columns. Irrespective of whether you have a value for that attribute or not, the column exists and the value may be null or not.

HBase stores the same data very differently. Irrespective of how many attributes of data you have for a specific entity, it stores each attribute in a row of its own and the data flows down into a columnar structure. it has only 3 columns stored about every attribute as shown here: (actually 5 including the value, which we will see later)

i.e. a Unique Id, Name of the Column (a.k.a column qualifier) and Value.

If you notice the way data is stored, you can keep adding any number of column names and associated values for each id. There is no compulsion to have the same column names for every id or the same number of columns for every id

From this you can already see two advantages:

Wide Column Sparse tables are possible, without wastage of space:

Since you do not have to have an entry for a column that has no value for an entity, you can support wide-column tables which are very sparse in nature. Every entity will have only those columns and values that exist - stored. No space is wasted for the innumerable columns that have no values.

Some entities may have hundreds of columns with values while some may have just a few columns values. Consider the example of patient data. Healthy ones may have very few attributes with values and the ones with a disease may have a lot of columns that have values. HBase is a fantastic store for these type of use cases.

2. Dynamic attributes: This means that each entity can have a unique set of columns totally different from the other entities. Considering the same patient data, some may have parameters related to heart disease while some may have parameters related to say cancer. In both cases, the column names may be very different. This flexibility is allowed by HBase.

While unique column names may run into hundreds or thousands, the actual data per entity may be only a small subset of these columns.

Note that while we call HBase a columnar DB, it does not store all data of the same columns together but it stores all data of the same column families together and hence it better be understood as "column-family" oriented. The concept of column family is introduced in the next section

HBase Data Model

Having understood the basic idea of HBase data storage, let us understand the complete data model of HBase

HBase uses a 4-dimensional Model as shown here:

It consists of

Row Key: which is the unique identifier for an entity
Column Family: a way of grouping columns together, for various optimization reasons. Its significance will be understood as we go along.
Column/Column qualifier: This is the way to identify a value or an attribute of an entity
Timestamp: this acts as the version number for values stored in a column

Every attribute of an entity has these 4 aspects. Hence understanding this is very fundamental to understanding how HBase stores and manages data. This also helps in designing your HBase data models correctly.

What is a Row key?

A row key plays the role of an entity identifier and can be mapped to a primary key in the RDBMS world. This defines the unique id by which you identify an entity in HBase.

So, against each row key, you can have multiple columns and column families. This is the first dimension.

Now, what is a Column family?

A column family is a group of columns grouped together. It is a logical group of columns. There is no straightforward equivalent to this in the RDBMS world. This is the second dimension.

Then, a column is exactly the traditional meaning of a column that can store a value related to the column or attribute. This is the third dimension

Finally, what is timestamp doing in the data model? if you want to keep track of multiple updates to the value of the same column, what can you do? You can store each update along with the timestamp of the update. Then, you do get to see every update done by timestamp. So, timestamp acts like the version number for the updates to values in a column. This is the fourth dimension.

If we put all of the above ideas together and If we were to logically represent this, it would look like this:

For each row key, you have a set of column families. Within each column family, you can have any number of columns.

For each column, you can have values that are updated multiple times based on the timestamp when the value was inserted.

So, if you mention a row key, a column family, a column name and a specific timestamp, you can get exactly one value. If you don't specify a timestamp, it defaults to the latest value.

Therefore, for every piece of value you want to retrieve, you need to mention the 4 aspects of the 4 dimensions - the row key, the column family, the column and the timestamp.

To take an example of an employee data, if you had data about one employee, whose data was divided into official and personal data based on the fact that these would logically be retrieved at different points of time - to view or update, you would have two column families "Official and "Personal" to represent the same. It would be logically stored as shown here:

In the column families, you could have any number of columns per employee. This employee has columns 'Department', 'Designation', 'Worklevel' as official data and 'Name' and 'Address' are the columns in the 'Personal' column family.

Assume the designation was updated at some point, then both the old and the new designation are stored as "Engineer" and "Senior Engineer" against the specific timestamp when they were inserted into the database column 'Official:Designation'. This is the way you refer to a column in HBase. You mention it as <columnFamily>:<column>

This is purely a logical representation. How it is stored physically is in a columnar structure as explained in HBase as a Columnar Store section.

To design the HBase Schema, we need to understand a few characteristics of each of the 4 dimensions of the data model.

Row Key:

This uniquely identifies a row and acts as the Primary Key
It is always sorted in the ascending order
It is used to partition data into what are called Regions. More on that will be discussed in a later article on HBase architecture.
It can be a single value or a composite value like 'EmployeeId' alone or 'EmployeeId,GovernmentId'
This is what helps in retrieving the data and is the main way by which data in HBase is accessed

Column Family:

This has to be defined upfront at the schema definition time, for a table, as shown below

create 'employee','official', 'personal'

This is the way you define an HBase table where the first parameter after the keyword 'create' is the table name, followed by column-family names.

These column families are not flexible - in terms of adding, deleting or modifying.
All rows have the same set of column families
Each column family can have different columns for each row
Columns can be dynamically added
Each column family is stored in a separate data file

Columns:

Columns are units within a column family
They need not be the same for every row
New columns are added on the fly
The way to retrieve a column is using ColumnFamily:ColumnName

Timestamp:

This is used as a version number for values stored in a column
The value for any version can be accessed
The latest timestamp is retrieved by default
The number of versions to keep can be defined

There are more nuances about the data model that have to be understood to design HBase schemas optimally.

To put all this together and get a feel for how data is retrieved using these 4 dimensions of the data model, here is a small code snippet that you would run at the HBase shell

get 'employee','0001','official:employeename'

Here 'employee' is the table name, '0001' is the employeeid which is the rowkey and the column name consists of the columnfamily:columnname. So the name of the employee with id 0001 will be retrieved as stored in the 'official' column family.

Having understood the data model of HBase, we probably should understand when is it good to go for HBase as a choice of DB and when to avoid going with it

When HBase and When Not?

HBase can be looked at as an option, when

you want to support random reads or range reads
You want to update data on Hadoop
Your data is somewhat columnar in nature i.e. you want to store data in columns but the number and names of columns have to be very flexible
You are more interested in Consistency and are fine with a slight compromise on availability in the CAP Theorem

And of course, the common reasons for going with any NoSQL DB stand good:

You want to scale to very large data
You have very few access patterns to data - such that you can define one-row key that can be used to retrieve data

You would not think of HBase as an option

If your priority is availability over consistency
If data access paths are very complex and a row key is not sufficient to retrieve data
if you want to do complex aggregations of data
if you want the flexibility of joining multiple tables, as join are not inherently supported
If you have small data that can be handled by standard RDBMS
You have data analytics or BI use cases

Different ways to connect to HBase:

There are multiple ways to connect to HBase. Let us have a look at them.

HBase Shell - a CLI interface
Through Java programmatically
Apache Hive
HBase Thrift Interface
HBase REST API
Apache Phoenix

HBase Shell: The easiest way to access HBase is through the HBase shell. You have typical commands that help with DDL and DML actions

HBase through Java: HBase also provides an HBase Java API to programmatically access HBase from Java. This is a first-class citizen in terms of interacting with HBase and hence powerful and preferred.

Apache Hive: Hive also provides an optional library for interacting with HBase. A bridge layer between the two systems is implemented in this library. Useful when you want clients familiar with hive to also work with HBase

Thrift Interface: HBase thrift interface provides a way for languages other than Java to connect to HBase, by allowing the HBase client to interact with the thrift servers.

REST API: HBase also provides a REST API through which CRUD operations can be done. and many more cluster level operations can be queried

Apache Phoenix: You could use Apache Phoenix as a SQL engine over HBase and actually run SQL queries on top of HBase, which otherwise does not support SQL. As per the documentation, Apache Phoenix takes your SQL query, compiles it into a series of HBase scans, and orchestrates the running of those scans to produce regular JDBC result. This is quite a powerful tool though it has a few limitations due to the inherent architectural choices made in developing HBase. It provides

JDBC Drivers
Phoenix shell called sqlline.py
SQuirreL Client
ODBC connectivity

Conclusion

Having got a very basic understanding of HBase, I hope you are at least equipped to think whether HBase is a DB suitable for your use case or not. You are also aware of the complexity of the data model which should help you decide on whether you are ok with this level of complexity for the flexibility you will get from it.

You must now be equipped with some basic trade-offs that you will have to deal with when you choose to go with HBase.

There are many more things to consider to make a full-fledged decision, which I hope to bring to you in the next few articles in the coming weeks.

Decision Trees through an Example

Decision Trees - Feature Selection for a Split

Decision Trees - Homogeneity Measures