HBase is a NoSQL DB that uses some capabilities of the Hadoop ecosystem to provide its features.
NoSQL DBs (a.k.a Not Only SQL) are non-tabular stores that store data very differently from Relational Databases. The main types are document DBs, key-value stores, columnar DBs and Graph DBs. They provide flexible schemas and high levels of scalability.
In this article, I introduce HBase and its associated concepts - its basic features, the data model and the various ways to access HBase. Also, I talk about some criteria that need to be considered while deciding whether HBase suits your use case or not.
In a subsequent article, I will get into the architecture that will equip you with a stronger understanding of HBase, thus, empowering you to make the right choices and good design.
We all agree with greater power comes greater responsibility and this is true when you deal with NoSQL databases as well.
What is HBase?
A very basic definition of HBase is that it is a distributed database management system that runs on top of Hadoop. Some of its important characteristics are:
Distributed - as it stored on HDFS
Scalable - to the number of nodes in the cluster and that could be any nmber
Fault-Tolerant - as it relies on the fault-tolerant capability of HDFS
Low-Latency - It provides real-time access to read and update (using row keys)
Structured - a loose data structure is supported that allows flexibility and yet advantages of structured data
It is a columnar database that allows ACID compliance at a row level. Each of these features will be understood as we go along.
As we saw HBase is a NoSQL DB that works on top of the HDFS file system that is part of the Hadoop ecosystem. Why do we need an extra database on Hadoop when we already have database engines like Hive that serve the purpose of querying Hadoop data?
You can even access data on Hadoop using Spark SQL, Spark, map-reduce jobs etc. But there are some inherent limitations of data stored on HDFS.
HBase was designed to use the HDFS storage but with the main focus of overcoming its limitations, such as:
No Random Access
Not ACID Compliant
No updates supported
Totally unstructured data i.e. HDFS does not impose any structure to data that is ingested into it.
HBase allows random access fantastically, is very efficient with reads with very low latency, is ACID-compliant at a row level, supports updates and allows for a flexible structure. It brings in some level of structure but does not impose it like relational databases. Isn't it fantastic that all the above limitations have been overcome in HBase?
So it is often used as a complementary tool with other tools on the Hadoop stack when you need the above set of features.
HBase as a Columnar Store
HBase falls in the category of Columnar databases among the NoSQL databases. Let us understand, why?
It fundamentally stores data in a columnar way and can be understood better by taking an example of data in a relational DB and how it gets stored in HBase.
A relational DB has a fixed schema and supports a 2-dimensional data model with rows and columns as shown here:
Here the attributes of a particular entity i.e. employee, are predefined as columns. Irrespective of whether you have a value for that attribute or not, the column exists and the value may be null or not.
HBase stores the same data very differently. Irrespective of how many attributes of data you have for a specific entity, it stores each attribute in a row of its own and the data flows down into a columnar structure. it has only 3 columns stored about every attribute as shown here: (actually 5 including the value, which we will see later)
i.e. a Unique Id, Name of the Column (a.k.a column qualifier) and Value.
If you notice the way data is stored, you can keep adding any number of column names and associated values for each id. There is no compulsion to have the same column names for every id or the same number of columns for every id
From this you can already see two advantages:
Wide Column Sparse tables are possible, without wastage of space:
Since you do not have to have an entry for a column that has no value for an entity, you can support wide-column tables which are very sparse in nature. Every entity will have only those columns and values that exist - stored. No space is wasted for the innumerable columns that have no values.
Some entities may have hundreds of columns with values while some may have just a few columns values. Consider the example of patient data. Healthy ones may have very few attributes with values and the ones with a disease may have a lot of columns that have values. HBase is a fantastic store for these type of use cases.
2. Dynamic attributes: This means that each entity can have a unique set of columns totally different from the other entities. Considering the same patient data, some may have parameters related to heart disease while some may have parameters related to say cancer. In both cases, the column names may be very different. This flexibility is allowed by HBase.
While unique column names may run into hundreds or thousands, the actual data per entity may be only a small subset of these columns.
Note that while we call HBase a columnar DB, it does not store all data of the same columns together but it stores all data of the same column families together and hence it better be understood as "column-family" oriented. The concept of column family is introduced in the next section
HBase Data Model
Having understood the basic idea of HBase data storage, let us understand the complete data model of HBase
HBase uses a 4-dimensional Model as shown here:
It consists of
Row Key: which is the unique identifier for an entity
Column Family: a way of grouping columns together, for various optimization reasons. Its significance will be understood as we go along.
Column/Column qualifier: This is the way to identify a value or an attribute of an entity
Timestamp: this acts as the version number for values stored in a column
Every attribute of an entity has these 4 aspects. Hence understanding this is very fundamental to understanding how HBase stores and manages data. This also helps in designing your HBase data models correctly.
What is a Row key?
A row key plays the role of an entity identifier and can be mapped to a primary key in the RDBMS world. This defines the unique id by which you identify an entity in HBase.
So, against each row key, you can have multiple columns and column families. This is the first dimension.
Now, what is a Column family?
A column family is a group of columns grouped together. It is a logical group of columns. There is no straightforward equivalent to this in the RDBMS world. This is the second dimension.
Then, a column is exactly the traditional meaning of a column that can store a value related to the column or attribute. This is the third dimension
Finally, what is timestamp doing in the data model? if you want to keep track of multiple updates to the value of the same column, what can you do? You can store each update along with the timestamp of the update. Then, you do get to see every update done by timestamp. So, timestamp acts like the version number for the updates to values in a column. This is the fourth dimension.
If we put all of the above ideas together and If we were to logically represent this, it would look like this:
For each row key, you have a set of column families. Within each column family, you can have any number of columns.
For each column, you can have values that are updated multiple times based on the timestamp when the value was inserted.
So, if you mention a row key, a column family, a column name and a specific timestamp, you can get exactly one value. If you don't specify a timestamp, it defaults to the latest value.
Therefore, for every piece of value you want to retrieve, you need to mention the 4 aspects of the 4 dimensions - the row key, the column family, the column and the timestamp.
To take an example of an employee data, if you had data about one employee, whose data was divided into official and personal data based on the fact that these would logically be retrieved at different points of time - to view or update, you would have two column families "Official and "Personal" to represent the same. It would be logically stored as shown here:
In the column families, you could have any number of columns per employee. This employee has columns 'Department', 'Designation', 'Worklevel' as official data and 'Name' and 'Address' are the columns in the 'Personal' column family.
Assume the designation was updated at some point, then both the old and the new designation are stored as "Engineer" and "Senior Engineer" against the specific timestamp when they were inserted into the database column 'Official:Designation'. This is the way you refer to a column in HBase. You mention it as <columnFamily>:<column>
This is purely a logical representation. How it is stored physically is in a columnar structure as explained in HBase as a Columnar Store section.
To design the HBase Schema, we need to understand a few characteristics of each of the 4 dimensions of the data model.
This uniquely identifies a row and acts as the Primary Key
It is always sorted in the ascending order
It is used to partition data into what are called Regions. More on that will be discussed in a later article on HBase architecture.
It can be a single value or a composite value like 'EmployeeId' alone or 'EmployeeId,GovernmentId'
This is what helps in retrieving the data and is the main way by which data in HBase is accessed
This has to be defined upfront at the schema definition time, for a table, as shown below
create 'employee','official', 'personal'
This is the way you define an HBase table where the first parameter after the keyword 'create' is the table name, followed by column-family names.
These column families are not flexible - in terms of adding, deleting or modifying.
All rows have the same set of column families
Each column family can have different columns for each row
Columns can be dynamically added
Each column family is stored in a separate data file
Columns are units within a column family
They need not be the same for every row
New columns are added on the fly
The way to retrieve a column is using ColumnFamily:ColumnName
This is used as a version number for values stored in a column