Sai Geetha M N
We looked at the basics of HBase in the previous article, last week. Today we will understand the Architecture of HBase.
We all agree that software architecture is a sort of a plan of a system the helps in understanding the initial decisions taken, the trade-offs made and gives an idea about how the systems were intended to be used and thus equips the user of the software to use it appropriately. It makes it easier to understand the whole system and makes the decision-making process efficient.
With this as the background, let us understand HBase architecture which will certainly equip us to use HBase correctly and for the use cases it was meant to be used for.
One of the important aspects of a distributed database is to provide high availability through robust failover mechanisms. They are designed to run on commodity hardware. So failures are expected and the architecture takes care of dealing with these failures at both the storage and the compute levels, almost seamlessly.
How is storage level availability achieved?
HBase uses HDFS as the storage layer.
HDFS itself is a distributed file system with a default replication factor of 3, thus providing storage level resilience.
The HBase data is stored in what is termed as HFiles on HDFS. Further, HFiles are split into blocks. So, if one node or disk that is serving some data goes down, there are 2 other copies of the same data and they become available. Hence data availability is assured.
Let us now see how compute-availability is assured in the HBase architecture
How is compute availability achieved?
A typical Cluster Architecture will have 2 Master servers called HMasters, in an active-passive configuration. At any point only 1 is active.
You have a cluster of 3 zookeepers that help in coordinating between the various servers and provide centralized configuration management. They also maintain the health of the cluster. The zookeepers keep a session-based connection with each region server, which act as the data nodes. The servers send heartbeats so that the zookeeper knows that they are all still running. The master keeps listening to the zookeeper notifications to know which worker nodes are healthy. If one of the worker nodes go down, the master will assign its job to another worker node. This way the clients will see an almost seamless failover.
How is the availability of the master itself achieved? The passive master keeps listening to the zookeeper notifications and if the active master goes down, it will take over and become the active HMaster. Thus the master availability is also ensured.
The master's responsibility is to manage schema changes, cluster management and data administration.
Now let us dwell a little deeper into the concept of Regions, Region Servers and how they manage HFiles.
HFiles, Regions & Region Servers
HBase has a concept of Regions served by Region servers. Let us understand this a little more. By now we know that HBase stores its data in the HFile format. All these are closely related and let us see how.
HBase distributes the data horizontally partitioning it into what are called regions. It implies that every HBase table has one or more regions.
The nodes in an HBase cluster are called region servers. The partitioned data or 'regions' are distributed onto these region servers. Note that one region server can have many regions on it.
More about Regions:
Each table has one or more regions depending on how many horizontal partitions have been created. There are various ways to decide on the partition strategy. However, if none is defined, there is an auto sharding that happens by HBase as the data grows.
Each of the regions is accessible only through one region server and there might be many regions on each region server. One of the region servers is made 'in charge' of a region and all the read and write is directed only through that region server as shown in the figure.
Since it is horizontal partitioning, the splits are done based on sorted row keys. Each region has a start key and an end key to define the regions' limits. The rows within a region too are ordered by row keys. This is shown in the example figure below.
Also, all the data in one region is not in one HFile. There is one HFile created for every column family. Therefore, if there are three regions and two column families, you will end up have 3 * 2 = 6 HFiles on the disk.
All of the above theory can be understood better with the example in the figure above.
Here you have two regions, region 1 and 2. The table has a rowkey of year, a unique ID, probably a product id and a date. It has two column families named 'Input' and 'Forecast'.
Note that Region 1 has start key as 2005... and end key as 2006... Region 2 has start key as 2007... and end key as 2008... Everything in between these two keys belongs to that region. Within this region, the Input column family is stored as one HFile and the Forecast column family is stored as another HFile.
This is how data is distributed in HBase using HFiles, regions and region servers.
Understanding HBase Reads
We know by now that rows in an HFile are sorted by rowkey. The HFile has an index (which is B+ tree-based). This index helps in the efficient retrieval of the rows.
These indexes are loaded into memory by region servers as shown here. So, when you read data, the region server scans the index in memory to find which block to read. Knowing which block to read, HBase directly accesses only that block on the disc and retrieves the required data.
Therefore, the lookup happens in memory and the disk reads are minimised to that extent. Within the block, the row keys are sorted hence easily retrieved (probably with the time complexity of a binary search).
The above is the case of a first time read. However, if data has already been read once, HBase has the concept of a block cache where it caches the data that has already been read. We will understand this a little more after we look at the HBase writes as well, since we would also want to know how HBase reads modified/updated data.
Understanding HBase Writes
Consider that you are now inserting a record into HBase. The Region server that is in charge of the row key that you are inserting has a write buffer in memory called MemStore.
So, it first writes into the MemStore. While it writes into MemStore, it is sorted and written. When MemStore fills up, it flushes that sorted data onto the disk. This flush happens very fast as the write to disk now is just sequential writes.
In the example given in the figure, the new row key is 0011, which gets inserted between 0000 and 0123, in memory. The sorting and insertion happens in the MemStore of that region. Only when Memstore is full, it flushes it to the disk. This is again the happy path of how HBase writes data.
Now once this data is inserted in memory, how does the read work? In the earlier section, we said it will read from disk based on the block that the data exists in. Now, this data is not yet flushed to the disk. How will HBase read this?
HBase Reads for Unflushed Data:
HBase has a read cache called the Block Cache where frequently read data is kept, in memory. We just saw that HBase has a write cache too called the MemStore. There is data on the disk too.
So, how does HBase ensure that the data is read from all of these sources? Block cache, MEmStore and hard disk. HBase follows this route of reading the Block cache first, the memstore next to see if any updates have happened and finally it reads the disk. If it finds the rowkey in the Block cache, it checks the MemStore to see if there are any updates. If, yes, it merges the two and returns the data. If it is neither found in block cache nor MemStore, it looks up the data on the disk.
HBase Writes to Disk:
We saw that HBase flushes the data to disk every time the MemStore is full. So, small files get written to the disk. HBase has the process of compaction that are triggered to merge these small files into large files at regular intervals, based on certain configuration parameters. These are called Minor and Major compactions.
The minor compactions merge all small files into one without modifying the main HFile of the region that already existed. The Major compaction merges this one compacted file from minor compaction with the larger HFile that already exists.
These compactions help in read performance immensely.
So, what happens if a node goes down before the data in memory is flushed to the disk?
HBase actually has the concept of a Write Ahead Log (WAL) into which it writes even as it writes into the MemStore. Its write is not complete till it writes in both these destinations. Since the WAL is sequential, it is pretty fast.
Therefore, in the case of a region server going down before the data is flushed to the disk, the data is reconstructed from the WAL file. When a new region server comes up to take its place (as elected by HMaster), it will check if the regions that it has to take care of, are all up to date with the WAL file and if not, update its state before making itself available to clients.
This takes a little time during which these regions become temporarily unavailable. This is why HBase is said to give more importance to Consistency over Availability. Once the newly elected Region Server is ready to serve, that region is consistent and available. At this time, the rest of the regions and region servers are all continuing to serve the clients and hence the impact is only on those regions that were served by the node that went down.
Having understood the HBase architecture of Partitioning or Sharding, regions, region servers, HFiles and the read and write path of HBase, you are in a much better position to work with HBase and make the right decisions on HBase design for your Use Case.
We still need to see some best practices for HBase table design and certain heuristics that can act as some guidance if you are starting on HBase. This will be covered in the next article.