Sai Geetha M N
HBase Design - Guidelines & Best Practices
We have looked at HBase Fundamentals and HBase Architecture in the last two weeks. Today I will look at a few best practices and guidelines that will help in designing the schemas suitably for HBase. The design of schemas, regions, parameters, region servers and their sizes, the caching strategies are all covered here.
A warning that this assumes you have HBase working knowledge and are dealing with improving its performance.
Your design considerations are very specific to each use case. Therefore it is imperative that you have answers to most of these questions on your requirements:
Is your use case a write-heavy use case or a read-heavy one?
Is it dealing with a continuous inflow of data in near-real-time?
Is it very large data that is gathering up very quickly?
Is the data storage to be cleared automatically based on some time?
Are you looking for a millisecond read response? or a write response?
Do you have bulk writes happening at some times with strict SLAs on the write latency?
Do you have point queries or range queries?
Are you aware of all the access paths to your data?
Know that "Everything in Software Architecture is a tradeoff" and there is no right or wrong. It just works or doesn't as per the requirements of your stakeholders. That is all.
Having all of the above data (requirements) will help you deal with
Row key design
Column family design
Number of regions per server
Block cache versus MemStore ratio
HBase Parameter fine-tuning
Deleting or clearing data
And some guidelines to watch out for
Each of these is elaborated in the rest of this article.
As part of the Schema Design, the two very important aspects that are to be decided upfront while creating the table are the row key and the column families. Let us see the design considerations for the same.
(columns themselves are dynamic and do not require upfront design thinking )
Row Key Design
Your access paths or query paths are key in defining row keys. i.e. the columns that you will query by in your 'where' clause if you were to write a SQL statement, those are the columns that need to be in your row key.
Row key can be a single column or a composite of multiple columns. For example, if you fetch your data by product id, then that should be your row key. If you fetch your data by a product id and a store id, then the row key has to be a composite of these two columns.
The order of querying also determines the order of the columns in the row key.
Note that whatever you query by or filter by should be part of the row key. Else HBase has to scan all the regions of the table leading to a full table scan. That is extremely inefficient and hence should be avoided at all costs.
However, also note that if you create a row key that is a composite key of many columns, the purpose will be defeated and may lead to full table scans. You have to find a trade-off between the number of columns that form the row key and yet not scan the whole table.
Your row key design should finally ensure that you do not end up scanning the whole table for any query of yours.
How do you check this?
If you run an 'Explain plan' for your queries, you will see whether it is scanning the whole table - all the regions or is it doing a skip scan or range scan. This is possible when you use a SQL engine like Apache phoenix over HBase.
An example shown here:
#For this query: SELECT PGDC, SUM(SFC) FROM LRF_PROMO_YRWKBPN_SPLIT WHERE (ID LIKE '201701%' OR ID LIKE '201702%' OR ID LIKE '201703%' OR ID LIKE '201704%') AND PGDC = 'H' GROUP BY PGDC;
The explain plan shows this:
| CLIENT 10-CHUNK 3946787 ROWS 3145728221 BYTES PARALLEL 10-WAY RANGE SCAN OVER DEV_RDF:LRF_PROMO_CVDIDX_YRWKBPN [0,'201701'] - [9,'201702'] | | SERVER FILTER BY FA.PGDC = 'H' | | SERVER AGGREGATE INTO DISTINCT ROWS BY [FA.PGDC] | | CLIENT MERGE SORT |
This gives an insight that a range scan is happening over 10 chunks of data (out of the 64 regions it has) and the entire table is not being scanned.
Note that HBase does not support secondary indexes unless you use a SQL engine like Apache Phoenix over it. The above example is using Apache Phoenix and secondary indexes.
Column Family Design
The read and write patterns determine the column family design.
Remember that all the data in one column family in a region is stored together in one HFile. So, if data from one column family is to be read, then the entire block of data in that HFile is loaded into memory, which includes all the columns in that column family. Hence store all the data that is retrieved together in the same column family. A similar argument holds good for updates or writes too.
However, this does not mean you have a huge number of column families separating columns for a huge number of use cases. Use as few column families as possible, typically restricting it to 2 or at the most 3. Having too many column families can cause many files to stay open per region and can trigger 'compaction storms'. (Compaction is a process where small files merge into a larger one. )
Since flushing to hard disk and compactions are done on a per region basis, even if one column family carries the bulk of the data, it causes flushes on the adjacent column families too. Keep this too in mind when you are designing column families. This means you should not have one column family with say, hundreds of columns while another column family with just 5 to 10 columns. Then, when it is forced to flush the large column family from memory to hard disk, it flushes the small column families too.
Regions Design & Optimization
By default, each table has only a single region. Suppose, we have 4 tables, each with one region and suppose we also have 4 region servers, each server will have 1 region per server or rather one table per server. This is the default behaviour.
This does not help in performance as every read and write to a table goes to the same single server. This is not utilizing the power of distributed processing. Therefore, based on the table profile and the data size, you would want to decide how many regions should be there per table.
HBase itself starts to split a table into multiple regions after a default size of a region is reached. Depending on the version of HBase you are using, this default size could vary. It is defined by hbase.hregion.max.filesize property in the hbase-site.xml. For 0.90.x version a default is 256 MB and max recommended is 4 GB. In the HBase 1.x version, this has increased to 10 GB.
But relying on auto-sharding leads to performance degradation during the auto-splitting and rebalancing period.
Hence the best practice here is to decide the partitions upfront if you are aware of the range of the row key used. Create the region splits upfront during the table creation itself.
CREATE 'tableName', 'cfName' SPLITS => ['10', '20', '30', '40']
This gives the boundaries on the key where the regions have to be split.
Partition Sizing / Region Sizing
Recollect that regions are equivalent to horizontal partitions or shards in HBase. Just as partitioning should ensure that data is not skewed in a single partition or a few partitions, the distribution of data between the regions should be even. This leads to a lot of efficiencies.
In the CREATE tableName statement above, if the data between 10 and 20 is very less and a lot more between 20 and 30, then change the split boundaries to make these two regions evenly distributed. May be something like 10, 25, 30 might turn out to be the right split.
If the regions are evenly spread, the IO is balanced between the servers. The writes into MemStore is balanced and the flushes to the disk are also balanced. Similarly, the reads are also spread out across the region servers thus using the cluster capacity effectively.
Note that if the reads distributed, cache evictions will be optimal too.
HBase sequential writes may suffer from region server hot-spotting if the row key is a monotonically increasing value. This can be overcome by salting. A detailed description of this solution is given here
If you are using Apache Phoenix, the salted table would be created like this
CREATE TABLE table (a_key VARCHAR PRIMARY KEY, a_col VARCHAR) SALT_BUCKETS = 20;
This ensures that the sequences are not stored in the same region server and hence hot-spotting is avoided.
For example, if the current date is part of the row key and all the data coming in is with the current date, only one region with this row key would get all the traffic. To avoid this, salt your row key and then it gets distributed to different regions.
However, salting has its own disadvantages when you want to retrieve data, esp, a data range based on a row key.
When to avoid salting?
If your table has no hot-spotting issues during write, then avoid salting altogether.
If you are trying to retrieve a contiguous range, without salting, you would have hit just one or a couple of regions and retrieved all the required data. This is because the data remains in a sorted order across regions and within regions. But with salting, the sorted order is not maintained across regions and you end up scanning all the regions for getting a range of row keys.
So, use salting only if necessary to avoid hot-spotting during writes. Hotspotting would become a problem only if you have a large inflow of continuous data into your servers, which beats the writing capacity of your server.
Keeping the regions sorted by row key will ensure it avoids unnecessary scans across regions for a specific range of data.
Number of regions per server
How many regions can you have in one region server? This depends on multiple factors like how big are your regions, how fast are you writing into your regions, whether is it a continuous flow of data or bulk data writes, how much are you reading back from the regions and what is the memory available on your region server.
There are many heuristics around this, all of which need to be taken with a pinch of salt and need to be tried out for your specific use case.
If you are a write-heavy use case with data flowing continuously, then you probably want to minimize the number of regions to 1 or a few per region server so that you are not filling your memstore too soon and flushing down to the disk too often.
However, if yours is a bulk write use case, you want to increase the parallelism with which you write. Hence more regions or partitions would increase the parallelism. This may dictate sometimes, that you have more regions per server. To explain this with an example, if your HBase cluster is a 16 nodes cluster, and you want to write same 500 GB to 1 TB of data in less than 10minutes, you would want high levels of parallelism in writing. If you go with 1 region per server, you will have 60 regions for this table and only 16 parallel writes are possible. To increase the oparallelismf writing you would make the table have 32 regions of 64 regions, leading to 4 regions per server. The write time for 64 regions is lower than write time for 16 regions, in he case of bulk writes.
However, a word of caution on making the regions too small such that you have very little data in each region. This will cause the reads to underperform as many regions may have to be scanned to get the data of interest.
The common heuristic that I have read is about 100 to 200 regions per server, but practically what I have seen is that the performance is manageable up to 100 regions and beyond that, it starts deteriorating.
Block cache versus MemStore ratio
We have seen that HBase caches read data in Block cache and write data in MemStore. Since HBase uses the JVM heap, by default, 40% of the heap memory is allotted to block cache and 40% to memstore while the rest is used for its own execution purposes.
This can be fine-tuned to improve either your read or write performance.
Read Heavy use case: To get very good read performance, we all know caching is the key. The more the read cache the better it is for read performance.
You would ideally want all the data in a region to be available in the memory of the region server, so that the region servers do as few disk reads as possible. This can be achieved by increasing the block cache ratio to 60% or 70% knowing that you are compromising the write cache and hence the write performance.
Write heavy use case: This is like a corollary to the previous use case. If you increase the ratio of memstore, you would improve write performance at the cost of read performance.
For example, if your JVM is having a heap memory of 32 GB, by default you have 12.8 GB as block cache and 12.8 GB as MemStore. So, if the data that is retrieved often is within this size, it will be cached and efficiently served. Any data that is not in cache will have to be read from the disk leading to slightly slower responses. This too can be mitigated to an extent by using SSD hard drives to quicken the read time from disks.
However, if you have many regions per server each with 10 GB of data, then this cache is too small to make a difference. Then, you have the bucket cache that can help, which is discussed later.
HBase Parameter fine-tuning
There are a whole host of parameters that can be tuned and can be a deep subject in itself. However, I would like to call out a couple of them that have helped in my explorations.
While creating a table, you could make it an IN_MEMORY table by setting this parameter to TRUE. Then, the cache gives importance to keep this table's data in memory, on a priority basis. This, however, should not be used if bucket cache is enabled.
A way to warm block cache, set the below parameter:
PREFETCH_BLOCKS_ON_OPEN => 'true'
The purpose is to warm the BlockCache as rapidly as possible after the cache is opened, using in-memory table data, and not counting the prefetching as cache misses. This is great for fast reads, but is not a good idea if the data to be preloaded cannot not fit into the BlockCache.
We saw that by default, data gets cached in block cache for improving read performance. Generally, the data in one region can be about 10GB. And you have many regions per region server. So, if you have 10 regions in one region server, that would be 100 GB of data. However, due to GC limitations of the JVM heap, most often you are able to give a max of 32 GB to the heap memory and hence only 12.3 GB to block cache.
In this case, block cache would not be sufficient for 100 GB of data. To over come this limitation, bucket cache could be used. The offheap implementation of bucket cache allows for using memory offheap.
If the BucketCache is enabled, it stores data blocks off-heap, leaving the on-heap cache free for storing indexes and Bloom filters. The physical location of the BucketCache storage can be either in memory (off-heap) or in a file stored in a fast disk.
Hence if you are using servers with large RAMs as the nodes of your HBase cluster, you could utilize the RAM beyond 32 GB as part of the bucket cache and could get your entire data almost into memory. This would enhance your read performance significantly.
Monitoring your cache hits and cache misses through tools like grafana could go a long way in fine-tuning the cache usage.
In an ideal scenario, having large Cache, almos as much as the data you have on disk would give you the best performance. This is in case all your data could be queried equally and there is no small subset that is more often queried.
HBase has concepts of minor and major compactions. There are multiple configurations that define when they get triggered. Typically the number of small files defines the trigger point of compaction.
gives the number of files that should trigger minor compaction.
Major compaction is slightly more involved and requires more resources. It not only merges all the small files into the large file but also does version data cleaning, deleted data cleaning etc and hence is very resource-intensive. Need to carefully ensure it triggers in non-peak hours of your cluster usage.
Data is spread across region servers. The regions served by a region server (as assigned by HMaster) should ideally be located locally in that server. This is the best-case scenario and provides the best performance. The percentage of data local to that server is termed data locality.
Strategies must be used to ensure close to 100% data locality. Whenever a large chunk of data is written, this locality percentage can go down drastically. Hence doing major compaction after a bulk write is almost always necessary to improve data locality.
If data is flowing in all through the day, then the locality could be slowly deteriorating and it should be fixed through major compactions at appropriate intervals.
Data locality cannot be underestimated in how much it improves the read performance.
Deleting or Clearing Data
It is imperative to to keep only the data that you are going to need, in HBase. Trying to archive all data in HBase for historical reasons is not advisable. Hadoop file system can be used for that purpose.
Two simple ways of ensuring that data does not keep on growing in HBase are:
Setting the time-to-live (TTL) on your table or column family
Setting the number of versions of a data you want to store
For setting TTL at a column family level, you could do it this way at HBase shell:
alter ‘tableName′, NAME => ‘cfname′, TTL => 3000
The TTL value is in seconds, after which the data will be deleted automatically. This e