• Sai Geetha M N

HBase Design - Guidelines & Best Practices

We have looked at HBase Fundamentals and HBase Architecture in the last two weeks. Today I will look at a few best practices and guidelines that will help in designing the schemas suitably for HBase. The design of schemas, regions, parameters, region servers and their sizes, the caching strategies are all covered here.


A warning that this assumes you have HBase working knowledge and are dealing with improving its performance.


Your design considerations are very specific to each use case. Therefore it is imperative that you have answers to most of these questions on your requirements:

  • Is your use case a write-heavy use case or a read-heavy one?

  • Is it dealing with a continuous inflow of data in near-real-time?

  • Is it very large data that is gathering up very quickly?

  • Is the data storage to be cleared automatically based on some time?

  • Are you looking for a millisecond read response? or a write response?

  • Do you have bulk writes happening at some times with strict SLAs on the write latency?

  • Do you have point queries or range queries?

  • Are you aware of all the access paths to your data?

Know that "Everything in Software Architecture is a tradeoff" and there is no right or wrong. It just works or doesn't as per the requirements of your stakeholders. That is all.

Having all of the above data (requirements) will help you deal with

  • Schema Design

  • Row key design

  • Column family design

  • Regions design

  • Partition sizes

  • Salting

  • Number of regions per server

  • Block cache versus MemStore ratio

  • HBase Parameter fine-tuning

  • Caching Strategies

  • Compaction Strategies

  • Data Locality

  • Deleting or clearing data

  • And some guidelines to watch out for

Each of these is elaborated in the rest of this article.


Schema Design

As part of the Schema Design, the two very important aspects that are to be decided upfront while creating the table are the row key and the column families. Let us see the design considerations for the same.

(columns themselves are dynamic and do not require upfront design thinking )


Row Key Design

Your access paths or query paths are key in defining row keys. i.e. the columns that you will query by in your 'where' clause if you were to write a SQL statement, those are the columns that need to be in your row key.


Row key can be a single column or a composite of multiple columns. For example, if you fetch your data by product id, then that should be your row key. If you fetch your data by a product id and a store id, then the row key has to be a composite of these two columns.


The order of querying also determines the order of the columns in the row key.


Note that whatever you query by or filter by should be part of the row key. Else HBase has to scan all the regions of the table leading to a full table scan. That is extremely inefficient and hence should be avoided at all costs.


However, also note that if you create a row key that is a composite key of many columns, the purpose will be defeated and may lead to full table scans. You have to find a trade-off between the number of columns that form the row key and yet not scan the whole table.


Your row key design should finally ensure that you do not end up scanning the whole table for any query of yours.


How do you check this?

If you run an 'Explain plan' for your queries, you will see whether it is scanning the whole table - all the regions or is it doing a skip scan or range scan. This is possible when you use a SQL engine like Apache phoenix over HBase.


An example shown here:

#For this query:
SELECT PGDC, SUM(SFC) FROM LRF_PROMO_YRWKBPN_SPLIT WHERE (ID LIKE '201701%' OR ID LIKE '201702%' OR ID LIKE '201703%' OR ID LIKE '201704%') AND PGDC = 'H' GROUP BY PGDC;

The explain plan shows this:

| CLIENT 10-CHUNK 3946787 ROWS 3145728221 BYTES PARALLEL 10-WAY RANGE SCAN OVER DEV_RDF:LRF_PROMO_CVDIDX_YRWKBPN [0,'201701'] - [9,'201702']  |
|     SERVER FILTER BY FA.PGDC = 'H'  |                                
|     SERVER AGGREGATE INTO DISTINCT ROWS BY [FA.PGDC]  |                                                                                 
| CLIENT MERGE SORT   |                                                                                                                        

This gives an insight that a range scan is happening over 10 chunks of data (out of the 64 regions it has) and the entire table is not being scanned.


Note that HBase does not support secondary indexes unless you use a SQL engine like Apache Phoenix over it. The above example is using Apache Phoenix and secondary indexes.


Column Family Design


The read and write patterns determine the column family design.


Remember that all the data in one column family in a region is stored together in one HFile. So, if data from one column family is to be read, then the entire block of data in that HFile is loaded into memory, which includes all the columns in that column family. Hence store all the data that is retrieved together in the same column family. A similar argument holds good for updates or writes too.


However, this does not mean you have a huge number of column families separating columns for a huge number of use cases. Use as few column families as possible, typically restricting it to 2 or at the most 3. Having too many column families can cause many files to stay open per region and can trigger 'compaction storms'. (Compaction is a process where small files merge into a larger one. )


Since flushing to hard disk and compactions are done on a per region basis, even if one column family carries the bulk of the data, it causes flushes on the adjacent column families too. Keep this too in mind when you are designing column families. This means you should not have one column family with say, hundreds of columns while another column family with just 5 to 10 columns. Then, when it is forced to flush the large column family from memory to hard disk, it flushes the small column families too.


Regions Design & Optimization


Region Design

By default, each table has only a single region. Suppose, we have 4 tables, each with one region and suppose we also have 4 region servers, each server will have 1 region per server or rather one table per server. This is the default behaviour.


This does not help in performance as every read and write to a table goes to the same single server. This is not utilizing the power of distributed processing. Therefore, based on the table profile and the data size, you would want to decide how many regions should be there per table.


HBase itself starts to split a table into multiple regions after a default size of a region is reached. Depending on the version of HBase you are using, this default size could vary. It is defined by hbase.hregion.max.filesize property in the hbase-site.xml. For 0.90.x version a default is 256 MB and max recommended is 4 GB. In the HBase 1.x version, this has increased to 10 GB.


But relying on auto-sharding leads to performance degradation during the auto-splitting and rebalancing period.


Hence the best practice here is to decide the partitions upfront if you are aware of the range of the row key used. Create the region splits upfront during the table creation itself.


CREATE 'tableName', 'cfName'  SPLITS => ['10', '20', '30', '40']

This gives the boundaries on the key where the regions have to be split.


Partition Sizing / Region Sizing

Recollect that regions are equivalent to horizontal partitions or shards in HBase. Just as partitioning should ensure that data is not skewed in a single partition or a few partitions, the distribution of data between the regions should be even. This leads to a lot of efficiencies.


In the CREATE tableName statement above, if the data between 10 and 20 is very less and a lot more between 20 and 30, then change the split boundaries to make these two regions evenly distributed. May be something like 10, 25, 30 might turn out to be the right split.


If the regions are evenly spread, the IO is balanced between the servers. The writes into MemStore is balanced and the flushes to the disk are also balanced. Similarly, the reads are also spread out across the region servers thus using the cluster capacity effectively.


Note that if the reads distributed, cache evictions will be optimal too.


Salting

HBase sequential writes may suffer from region server hot-spotting if the row key is a monotonically increasing value. This can be overcome by salting. A detailed description of this solution is given here


If you are using Apache Phoenix, the salted table would be created like this

CREATE TABLE table (a_key VARCHAR PRIMARY KEY, a_col VARCHAR) SALT_BUCKETS = 20;

This ensures that the sequences are not stored in the same region server and hence hot-spotting is avoided.


For example, if the current date is part of the row key and all the data coming in is with the current date, only one region with this row key would get all the traffic. To avoid this, salt your row key and then it gets distributed to different regions.


However, salting has its own disadvantages when you want to retrieve data, esp, a data range based on a row key.


When to avoid salting?

If your table has no hot-spotting issues during write, then avoid salting altogether.


If you are trying to retrieve a contiguous range, without salting, you would have hit just one or a couple of regions and retrieved all the required data. This is because the data remains in a sorted order across regions and within regions. But with salting, the sorted order is not maintained across regions and you end up scanning all the regions for getting a range of row keys.


So, use salting only if necessary to avoid hot-spotting during writes. Hotspotting would become a problem only if you have a large inflow of continuous data into your servers, which beats the writing capacity of your server.


Keeping the regions sorted by row key will ensure it avoids unnecessary scans across regions for a specific range of data.


Number of regions per server

How many regions can you have in one region server? This depends on multiple factors like how big are your regions, how fast are you writing into your regions, whether is it a continuous flow of data or bulk data writes, how much are you reading back from the regions and what is the memory available on your region server.


There are many heuristics around this, all of which need to be taken with a pinch of salt and need to be tried out for your specific use case.


If you are a write-heavy use case with data flowing continuously, then you probably want to minimize the number of regions to 1 or a few per region server so that you are not filling your memstore too soon and flushing down to the disk too often.


However, if yours is a bulk write use case, you want to increase the parallelism with which you write. Hence more regions or partitions would increase the parallelism. This may dictate sometimes, that you have more regions per server. To explain this with an example, if your HBase cluster is a 16 nodes cluster, and you want to write same 500 GB to 1 TB of data in less than 10minutes, you would want high levels of parallelism in writing. If you go with 1 region per server, you will have 60 regions for this table and only 16 parallel writes are possible. To increase the oparallelismf writing you would make the table have 32 regions of 64 regions, leading to 4 regions per server. The write time for 64 regions is lower than write time for 16 regions, in he case of bulk writes.


However, a word of caution on making the regions too small such that you have very little data in each region. This will cause the reads to underperform as many regions may have to be scanned to get the data of interest.


The common heuristic that I have read is about 100 to 200 regions per server, but practically what I have seen is that the performance is manageable up to 100 regions and beyond that, it starts deteriorating.


Block cache versus MemStore ratio


We have seen that HBase caches read data in Block cache and write data in MemStore. Since HBase uses the JVM heap, by default, 40% of the heap memory is allotted to block cache and 40% to memstore while the rest is used for its own execution purposes.


This can be fine-tuned to improve either your read or write performance.


Read Heavy use case: To get very good read performance, we all know caching is the key. The more the read cache the better it is for read performance.