• HBase table scans
• Implementing filters
Let's get an insight into the listed advanced concepts of HBase.
Understanding keys
In HBase, we primarily have the following keys to handle data within the tables:
• Row Key: This provides a logical representation of an entire row, containing all the column families and column qualifiers
• Column Key: This is formed by combining the column family and the column qualifier
Logically, the data stored in cells is arranged in a tabular format, but physically, these tabular rows are stored as linear sets of the actual cells. These linear sets of cells contain all the real data inside them.
Additionally, the data within multiple versions of the same cell is also stored as a separate linear set of cells and a timestamp is added, along with the cell data stored.
These linear sets of cells are sorted in descending order by their timestamp so that the HBase client always fetches the most recent value of the cell data.
The following diagram represents how data is stored physically on the disk:
Column Family :: CF1
Row Keys Col-1 Col-2
ROW-1 David 982 765 2345
ROW-2
In HBase, the entire cell, along with the added structural information such as the row key and timestamp, is called the key value. Hence, each cell not only represents the column and data, but also the row key and timestamp stored.
While designing tables in HBase, we usually have two options to go for:
• Fewer rows with many columns (flat and wide tables)
• Fewer columns with many rows (tall and narrow tables)
Let's consider a use case where we need to store all the tweets made by a user in a single row. This approach might work for many users, but there will be users who will have a large magnitude of tweets in their account. In HBase, rows are identified by splitting them at boundaries. This also enforces the recommendation for tall and narrow tables that have fewer columns with many rows.
Hence, a better approach would be to store each tweet of a user in a separate row, where the row key should be the combination of the user ID and the tweet ID. Rows with fewer columns is just a logical representation, and physically, at the disk level, this makes no difference as all the values are stored in linear sets. Hence, even if the tweet ID is defined in the column qualifier or in the row key, each cell will ultimately contain a single tweet message.
Consider another use case of processing streaming events, which is a classic example of time series data. The source of streaming data could be any, for example, stock exchange real-time feeds, data coming from a sensor, or data coming from the network monitoring system for the production environment. While designing the table
structure for the time series data, we usually consider the event's time as a row key.
In HBase, rows are stored in regions by sorting them in distinct ranges using specific start and stop keys. The sequentially increasing time series data gets written to the same region; this causes the issue of data being ingested onto a single region which is hosted on a region server, leading to a hotspot. This distribution of data instantly slows down the read/write performance of a cluster to the speed of a single server.
To solve this issue of data getting written to a single region server, an easy
solution can be to prefix the row key with a nonsequential prefix and to ensure the distribution of data over all the region servers instead of just one. There are other approaches as well:
• Salting: The salting prefix can be used, along with a row key, to ensure that the data is stored across all the region servers. For example, we can generate a random salt number by taking the hash code of the timestamp and its modulus with any number of region servers. The drawback of this approach is that data reads are distributed across the region servers and need to be handled in a client code for the get() or scan() operation. An example of salting is shown in the following code:
int saltNumber = new Long(new Long(timestamp).hashCode()) %
<number of region servers>
byte[] rowkey = Bytes.add(Bytes.toBytes(saltNumber), Bytes.
toBytes(timestamp);
• Hashing: This approach is not suited for time series data, as by performing hashing on the timestamp, the certainty of losing the consecutive values arises and reading the data between the time ranges would not be possible.
HBase does not provide direct support for secondary indexes, but there are many use cases that require secondary indexes such as:
• A cell lookup using coordinates other than the row key, column family name, and qualifier
• Scanning a range of rows from the table ordered by the secondary index
Due to the lack of direct support, we can use the following approaches in HBase to create secondary indexes, which stores a mapping between the new coordinates and the existing coordinates:
• Application-managed approach: This approach suggests that you move the responsibility completely into the application or client layer. This approach deals with a data table and one or more lookup/mapping tables. Whenever the code writes into the data table, it also updates the lookup tables. The main advantage of this approach is that it provides full control over mapping the keys as the full logic of mapping is written at the client's end. However, this liberty also carries a cost: getting some orphaned mappings if any client process fails; cleaning orphaned mappings (using MapReduce) is another overhead as lookup/mapping tables also takes cluster space and consumes processing power.
• Indexing solutions for HBase: Other indexing solutions are also present to provide secondary index support in HBase, such as Lily HBase indexer, http://ngdata.github.io/hbase-indexer/. This solution quickly indexes HBase rows into Solr and provides the ability to easily search for any content stored in HBase. Such solutions do not require separate tables for each index, rather they maintain them purely in the memory. These solutions index the on-disk data, and during searches, only in-memory index related details are used for data. The main advantage of this solution is that the index is never out of sync.
HBase provides an advanced feature called coprocessor that can also be used to achieve a behavior similar to that of secondary indexes. The coprocessor provides a framework for a flexible and generic extension for distributed computation directly within the HBase server processes.