HBase table scans - A practical guide to realizing the seamless potential of storing and mana

In the previous chapter, we took a look at CRUD operations in HBase. Now, let's take a step further and discuss table scans in Hbase. In Hbase, table scans are similar to iterators in Java or nonscrollable cursors in the RDBMS world. The HBase table scans command is useful for querying the data to access the complete set of records for a specific value by applying filters. Hence, the scan() operation reads the defined portion of data similar to the get() operation, and the filters are applied to the read portion for narrowing down the results further.

The org.apache.hadoop.hbase.client package provides the Scan class with the following constructors:

Constructor Description

Scan() The default scan constructor reads the entire HBase table, including all the column families and the respective columns

Scan(byte[] startRow) Creates a scan operation starting at the specified row

Scan(byte[] startRow, byte[]

stopRow) Creates a scan operation for the range of rows specified, including the start row and excluding the stop row

Scan(byte[] startRow, Filter

filter) Creates a scan operation starting at the specified row and also applies the filter Scan(Get get) Builds a scan object with the same

specifications as Get

Scan(Scan scan) Creates a new instance of this class while copying all values

The behavior of the scan() operation looks similar to the get() operation, but the difference between the two is also very much visible through constructors. In the get() operation, we only define the row key to get the results, whereas in a scan, we can define the optional startRow parameter, which signifies the starting row key from where the scan needs to start reading data from the HBase table; this also makes the results inclusive of the start row. Similarly, the constructors also define the optional stopRow parameter, which limits the scan to a specific row key where it should conclude the reading, and the results exclude the stop row.

Hence, using the partial key scan by using the start and stop keys, it is possible to iterate over subsets of rows. We can also take an offset, limit the parameters, and apply them to the rows on the client side.

The scan() operation does not look for an exact match for the values defined for startRow and stopRow. The scan() operation matches the first row key for equality or greater than the given startRow value. In case no start row is specified, reading starts from the beginning of the table. Similarly, the current row key should also be equal to or greater than the stopRow value and in case no stop row is specified, the scan will read the data until the end of the table.

The scan() operation also defines one more optional parameter called filter. This filter is the instance of the Filter class present in the org.apache.hadoop.hbase.

filter package.

Filters limit data retrieval by adding limiting selectors to the get() or scan() operation. Filters will be discussed in detail in the following section.

Once we have the results from the scan constructor, the following methods can be used to further narrow down the results:

Method name Description

addFamily(byte[] family) Gets all columns from the specified family.

addColumn(byte[] family,

byte[] qualifier) Gets the column from the specific family with the specified qualifier.

setTimeRange(long minStamp,

long maxStamp) Gets versions of columns only within the specified timestamp range (minStamp, maxStamp).

setTimeStamp(long timestamp) Gets versions of columns with the specified timestamp.

setMaxVersions(int

maxVersions) Gets up to the specified number of versions of each column. The default value of the maximum version returned is 1 which is the latest cell value.

setFilter(Filter filter) Applies the specified server-side filter when performing the query.

setStartRow(byte[] startRow) Sets the start row of the scan.

setStopRow(byte[] stopRow) Sets the stop row.

As discussed, we have multiple constructors for the Scan class, but we do not have any method call for scanning the results within the HTable class. We need to call the getScanner() method available in the HTable class to get the instance of the scan and browse through the results.

Method Description

getScanner(byte[] family) Gets a scanner on the current table for the given family

getScanner(byte[] family,

byte[] qualifier) Gets a scanner on the current table for the given family and qualifier

getScanner(Scan scan) Returns a scanner on the current table as specified by the Scan object

All the preceding methods return an instance of the ResultScanner class. This class provides a behavior similar to an iterator to the Scan class instance. The Scan instance does not obtain the complete results in a single call, as this could be a very heavy call to make. The following methods of the ResultScanner class help to achieve iterative behavior:

Method Description

close() Closes the scanner and releases any resources it has allocated

next() Grabs the next row's worth of values next(int nbRows) Grabs the zero and nbRows results

The next() method returns the Results class to represent the row's contents. By default, the result is contained only for a single row and a fresh RPC call is made for each next call. To avoid too many calls, the ResultScanner class also provides the provision for row caching. Within the hbase-site.xml configuration file, we can set the following code:

<name>hbase.client.scanner.caching</name>

</property>

This property sets the row caching to 5 from the default value of 1 for all the scan calls. We can also set the caching limit for individual scan calls using the setScannerCaching(int scannerCaching) method on an HTable instance.

This caching works at the row level and might not be a good option for the rows containing hundreds of columns. For limiting the columns returned on each next() call, we can use the following code:

void setBatch(int batch)

Using scanner caching and batch together provides control over the number of RPC calls required to scan the row key range selected.

Let's take a look at a complete example of the Scan usage:

public class ScanExample {

public static void main(String[] args) throws IOException { // Get instance of Default Configuration

Configuration conf = HBaseConfiguration.create();

// Get table instance

HTable table = new HTable(conf, "tab1");

// Create Scan instance Scan scan = new Scan();

// Add a column with value "Hello", in "cf1:greet", // to the Scan.

scan.addColumn(Bytes.toBytes("cf1"), Bytes.toBytes("greet"));

// Set Start Row

scan.setStartRow(Bytes.toBytes("row-5"));

// Set End Row

scan.setStopRow(Bytes.toBytes("row-10"));

// Get Scanner Results

ResultScanner scanner = table.getScanner(scan);

for (Result res : scanner) {

System.out.println("Row Value: " + res);

}

scanner.close();

table.close();

} }

This example returns row 5 to row 9 from the column family CF1 and the greet column.

在文檔中 A practical guide to realizing the seamless potential of storing and managing high-volume, high-velocity data quickly and painlessly with HBase (頁 51-55)