DataPotion
Posted on July 8, 2022
One of the most useful functions available in HBase are filters. In this article, I will show you how to use filters to get lookup columns with specific values. Every example is shown in Hbase Shell, but they are also available in the API, so you can use it in your ETL application as well.
First, let's check which filters are available for us from Hbase Shell. To do this use show_filters
command in Hbase Shell:
As you can see there are multiple filters available but we are focusing on Column Filters, so 3 that are useful for us are ColumnValueFilter, ValueFilter, and SingleColumnValueFilters. All of these filters have very similar properties and all of them can be used for filtering columns.
Let's assume we have a table named myTable with the following contents.
Rowkey | myColumnFamily:columnA | myColumnFamily:columnB |
---|---|---|
row1 | value1 | |
row2 | value2 | |
row3 | value3 | valueX |
row3_special | value4 |
hbase(main):003:0> scan 'myTable'
ROW COLUMN+CELL
row1 column=myColumnFamily:columnA, timestamp=1656785481871, value=value1
row2 column=myColumnFamily:columnA, timestamp=1656785500397, value=value2
row3 column=myColumnFamily:columnA, timestamp=1656785504786, value=value3
special_row3 column=myColumnFamily:columnA, timestamp=1656785539020, value=value4
ColumnValueFilter
The first option to filter values is to use ColumnValueFilter.
As an argument we pass column family and column name in which we want to search, and
filtering argument in this example we search for a value that equals value1.
scan 'myTable', {FILTER => "ColumnValueFilter('myColumnFamily','columnA',=,'binary:value1')"}
ColumnValueFilter will return only columns that match our filtering statement.
hbase(main):010:0> scan 'myTable', {FILTER => "ColumnValueFilter('myColumnFamily','columnA',=,'binary:value1')"}
ROW COLUMN+CELL\n
row1 column=myColumnFamily:columnA,\n timestamp=1656785481871, value=value1\n
1 row(s)\n
Took 0.0188 seconds\n
hbase(main):011:0>
SingleColumnValueFilter
SingleColumnValueFilter returns an entire row when specified condition is matched. Besides that it works exactly like ColumnValueFilter.
This example also shows another form of filtering comparison - when only the substring of the value matches it returns this value.
hbase(main):022:0> scan 'myTable', {FILTER => "SingleColumnValueFilter('myColumnFamily','columnA',=,'substring:value')"}
ROW COLUMN+CELL\n
row1 column=myColumnFamily:columnA, timestamp=1656785481871, value=value1\n
row1 column=myColumnFamily:columnB, timestamp=1656802766312, value=valueX\n
row2 column=myColumnFamily:columnA, timestamp=1656785500397, value=value2\n
row3 column=myColumnFamily:columnA, timestamp=1656785504786, value=value3\n
special_row3 column=myColumnFamily:columnA, timestamp=1656785539020, value=value4\n
4 row(s)\n
Took 0.0253 seconds\n
hbase(main):023:0> scan 'myTable', {FILTER => "SingleColumnValueFilter('myColumnFamily','columnA',=,'substring:X')"}
ROW COLUMN+CELL
0 row(s)
Took 0.0047 seconds
⚠️WARNING ⚠️
If the column is not found every column from searched row will be filtered
ValueFilter
In ValueFilter we don't provide the column family and column name, just the filter statement. Which makes it search in every column possible. ValueFilter return only rows that match statement.
From ValueFilter documentation:
To test the value of a single qualifier when scanning multiple qualifiers, use link SingleColumnValueFilter
hbase(main):004:0> scan 'myTable', {FILTER => "ValueFilter(=,'binary:valueX')"}
ROW COLUMN+CELL
row1 column=myColumnFamily:columnB, timestamp=1656802766312, value=valueX
1 row(s)
Took 0.0106 seconds
Conclusion
- ValueFilters or specific column values in every column
- SingleValueColumnFilter returns only matched row in a specified column
- ColumnValueFilter returns every row in case it matched the row in a specified column
Posted on July 8, 2022
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.