HBase: having fun with the shell

Posted by in Data

HBase shell is a full interactive JRuby shell (IRB) providing tools allowing you to query your data or execute admin commands on a HBase cluster. Since it uses JRuby, this shell is a powerful interactive scripting environment. This post is not about presenting you the commands available in the shell, you can easily find documentation or article on the Internet, but more about the possibilities of the shell.

Add custom command

Actually, there is not easy way to extend the set of available commands in HBase shell. If you want to create a custom command you have to create a new file in HBASE_CLIENT_HOME/lib/ruby/shell/commands/ and edit the main Ruby file “shell.rb”. I hope in the future there will be a system of extensions/plugins.

Let’s create a simple command printing a “Hello world” in the console. Not very useful, I agree, but all the next tips can be included as custom commands: generate fake data for HBase, scan and print values serialized in Avro, delete rows via a scan… We will discuss about this later, for now, we need to create a new Ruby file in HBASE_CLIENT_HOME/lib/ruby/shell/commands/ named hello_world.rb containing:

module Shell
  module Commands
    class HelloWorld < Command
      def help
        return <<-EOF
Print 'Hello world' in the output
Syntax : hello_world
For example:
    hbase> hello_world
EOF
      end

      def command()
        puts "Hello world!"
      end
    end
  end
end

Now we have to register this command in one the groups of commands. As you can see, shell.rb already defines some groups:
– GENERAL HBASE SHELL COMMANDS
– TABLES MANAGEMENT COMMANDS
– DATA MANIPULATION COMMANDS
– HBASE SURGERY TOOLS
– CLUSTER REPLICATION TOOLS
– ONLINE CONFIGURATION TOOLS
and more…

For our great command (…), let’s add it in the group “GENERAL HBASE SHELL COMMANDS”:

...

Shell.load_command_group(
  'general',
  :full_name => 'GENERAL HBASE SHELL COMMANDS',
  :commands => %w[
    status
    version
    table_help
    whoami
    processlist
    hello_world
  ]
)

...

Open a new session and try it:

hbase(main):009:0> help 'general'
Command: hello_world
Print 'Hello world' in the output
Syntax : hello_world
For example:
    hbase> hello_world

Command: status
Show cluster status. Can be 'summary', 'simple', 'detailed', or 'replication'. The
default is 'summary'. Examples:

...

hbase(main):010:0> hello_world
Hello world!

Great!

Print a warning message for sessions opened in production

When starting a new session in the HBase shell, a banner is printed. For production it may be useful to inform the user via a warning message that this session will execute commands on a sensitive platform. Yes, I mean the production. The default banner looks like this:

Use "help" to get list of supported commands.
Use "exit" to quit this interactive shell.
Version 1.2.4, r67592f3d062743907f8c5ae00dbbe1ae4f69e5af, Tue Oct 25 18:10:20 CDT 2016

hbase(main):001:0>

The banner is defined in shell.rb

def print_banner
  puts 'HBase Shell'
  puts 'Use "help" to get list of supported commands.'
  puts 'Use "exit" to quit this interactive shell.'
  print 'Version '
  command('version')
  puts
end

Let’s edit it, and add a red printed message:

def print_banner
  puts 'HBase Shell'
  puts red('***************************************************************************')
  puts red('This is the production, be aware...')
  puts red('***************************************************************************')
  puts 'Use "help" to get list of supported commands.'
  puts 'Use "exit" to quit this interactive shell.'
  print 'Version '
  command('version')
  puts
end

def red(text) ; "\e[31m#{text}\e[0m" ; end

Now if you open a new shell, you should see:

alt text

Generate fake data

This is not a real tool for generating fake data, but it is pretty simple:

ary = [1,2,3,4,5]
ary.each do |i|
   put 'ns:my_table', 'r'+i.to_s, 'col1', 'value'+i.to_s
end

The idea is simple: generate a bunch of HBase Put inside a loop. If you want more useful data, you can find on the Internet websites generating fake data. All you have to do is parse the file (CSV) and generate the Put, or use the HBase ImportTSV

Scan values serialized in Avro

Not all the data in our HBase table are text values. But if you scan the table, only the text values can be well printed. The ‘scan’ command in HBase shell uses Bytes.toStringBinary in order to print the values. It writes a printable representation of a byte array, all non-printable characters are hex escaped in the format \x%02X, eg: \x00 \x05 etc

From the doc:

Besides the default ‘toStringBinary’ format, ‘scan’ supports custom formatting by column.
A user can define a FORMATTER by adding it to the column name in the scan specification. The FORMATTER can be stipulated:

  1. either as a org.apache.hadoop.hbase.util.Bytes method name (e.g, toInt, toString)
  2. or as a custom class followed by method name: e.g. ‘c(MyFormatterClass).format’.

This can be useful for values like int, long, float… but not for Avro values. In the shell, Avro values look like this:

hbase(main):018:0> scan 'my_avro_table'
ROW                                           COLUMN+CELL

 row1                                         column=d:test, timestamp=1481976466987, value=\x00\x16\x00\x0CChrome\x02\x02

1 row(s) in 0.0590 seconds

Not very readable… And there is no formatter available.
If you want to do it, you need first to find the Avro schema used when the values have been stored in HBase. For this example, I will use this one from my previous post about Parquet

{
   "type":"record",
   "name":"AccessLog",
   "namespace":"fr.layer4",
   "fields":[
      {
         "name":"id",
         "type":[
            "int",
            "null"
         ]
      },
      {
         "name":"useragent",
         "type":[
            "string",
            "null"
         ]
      },
      {
         "name":"ip",
         "type":[
            "string",
            "null"
         ]
      },
      {
         "name":"path",
         "type":[
            "string",
            "null"
         ]
      }
   ]
}

Then create a rb file containing:

import java.io.File
import org.apache.hadoop.hbase.HBaseConfiguration
import org.apache.hadoop.hbase.client.HTable
import org.apache.hadoop.hbase.client.Scan
import org.apache.hadoop.hbase.util.Bytes
import org.apache.avro.Schema
import org.apache.avro.generic.GenericData
import org.apache.avro.generic.GenericRecord
import org.apache.avro.generic.GenericDatumReader
import org.apache.avro.io.DecoderFactory

table = 'my_avro_table'.to_java_bytes

config = HBaseConfiguration.create
schema = Schema::Parser.new().parse(File.new("schema.avsc"))
reader = GenericDatumReader.new(schema)

scan = Scan.new()
htable = HTable.new(config, table)
scanner = htable.getScanner(scan)
scanner.each do |r|
 row_as_string = Bytes.toString(r.getRow)
 raw_content = r.getValue('d'.to_java_bytes,'test'.to_java_bytes)
 content = raw_content != nil ? reader.read(nil,DecoderFactory.new().binaryDecoder(raw_content, nil)).to_s : 'null'
 puts "#{row_as_string}     #{content}"
end
scanner.close
htable.close

And launch it:

row1     {"id": 11, "useragent": "Chrome", "ip": null, "path": null}
row2     {"id": 22, "useragent": "Chrome", "ip": null, "path": null}
row3     null
row4     {"id": 44, "useragent": "Chrome", "ip": null, "path": null}

And it’s done!

Delete values with a scan

If you have already worked with HBase, you know that this is an important missing feature. A bunch of qualifiers are no more needed? You generated the wrong key for the data imported? In HBase, you can only scan for retrieve the data, but not scan for delete data (like a ‘DELETE FROM TABLE xxx WHERE …’). Here is a little example of a solution you can build using a Ruby script:

import org.apache.hadoop.hbase.HBaseConfiguration
import org.apache.hadoop.hbase.client.HTable
import org.apache.hadoop.hbase.client.Scan
import org.apache.hadoop.hbase.client.Delete
import org.apache.hadoop.hbase.filter.FirstKeyOnlyFilter
import org.apache.hadoop.hbase.util.Bytes

table = 'ns:my_table'.to_java_bytes
family = 'd'.to_java_bytes

config = HBaseConfiguration.create

scan = Scan.new()
scan.setStartRow "a"
scan.setStopRow "c"
scan.setCacheBlocks false
scan.setCaching 10
scan.addFamily family
filter = FirstKeyOnlyFilter.new()
scan.setFilter(filter)

htable = HTable.new(config, table)
scanner = htable.getScanner(scan)
scanner.each do |r|
 row_as_string = Bytes.toString(r.getRow)
 if row_as_string.start_with?('ab')
  puts "Deleting #{row_as_string}..."
  delete = Delete.new r.getRow()
  table.delete delete
 else
  puts "Skip #{row_as_string}"
 end
end
scanner.close
htable.close

Feel free to play with the scan options, like the filters, the start/stop rowkey…

Export to CSV

In fact, “small to medium export HBase data to CSV”. There are already tools for exporting data from HBase to a delimited character file like Hive, Spark or Pig, the goal here is not to replace them, just to provide a new way of extracting the data out of HBase.

require 'csv'

import org.apache.hadoop.hbase.HBaseConfiguration
import org.apache.hadoop.hbase.client.HTable
import org.apache.hadoop.hbase.client.Scan
import org.apache.hadoop.hbase.client.Delete
import org.apache.hadoop.hbase.filter.FirstKeyOnlyFilter
import org.apache.hadoop.hbase.util.Bytes

table = 'ns:my_table'.to_java_bytes
family = 'd'.to_java_bytes

config = HBaseConfiguration.create

scan = Scan.new()

htable = HTable.new(config, table)
scanner = htable.getScanner(scan)

CSV.open('extract.csv', 'w') do |csv_object|
 scanner.each do |r|
  results = Array.new
  results << Bytes.toString(r.getRow)
  results << Bytes.toString(r.getValue('d'.to_java_bytes,'the_qualifier'.to_java_bytes))
  csv_object << results
 end
end
scanner.close
htable.close

Not a production grade tool, but very helpful.

Display infos about regions

This is not something you may use everyday, but this example shows how to scan the “hbase:meta” table and display informations about the regions for each tables:

import org.apache.hadoop.hbase.HBaseConfiguration
import org.apache.hadoop.hbase.client.HTable
import org.apache.hadoop.hbase.client.Scan
import org.apache.hadoop.hbase.util.Bytes

config = HBaseConfiguration.create

htable = HTable.new(config, "hbase:meta")
scan = Scan.new()
scanner = htable.getScanner(scan)
tables = {}
scanner.each do |r|
 split = Bytes.toString(r.getRow).split(",")
 table_name = split[0]
 if not tables.has_key?(table_name)
  tables[table_name] = Array.new
 end
 tables[table_name].push(split[1] + "-" + split[2])
end
scanner.close
htable.close
tables.each do |table, regions|
 puts "Table #{table}: "
 regions.each  do |region|  
  puts "    Region #{region}"
 end
end

You should see something like this:

Table testv1:
    Region XXXXXXXXXXXXXXXXXX
    Region XXXXXXXXXXXXXXXXXX
    Region XXXXXXXXXXXXXXXXXX

Table testv2:
    Region XXXXXXXXXXXXXXXXXX
    Region XXXXXXXXXXXXXXXXXX
    Region XXXXXXXXXXXXXXXXXX

Sources:
The Apache HBase Shell: official documentation
HBase Shell on Github

Credits:
“person-sky-silhouette-night” by snapwire is licensed under CC0 1.0 / Resized

Published the