Sunday, January 27, 2013

Apache HBase Performance Considerations

As you know Apache HBase is a columnar database in the Hadoop ecosystem.  Since a column family can store different types of data, it is very important to understand the various performance options you have in terms of compression at the column level.

Please refer to http://jimbojw.com/wiki/index.php?title=Understanding_HBase_column-family_performance_options  for an excellent writeup on the various column family performance options.

Apache HBase - a simple tutorial

Apache HBase is a Column Database in the Hadoop ecosystem.  You can take a look at Apache HBase from its website at http://hbase.apache.org/


HBase Operations

Step 1: Download HBase

I downloaded hbase-0.94.4. This was the latest this day. You may get a later version.

Step 2: Unzip HBase

$> mkdir hbase
$> gunzip hbase-0.94.4.tar.gz
$> ls
hbase-0.94.4.tar

$> tar xvf hbase-0.94.4.tar

Now you should have a directory called hbase-0.94.4

$> cd hbase-0.94.4

Step 3:  Start HBase Daemon

$> cd bin
$> ./hbase-daemon.sh start master
starting master, logging to  .../hbase-0.94.4/bin/../logs/hbase-anil-master-2.local.out
$

Step 4:  Enter HBase Shell

$> ./hbase shell

HBase Shell; enter 'help<RETURN>' for list of supported commands.
Type "exit<RETURN>" to leave the HBase Shell
Version 0.94.4, r1428173, Thu Jan  3 06:29:56 UTC 2013

hbase(main):001:0>


Step 5:  Create an HBase Table 


Table  will be called blog with a column family called "posts" and another column family called "images"


hbase(main):007:0> create 'blog', 'posts', 'images'
0 row(s) in 1.0610 seconds


Step 6: Populate the HBase Table

hbase(main):009:0> put 'blog','firstpost','posts:title','My HBase Post'
0 row(s) in 0.0220 seconds

hbase(main):009:0> put 'blog','firstpost','posts:title','My HBase Post'
0 row(s) in 0.0220 seconds

hbase(main):010:0> put 'blog','firstpost','posts:author','Anil'
0 row(s) in 0.0050 seconds

hbase(main):011:0> put 'blog','firstpost','posts:location','Chicago'
0 row(s) in 0.0070 seconds

hbase(main):012:0> put 'blog','firstpost','posts:content','HBase is cool'
0 row(s) in 0.0050 seconds

hbase(main):014:0> put 'blog','firstpost','images:header', 'first.jpg'
0 row(s) in 0.0060 seconds

hbase(main):015:0> put 'blog','firstpost','images:bodyimage', 'second.jpg'
0 row(s) in 0.0040 seconds



INFO ON HBASE CELL INSERTION FORMAT
NOTE:  Put a cell 'value' at specified table/row/column and optionally
timestamp coordinates.  To put a cell value into table 't1' at
row 'r1' under column 'c1' marked with the time 'ts1', do:
        hbase> put 't1', 'r1', 'c1', 'value', ts1



Step 7:  Verify the HBase Table Contents

hbase(main):016:0> get 'blog','firstpost'
COLUMN                CELL                                                    
 images:bodyimage     timestamp=1359347351382, value=second.jpg              
 images:header        timestamp=1359347324836, value=first.jpg                
 posts:author         timestamp=1359347197336, value=Anil                    
 posts:content        timestamp=1359347230734, value=HBase is cool            
 posts:location       timestamp=1359347210258, value=Chicago                  
 posts:title          timestamp=1359347161523, value=My HBase Post            
6 row(s) in 0.0350 seconds

hbase(main):017:0>


Cleaning Up

To delete the hbase table you created above, you need to first disable and then drop


hbase(main):005:0> disable 'blog'
0 row(s) in 2.0560 seconds

hbase(main):006:0> drop 'blog'
0 row(s) in 1.0560 seconds

Troubleshooting

If you make a mistake in the column name, you may see an error like this:

ERROR: org.apache.hadoop.hbase.client.RetriesExhaustedWithDetailsException: Failed 1 action: org.apache.hadoop.hbase.regionserver.NoSuchColumnFamilyException: Column family image does not exist in region blog,,1359346963541.261ada3f5ada71f241759e6a062dc523. in table {NAME => 'blog', FAMILIES => [{NAME => 'images', DATA_BLOCK_ENCODING => 'NONE', BLOOMFILTER => 'NONE', REPLICATION_SCOPE => '0', COMPRESSION => 'NONE', VERSIONS => '3', TTL => '2147483647', MIN_VERSIONS => '0', KEEP_DELETED_CELLS => 'false', BLOCKSIZE => '65536', ENCODE_ON_DISK => 'true', IN_MEMORY => 'false', BLOCKCACHE => 'true'}, {NAME => 'posts', DATA_BLOCK_ENCODING => 'NONE', BLOOMFILTER => 'NONE', REPLICATION_SCOPE => '0', COMPRESSION => 'NONE', VERSIONS => '3', TTL => '2147483647', MIN_VERSIONS => '0', KEEP_DELETED_CELLS => 'false', BLOCKSIZE => '65536', ENCODE_ON_DISK => 'true', IN_MEMORY => 'false', BLOCKCACHE => 'true'}]}



HBase REST Server

If you are interested in starting the HBase Server as a REST Server,

Start the RegionServer

 ./hbase-daemon.sh start regionserver
starting regionserver, logging to hbase-0.94.4/bin/../logs/hbase-anil-regionserver-2.local.out
$

Start HBase REST Server

$ ./hbase-daemon.sh start rest -p 50000

NOTE:  You can use any port. I use 50000 for the rest server.

So when I go to http://localhost:50000
I see my hbase tables.

When I go to  http://localhost:50000/version
it gives me some version metadata info.

Stop HBase REST Server

$ ./hbase-daemon.sh stop rest -p 50000
stopping rest..

Stop HBase Master

$ ./hbase-daemon.sh stop master
stopping master.