Sunday, March 30, 2014

Working With Apache Solr

Background

Apache Solr is based on Apache Lucene. It is a search server. You can index data in Solr and then run queries.

Installing Apache Solr


Download Apache Solr from an Apache Mirror.  At the time of writing, the version was 4.7.0

Start Apache Solr

You can start solr using a default embedded jetty instance by going to the examples directory of your Solr installation.


$> java -jar start.jar

If there are no errors, your Solr instance is available at http://localhost:8983/solr/#/

You should see the default Solr Welcome Screen.

Exploring the collections


On the left hand column, use the drop down to choose the default colletion "Collection1".

You should see a screen for the collection.

Click on "query" on the left hand side.

You should see the query screen.

Press the "Execute query" button.

You should see the query response in JSON format as follows:

{
  "responseHeader": {
    "status": 0,
    "QTime": 2,
    "params": {
      "indent": "true",
      "q": "*:*",
      "_": "1396205101116",
      "wt": "json"
    }
  },
  "response": {
    "numFound": 0,
    "start": 0,
    "docs": []
  }
}

The response shows that we have no data.

This is because we have not fed Apache Solr any data to index.

Index some data


We have Apache Solr running. We will try to index some data.

In another command window, let us go solr-install-dir/examples/exampledocs directory.


solr-4.7.0/example/exampledocs$ java -jar post.jar .
SimplePostTool version 1.5
Posting files to base url http://localhost:8983/solr/update using content-type application/xml..
Indexing directory . (16 files, depth=0)
POSTing file books.csv
SimplePostTool: WARNING: Solr returned an error #400 Bad Request
SimplePostTool: WARNING: IOException while reading response: java.io.IOException: Server returned HTTP response code: 400 for URL: http://localhost:8983/solr/update
POSTing file books.json
SimplePostTool: WARNING: Solr returned an error #400 Bad Request
SimplePostTool: WARNING: IOException while reading response: java.io.IOException: Server returned HTTP response code: 400 for URL: http://localhost:8983/solr/update
POSTing file gb18030-example.xml
POSTing file hd.xml
POSTing file ipod_other.xml
POSTing file ipod_video.xml
POSTing file manufacturers.xml
POSTing file mem.xml
POSTing file money.xml
POSTing file monitor.xml
POSTing file monitor2.xml
POSTing file mp500.xml
POSTing file sd500.xml
POSTing file solr.xml
POSTing file utf8-example.xml
POSTing file vidcard.xml
16 files indexed.
COMMITting Solr index changes to http://localhost:8983/solr/update..
Time spent: 0:00:00.413
anil@anil:~/solr/solr-4.7.0/example/exampledocs$
Basically Apache Solr is indexed with all the files available in examples/exampledocs directory.

 Testing the Indexed Data


Now that Solr is indexed with some data, we can send queries.

In the Solr admin screen in the browser, just click the "Execute Query" button. This should return all the data that is available.
{
  "responseHeader": {
    "status": 0,
    "QTime": 5,
    "params": {
      "indent": "true",
      "q": "*:*",
      "_": "1396205796896",
      "wt": "json"
    }
  },
  "response": {
    "numFound": 32,
    "start": 0,
    "docs": [
      {
        "id": "GB18030TEST",
        "name": "Test with some GB18030 encoded characters",
        "features": [
          "No accents here",
          "这是一个功能",
          "This is a feature (translated)",
          "这份文件是很有光泽",
          "This document is very shiny (translated)"
        ],
        "price": 0,
        "price_c": "0,USD",
        "inStock": true,
        "_version_": 1464027600530702300
      },
      {
        "id": "SP2514N",
        "name": "Samsung SpinPoint P120 SP2514N - hard drive - 250 GB - ATA-133",
        "manu": "Samsung Electronics Co. Ltd.",
        "manu_id_s": "samsung",
        "cat": [
          "electronics",
          "hard drive"
        ],
        "features": [
          "7200RPM, 8MB cache, IDE Ultra ATA-133",
          "NoiseGuard, SilentSeek technology, Fluid Dynamic Bearing (FDB) motor"
        ],
        "price": 92,
        "price_c": "92,USD",
        "popularity": 6,
        "inStock": true,
        "manufacturedate_dt": "2006-02-13T15:26:37Z",
        "store": "35.0752,-97.032",
        "_version_": 1464027600570548200
      },
      {
        "id": "6H500F0",
        "name": "Maxtor DiamondMax 11 - hard drive - 500 GB - SATA-300",
        "manu": "Maxtor Corp.",
        "manu_id_s": "maxtor",
        "cat": [
          "electronics",
          "hard drive"
        ],
        "features": [
          "SATA 3.0Gb/s, NCQ",
          "8.5ms seek",
          "16MB cache"
        ],
        "price": 350,
        "price_c": "350,USD",
        "popularity": 6,
        "inStock": true,
        "store": "45.17614,-93.87341",
        "manufacturedate_dt": "2006-02-13T15:26:37Z",
        "_version_": 1464027600579985400
      },
      {
        "id": "F8V7067-APL-KIT",
        "name": "Belkin Mobile Power Cord for iPod w/ Dock",
        "manu": "Belkin",
        "manu_id_s": "belkin",
        "cat": [
          "electronics",
          "connector"
        ],
        "features": [
          "car power adapter, white"
        ],
        "weight": 4,
        "price": 19.95,
        "price_c": "19.95,USD",
        "popularity": 1,
        "inStock": false,
        "store": "45.18014,-93.87741",
        "manufacturedate_dt": "2005-08-01T16:30:25Z",
        "_version_": 1464027600588374000
      },
      {
        "id": "IW-02",
        "name": "iPod & iPod Mini USB 2.0 Cable",
        "manu": "Belkin",
        "manu_id_s": "belkin",
        "cat": [
          "electronics",
          "connector"
        ],
        "features": [
          "car power adapter for iPod, white"
        ],
        "weight": 2,
        "price": 11.5,
        "price_c": "11.50,USD",
        "popularity": 1,
        "inStock": false,
        "store": "37.7752,-122.4232",
        "manufacturedate_dt": "2006-02-14T23:55:59Z",
        "_version_": 1464027600592568300
      },
      {
        "id": "MA147LL/A",
        "name": "Apple 60 GB iPod with Video Playback Black",
        "manu": "Apple Computer Inc.",
        "manu_id_s": "apple",
        "cat": [
          "electronics",
          "music"
        ],
        "features": [
          "iTunes, Podcasts, Audiobooks",
          "Stores up to 15,000 songs, 25,000 photos, or 150 hours of video",
          "2.5-inch, 320x240 color TFT LCD display with LED backlight",
          "Up to 20 hours of battery life",
          "Plays AAC, MP3, WAV, AIFF, Audible, Apple Lossless, H.264 video",
          "Notes, Calendar, Phone book, Hold button, Date display, Photo wallet, Built-in games, JPEG photo playback, Upgradeable firmware, USB 2.0 compatibility, Playback speed control, Rechargeable capability, Battery level indication"
        ],
        "includes": "earbud headphones, USB cable",
        "weight": 5.5,
        "price": 399,
        "price_c": "399.00,USD",
        "popularity": 10,
        "inStock": true,
        "store": "37.7752,-100.0232",
        "manufacturedate_dt": "2005-10-12T08:00:00Z",
        "_version_": 1464027600599908400
      },
      {
        "id": "adata",
        "compName_s": "A-Data Technology",
        "address_s": "46221 Landing Parkway Fremont, CA 94538",
        "_version_": 1464027600616685600
      },
      {
        "id": "apple",
        "compName_s": "Apple",
        "address_s": "1 Infinite Way, Cupertino CA",
        "_version_": 1464027600618782700
      },
      {
        "id": "asus",
        "compName_s": "ASUS Computer",
        "address_s": "800 Corporate Way Fremont, CA 94539",
        "_version_": 1464027600619831300
      },
      {
        "id": "ati",
        "compName_s": "ATI Technologies",
        "address_s": "33 Commerce Valley Drive East Thornhill, ON L3T 7N6 Canada",
        "_version_": 1464027600620880000
      }
    ]
  }
}

Above we have just sent a query for all data.

Let us try to be specific with our queries.

In the edit box named "q", enter the following word: ipod and click the "Execute Query" button, you should see the following data returned as JSON response.

{
  "responseHeader": {
    "status": 0,
    "QTime": 8,
    "params": {
      "indent": "true",
      "q": "ipod",
      "_": "1396206251386",
      "wt": "json"
    }
  },
  "response": {
    "numFound": 3,
    "start": 0,
    "docs": [
      {
        "id": "IW-02",
        "name": "iPod & iPod Mini USB 2.0 Cable",
        "manu": "Belkin",
        "manu_id_s": "belkin",
        "cat": [
          "electronics",
          "connector"
        ],
        "features": [
          "car power adapter for iPod, white"
        ],
        "weight": 2,
        "price": 11.5,
        "price_c": "11.50,USD",
        "popularity": 1,
        "inStock": false,
        "store": "37.7752,-122.4232",
        "manufacturedate_dt": "2006-02-14T23:55:59Z",
        "_version_": 1464027600592568300
      },
      {
        "id": "F8V7067-APL-KIT",
        "name": "Belkin Mobile Power Cord for iPod w/ Dock",
        "manu": "Belkin",
        "manu_id_s": "belkin",
        "cat": [
          "electronics",
          "connector"
        ],
        "features": [
          "car power adapter, white"
        ],
        "weight": 4,
        "price": 19.95,
        "price_c": "19.95,USD",
        "popularity": 1,
        "inStock": false,
        "store": "45.18014,-93.87741",
        "manufacturedate_dt": "2005-08-01T16:30:25Z",
        "_version_": 1464027600588374000
      },
      {
        "id": "MA147LL/A",
        "name": "Apple 60 GB iPod with Video Playback Black",
        "manu": "Apple Computer Inc.",
        "manu_id_s": "apple",
        "cat": [
          "electronics",
          "music"
        ],
        "features": [
          "iTunes, Podcasts, Audiobooks",
          "Stores up to 15,000 songs, 25,000 photos, or 150 hours of video",
          "2.5-inch, 320x240 color TFT LCD display with LED backlight",
          "Up to 20 hours of battery life",
          "Plays AAC, MP3, WAV, AIFF, Audible, Apple Lossless, H.264 video",
          "Notes, Calendar, Phone book, Hold button, Date display, Photo wallet, Built-in games, JPEG photo playback, Upgradeable firmware, USB 2.0 compatibility, Playback speed control, Rechargeable capability, Battery level indication"
        ],
        "includes": "earbud headphones, USB cable",
        "weight": 5.5,
        "price": 399,
        "price_c": "399.00,USD",
        "popularity": 10,
        "inStock": true,
        "store": "37.7752,-100.0232",
        "manufacturedate_dt": "2005-10-12T08:00:00Z",
        "_version_": 1464027600599908400
      }
    ]
  }
}

Basically we are now returned all the data containing the word "ipod".


Tips

1. By default, Solr Search returns 10 results. If you want to return all the values, just use "&rows=100000" or a high value.

Sunday, September 1, 2013

GeoFencing : trend in big data

GeoFencing in my opinion will be an exciting trend in the world of Big Data, particularly in retail and customer loyalty. Reading many online articles, I see that this is already being put into place by retailers.

A typical usage of geofencing would be : you enter a store, based on your earlier permissions to tag you, the store will determine that you are in the vicinity. The store's app or sms will deliver the latest coupons or deals to increase your loyalty.

Android developers have important guidance for geofencing. It is at http://developer.android.com/training/location/geofencing.html

Geofencing involves the use of GPS or some other tracking device. It does involve specialized software. Important paradigms in play are customer loyalty and context based discount.

Sunday, January 27, 2013

Apache HBase Performance Considerations

As you know Apache HBase is a columnar database in the Hadoop ecosystem.  Since a column family can store different types of data, it is very important to understand the various performance options you have in terms of compression at the column level.

Please refer to http://jimbojw.com/wiki/index.php?title=Understanding_HBase_column-family_performance_options  for an excellent writeup on the various column family performance options.

Apache HBase - a simple tutorial

Apache HBase is a Column Database in the Hadoop ecosystem.  You can take a look at Apache HBase from its website at http://hbase.apache.org/


HBase Operations

Step 1: Download HBase

I downloaded hbase-0.94.4. This was the latest this day. You may get a later version.

Step 2: Unzip HBase

$> mkdir hbase
$> gunzip hbase-0.94.4.tar.gz
$> ls
hbase-0.94.4.tar

$> tar xvf hbase-0.94.4.tar

Now you should have a directory called hbase-0.94.4

$> cd hbase-0.94.4

Step 3:  Start HBase Daemon

$> cd bin
$> ./hbase-daemon.sh start master
starting master, logging to  .../hbase-0.94.4/bin/../logs/hbase-anil-master-2.local.out
$

Step 4:  Enter HBase Shell

$> ./hbase shell

HBase Shell; enter 'help<RETURN>' for list of supported commands.
Type "exit<RETURN>" to leave the HBase Shell
Version 0.94.4, r1428173, Thu Jan  3 06:29:56 UTC 2013

hbase(main):001:0>


Step 5:  Create an HBase Table 


Table  will be called blog with a column family called "posts" and another column family called "images"


hbase(main):007:0> create 'blog', 'posts', 'images'
0 row(s) in 1.0610 seconds


Step 6: Populate the HBase Table

hbase(main):009:0> put 'blog','firstpost','posts:title','My HBase Post'
0 row(s) in 0.0220 seconds

hbase(main):009:0> put 'blog','firstpost','posts:title','My HBase Post'
0 row(s) in 0.0220 seconds

hbase(main):010:0> put 'blog','firstpost','posts:author','Anil'
0 row(s) in 0.0050 seconds

hbase(main):011:0> put 'blog','firstpost','posts:location','Chicago'
0 row(s) in 0.0070 seconds

hbase(main):012:0> put 'blog','firstpost','posts:content','HBase is cool'
0 row(s) in 0.0050 seconds

hbase(main):014:0> put 'blog','firstpost','images:header', 'first.jpg'
0 row(s) in 0.0060 seconds

hbase(main):015:0> put 'blog','firstpost','images:bodyimage', 'second.jpg'
0 row(s) in 0.0040 seconds



INFO ON HBASE CELL INSERTION FORMAT
NOTE:  Put a cell 'value' at specified table/row/column and optionally
timestamp coordinates.  To put a cell value into table 't1' at
row 'r1' under column 'c1' marked with the time 'ts1', do:
        hbase> put 't1', 'r1', 'c1', 'value', ts1



Step 7:  Verify the HBase Table Contents

hbase(main):016:0> get 'blog','firstpost'
COLUMN                CELL                                                    
 images:bodyimage     timestamp=1359347351382, value=second.jpg              
 images:header        timestamp=1359347324836, value=first.jpg                
 posts:author         timestamp=1359347197336, value=Anil                    
 posts:content        timestamp=1359347230734, value=HBase is cool            
 posts:location       timestamp=1359347210258, value=Chicago                  
 posts:title          timestamp=1359347161523, value=My HBase Post            
6 row(s) in 0.0350 seconds

hbase(main):017:0>


Cleaning Up

To delete the hbase table you created above, you need to first disable and then drop


hbase(main):005:0> disable 'blog'
0 row(s) in 2.0560 seconds

hbase(main):006:0> drop 'blog'
0 row(s) in 1.0560 seconds

Troubleshooting

If you make a mistake in the column name, you may see an error like this:

ERROR: org.apache.hadoop.hbase.client.RetriesExhaustedWithDetailsException: Failed 1 action: org.apache.hadoop.hbase.regionserver.NoSuchColumnFamilyException: Column family image does not exist in region blog,,1359346963541.261ada3f5ada71f241759e6a062dc523. in table {NAME => 'blog', FAMILIES => [{NAME => 'images', DATA_BLOCK_ENCODING => 'NONE', BLOOMFILTER => 'NONE', REPLICATION_SCOPE => '0', COMPRESSION => 'NONE', VERSIONS => '3', TTL => '2147483647', MIN_VERSIONS => '0', KEEP_DELETED_CELLS => 'false', BLOCKSIZE => '65536', ENCODE_ON_DISK => 'true', IN_MEMORY => 'false', BLOCKCACHE => 'true'}, {NAME => 'posts', DATA_BLOCK_ENCODING => 'NONE', BLOOMFILTER => 'NONE', REPLICATION_SCOPE => '0', COMPRESSION => 'NONE', VERSIONS => '3', TTL => '2147483647', MIN_VERSIONS => '0', KEEP_DELETED_CELLS => 'false', BLOCKSIZE => '65536', ENCODE_ON_DISK => 'true', IN_MEMORY => 'false', BLOCKCACHE => 'true'}]}



HBase REST Server

If you are interested in starting the HBase Server as a REST Server,

Start the RegionServer

 ./hbase-daemon.sh start regionserver
starting regionserver, logging to hbase-0.94.4/bin/../logs/hbase-anil-regionserver-2.local.out
$

Start HBase REST Server

$ ./hbase-daemon.sh start rest -p 50000

NOTE:  You can use any port. I use 50000 for the rest server.

So when I go to http://localhost:50000
I see my hbase tables.

When I go to  http://localhost:50000/version
it gives me some version metadata info.

Stop HBase REST Server

$ ./hbase-daemon.sh stop rest -p 50000
stopping rest..

Stop HBase Master

$ ./hbase-daemon.sh stop master
stopping master.


Monday, December 3, 2012

Impressions on Cloudera Impala

Today I attended a meet up arranged by the Chicago Big Data meet up at 6pm titled "Cloudera Impala". We were fortunate to have Marcel Kornacker, Lead Architect of Cloudera Impala as the presenter.  Marcel must have been pleasantly surprised to experience 70 Degree Fahrenheit weather in Chicago in December. It was one of those beautiful days, courtesy "Global Warming".

Speaker Impressions:-

My first impressions on Marcel were as follows:  he was unlike those speakers who do the pre-talk theatrics such as going around the room shaking hands or speaking loudly.  He was moving quietly (closer to the presentation area) or having silent conversations. So I deduced him to be a geeky dude, who does not seek conversations in the presentation room. So I thought it probably is not a marketing/technical evangelist person who will be shallow on the technical details of the presentation.

The other concern was whether he had an European accent that may be difficult to grasp, if he was too geeky.  Marcel started speaking.  As they say, do not judge a book by its cover, he drove away the accent issue and gave me the feeling that I will at least be able to hear what he is going to say. He speaks well and convincingly well on a topic where he is the subject matter expert.

Jonathan Seidman, organizer of the Chicago Big Data group, introduced Marcel as an ex-googler who has worked on the F1 database project in the past. I did not know what F1 was at Google. It sounded like important.  That was a good introduction to set the stage for Marcel. If he was employed at Google in a core database tech field. he should definitely know things well. As a presenter, Marcel did a good job discussing the objectives, intricacies, target areas and limitations of Impala. Kudos!

Impala Impressions :-

Let me get back to Impala. Marcel said that the code was written in C++. Bummer. As you know, Hadoop ecosystem is primarily Java (even though you have bits and pieces and tools that are non Java such as Hadoop Streaming). I guess Marcel knows C++ well. That is why he chose to write Impala in C++.  He mentioned that the interface of Impala for applications will be via ODBC. Ok, there is the first roadblock. I write Java code. Now if I want to be excited about Impala, I will need to look at some form of JDBC to ODBC bridge or wait for Marcel's team to code up some client utilities.  People tinkering with the Hadoop ecosystem may have the same questions/impressions as me.

While Hive exists for Java programmers to do SQL with Hadoop ecosystem, Marcel is trying to bring in C++ to the equation.  Here is the catch though.  Impala according to Marcel, performs 3 times better than Hive in certain situations. Wow, this can be a big thing.  But alas, we cannot use Impala via Java interfaces. So we are stuck with Hive (just remember Hives is bad allergy and not fun. :).  We are talking about Apache Hive), if we want to use SQL like interfaces into Hadoop.

I am sure there will be takers for Impala. I am not going to be doing any experimentation with it because I do not intend to a) use C++ or ODBC or b) use CDH4. My experiments are with Apache Hadoop community version and there are enough goodies to get excited about there. :)

Unlike Hive, Impala does not use Map Reduce underneath. It has Query Plans that get fragmented and distributed among the nodes in a cluster. There is a component that gathers the results of the plan execution. 

After the talk on my way back, I googled Marcel to learn more about him.  I hit on the following article that gives a very good background into Marcel.
http://www.wired.com/wiredenterprise/2012/10/kornacker-cloudera-google/
Basically Marcel is a details guy, with a PhD from Univ of Cal at Berkeley and is an excellent cook.

Cloudera Impala is in the hands of an excellent Chef.  Good Luck Marcel!

Other people such as http://java.sys-con.com/node/2461455 are getting excited about Impala.  Mention of "near real time" without the use of wind river or RTOS. :)


Friday, November 23, 2012

Market Basket Analysis : Importance

Market Basket Analysis according to Wikipedia is defined as follows:
"The term market basket analysis in the retail business refers to research that provides the retailer with information to understand the purchase behaviour of a buyer. This information will enable the retailer to understand the buyer's needs and rewrite the store's layout accordingly, develop cross-promotional programs, or even capture new buyers (much like the cross-selling concept)." 

It is very importance to understand the importance of Market Basket Analysis in the Big Data Arena. 

Questions such as:

  • What products are typically bought together?
  • How many quantities of products are bought together?

This probably looks simple to you. 

But if you recall Amazon generating recommendations such as : If you buy Book A (costing $25) and Book B(costing $15) together, you can for $35. Well, as a consumer you got a discount of $5.  But Amazon used MBA to determine which books go together based on customer purchases. Now they just increased their sales while losing $5 (if the books were bought separately).  Amazon would not have increased sales if they had arbitrarily clubbed two books together.  

Think along the following lines:
a) A person buying a pack of cigarettes is also probably going to buy a book of matches or a lighter.
b) A person buying a stapler is also probably going to buy staples.
c) A person buying a school bag is also probably going to buy books/pens/pencils.

Even though a), b) and c) are common sense - MBA can help retailers figure this out through processing Point Of Sale (POS) transactional data. 

How is this relevant to Big Data?

Well, POS transactional data is not small in quantity for a typical retailer. There are hundreds and thousands of transactions that they have to analyze.

Monday, November 19, 2012

Apache Hadoop Security

Apache Hadoop is synonymous with Big Data. Majority of Big Data processing happens via the Hadoop ecosystem. If you have a Big Data project, chances of you using elements of the Hadoop ecosystem is very huge.

One of the biggest challenges with the Hadoop ecosystem is security. The Map/Reduce processing framework in Hadoop does not really have major security support. HDFS uses the Unix file system model security. The HDFS security model may work perfectly for storing data in a distributed file system.

Kerberos based authentication is used as the primary security mechanism in the Hadoop Map/Reduce framework. But there is no real data confidentiality/privacy mechanisms in supported.

Data can be in motion (passing through programs) or network elements. Data can be at rest in data stores. The HDFS security controls seem to be adequate for data at rest.  But for data that is being processed via the Map/Reduce mechanism, it is up to the developers/programmers to utilize encryption mechanisms.

If you need guidance, please do email me at anil  AT  apache  DOT org.  I will be happy to suggest approaches to achieve Big Data Security.

Hadoop By White, Tom (Google Affiliate Ad)Hadoop in Action By Lam, Chuck (Google Affiliate Ad)