Friday, November 23, 2012

Market Basket Analysis : Importance

Market Basket Analysis according to Wikipedia is defined as follows:
"The term market basket analysis in the retail business refers to research that provides the retailer with information to understand the purchase behaviour of a buyer. This information will enable the retailer to understand the buyer's needs and rewrite the store's layout accordingly, develop cross-promotional programs, or even capture new buyers (much like the cross-selling concept)." 

It is very importance to understand the importance of Market Basket Analysis in the Big Data Arena. 

Questions such as:

  • What products are typically bought together?
  • How many quantities of products are bought together?

This probably looks simple to you. 

But if you recall Amazon generating recommendations such as : If you buy Book A (costing $25) and Book B(costing $15) together, you can for $35. Well, as a consumer you got a discount of $5.  But Amazon used MBA to determine which books go together based on customer purchases. Now they just increased their sales while losing $5 (if the books were bought separately).  Amazon would not have increased sales if they had arbitrarily clubbed two books together.  

Think along the following lines:
a) A person buying a pack of cigarettes is also probably going to buy a book of matches or a lighter.
b) A person buying a stapler is also probably going to buy staples.
c) A person buying a school bag is also probably going to buy books/pens/pencils.

Even though a), b) and c) are common sense - MBA can help retailers figure this out through processing Point Of Sale (POS) transactional data. 

How is this relevant to Big Data?

Well, POS transactional data is not small in quantity for a typical retailer. There are hundreds and thousands of transactions that they have to analyze.

Monday, November 19, 2012

Apache Hadoop Security

Apache Hadoop is synonymous with Big Data. Majority of Big Data processing happens via the Hadoop ecosystem. If you have a Big Data project, chances of you using elements of the Hadoop ecosystem is very huge.

One of the biggest challenges with the Hadoop ecosystem is security. The Map/Reduce processing framework in Hadoop does not really have major security support. HDFS uses the Unix file system model security. The HDFS security model may work perfectly for storing data in a distributed file system.

Kerberos based authentication is used as the primary security mechanism in the Hadoop Map/Reduce framework. But there is no real data confidentiality/privacy mechanisms in supported.

Data can be in motion (passing through programs) or network elements. Data can be at rest in data stores. The HDFS security controls seem to be adequate for data at rest.  But for data that is being processed via the Map/Reduce mechanism, it is up to the developers/programmers to utilize encryption mechanisms.

If you need guidance, please do email me at anil  AT  apache  DOT org.  I will be happy to suggest approaches to achieve Big Data Security.

Hadoop By White, Tom (Google Affiliate Ad)Hadoop in Action By Lam, Chuck (Google Affiliate Ad)