Everything Big Data! Hadoop,HBase,NoSQL,MongoDB,HDFS,MapReduce etc.: big data

Showing posts with label big data. Show all posts

Sunday, September 1, 2013

GeoFencing : trend in big data

GeoFencing in my opinion will be an exciting trend in the world of Big Data, particularly in retail and customer loyalty. Reading many online articles, I see that this is already being put into place by retailers.

A typical usage of geofencing would be : you enter a store, based on your earlier permissions to tag you, the store will determine that you are in the vicinity. The store's app or sms will deliver the latest coupons or deals to increase your loyalty.

Android developers have important guidance for geofencing. It is at http://developer.android.com/training/location/geofencing.html

Geofencing involves the use of GPS or some other tracking device. It does involve specialized software. Important paradigms in play are customer loyalty and context based discount.

Monday, December 3, 2012

Impressions on Cloudera Impala

Today I attended a meet up arranged by the Chicago Big Data meet up at 6pm titled "Cloudera Impala". We were fortunate to have Marcel Kornacker, Lead Architect of Cloudera Impala as the presenter. Marcel must have been pleasantly surprised to experience 70 Degree Fahrenheit weather in Chicago in December. It was one of those beautiful days, courtesy "Global Warming".

Speaker Impressions:-

My first impressions on Marcel were as follows: he was unlike those speakers who do the pre-talk theatrics such as going around the room shaking hands or speaking loudly. He was moving quietly (closer to the presentation area) or having silent conversations. So I deduced him to be a geeky dude, who does not seek conversations in the presentation room. So I thought it probably is not a marketing/technical evangelist person who will be shallow on the technical details of the presentation.

The other concern was whether he had an European accent that may be difficult to grasp, if he was too geeky. Marcel started speaking. As they say, do not judge a book by its cover, he drove away the accent issue and gave me the feeling that I will at least be able to hear what he is going to say. He speaks well and convincingly well on a topic where he is the subject matter expert.

Jonathan Seidman, organizer of the Chicago Big Data group, introduced Marcel as an ex-googler who has worked on the F1 database project in the past. I did not know what F1 was at Google. It sounded like important. That was a good introduction to set the stage for Marcel. If he was employed at Google in a core database tech field. he should definitely know things well. As a presenter, Marcel did a good job discussing the objectives, intricacies, target areas and limitations of Impala. Kudos!

Impala Impressions :-

Let me get back to Impala. Marcel said that the code was written in C++. Bummer. As you know, Hadoop ecosystem is primarily Java (even though you have bits and pieces and tools that are non Java such as Hadoop Streaming). I guess Marcel knows C++ well. That is why he chose to write Impala in C++. He mentioned that the interface of Impala for applications will be via ODBC. Ok, there is the first roadblock. I write Java code. Now if I want to be excited about Impala, I will need to look at some form of JDBC to ODBC bridge or wait for Marcel's team to code up some client utilities. People tinkering with the Hadoop ecosystem may have the same questions/impressions as me.

While Hive exists for Java programmers to do SQL with Hadoop ecosystem, Marcel is trying to bring in C++ to the equation. Here is the catch though. Impala according to Marcel, performs 3 times better than Hive in certain situations. Wow, this can be a big thing. But alas, we cannot use Impala via Java interfaces. So we are stuck with Hive (just remember Hives is bad allergy and not fun. :). We are talking about Apache Hive), if we want to use SQL like interfaces into Hadoop.

I am sure there will be takers for Impala. I am not going to be doing any experimentation with it because I do not intend to a) use C++ or ODBC or b) use CDH4. My experiments are with Apache Hadoop community version and there are enough goodies to get excited about there. :)

Unlike Hive, Impala does not use Map Reduce underneath. It has Query Plans that get fragmented and distributed among the nodes in a cluster. There is a component that gathers the results of the plan execution.

After the talk on my way back, I googled Marcel to learn more about him. I hit on the following article that gives a very good background into Marcel.
http://www.wired.com/wiredenterprise/2012/10/kornacker-cloudera-google/
Basically Marcel is a details guy, with a PhD from Univ of Cal at Berkeley and is an excellent cook.

Cloudera Impala is in the hands of an excellent Chef. Good Luck Marcel!

Other people such as http://java.sys-con.com/node/2461455 are getting excited about Impala. Mention of "near real time" without the use of wind river or RTOS. :)

Friday, November 23, 2012

Market Basket Analysis : Importance

Market Basket Analysis according to Wikipedia is defined as follows:
"The term market basket analysis in the retail business refers to research that provides the retailer with information to understand the purchase behaviour of a buyer. This information will enable the retailer to understand the buyer's needs and rewrite the store's layout accordingly, develop cross-promotional programs, or even capture new buyers (much like the cross-selling concept)."

It is very importance to understand the importance of Market Basket Analysis in the Big Data Arena.

Questions such as:

What products are typically bought together?
How many quantities of products are bought together?

This probably looks simple to you.

But if you recall Amazon generating recommendations such as : If you buy Book A (costing $25) and Book B(costing $15) together, you can for $35. Well, as a consumer you got a discount of $5. But Amazon used MBA to determine which books go together based on customer purchases. Now they just increased their sales while losing $5 (if the books were bought separately). Amazon would not have increased sales if they had arbitrarily clubbed two books together.

Think along the following lines:

a) A person buying a pack of cigarettes is also probably going to buy a book of matches or a lighter.

b) A person buying a stapler is also probably going to buy staples.

c) A person buying a school bag is also probably going to buy books/pens/pencils.

Even though a), b) and c) are common sense - MBA can help retailers figure this out through processing Point Of Sale (POS) transactional data.

How is this relevant to Big Data?

Well, POS transactional data is not small in quantity for a typical retailer. There are hundreds and thousands of transactions that they have to analyze.

Monday, November 19, 2012

Apache Hadoop Security

Apache Hadoop is synonymous with Big Data. Majority of Big Data processing happens via the Hadoop ecosystem. If you have a Big Data project, chances of you using elements of the Hadoop ecosystem is very huge.

One of the biggest challenges with the Hadoop ecosystem is security. The Map/Reduce processing framework in Hadoop does not really have major security support. HDFS uses the Unix file system model security. The HDFS security model may work perfectly for storing data in a distributed file system.

Kerberos based authentication is used as the primary security mechanism in the Hadoop Map/Reduce framework. But there is no real data confidentiality/privacy mechanisms in supported.

Data can be in motion (passing through programs) or network elements. Data can be at rest in data stores. The HDFS security controls seem to be adequate for data at rest. But for data that is being processed via the Map/Reduce mechanism, it is up to the developers/programmers to utilize encryption mechanisms.

If you need guidance, please do email me at anil AT apache DOT org. I will be happy to suggest approaches to achieve Big Data Security.

Hadoop By White, Tom (Google Affiliate Ad)Hadoop in Action By Lam, Chuck (Google Affiliate Ad)