Friday, March 30, 2012

Pig: Deriving meaningful data from SSO logs

If you use CAS sso solution, then you can generate logs.  I have a text file called "CASLoginLog.txt"  This file is basically a snippet of login trail for 3 days in March. I have changed usernames and many things. So your file may look a bit different. :)

Step: Generate a CAS SSO Login Trail


===================================
Date    Action  Username        Service         Ticket
28.3.2012 2:28:01       SERVICE_TICKET_CREATED  user1   https://myurl   ST-13133--org-sso
28.3.2012 2:27:30       SERVICE_TICKET_CREATED  user2   https://myurl/url ST-13046--j-sso
28.3.2012 2:27:17       TICKET_GRANTING_TICKET_DESTROYED                        TGT-3380--j-sso
28.3.2012 2:27:17       SERVICE_TICKET_CREATED  user3   https://c/thread/197282?tstart=0        ST-13045-j-sso
28.3.2012 2:27:16       TICKET_GRANTING_TICKET_CREATED  firstlion               TGT-3567--j-sso
28.3.2012 2:26:30       SERVICE_TICKET_CREATED  user4   https://issues.j.org/secure/D.jspa      ST-13044--j-sso
27.3.2012 23:12:37      SERVICE_TICKET_CREATED  user2   https://c/thread/151832?start=15&tstart=0       ST-13048--j-sso
27.3.2012 22:51:51      SERVICE_TICKET_CREATED  user5   https://c/login.jspa    ST-13038--j-sso
27.3.2012 22:51:50      TICKET_GRANTING_TICKET_CREATED  user5           TGT-3527--j-sso
27.3.2012 22:51:49      TICKET_GRANTING_TICKET_CREATED  user5           TGT-3526--j-sso
26.3.2012 14:17:27      SERVICE_TICKET_CREATED  user1   https://c/message/725882?tstart=0       ST-11709--j-sso
26.3.2012 13:02:51      TICKET_GRANTING_TICKET_CREATED  user1           TGT-3223--j-sso
=======================================

So let us try to figure out, how many times in these 3 days, each user was provided a "SERVICE_TICKET_CREATED" action.

I am going to use Apache Pig to generate the output.


Step: Code a Pig Script



My pig script is called CASLog.pig

=====================================
file = LOAD 'CASLoginLog.txt' USING PigStorage(' ') AS (ticketDate: chararray,ticketTime: chararray,action: chararray,username: chararray,service: chararray,ticket: chararray);

trimmedfile = FOREACH file GENERATE TRIM(ticketDate) as ticketDate, TRIM(action) AS action, TRIM(username) AS username ,TRIM(ticket) AS ticket ;

selectedrows = FILTER trimmedfile BY action == 'SERVICE_TICKET_CREATED';

usersgroup = GROUP selectedrows BY username;

counts = FOREACH usersgroup GENERATE group AS username, COUNT(selectedrows) AS num_digits;

STORE counts INTO 'result' USING PigStorage('=');
==========================================




Step:  Execute Apache Pig



Now let me run Pig on this.

===========================================
$ sh ../pig-0.9.2/bin/pig -x local CASLog.pig

....
Input(s):
Successfully read records from: "file:///hadoop/pig/anilpig/CASLoginLog.txt"

Output(s):
Successfully stored records in: "file:///hadoop/pig/anilpig/result"

Job DAG:
job_local_0001


2012-03-30 16:56:09,762 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Success!
============================================

Pig does the Map Reduce magic under the covers and stores the end result in a directory called "result" based on the last statement in the pig file.

Step :  View the results.


========================
$ vi result/part-r-00000

user1=2
user2=2
user3=1
user4=1
user5=1
========================


It took me like a couple of hours to get the script correct and working, after a lot of trial and error.  But I had to write 0 lines of Apache Hadoop Map Reduce java code.