Everything Big Data! Hadoop,HBase,NoSQL,MongoDB,HDFS,MapReduce etc.

Sunday, July 22, 2012

R: Introduction to statistics and R-Studio

This post will describe some of the common operations of the R statistical package.

Assuming you have installed R on your linux machine.

Step 1: Create a work directory.

$ mkdir R_Work

$ cd R_Work/

Step 2: Let us invoke "R".

$ R

R version 2.15.1 (2012-06-22) -- "Roasted Marshmallows"
Copyright (C) 2012 The R Foundation for Statistical Computing
ISBN 3-900051-07-0
Platform: x86_64-redhat-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

Natural language support but running in an English locale

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> q()

Basically to quit R, you need to type in q()

Step 3: Install R-Studio for your Operating System

http://www.rstudio.org/

$>rstudio

Step 4: Install rJava

It is very important that you get the following command right.

Run as Root

# sudo R CMD javareconf JAVA=/usr/lib/jvm/java-1.6.0-openjdk-1.6.0.0.x86_64/jre/bin/java JAVA_HOME=/usr/java/latest JAVAC=/usr/java/latest/bin/javac JAR=/usr/java/latest/bin/jar JAVAH=/usr/java/latest/bin/javah

In this case, we are trying to tell R where the OpenJDK java executable is. You also want to point to locations where your javac, javah and jar executables are available.

Now we are ready to install rJava

# R

> install.packages("rJava")
Installing package(s) into ‘/usr/lib64/R/library’
(as ‘lib’ is unspecified)
--- Please select a CRAN mirror for use in this session ---
Loading Tcl/Tk interface ... done
.....

* DONE (rJava)
Making packages.html ... done

The downloaded source packages are in
‘/tmp/RtmpwsCz85/downloaded_packages’
Updating HTML index of packages in '.Library'
Making packages.html ... done
>

This was very painful for me. Took me about 20mins to get it right.

By the way, there is a thread that may be useful to you:

http://r.789695.n4.nabble.com/rjava-JDK-not-found-td889163.html

Error I was getting was:

Make sure R is configured with full Java support (including JDK). Run
R CMD javareconf
as root to add Java support to R.

Reference

http://cran.r-project.org/doc/manuals/R-intro.html

http://svn.r-project.org/R/trunk/src/scripts/javareconf

Installing R statistical package on Fedora

If you are operating on Fedora (or any other flavors of Linux), chances are that there is availability of pre-built package of R.

On Fedora 16, I did
$> sudo yum info R
Installed Packages
Name        : R
Arch        : x86_64
Version     : 2.15.1
Release     : 1.fc16
Size        : 0.0
Repo        : installed
From repo   : updates
Summary     : A language for data analysis and graphics
URL         : http://www.r-project.org
License     : GPLv2+
Description : This is a metapackage that provides both core R userspace and
            : all R development components.
            :
            : R is a language and environment for statistical computing and
            : graphics. R is similar to the award-winning S system, which was
            : developed at Bell Laboratories by John Chambers et al. It provides
            : a wide variety of statistical and graphical techniques (linear and
            : nonlinear modelling, statistical tests, time series analysis,
            : classification, clustering, ...).
            :
            : R is designed as a true computer language with control-flow
            : constructions for iteration and alternation, and it allows users
            : to add additional functionality by defining new functions. For
            : computationally intensive tasks, C, C++ and Fortran code can be
            : linked and called at run time.

To install R on fedora,
$> sudo yum install R

Now that you have installed R, go read up on it at http://cran.r-project.org/doc/manuals/R-intro.html

Embedding Pig Scripts in Java Applications

You are familiar with running pig scripts via the command line. Now what if you intend to run Apache Pig as part of your Java applications?

Pig Modes

There are two modes as shown in http://pig.apache.org/docs/r0.7.0/setup.html#Run+Modes

PigServer

This is the main class for embedding Apache Pig as part of your java applications.

import org.apache.pig.PigServer;

=================
PigServer pigServer = null;
String mode = "local";
String pigScriptName = null;

Map<String,String> params = null;

List<String> paramFiles = null;

        try {
            pigServer = new PigServer(mode);
            pigServer.setBatchOn();
            pigServer.debugOn();
            InputStream is = getClass().getClassLoader().getResourceAsStream(pigScriptName);
            if(params != null){
                pigServer.registerScript(is, params);
            } else if(paramFiles != null){
                pigServer.registerScript(is, paramFiles);
            } else {
                pigServer.registerScript(is);
            }
            pigServer.executeBatch();
        } catch (Exception e) {
            throw new RuntimeException(e);
        } finally {
            if(pigServer != null){
                pigServer.shutdown();
            }
        }
==========================
Note: the variable mode can be "local" or "mapreduce".

PigServer can take two additional parameters while registering your pig script.

Params: this is a key/value map passed that can be referenced in your pigscript as $key.
ParamFiles: takes in filenames that contain the parameters.

You can register the script with the PigServer without providing any params.

Do not forget to bookmark this blog. :)

All the best!

Reference: http://everythingbigdata.blogspot.com/2012/07/apache-pig-tips.html

Saturday, July 14, 2012

Hadoop with Drools, Infinispan, PicketLink etc

Here are the slides that I used at JUDCON 2012 in Boston.
http://www.jboss.org/dms/judcon/2012boston/presentations/judcon2012boston_day1track3session2.pdf

In your Map/Reduce programs, you should be able to use any Java library of your choice. In this regard, you should be able to use:

Infinispan Data Grids to send cache events from your Map Reduce programs.
Drools rules engine to apply rules in your M/R programs.
PicketLink can be used for bringing in security aspects to your M/R programs.

Thursday, July 12, 2012

Apache Pig Tips

1) If you want to remove the directory where you are going to perform a STORE operation, then it is better to remove it first.

rmf output/somedir
STORE xyz INTO output/somedir USING PIGSTORAGE()

2) How do I use Pig via Java?
Look at the PigServer usage.

Friday, May 25, 2012

Apache Hadoop Map Reduce - Advanced Example

Map Reduce Example

Latest updated proper work is at http://blog.hampisoftware.com/?p=20

Solve the same problem using Apache Spark: http://blog.hampisoftware.com/?p=41

Use ^^^^

DATED MATERIAL

~~Continuing with my experiments, now I tried to attempt the Patent Citation Example mentioned in the book "Hadoop In Action" by Chuck Lam.~~

Data Set

~~Visit http://nber.org/patents/~~
~~Choose acite75_99.zip and it yielded cite75_99.txt~~

~~This file lists the patent number that cites another patent. In this map reduce example, we are going to attempt to find the reverse.~~

~~For a given patent, how many citations exist.~~

What do you need?

~~Download Hadoop v1.0.3 from http://hadoop.apache.org~~
~~I downloaded hadoop-1.0.3.tar.gz (60MB)~~

Map Reduce Program

Note: I give out a program that was busted. Look for commentary following this program as to why it is busted etc. Finally, I give out a working program.

~~==================~~
~~package mapreduce;~~

~~import java.io.File;~~
~~import java.io.FileNotFoundException;~~
~~import java.io.IOException;~~
~~import java.util.Iterator;~~

~~import org.apache.hadoop.conf.Configuration;~~
~~import org.apache.hadoop.conf.Configured;~~
~~import org.apache.hadoop.fs.Path;~~
~~import org.apache.hadoop.io.Text;~~
~~import org.apache.hadoop.mapreduce.Job;~~
~~import org.apache.hadoop.mapreduce.Mapper;~~
~~import org.apache.hadoop.mapreduce.Reducer;~~
~~import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;~~
~~import org.apache.hadoop.mapreduce.lib.input.KeyValueTextInputFormat;~~
~~import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;~~
~~import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;~~
~~import org.apache.hadoop.util.Tool;~~
~~import org.apache.hadoop.util.ToolRunner;~~

~~public class PatentCitation extends Configured implements Tool{~~

    ~~public static class PatentCitationMapper extends Mapper<Text,Text,Text,Text> {~~
        ~~protected void map(Text key, Text value, Context context)~~
                ~~throws IOException, InterruptedException {~~
            ~~context.write(value, key);~~
        }
    }

    ~~public static class PatentCitationReducer extends Reducer<Text,Text,Text,Text>{~~
        ~~protected void reduce(Text key, Iterable<Text> values, Context context)~~
                ~~throws IOException, InterruptedException {~~
            ~~String csv = "";~~
            ~~Iterator<Text> iterator = values.iterator();~~
            ~~while(iterator.hasNext()){~~
                ~~if(csv.length() > 0 ) csv += ",";~~
                ~~csv += iterator.next().toString();~~
            }
            ~~context.write(key, new Text(csv));~~
        }
    }

    ~~private void deleteFilesInDirectory(File f) throws IOException {~~
        ~~if (f.isDirectory()) {~~
            ~~for (File c : f.listFiles())~~
                ~~deleteFilesInDirectory(c);~~
        }
        ~~if (!f.delete())~~
            ~~throw new FileNotFoundException("Failed to delete file: " + f);~~
    }

    ~~@Override~~
    ~~public int run(String[] args) throws Exception {~~
        ~~if(args.length == 0)~~
            ~~throw new IllegalArgumentException("Please provide input and output paths");~~

        ~~Path inputPath = new Path(args[0]);~~
        ~~File outputDir = new File(args[1]);~~
        ~~deleteFilesInDirectory(outputDir);~~
        ~~Path outputPath = new Path(args[1]);~~

        ~~Job job = new Job(getConf(), "Hadoop Patent Citation Example");~~
        ~~job.setJarByClass(PatentCitation.class);~~

        ~~FileInputFormat.setInputPaths(job, inputPath);~~
        ~~FileOutputFormat.setOutputPath(job, outputPath);~~

        ~~job.setInputFormatClass(KeyValueTextInputFormat.class);~~
        ~~job.setOutputFormatClass(TextOutputFormat.class);~~

        ~~job.setMapperClass(PatentCitationMapper.class);~~
        ~~job.setReducerClass(PatentCitationReducer.class);~~

        ~~job.setMapOutputKeyClass(Text.class);~~
        ~~job.setMapOutputValueClass(Text.class);~~

        ~~return job.waitForCompletion(false) ? 0 : -1;~~
    }

    ~~public static void main(String[] args) throws Exception {~~
        ~~System.exit(ToolRunner.run(new Configuration(), new PatentCitation(), args));~~
    }
}

I created a directory called "input" where the cite75_99.txt was placed. I also created an empty directory called "output". These directories form the input and output directories for the M/R program

Execution

First Attempt:

~~I executed the program as is. It choked because my /tmp directory and root filesystem exhausted the disk space.~~

Second Attempt:

~~Now I exclusively configure the hadoop tmp directory so I can place the tmp files wherever I want.~~
~~-Dhadoop.tmp.dir=/home/anil/judcon12/tmp~~

~~Ok, now the program did not choke. It just ran and ran for 2+ hours. I killed it. It seems the data set is too large to get it finished in a short duration.~~
~~The culprit was the reduce phase. It just did not finish.~~

Third Attempt:

~~Now I tried to configure the reducers to zero so I can view the output of the map phase.~~

~~I tried the property -Dmapred.reduce.tasks=0. It made no difference.~~

~~I then added the following deprecated usage of the Job class. That worked.~~

~~job.setNumReduceTasks(0);~~

~~Ok, now the program just undertook the map phase.~~

~~=======================~~
~~May 25, 2012 4:40:42 PM org.apache.hadoop.util.NativeCodeLoader <clinit>~~
~~WARNING: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable~~
~~May 25, 2012 4:40:42 PM org.apache.hadoop.mapred.JobClient copyAndConfigureFiles~~
~~WARNING: No job jar file set. User classes may not be found. See JobConf(Class) or JobConf#setJar(String).~~
~~May 25, 2012 4:40:42 PM org.apache.hadoop.mapreduce.lib.input.FileInputFormat listStatus~~
~~INFO: Total input paths to process : 1~~
~~May 25, 2012 4:40:42 PM org.apache.hadoop.io.compress.snappy.LoadSnappy <clinit>~~
~~WARNING: Snappy native library not loaded~~
~~May 25, 2012 4:40:42 PM org.apache.hadoop.util.ProcessTree isSetsidSupported~~
~~INFO: setsid exited with exit code 0~~
~~May 25, 2012 4:40:42 PM org.apache.hadoop.mapred.Task initialize~~
~~INFO: Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@420f9c40~~
~~May 25, 2012 4:40:44 PM org.apache.hadoop.mapred.Task done~~
~~INFO: Task:attempt_local_0001_m_000000_0 is done. And is in the process of commiting~~
~~May 25, 2012 4:40:44 PM org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate~~
~~INFO:~~
~~May 25, 2012 4:40:44 PM org.apache.hadoop.mapred.Task commit~~
~~INFO: Task attempt_local_0001_m_000000_0 is allowed to commit now~~
~~May 25, 2012 4:40:44 PM org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter commitTask~~
~~INFO: Saved output of task 'attempt_local_0001_m_000000_0' to /home/anil/judcon12/output~~
~~May 25, 2012 4:40:45 PM org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate~~
~~INFO:~~
~~May 25, 2012 4:40:45 PM org.apache.hadoop.mapred.Task sendDone~~
~~INFO: Task 'attempt_local_0001_m_000000_0' done.~~
~~May 25, 2012 4:40:45 PM org.apache.hadoop.mapred.Task initialize~~
~~INFO: Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@527736bd~~
~~May 25, 2012 4:40:46 PM org.apache.hadoop.mapred.Task done~~
~~INFO: Task:attempt_local_0001_m_000001_0 is done. And is in the process of commiting~~
~~May 25, 2012 4:40:46 PM org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate~~
~~INFO:~~
~~May 25, 2012 4:40:46 PM org.apache.hadoop.mapred.Task commit~~
~~INFO: Task attempt_local_0001_m_000001_0 is allowed to commit now~~
~~May 25, 2012 4:40:46 PM org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter commitTask~~
~~INFO: Saved output of task 'attempt_local_0001_m_000001_0' to /home/anil/judcon12/output~~
~~May 25, 2012 4:40:48 PM org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate~~
~~INFO:~~
~~May 25, 2012 4:40:48 PM org.apache.hadoop.mapred.Task sendDone~~
~~INFO: Task 'attempt_local_0001_m_000001_0' done.~~
~~May 25, 2012 4:40:48 PM org.apache.hadoop.mapred.Task initialize~~
~~INFO: Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@5749b290~~
~~May 25, 2012 4:40:49 PM org.apache.hadoop.mapred.Task done~~
~~INFO: Task:attempt_local_0001_m_000002_0 is done. And is in the process of commiting~~
~~May 25, 2012 4:40:49 PM org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate~~
~~INFO:~~
~~May 25, 2012 4:40:49 PM org.apache.hadoop.mapred.Task commit~~
~~INFO: Task attempt_local_0001_m_000002_0 is allowed to commit now~~
~~May 25, 2012 4:40:49 PM org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter commitTask~~
~~INFO: Saved output of task 'attempt_local_0001_m_000002_0' to /home/anil/judcon12/output~~
~~May 25, 2012 4:40:51 PM org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate~~
~~INFO:~~
~~May 25, 2012 4:40:51 PM org.apache.hadoop.mapred.Task sendDone~~
~~INFO: Task 'attempt_local_0001_m_000002_0' done.~~
~~May 25, 2012 4:40:51 PM org.apache.hadoop.mapred.Task initialize~~
~~INFO: Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@2a8ceeea~~
~~May 25, 2012 4:40:52 PM org.apache.hadoop.mapred.Task done~~
~~INFO: Task:attempt_local_0001_m_000003_0 is done. And is in the process of commiting~~
~~May 25, 2012 4:40:52 PM org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate~~
~~INFO:~~
~~May 25, 2012 4:40:52 PM org.apache.hadoop.mapred.Task commit~~
~~INFO: Task attempt_local_0001_m_000003_0 is allowed to commit now~~
~~May 25, 2012 4:40:52 PM org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter commitTask~~
~~INFO: Saved output of task 'attempt_local_0001_m_000003_0' to /home/anil/judcon12/output~~
~~May 25, 2012 4:40:54 PM org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate~~
~~INFO:~~
~~May 25, 2012 4:40:54 PM org.apache.hadoop.mapred.Task sendDone~~
~~INFO: Task 'attempt_local_0001_m_000003_0' done.~~
~~May 25, 2012 4:40:54 PM org.apache.hadoop.mapred.Task initialize~~
~~INFO: Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@46238a47~~
~~May 25, 2012 4:40:55 PM org.apache.hadoop.mapred.Task done~~
~~INFO: Task:attempt_local_0001_m_000004_0 is done. And is in the process of commiting~~
~~May 25, 2012 4:40:55 PM org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate~~
~~INFO:~~
~~May 25, 2012 4:40:55 PM org.apache.hadoop.mapred.Task commit~~
~~INFO: Task attempt_local_0001_m_000004_0 is allowed to commit now~~
~~May 25, 2012 4:40:55 PM org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter commitTask~~
~~INFO: Saved output of task 'attempt_local_0001_m_000004_0' to /home/anil/judcon12/output~~
~~May 25, 2012 4:40:57 PM org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate~~
~~INFO:~~
~~May 25, 2012 4:40:57 PM org.apache.hadoop.mapred.Task sendDone~~
~~INFO: Task 'attempt_local_0001_m_000004_0' done.~~
~~May 25, 2012 4:40:57 PM org.apache.hadoop.mapred.Task initialize~~
~~INFO: Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@559113f8~~
~~May 25, 2012 4:40:58 PM org.apache.hadoop.mapred.Task done~~
~~INFO: Task:attempt_local_0001_m_000005_0 is done. And is in the process of commiting~~
~~May 25, 2012 4:40:58 PM org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate~~
~~INFO:~~
~~May 25, 2012 4:40:58 PM org.apache.hadoop.mapred.Task commit~~
~~INFO: Task attempt_local_0001_m_000005_0 is allowed to commit now~~
~~May 25, 2012 4:40:58 PM org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter commitTask~~
~~INFO: Saved output of task 'attempt_local_0001_m_000005_0' to /home/anil/judcon12/output~~
~~May 25, 2012 4:41:00 PM org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate~~
~~INFO:~~
~~May 25, 2012 4:41:00 PM org.apache.hadoop.mapred.Task sendDone~~
~~INFO: Task 'attempt_local_0001_m_000005_0' done.~~
~~May 25, 2012 4:41:00 PM org.apache.hadoop.mapred.Task initialize~~
~~INFO: Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@76a9b9c~~
~~May 25, 2012 4:41:01 PM org.apache.hadoop.mapred.Task done~~
~~INFO: Task:attempt_local_0001_m_000006_0 is done. And is in the process of commiting~~
~~May 25, 2012 4:41:01 PM org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate~~
~~INFO:~~
~~May 25, 2012 4:41:01 PM org.apache.hadoop.mapred.Task commit~~
~~INFO: Task attempt_local_0001_m_000006_0 is allowed to commit now~~
~~May 25, 2012 4:41:01 PM org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter commitTask~~
~~INFO: Saved output of task 'attempt_local_0001_m_000006_0' to /home/anil/judcon12/output~~
~~May 25, 2012 4:41:03 PM org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate~~
~~INFO:~~
~~May 25, 2012 4:41:03 PM org.apache.hadoop.mapred.Task sendDone~~
~~INFO: Task 'attempt_local_0001_m_000006_0' done.~~
~~May 25, 2012 4:41:03 PM org.apache.hadoop.mapred.Task initialize~~
~~INFO: Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@560c3014~~
~~May 25, 2012 4:41:04 PM org.apache.hadoop.mapred.Task done~~
~~INFO: Task:attempt_local_0001_m_000007_0 is done. And is in the process of commiting~~
~~May 25, 2012 4:41:04 PM org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate~~
~~INFO:~~
~~May 25, 2012 4:41:04 PM org.apache.hadoop.mapred.Task commit~~
~~INFO: Task attempt_local_0001_m_000007_0 is allowed to commit now~~
~~May 25, 2012 4:41:04 PM org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter commitTask~~
~~INFO: Saved output of task 'attempt_local_0001_m_000007_0' to /home/anil/judcon12/output~~
~~May 25, 2012 4:41:06 PM org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate~~
~~INFO:~~
~~May 25, 2012 4:41:06 PM org.apache.hadoop.mapred.Task sendDone~~
~~INFO: Task 'attempt_local_0001_m_000007_0' done.~~
~~=======================~~

~~So the Map phase is taking between 1-2mins. It is the reduce phase that does not end for me. I will evaluate as to why that is the case. :)~~

~~Let us see what the output of the map phase looks like:~~

~~=============================~~
~~anil@sadbhav:~/judcon12/output$ ls~~
~~part-m-00000 part-m-00002 part-m-00004 part-m-00006 _SUCCESS~~
~~part-m-00001 part-m-00003 part-m-00005 part-m-00007~~
~~==============================~~

~~You can see the end result of the map phase in these files.~~

~~Now onto figuring out the number of reduce phases.~~

~~Let me look at the file size~~

~~===================~~
~~anil@sadbhav:~/judcon12/input$ wc -l cite75_99.txt~~
~~16522439 cite75_99.txt~~
~~===================~~

~~OMG! that is something like 16 million 522 thousand entries. Too much for a laptop to handle.~~

~~Let us try another experiment. Let us choose 20000 patent citations from this file and save it to 20000cite.txt~~

~~When I run the program now, I see that the M/R execution took all of 10 secs to finish.~~

~~Let us view the results of the M/R execution.~~

~~=============================~~
~~anil@sadbhav:~/judcon12/output$ ls~~
~~part-r-00000 _SUCCESS~~
~~============================~~

~~When I look inside the part-r-00000 , I see one long line of csv patent citations. Yeah, you are right. My reducer is busted. It is not working. I need to fix it.....~~
~~That is next step.... This exercise was a failure. But there was a lesson here. If you mess up, you will wait. :)~~

~~Ok, here is the updated map reduce program that works:~~
~~======================~~
~~package mapreduce;~~

~~import java.io.File;~~
~~import java.io.FileNotFoundException;~~
~~import java.io.IOException;~~

~~import org.apache.hadoop.conf.Configuration;~~
~~import org.apache.hadoop.conf.Configured;~~
~~import org.apache.hadoop.fs.Path;~~
~~import org.apache.hadoop.io.Text;~~
~~import org.apache.hadoop.mapreduce.Job;~~
~~import org.apache.hadoop.mapreduce.Mapper;~~
~~import org.apache.hadoop.mapreduce.Reducer;~~
~~import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;~~
~~import org.apache.hadoop.mapreduce.lib.input.KeyValueTextInputFormat;~~
~~import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;~~
~~import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;~~
~~import org.apache.hadoop.util.Tool;~~
~~import org.apache.hadoop.util.ToolRunner;~~

~~public class PatentCitation extends Configured implements Tool{~~

    ~~public static class PatentCitationMapper extends Mapper<Text,Text,Text,Text> {~~
       ~~protected void map(Text key, Text value, Context context)~~
               ~~throws IOException, InterruptedException {~~

           ~~String[] citation = key.toString().split(",");~~
           ~~context.write(new Text(citation[1]), new Text(citation[0]));~~
       }
    }

    ~~public static class PatentCitationReducer extends Reducer<Text,Text,Text,Text>{~~
       ~~protected void reduce(Text key, Iterable<Text> values, Context context)~~
               ~~throws IOException, InterruptedException {~~
           ~~String csv = "";~~
           ~~for(Text val:values){~~
               ~~if(csv.length() > 0 ) csv += ",";~~
               ~~csv += val.toString();~~
           }
           ~~context.write(key, new Text(csv));~~
       }
    }

    ~~private void deleteFilesInDirectory(File f) throws IOException {~~
       ~~if (f.isDirectory()) {~~
           ~~for (File c : f.listFiles())~~
               ~~deleteFilesInDirectory(c);~~
       }
       ~~if (!f.delete())~~
           ~~throw new FileNotFoundException("Failed to delete file: " + f);~~
    }

    ~~@Override~~
    ~~public int run(String[] args) throws Exception {~~
       ~~if(args.length == 0)~~
           ~~throw new IllegalArgumentException("Please provide input and output paths");~~

       ~~Path inputPath = new Path(args[0]);~~
       ~~File outputDir = new File(args[1]);~~
       ~~deleteFilesInDirectory(outputDir);~~
       ~~Path outputPath = new Path(args[1]);~~

       ~~Job job = new Job(getConf(), "Hadoop Patent Citation Example");~~
       ~~job.setJarByClass(PatentCitation.class);~~

       ~~FileInputFormat.setInputPaths(job, inputPath);~~
       ~~FileOutputFormat.setOutputPath(job, outputPath);~~

       ~~job.setInputFormatClass(KeyValueTextInputFormat.class);~~
       ~~job.setOutputFormatClass(TextOutputFormat.class);~~

       ~~job.setMapperClass(PatentCitationMapper.class);~~
       ~~job.setReducerClass(PatentCitationReducer.class);~~

       ~~job.setMapOutputKeyClass(Text.class);~~
       ~~job.setMapOutputValueClass(Text.class);~~

       ~~//job.setNumReduceTasks(10000);~~

       ~~return job.waitForCompletion(false) ? 0 : -1;~~
    }

    ~~public static void main(String[] args) throws Exception {~~
       ~~System.exit(ToolRunner.run(new Configuration(), new PatentCitation(), args));~~
    }
}
~~=======================================~~

Running updated program

~~================================~~
~~May 25, 2012 6:14:26 PM org.apache.hadoop.util.NativeCodeLoader <clinit>~~
~~WARNING: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable~~
~~May 25, 2012 6:14:26 PM org.apache.hadoop.mapred.JobClient copyAndConfigureFiles~~
~~WARNING: No job jar file set. User classes may not be found. See JobConf(Class) or JobConf#setJar(String).~~
~~May 25, 2012 6:14:26 PM org.apache.hadoop.mapreduce.lib.input.FileInputFormat listStatus~~
~~INFO: Total input paths to process : 1~~
~~May 25, 2012 6:14:26 PM org.apache.hadoop.io.compress.snappy.LoadSnappy <clinit>~~
~~WARNING: Snappy native library not loaded~~
~~May 25, 2012 6:14:27 PM org.apache.hadoop.util.ProcessTree isSetsidSupported~~
~~INFO: setsid exited with exit code 0~~
~~May 25, 2012 6:14:27 PM org.apache.hadoop.mapred.Task initialize~~
~~INFO: Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@420f9c40~~
~~May 25, 2012 6:14:27 PM org.apache.hadoop.mapred.MapTask$MapOutputBuffer <init>~~
~~INFO: io.sort.mb = 100~~
~~May 25, 2012 6:14:27 PM org.apache.hadoop.mapred.MapTask$MapOutputBuffer <init>~~
~~INFO: data buffer = 79691776/99614720~~
~~May 25, 2012 6:14:27 PM org.apache.hadoop.mapred.MapTask$MapOutputBuffer <init>~~
~~INFO: record buffer = 262144/327680~~
~~May 25, 2012 6:14:27 PM org.apache.hadoop.mapred.MapTask$MapOutputBuffer flush~~
~~INFO: Starting flush of map output~~
~~May 25, 2012 6:14:27 PM org.apache.hadoop.mapred.MapTask$MapOutputBuffer sortAndSpill~~
~~INFO: Finished spill 0~~
~~May 25, 2012 6:14:27 PM org.apache.hadoop.mapred.Task done~~
~~INFO: Task:attempt_local_0001_m_000000_0 is done. And is in the process of commiting~~
~~May 25, 2012 6:14:30 PM org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate~~
~~INFO:~~
~~May 25, 2012 6:14:30 PM org.apache.hadoop.mapred.Task sendDone~~
~~INFO: Task 'attempt_local_0001_m_000000_0' done.~~
~~May 25, 2012 6:14:30 PM org.apache.hadoop.mapred.Task initialize~~
~~INFO: Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@3a56860b~~
~~May 25, 2012 6:14:30 PM org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate~~
~~INFO:~~
~~May 25, 2012 6:14:30 PM org.apache.hadoop.mapred.Merger$MergeQueue merge~~
~~INFO: Merging 1 sorted segments~~
~~May 25, 2012 6:14:30 PM org.apache.hadoop.mapred.Merger$MergeQueue merge~~
~~INFO: Down to the last merge-pass, with 1 segments left of total size: 359270 bytes~~
~~May 25, 2012 6:14:30 PM org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate~~
~~INFO:~~
~~May 25, 2012 6:14:30 PM org.apache.hadoop.mapred.Task done~~
~~INFO: Task:attempt_local_0001_r_000000_0 is done. And is in the process of commiting~~
~~May 25, 2012 6:14:30 PM org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate~~
~~INFO:~~
~~May 25, 2012 6:14:30 PM org.apache.hadoop.mapred.Task commit~~
~~INFO: Task attempt_local_0001_r_000000_0 is allowed to commit now~~
~~May 25, 2012 6:14:30 PM org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter commitTask~~
~~INFO: Saved output of task 'attempt_local_0001_r_000000_0' to /home/anil/judcon12/output~~
~~May 25, 2012 6:14:33 PM org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate~~
~~INFO: reduce > reduce~~
~~May 25, 2012 6:14:33 PM org.apache.hadoop.mapred.Task sendDone~~
~~INFO: Task 'attempt_local_0001_r_000000_0' done.~~
~~=================================~~

~~Let us look at the output:~~
~~===============~~
~~anil@sadbhav:~/judcon12/output$ ls~~
~~part-r-00000 _SUCCESS~~
~~================~~

~~If you look inside the part-xxx file, you will the results:~~

~~======================~~
~~"CITED" "CITING"~~
~~1000715 3861270~~
~~1001069 3858600~~
~~1001170 3861317~~
~~1001597 3861811~~
~~1004288 3861154~~
~~1006393 3861066~~
~~1006952 3860293~~
~~.......~~
~~1429311 3861187~~
~~1429835 3860154~~
~~1429968 3860060~~
~~1430491 3859976~~
~~1431444 3861601~~
~~1431718 3859022~~
~~1432243 3861774~~
~~1432467 3862478~~
~~1433649 3861223,3861222~~
~~1433923 3861232~~
~~1434088 3862293~~
~~1435134 3861526~~
~~1435144 3858398~~
~~....~~
~~======================~~

~~Woot!~~

Additional Information

~~If you see the following error, it means you have disk space issues.~~

~~org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any valid local directory for output/spill0.out~~

~~How do I set my own tmp directory for Hadoop?~~
~~-Dhadoop.tmp.dir=~~
or
~~-Dmapred.local.dir= (for map/reduce)~~

References

http://blog.hampisoftware.com

Friday, March 30, 2012

Pig: Deriving meaningful data from SSO logs

If you use CAS sso solution, then you can generate logs. I have a text file called "CASLoginLog.txt" This file is basically a snippet of login trail for 3 days in March. I have changed usernames and many things. So your file may look a bit different. :)

Step: Generate a CAS SSO Login Trail

===================================
Date    Action Username        Service         Ticket
28.3.2012 2:28:01       SERVICE_TICKET_CREATED user1   https://myurl   ST-13133--org-sso
28.3.2012 2:27:30       SERVICE_TICKET_CREATED user2   https://myurl/url ST-13046--j-sso
28.3.2012 2:27:17       TICKET_GRANTING_TICKET_DESTROYED                        TGT-3380--j-sso
28.3.2012 2:27:17       SERVICE_TICKET_CREATED user3   https://c/thread/197282?tstart=0        ST-13045-j-sso
28.3.2012 2:27:16       TICKET_GRANTING_TICKET_CREATED firstlion               TGT-3567--j-sso
28.3.2012 2:26:30       SERVICE_TICKET_CREATED user4   https://issues.j.org/secure/D.jspa      ST-13044--j-sso
27.3.2012 23:12:37      SERVICE_TICKET_CREATED user2   https://c/thread/151832?start=15&tstart=0       ST-13048--j-sso
27.3.2012 22:51:51      SERVICE_TICKET_CREATED user5   https://c/login.jspa    ST-13038--j-sso
27.3.2012 22:51:50      TICKET_GRANTING_TICKET_CREATED user5           TGT-3527--j-sso
27.3.2012 22:51:49      TICKET_GRANTING_TICKET_CREATED user5           TGT-3526--j-sso
26.3.2012 14:17:27      SERVICE_TICKET_CREATED user1   https://c/message/725882?tstart=0       ST-11709--j-sso
26.3.2012 13:02:51      TICKET_GRANTING_TICKET_CREATED user1           TGT-3223--j-sso
=======================================

So let us try to figure out, how many times in these 3 days, each user was provided a "SERVICE_TICKET_CREATED" action.

I am going to use Apache Pig to generate the output.

Step: Code a Pig Script

My pig script is called CASLog.pig

=====================================
file = LOAD 'CASLoginLog.txt' USING PigStorage(' ') AS (ticketDate: chararray,ticketTime: chararray,action: chararray,username: chararray,service: chararray,ticket: chararray);

trimmedfile = FOREACH file GENERATE TRIM(ticketDate) as ticketDate, TRIM(action) AS action, TRIM(username) AS username ,TRIM(ticket) AS ticket ;

selectedrows = FILTER trimmedfile BY action == 'SERVICE_TICKET_CREATED';

usersgroup = GROUP selectedrows BY username;

counts = FOREACH usersgroup GENERATE group AS username, COUNT(selectedrows) AS num_digits;

STORE counts INTO 'result' USING PigStorage('=');
==========================================

Step: Execute Apache Pig

Now let me run Pig on this.

===========================================
$ sh ../pig-0.9.2/bin/pig -x local CASLog.pig

....
Input(s):
Successfully read records from: "file:///hadoop/pig/anilpig/CASLoginLog.txt"

Output(s):
Successfully stored records in: "file:///hadoop/pig/anilpig/result"

Job DAG:
job_local_0001

2012-03-30 16:56:09,762 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Success!
============================================

Pig does the Map Reduce magic under the covers and stores the end result in a directory called "result" based on the last statement in the pig file.

Step : View the results.

========================
$ vi result/part-r-00000

user1=2
user2=2
user3=1
user4=1
user5=1
========================

It took me like a couple of hours to get the script correct and working, after a lot of trial and error. But I had to write 0 lines of Apache Hadoop Map Reduce java code.