Sunday, July 22, 2012

R: Introduction to statistics and R-Studio

This post will describe some of the common operations of the R statistical package.

Assuming you have installed R on your linux machine.

Step 1: Create a work directory.

 $ mkdir R_Work

$ cd R_Work/


Step 2: Let us invoke "R".


$ R

R version 2.15.1 (2012-06-22) -- "Roasted Marshmallows"
Copyright (C) 2012 The R Foundation for Statistical Computing
ISBN 3-900051-07-0
Platform: x86_64-redhat-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

  Natural language support but running in an English locale

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> q()


Basically to quit R, you need to type in q()


Step 3:  Install R-Studio for your Operating System

http://www.rstudio.org/

$>rstudio


Step 4:  Install rJava

It is very important that you get the following command right.

Run as Root
 # sudo R CMD javareconf JAVA=/usr/lib/jvm/java-1.6.0-openjdk-1.6.0.0.x86_64/jre/bin/java JAVA_HOME=/usr/java/latest JAVAC=/usr/java/latest/bin/javac JAR=/usr/java/latest/bin/jar JAVAH=/usr/java/latest/bin/javah

In this case, we are trying to tell R where the OpenJDK java executable is.  You also want to point to locations where your javac, javah and jar executables are available.


Now we are ready to install rJava
# R
> install.packages("rJava")
Installing package(s) into ‘/usr/lib64/R/library’
(as ‘lib’ is unspecified)
--- Please select a CRAN mirror for use in this session ---
Loading Tcl/Tk interface ... done
.....
* DONE (rJava)
Making packages.html  ... done

The downloaded source packages are in
    ‘/tmp/RtmpwsCz85/downloaded_packages’
Updating HTML index of packages in '.Library'
Making packages.html  ... done
>

This was very painful for me. Took me about 20mins to get it right.
By the way, there is a thread that may be useful to you: 

Error I was getting was:

Make sure R is configured with full Java support (including JDK). Run
R CMD javareconf
as root to add Java support to R. 

Reference

http://cran.r-project.org/doc/manuals/R-intro.html

http://svn.r-project.org/R/trunk/src/scripts/javareconf


Installing R statistical package on Fedora

If you are operating on Fedora (or any other flavors of Linux), chances are that there is availability of pre-built package of R.

On Fedora 16, I did
$> sudo yum info R
Installed Packages
Name        : R
Arch        : x86_64
Version     : 2.15.1
Release     : 1.fc16
Size        : 0.0 
Repo        : installed
From repo   : updates
Summary     : A language for data analysis and graphics
URL         : http://www.r-project.org
License     : GPLv2+
Description : This is a metapackage that provides both core R userspace and
            : all R development components.
            :
            : R is a language and environment for statistical computing and
            : graphics. R is similar to the award-winning S system, which was
            : developed at Bell Laboratories by John Chambers et al. It provides
            : a wide variety of statistical and graphical techniques (linear and
            : nonlinear modelling, statistical tests, time series analysis,
            : classification, clustering, ...).
            :
            : R is designed as a true computer language with control-flow
            : constructions for iteration and alternation, and it allows users
            : to add additional functionality by defining new functions. For
            : computationally intensive tasks, C, C++ and Fortran code can be
            : linked and called at run time.

To install R on fedora,
$> sudo yum install R

Now that you have installed R,  go read up on it at http://cran.r-project.org/doc/manuals/R-intro.html

Embedding Pig Scripts in Java Applications

You are familiar with running pig scripts via the command line. Now what if you intend to run Apache Pig as part of your Java applications?

Pig Modes


There are two modes as shown in http://pig.apache.org/docs/r0.7.0/setup.html#Run+Modes


PigServer

This is the main class for embedding Apache Pig as part of your java applications.


import org.apache.pig.PigServer;

=================
 PigServer pigServer = null;
String mode = "local";
 String pigScriptName = null;

Map<String,String> params = null;

List<String> paramFiles = null;

        try {
            pigServer = new PigServer(mode);
            pigServer.setBatchOn();
            pigServer.debugOn();
            InputStream is = getClass().getClassLoader().getResourceAsStream(pigScriptName);
            if(params != null){
                pigServer.registerScript(is, params);
            } else if(paramFiles != null){
                pigServer.registerScript(is, paramFiles);
            } else {
                pigServer.registerScript(is);
            }
            pigServer.executeBatch();
        } catch (Exception e) {
            throw new RuntimeException(e);
        } finally {
            if(pigServer != null){
                pigServer.shutdown();
            }
        }
==========================
Note: the variable mode can be "local" or "mapreduce".

PigServer can take two additional parameters while registering your pig script.
  • Params: this is a key/value map passed that can be referenced in your pigscript as $key.
  • ParamFiles: takes in filenames that contain the parameters.
You can register the script with the PigServer without providing any params.

Do not forget to bookmark this blog. :)

All the best!

Reference: http://everythingbigdata.blogspot.com/2012/07/apache-pig-tips.html

Saturday, July 14, 2012

Hadoop with Drools, Infinispan, PicketLink etc

Here are the slides that I used at JUDCON 2012 in Boston.
http://www.jboss.org/dms/judcon/2012boston/presentations/judcon2012boston_day1track3session2.pdf

In your Map/Reduce programs, you should be able to use any Java library of your choice.  In this regard, you should be able to use:

  • Infinispan Data Grids to send cache events from your Map Reduce programs.
  • Drools rules engine to apply rules in your M/R programs.
  • PicketLink can be used for bringing in security aspects to your M/R programs.

Thursday, July 12, 2012

Apache Pig Tips

1) If you want to remove the directory where you are going to perform a STORE operation, then it is better to remove it first.

rmf output/somedir
STORE xyz INTO output/somedir USING PIGSTORAGE()

2) How do I use Pig via Java?
Look at the PigServer usage.