Everything Big Data! Hadoop,HBase,NoSQL,MongoDB,HDFS,MapReduce etc.: mapreduce

Showing posts with label mapreduce. Show all posts

Friday, May 25, 2012

Apache Hadoop Map Reduce - Advanced Example

Map Reduce Example

Latest updated proper work is at http://blog.hampisoftware.com/?p=20

Solve the same problem using Apache Spark: http://blog.hampisoftware.com/?p=41

Use ^^^^

DATED MATERIAL

~~Continuing with my experiments, now I tried to attempt the Patent Citation Example mentioned in the book "Hadoop In Action" by Chuck Lam.~~

Data Set

~~Visit http://nber.org/patents/~~
~~Choose acite75_99.zip and it yielded cite75_99.txt~~

~~This file lists the patent number that cites another patent. In this map reduce example, we are going to attempt to find the reverse.~~

~~For a given patent, how many citations exist.~~

What do you need?

~~Download Hadoop v1.0.3 from http://hadoop.apache.org~~
~~I downloaded hadoop-1.0.3.tar.gz (60MB)~~

Map Reduce Program

Note: I give out a program that was busted. Look for commentary following this program as to why it is busted etc. Finally, I give out a working program.

~~==================~~
~~package mapreduce;~~

~~import java.io.File;~~
~~import java.io.FileNotFoundException;~~
~~import java.io.IOException;~~
~~import java.util.Iterator;~~

~~import org.apache.hadoop.conf.Configuration;~~
~~import org.apache.hadoop.conf.Configured;~~
~~import org.apache.hadoop.fs.Path;~~
~~import org.apache.hadoop.io.Text;~~
~~import org.apache.hadoop.mapreduce.Job;~~
~~import org.apache.hadoop.mapreduce.Mapper;~~
~~import org.apache.hadoop.mapreduce.Reducer;~~
~~import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;~~
~~import org.apache.hadoop.mapreduce.lib.input.KeyValueTextInputFormat;~~
~~import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;~~
~~import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;~~
~~import org.apache.hadoop.util.Tool;~~
~~import org.apache.hadoop.util.ToolRunner;~~

~~public class PatentCitation extends Configured implements Tool{~~

    ~~public static class PatentCitationMapper extends Mapper<Text,Text,Text,Text> {~~
        ~~protected void map(Text key, Text value, Context context)~~
                ~~throws IOException, InterruptedException {~~
            ~~context.write(value, key);~~
        }
    }

    ~~public static class PatentCitationReducer extends Reducer<Text,Text,Text,Text>{~~
        ~~protected void reduce(Text key, Iterable<Text> values, Context context)~~
                ~~throws IOException, InterruptedException {~~
            ~~String csv = "";~~
            ~~Iterator<Text> iterator = values.iterator();~~
            ~~while(iterator.hasNext()){~~
                ~~if(csv.length() > 0 ) csv += ",";~~
                ~~csv += iterator.next().toString();~~
            }
            ~~context.write(key, new Text(csv));~~
        }
    }

    ~~private void deleteFilesInDirectory(File f) throws IOException {~~
        ~~if (f.isDirectory()) {~~
            ~~for (File c : f.listFiles())~~
                ~~deleteFilesInDirectory(c);~~
        }
        ~~if (!f.delete())~~
            ~~throw new FileNotFoundException("Failed to delete file: " + f);~~
    }

    ~~@Override~~
    ~~public int run(String[] args) throws Exception {~~
        ~~if(args.length == 0)~~
            ~~throw new IllegalArgumentException("Please provide input and output paths");~~

        ~~Path inputPath = new Path(args[0]);~~
        ~~File outputDir = new File(args[1]);~~
        ~~deleteFilesInDirectory(outputDir);~~
        ~~Path outputPath = new Path(args[1]);~~

        ~~Job job = new Job(getConf(), "Hadoop Patent Citation Example");~~
        ~~job.setJarByClass(PatentCitation.class);~~

        ~~FileInputFormat.setInputPaths(job, inputPath);~~
        ~~FileOutputFormat.setOutputPath(job, outputPath);~~

        ~~job.setInputFormatClass(KeyValueTextInputFormat.class);~~
        ~~job.setOutputFormatClass(TextOutputFormat.class);~~

        ~~job.setMapperClass(PatentCitationMapper.class);~~
        ~~job.setReducerClass(PatentCitationReducer.class);~~

        ~~job.setMapOutputKeyClass(Text.class);~~
        ~~job.setMapOutputValueClass(Text.class);~~

        ~~return job.waitForCompletion(false) ? 0 : -1;~~
    }

    ~~public static void main(String[] args) throws Exception {~~
        ~~System.exit(ToolRunner.run(new Configuration(), new PatentCitation(), args));~~
    }
}

I created a directory called "input" where the cite75_99.txt was placed. I also created an empty directory called "output". These directories form the input and output directories for the M/R program

Execution

First Attempt:

~~I executed the program as is. It choked because my /tmp directory and root filesystem exhausted the disk space.~~

Second Attempt:

~~Now I exclusively configure the hadoop tmp directory so I can place the tmp files wherever I want.~~
~~-Dhadoop.tmp.dir=/home/anil/judcon12/tmp~~

~~Ok, now the program did not choke. It just ran and ran for 2+ hours. I killed it. It seems the data set is too large to get it finished in a short duration.~~
~~The culprit was the reduce phase. It just did not finish.~~

Third Attempt:

~~Now I tried to configure the reducers to zero so I can view the output of the map phase.~~

~~I tried the property -Dmapred.reduce.tasks=0. It made no difference.~~

~~I then added the following deprecated usage of the Job class. That worked.~~

~~job.setNumReduceTasks(0);~~

~~Ok, now the program just undertook the map phase.~~

~~=======================~~
~~May 25, 2012 4:40:42 PM org.apache.hadoop.util.NativeCodeLoader <clinit>~~
~~WARNING: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable~~
~~May 25, 2012 4:40:42 PM org.apache.hadoop.mapred.JobClient copyAndConfigureFiles~~
~~WARNING: No job jar file set. User classes may not be found. See JobConf(Class) or JobConf#setJar(String).~~
~~May 25, 2012 4:40:42 PM org.apache.hadoop.mapreduce.lib.input.FileInputFormat listStatus~~
~~INFO: Total input paths to process : 1~~
~~May 25, 2012 4:40:42 PM org.apache.hadoop.io.compress.snappy.LoadSnappy <clinit>~~
~~WARNING: Snappy native library not loaded~~
~~May 25, 2012 4:40:42 PM org.apache.hadoop.util.ProcessTree isSetsidSupported~~
~~INFO: setsid exited with exit code 0~~
~~May 25, 2012 4:40:42 PM org.apache.hadoop.mapred.Task initialize~~
~~INFO: Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@420f9c40~~
~~May 25, 2012 4:40:44 PM org.apache.hadoop.mapred.Task done~~
~~INFO: Task:attempt_local_0001_m_000000_0 is done. And is in the process of commiting~~
~~May 25, 2012 4:40:44 PM org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate~~
~~INFO:~~
~~May 25, 2012 4:40:44 PM org.apache.hadoop.mapred.Task commit~~
~~INFO: Task attempt_local_0001_m_000000_0 is allowed to commit now~~
~~May 25, 2012 4:40:44 PM org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter commitTask~~
~~INFO: Saved output of task 'attempt_local_0001_m_000000_0' to /home/anil/judcon12/output~~
~~May 25, 2012 4:40:45 PM org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate~~
~~INFO:~~
~~May 25, 2012 4:40:45 PM org.apache.hadoop.mapred.Task sendDone~~
~~INFO: Task 'attempt_local_0001_m_000000_0' done.~~
~~May 25, 2012 4:40:45 PM org.apache.hadoop.mapred.Task initialize~~
~~INFO: Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@527736bd~~
~~May 25, 2012 4:40:46 PM org.apache.hadoop.mapred.Task done~~
~~INFO: Task:attempt_local_0001_m_000001_0 is done. And is in the process of commiting~~
~~May 25, 2012 4:40:46 PM org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate~~
~~INFO:~~
~~May 25, 2012 4:40:46 PM org.apache.hadoop.mapred.Task commit~~
~~INFO: Task attempt_local_0001_m_000001_0 is allowed to commit now~~
~~May 25, 2012 4:40:46 PM org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter commitTask~~
~~INFO: Saved output of task 'attempt_local_0001_m_000001_0' to /home/anil/judcon12/output~~
~~May 25, 2012 4:40:48 PM org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate~~
~~INFO:~~
~~May 25, 2012 4:40:48 PM org.apache.hadoop.mapred.Task sendDone~~
~~INFO: Task 'attempt_local_0001_m_000001_0' done.~~
~~May 25, 2012 4:40:48 PM org.apache.hadoop.mapred.Task initialize~~
~~INFO: Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@5749b290~~
~~May 25, 2012 4:40:49 PM org.apache.hadoop.mapred.Task done~~
~~INFO: Task:attempt_local_0001_m_000002_0 is done. And is in the process of commiting~~
~~May 25, 2012 4:40:49 PM org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate~~
~~INFO:~~
~~May 25, 2012 4:40:49 PM org.apache.hadoop.mapred.Task commit~~
~~INFO: Task attempt_local_0001_m_000002_0 is allowed to commit now~~
~~May 25, 2012 4:40:49 PM org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter commitTask~~
~~INFO: Saved output of task 'attempt_local_0001_m_000002_0' to /home/anil/judcon12/output~~
~~May 25, 2012 4:40:51 PM org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate~~
~~INFO:~~
~~May 25, 2012 4:40:51 PM org.apache.hadoop.mapred.Task sendDone~~
~~INFO: Task 'attempt_local_0001_m_000002_0' done.~~
~~May 25, 2012 4:40:51 PM org.apache.hadoop.mapred.Task initialize~~
~~INFO: Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@2a8ceeea~~
~~May 25, 2012 4:40:52 PM org.apache.hadoop.mapred.Task done~~
~~INFO: Task:attempt_local_0001_m_000003_0 is done. And is in the process of commiting~~
~~May 25, 2012 4:40:52 PM org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate~~
~~INFO:~~
~~May 25, 2012 4:40:52 PM org.apache.hadoop.mapred.Task commit~~
~~INFO: Task attempt_local_0001_m_000003_0 is allowed to commit now~~
~~May 25, 2012 4:40:52 PM org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter commitTask~~
~~INFO: Saved output of task 'attempt_local_0001_m_000003_0' to /home/anil/judcon12/output~~
~~May 25, 2012 4:40:54 PM org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate~~
~~INFO:~~
~~May 25, 2012 4:40:54 PM org.apache.hadoop.mapred.Task sendDone~~
~~INFO: Task 'attempt_local_0001_m_000003_0' done.~~
~~May 25, 2012 4:40:54 PM org.apache.hadoop.mapred.Task initialize~~
~~INFO: Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@46238a47~~
~~May 25, 2012 4:40:55 PM org.apache.hadoop.mapred.Task done~~
~~INFO: Task:attempt_local_0001_m_000004_0 is done. And is in the process of commiting~~
~~May 25, 2012 4:40:55 PM org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate~~
~~INFO:~~
~~May 25, 2012 4:40:55 PM org.apache.hadoop.mapred.Task commit~~
~~INFO: Task attempt_local_0001_m_000004_0 is allowed to commit now~~
~~May 25, 2012 4:40:55 PM org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter commitTask~~
~~INFO: Saved output of task 'attempt_local_0001_m_000004_0' to /home/anil/judcon12/output~~
~~May 25, 2012 4:40:57 PM org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate~~
~~INFO:~~
~~May 25, 2012 4:40:57 PM org.apache.hadoop.mapred.Task sendDone~~
~~INFO: Task 'attempt_local_0001_m_000004_0' done.~~
~~May 25, 2012 4:40:57 PM org.apache.hadoop.mapred.Task initialize~~
~~INFO: Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@559113f8~~
~~May 25, 2012 4:40:58 PM org.apache.hadoop.mapred.Task done~~
~~INFO: Task:attempt_local_0001_m_000005_0 is done. And is in the process of commiting~~
~~May 25, 2012 4:40:58 PM org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate~~
~~INFO:~~
~~May 25, 2012 4:40:58 PM org.apache.hadoop.mapred.Task commit~~
~~INFO: Task attempt_local_0001_m_000005_0 is allowed to commit now~~
~~May 25, 2012 4:40:58 PM org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter commitTask~~
~~INFO: Saved output of task 'attempt_local_0001_m_000005_0' to /home/anil/judcon12/output~~
~~May 25, 2012 4:41:00 PM org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate~~
~~INFO:~~
~~May 25, 2012 4:41:00 PM org.apache.hadoop.mapred.Task sendDone~~
~~INFO: Task 'attempt_local_0001_m_000005_0' done.~~
~~May 25, 2012 4:41:00 PM org.apache.hadoop.mapred.Task initialize~~
~~INFO: Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@76a9b9c~~
~~May 25, 2012 4:41:01 PM org.apache.hadoop.mapred.Task done~~
~~INFO: Task:attempt_local_0001_m_000006_0 is done. And is in the process of commiting~~
~~May 25, 2012 4:41:01 PM org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate~~
~~INFO:~~
~~May 25, 2012 4:41:01 PM org.apache.hadoop.mapred.Task commit~~
~~INFO: Task attempt_local_0001_m_000006_0 is allowed to commit now~~
~~May 25, 2012 4:41:01 PM org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter commitTask~~
~~INFO: Saved output of task 'attempt_local_0001_m_000006_0' to /home/anil/judcon12/output~~
~~May 25, 2012 4:41:03 PM org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate~~
~~INFO:~~
~~May 25, 2012 4:41:03 PM org.apache.hadoop.mapred.Task sendDone~~
~~INFO: Task 'attempt_local_0001_m_000006_0' done.~~
~~May 25, 2012 4:41:03 PM org.apache.hadoop.mapred.Task initialize~~
~~INFO: Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@560c3014~~
~~May 25, 2012 4:41:04 PM org.apache.hadoop.mapred.Task done~~
~~INFO: Task:attempt_local_0001_m_000007_0 is done. And is in the process of commiting~~
~~May 25, 2012 4:41:04 PM org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate~~
~~INFO:~~
~~May 25, 2012 4:41:04 PM org.apache.hadoop.mapred.Task commit~~
~~INFO: Task attempt_local_0001_m_000007_0 is allowed to commit now~~
~~May 25, 2012 4:41:04 PM org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter commitTask~~
~~INFO: Saved output of task 'attempt_local_0001_m_000007_0' to /home/anil/judcon12/output~~
~~May 25, 2012 4:41:06 PM org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate~~
~~INFO:~~
~~May 25, 2012 4:41:06 PM org.apache.hadoop.mapred.Task sendDone~~
~~INFO: Task 'attempt_local_0001_m_000007_0' done.~~
~~=======================~~

~~So the Map phase is taking between 1-2mins. It is the reduce phase that does not end for me. I will evaluate as to why that is the case. :)~~

~~Let us see what the output of the map phase looks like:~~

~~=============================~~
~~anil@sadbhav:~/judcon12/output$ ls~~
~~part-m-00000 part-m-00002 part-m-00004 part-m-00006 _SUCCESS~~
~~part-m-00001 part-m-00003 part-m-00005 part-m-00007~~
~~==============================~~

~~You can see the end result of the map phase in these files.~~

~~Now onto figuring out the number of reduce phases.~~

~~Let me look at the file size~~

~~===================~~
~~anil@sadbhav:~/judcon12/input$ wc -l cite75_99.txt~~
~~16522439 cite75_99.txt~~
~~===================~~

~~OMG! that is something like 16 million 522 thousand entries. Too much for a laptop to handle.~~

~~Let us try another experiment. Let us choose 20000 patent citations from this file and save it to 20000cite.txt~~

~~When I run the program now, I see that the M/R execution took all of 10 secs to finish.~~

~~Let us view the results of the M/R execution.~~

~~=============================~~
~~anil@sadbhav:~/judcon12/output$ ls~~
~~part-r-00000 _SUCCESS~~
~~============================~~

~~When I look inside the part-r-00000 , I see one long line of csv patent citations. Yeah, you are right. My reducer is busted. It is not working. I need to fix it.....~~
~~That is next step.... This exercise was a failure. But there was a lesson here. If you mess up, you will wait. :)~~

~~Ok, here is the updated map reduce program that works:~~
~~======================~~
~~package mapreduce;~~

~~import java.io.File;~~
~~import java.io.FileNotFoundException;~~
~~import java.io.IOException;~~

~~import org.apache.hadoop.conf.Configuration;~~
~~import org.apache.hadoop.conf.Configured;~~
~~import org.apache.hadoop.fs.Path;~~
~~import org.apache.hadoop.io.Text;~~
~~import org.apache.hadoop.mapreduce.Job;~~
~~import org.apache.hadoop.mapreduce.Mapper;~~
~~import org.apache.hadoop.mapreduce.Reducer;~~
~~import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;~~
~~import org.apache.hadoop.mapreduce.lib.input.KeyValueTextInputFormat;~~
~~import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;~~
~~import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;~~
~~import org.apache.hadoop.util.Tool;~~
~~import org.apache.hadoop.util.ToolRunner;~~

~~public class PatentCitation extends Configured implements Tool{~~

    ~~public static class PatentCitationMapper extends Mapper<Text,Text,Text,Text> {~~
       ~~protected void map(Text key, Text value, Context context)~~
               ~~throws IOException, InterruptedException {~~

           ~~String[] citation = key.toString().split(",");~~
           ~~context.write(new Text(citation[1]), new Text(citation[0]));~~
       }
    }

    ~~public static class PatentCitationReducer extends Reducer<Text,Text,Text,Text>{~~
       ~~protected void reduce(Text key, Iterable<Text> values, Context context)~~
               ~~throws IOException, InterruptedException {~~
           ~~String csv = "";~~
           ~~for(Text val:values){~~
               ~~if(csv.length() > 0 ) csv += ",";~~
               ~~csv += val.toString();~~
           }
           ~~context.write(key, new Text(csv));~~
       }
    }

    ~~private void deleteFilesInDirectory(File f) throws IOException {~~
       ~~if (f.isDirectory()) {~~
           ~~for (File c : f.listFiles())~~
               ~~deleteFilesInDirectory(c);~~
       }
       ~~if (!f.delete())~~
           ~~throw new FileNotFoundException("Failed to delete file: " + f);~~
    }

    ~~@Override~~
    ~~public int run(String[] args) throws Exception {~~
       ~~if(args.length == 0)~~
           ~~throw new IllegalArgumentException("Please provide input and output paths");~~

       ~~Path inputPath = new Path(args[0]);~~
       ~~File outputDir = new File(args[1]);~~
       ~~deleteFilesInDirectory(outputDir);~~
       ~~Path outputPath = new Path(args[1]);~~

       ~~Job job = new Job(getConf(), "Hadoop Patent Citation Example");~~
       ~~job.setJarByClass(PatentCitation.class);~~

       ~~FileInputFormat.setInputPaths(job, inputPath);~~
       ~~FileOutputFormat.setOutputPath(job, outputPath);~~

       ~~job.setInputFormatClass(KeyValueTextInputFormat.class);~~
       ~~job.setOutputFormatClass(TextOutputFormat.class);~~

       ~~job.setMapperClass(PatentCitationMapper.class);~~
       ~~job.setReducerClass(PatentCitationReducer.class);~~

       ~~job.setMapOutputKeyClass(Text.class);~~
       ~~job.setMapOutputValueClass(Text.class);~~

       ~~//job.setNumReduceTasks(10000);~~

       ~~return job.waitForCompletion(false) ? 0 : -1;~~
    }

    ~~public static void main(String[] args) throws Exception {~~
       ~~System.exit(ToolRunner.run(new Configuration(), new PatentCitation(), args));~~
    }
}
~~=======================================~~

Running updated program

~~================================~~
~~May 25, 2012 6:14:26 PM org.apache.hadoop.util.NativeCodeLoader <clinit>~~
~~WARNING: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable~~
~~May 25, 2012 6:14:26 PM org.apache.hadoop.mapred.JobClient copyAndConfigureFiles~~
~~WARNING: No job jar file set. User classes may not be found. See JobConf(Class) or JobConf#setJar(String).~~
~~May 25, 2012 6:14:26 PM org.apache.hadoop.mapreduce.lib.input.FileInputFormat listStatus~~
~~INFO: Total input paths to process : 1~~
~~May 25, 2012 6:14:26 PM org.apache.hadoop.io.compress.snappy.LoadSnappy <clinit>~~
~~WARNING: Snappy native library not loaded~~
~~May 25, 2012 6:14:27 PM org.apache.hadoop.util.ProcessTree isSetsidSupported~~
~~INFO: setsid exited with exit code 0~~
~~May 25, 2012 6:14:27 PM org.apache.hadoop.mapred.Task initialize~~
~~INFO: Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@420f9c40~~
~~May 25, 2012 6:14:27 PM org.apache.hadoop.mapred.MapTask$MapOutputBuffer <init>~~
~~INFO: io.sort.mb = 100~~
~~May 25, 2012 6:14:27 PM org.apache.hadoop.mapred.MapTask$MapOutputBuffer <init>~~
~~INFO: data buffer = 79691776/99614720~~
~~May 25, 2012 6:14:27 PM org.apache.hadoop.mapred.MapTask$MapOutputBuffer <init>~~
~~INFO: record buffer = 262144/327680~~
~~May 25, 2012 6:14:27 PM org.apache.hadoop.mapred.MapTask$MapOutputBuffer flush~~
~~INFO: Starting flush of map output~~
~~May 25, 2012 6:14:27 PM org.apache.hadoop.mapred.MapTask$MapOutputBuffer sortAndSpill~~
~~INFO: Finished spill 0~~
~~May 25, 2012 6:14:27 PM org.apache.hadoop.mapred.Task done~~
~~INFO: Task:attempt_local_0001_m_000000_0 is done. And is in the process of commiting~~
~~May 25, 2012 6:14:30 PM org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate~~
~~INFO:~~
~~May 25, 2012 6:14:30 PM org.apache.hadoop.mapred.Task sendDone~~
~~INFO: Task 'attempt_local_0001_m_000000_0' done.~~
~~May 25, 2012 6:14:30 PM org.apache.hadoop.mapred.Task initialize~~
~~INFO: Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@3a56860b~~
~~May 25, 2012 6:14:30 PM org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate~~
~~INFO:~~
~~May 25, 2012 6:14:30 PM org.apache.hadoop.mapred.Merger$MergeQueue merge~~
~~INFO: Merging 1 sorted segments~~
~~May 25, 2012 6:14:30 PM org.apache.hadoop.mapred.Merger$MergeQueue merge~~
~~INFO: Down to the last merge-pass, with 1 segments left of total size: 359270 bytes~~
~~May 25, 2012 6:14:30 PM org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate~~
~~INFO:~~
~~May 25, 2012 6:14:30 PM org.apache.hadoop.mapred.Task done~~
~~INFO: Task:attempt_local_0001_r_000000_0 is done. And is in the process of commiting~~
~~May 25, 2012 6:14:30 PM org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate~~
~~INFO:~~
~~May 25, 2012 6:14:30 PM org.apache.hadoop.mapred.Task commit~~
~~INFO: Task attempt_local_0001_r_000000_0 is allowed to commit now~~
~~May 25, 2012 6:14:30 PM org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter commitTask~~
~~INFO: Saved output of task 'attempt_local_0001_r_000000_0' to /home/anil/judcon12/output~~
~~May 25, 2012 6:14:33 PM org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate~~
~~INFO: reduce > reduce~~
~~May 25, 2012 6:14:33 PM org.apache.hadoop.mapred.Task sendDone~~
~~INFO: Task 'attempt_local_0001_r_000000_0' done.~~
~~=================================~~

~~Let us look at the output:~~
~~===============~~
~~anil@sadbhav:~/judcon12/output$ ls~~
~~part-r-00000 _SUCCESS~~
~~================~~

~~If you look inside the part-xxx file, you will the results:~~

~~======================~~
~~"CITED" "CITING"~~
~~1000715 3861270~~
~~1001069 3858600~~
~~1001170 3861317~~
~~1001597 3861811~~
~~1004288 3861154~~
~~1006393 3861066~~
~~1006952 3860293~~
~~.......~~
~~1429311 3861187~~
~~1429835 3860154~~
~~1429968 3860060~~
~~1430491 3859976~~
~~1431444 3861601~~
~~1431718 3859022~~
~~1432243 3861774~~
~~1432467 3862478~~
~~1433649 3861223,3861222~~
~~1433923 3861232~~
~~1434088 3862293~~
~~1435134 3861526~~
~~1435144 3858398~~
~~....~~
~~======================~~

~~Woot!~~

Additional Information

~~If you see the following error, it means you have disk space issues.~~

~~org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any valid local directory for output/spill0.out~~

~~How do I set my own tmp directory for Hadoop?~~
~~-Dhadoop.tmp.dir=~~
or
~~-Dmapred.local.dir= (for map/reduce)~~

References

http://blog.hampisoftware.com

Sunday, February 19, 2012

Trying out Apache Pig

I was going through Alex P's blog posts and one post that attracted my attention was related to Apache Pig. I have been thinking of playing with Pig Latin Scripts to simulate Map Reduce functionality.

I tried to use pig to run Alex's pig script. He just gives the input values and the output from Pig along with the Pig script. There is no information on how to use Pig. That is fine. He just wants reader to go through the Pig manual. :)

Here is what I tried out:

1) Downloaded Apache Pig 0.9.2 (that was the latest version).
2) The script from Alex uses PiggyBank which is in Pig Contrib directory. Looks like I will have to build Pig.

=====================
pig_directory $> ant

...

[javacc] Java Compiler Compiler Version 4.2 (Parser Generator)
   [javacc] (type "javacc" with no arguments for help)
...
   [javacc] File "SimpleCharStream.java" is being rebuilt.
   [javacc] Parser generated successfully.

prepare:
    [mkdir] Created dir: xxx/pig-0.9.2/src-gen/org/apache/pig/parser

genLexer:

genParser:

genTreeParser:

gen:

compile:
     [echo] *** Building Main Sources ***
     [echo] *** To compile with all warnings enabled, supply -Dall.warnings=1 on command line ***
     [echo] *** If all.warnings property is supplied, compile-sources-all-warnings target will be executed ***
     [echo] *** Else, compile-sources (which only warns about deprecations) target will be executed ***
[taskdef] Could not load definitions from resource net/sf/antcontrib/antcontrib.properties. It could not be found.

compile-sources:
    [javac] xxx/pig-0.9.2/build.xml:429: warning: 'includeantruntime' was not set, defaulting to build.sysclasspath=last; set to false for repeatable builds
    [javac] Compiling 667 source files to /home/anil/hadoop/pig/pig-0.9.2/build/classes
    [javac] Note: Some input files use or override a deprecated API.
    [javac] Note: Recompile with -Xlint:deprecation for details.
    [javac] Note: Some input files use unchecked or unsafe operations.
    [javac] Note: Recompile with -Xlint:unchecked for details.
     [copy] Copying 1 file to xxx/pig-0.9.2/build/classes/org/apache/pig/tools/grunt
     [copy] Copying 1 file to xxx/pig-0.9.2/build/classes/org/apache/pig/tools/grunt
[taskdef] Could not load definitions from resource net/sf/antcontrib/antcontrib.properties. It could not be found.

compile-sources-all-warnings:

jar:
[taskdef] Could not load definitions from resource net/sf/antcontrib/antcontrib.properties. It could not be found.

jarWithSvn:
[taskdef] Could not load definitions from resource net/sf/antcontrib/antcontrib.properties. It could not be found.

ivy-download:
      [get] Getting: http://repo2.maven.org/maven2/org/apache/ivy/ivy/2.2.0/ivy-2.2.0.jar
      [get] To: /home/anil/pig/pig-0.9.2/ivy/ivy-2.2.0.jar
      [get] Not modified - so not downloaded

ivy-init-dirs:

ivy-probe-antlib:

ivy-init-antlib:

ivy-init:

ivy-buildJar:
[ivy:resolve] :: resolving dependencies :: org.apache.pig#Pig;0.9.3-SNAPSHOT
[ivy:resolve]     confs: [buildJar]
[ivy:resolve]     found com.sun.jersey#jersey-core;1.8 in maven2
[ivy:resolve]     found org.apache.hadoop#hadoop-core;1.0.0 in maven2
[ivy:resolve]     found commons-cli#commons-cli;1.2 in maven2
[ivy:resolve]     found xmlenc#xmlenc;0.52 in maven2
[ivy:resolve]     found commons-httpclient#commons-httpclient;3.0.1 in maven2
[ivy:resolve]     found commons-codec#commons-codec;1.4 in maven2
[ivy:resolve]     found org.apache.commons#commons-math;2.1 in maven2
[ivy:resolve]     found commons-configuration#commons-configuration;1.6 in maven2
[ivy:resolve]     found commons-collections#commons-collections;3.2.1 in maven2
[ivy:resolve]     found commons-lang#commons-lang;2.4 in maven2
[ivy:resolve]     found commons-logging#commons-logging;1.1.1 in maven2
[ivy:resolve]     found commons-digester#commons-digester;1.8 in maven2
[ivy:resolve]     found commons-beanutils#commons-beanutils;1.7.0 in maven2
[ivy:resolve]     found commons-beanutils#commons-beanutils-core;1.8.0 in maven2
[ivy:resolve]     found commons-net#commons-net;1.4.1 in maven2
[ivy:resolve]     found oro#oro;2.0.8 in maven2
[ivy:resolve]     found org.mortbay.jetty#jetty;6.1.26 in maven2
[ivy:resolve]     found org.mortbay.jetty#jetty-util;6.1.26 in maven2
[ivy:resolve]     found org.mortbay.jetty#servlet-api;2.5-20081211 in maven2
[ivy:resolve]     found tomcat#jasper-runtime;5.5.12 in maven2
[ivy:resolve]     found tomcat#jasper-compiler;5.5.12 in maven2
[ivy:resolve]     found org.mortbay.jetty#jsp-api-2.1;6.1.14 in maven2
[ivy:resolve]     found org.mortbay.jetty#servlet-api-2.5;6.1.14 in maven2
[ivy:resolve]     found org.mortbay.jetty#jsp-2.1;6.1.14 in maven2
[ivy:resolve]     found org.eclipse.jdt#core;3.1.1 in maven2
[ivy:resolve]     found ant#ant;1.6.5 in maven2
[ivy:resolve]     found commons-el#commons-el;1.0 in maven2
[ivy:resolve]     found net.java.dev.jets3t#jets3t;0.7.1 in maven2
[ivy:resolve]     found net.sf.kosmosfs#kfs;0.3 in maven2
[ivy:resolve]     found hsqldb#hsqldb;1.8.0.10 in maven2
[ivy:resolve]     found org.apache.hadoop#hadoop-test;1.0.0 in maven2
[ivy:resolve]     found org.apache.ftpserver#ftplet-api;1.0.0 in maven2
[ivy:resolve]     found org.apache.mina#mina-core;2.0.0-M5 in maven2
[ivy:resolve]     found org.slf4j#slf4j-api;1.5.2 in maven2
[ivy:resolve]     found org.apache.ftpserver#ftpserver-core;1.0.0 in maven2
[ivy:resolve]     found org.apache.ftpserver#ftpserver-deprecated;1.0.0-M2 in maven2
[ivy:resolve]     found log4j#log4j;1.2.16 in maven2
[ivy:resolve]     found org.slf4j#slf4j-log4j12;1.6.1 in maven2
[ivy:resolve]     found org.slf4j#slf4j-api;1.6.1 in maven2
[ivy:resolve]     found org.apache.avro#avro;1.5.3 in maven2
[ivy:resolve]     found com.googlecode.json-simple#json-simple;1.1 in maven2
[ivy:resolve]     found com.jcraft#jsch;0.1.38 in maven2
[ivy:resolve]     found jline#jline;0.9.94 in maven2
[ivy:resolve]     found net.java.dev.javacc#javacc;4.2 in maven2
[ivy:resolve]     found org.codehaus.jackson#jackson-mapper-asl;1.7.3 in maven2
[ivy:resolve]     found org.codehaus.jackson#jackson-core-asl;1.7.3 in maven2
[ivy:resolve]     found joda-time#joda-time;1.6 in maven2
[ivy:resolve]     found com.google.guava#guava;11.0 in maven2
[ivy:resolve]     found org.python#jython;2.5.0 in maven2
[ivy:resolve]     found rhino#js;1.7R2 in maven2
[ivy:resolve]     found org.antlr#antlr;3.4 in maven2
[ivy:resolve]     found org.antlr#antlr-runtime;3.4 in maven2
[ivy:resolve]     found org.antlr#stringtemplate;3.2.1 in maven2
[ivy:resolve]     found antlr#antlr;2.7.7 in maven2
[ivy:resolve]     found org.antlr#ST4;4.0.4 in maven2
[ivy:resolve]     found org.apache.zookeeper#zookeeper;3.3.3 in maven2
[ivy:resolve]     found org.jboss.netty#netty;3.2.2.Final in maven2
[ivy:resolve]     found org.apache.hbase#hbase;0.90.0 in maven2
[ivy:resolve]     found org.vafer#jdeb;0.8 in maven2
[ivy:resolve]     found junit#junit;4.5 in maven2
[ivy:resolve]     found org.apache.hive#hive-exec;0.8.0 in maven2
[ivy:resolve] downloading http://repo2.maven.org/maven2/junit/junit/4.5/junit-4.5.jar ...
[ivy:resolve] ....................................................................................................................................................................................................................................................................................................................................................................................... (194kB)
[ivy:resolve] ... (0kB)
[ivy:resolve]     [SUCCESSFUL ] junit#junit;4.5!junit.jar (822ms)
[ivy:resolve] downloading http://repo2.maven.org/maven2/org/apache/hive/hive-exec/0.8.0/hive-exec-0.8.0.jar ...
[ivy:resolve] .............................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................. (3372kB)
[ivy:resolve] .. (0kB)
[ivy:resolve]     [SUCCESSFUL ] org.apache.hive#hive-exec;0.8.0!hive-exec.jar (2262ms)
[ivy:resolve] :: resolution report :: resolve 9172ms :: artifacts dl 3114ms
[ivy:resolve]     :: evicted modules:
[ivy:resolve]     junit#junit;3.8.1 by [junit#junit;4.5] in [buildJar]
[ivy:resolve]     commons-logging#commons-logging;1.0.3 by [commons-logging#commons-logging;1.1.1] in [buildJar]
[ivy:resolve]     commons-codec#commons-codec;1.2 by [commons-codec#commons-codec;1.4] in [buildJar]
[ivy:resolve]     commons-logging#commons-logging;1.1 by [commons-logging#commons-logging;1.1.1] in [buildJar]
[ivy:resolve]     commons-codec#commons-codec;1.3 by [commons-codec#commons-codec;1.4] in [buildJar]
[ivy:resolve]     commons-httpclient#commons-httpclient;3.1 by [commons-httpclient#commons-httpclient;3.0.1] in [buildJar]
[ivy:resolve]     org.codehaus.jackson#jackson-mapper-asl;1.0.1 by [org.codehaus.jackson#jackson-mapper-asl;1.7.3] in [buildJar]
[ivy:resolve]     org.slf4j#slf4j-api;1.5.2 by [org.slf4j#slf4j-api;1.6.1] in [buildJar]
[ivy:resolve]     org.apache.mina#mina-core;2.0.0-M4 by [org.apache.mina#mina-core;2.0.0-M5] in [buildJar]
[ivy:resolve]     org.apache.ftpserver#ftplet-api;1.0.0-M2 by [org.apache.ftpserver#ftplet-api;1.0.0] in [buildJar]
[ivy:resolve]     org.apache.ftpserver#ftpserver-core;1.0.0-M2 by [org.apache.ftpserver#ftpserver-core;1.0.0] in [buildJar]
[ivy:resolve]     org.apache.mina#mina-core;2.0.0-M2 by [org.apache.mina#mina-core;2.0.0-M5] in [buildJar]
[ivy:resolve]     commons-cli#commons-cli;1.0 by [commons-cli#commons-cli;1.2] in [buildJar]
[ivy:resolve]     org.antlr#antlr-runtime;3.3 by [org.antlr#antlr-runtime;3.4] in [buildJar]
    ---------------------------------------------------------------------
    |                  |            modules            ||   artifacts   |
    |       conf       | number| search|dwnlded|evicted|| number|dwnlded|
    ---------------------------------------------------------------------
    |     buildJar     |   74 |   2   |   2   |   14 ||   61 |   2   |
    ---------------------------------------------------------------------
[ivy:retrieve] :: retrieving :: org.apache.pig#Pig
[ivy:retrieve]     confs: [buildJar]
[ivy:retrieve]     3 artifacts copied, 58 already retrieved (3855kB/20ms)

buildJar:
     [echo] svnString exported
      [jar] Building jar: /home/anil/hadoop/pig/pig-0.9.2/build/pig-0.9.3-SNAPSHOT-core.jar
      [jar] Building jar: /home/anil/hadoop/pig/pig-0.9.2/build/pig-0.9.3-SNAPSHOT.jar
[taskdef] Could not load definitions from resource net/sf/antcontrib/antcontrib.properties. It could not be found.

include-meta:
     [copy] Copying 1 file to /home/anil/hadoop/pig/pig-0.9.2
[taskdef] Could not load definitions from resource net/sf/antcontrib/antcontrib.properties. It could not be found.

jarWithOutSvn:

jar-withouthadoop:
[taskdef] Could not load definitions from resource net/sf/antcontrib/antcontrib.properties. It could not be found.

jar-withouthadoopWithSvn:
[taskdef] Could not load definitions from resource net/sf/antcontrib/antcontrib.properties. It could not be found.

ivy-download:
      [get] Getting: http://repo2.maven.org/maven2/org/apache/ivy/ivy/2.2.0/ivy-2.2.0.jar
      [get] To: /home/anil/hadoop/pig/pig-0.9.2/ivy/ivy-2.2.0.jar
      [get] Not modified - so not downloaded

ivy-init-dirs:

ivy-probe-antlib:

ivy-init-antlib:

ivy-init:

ivy-buildJar:
[ivy:resolve] :: resolving dependencies :: org.apache.pig#Pig;0.9.3-SNAPSHOT
[ivy:resolve]     confs: [buildJar]
[ivy:resolve]     found com.sun.jersey#jersey-core;1.8 in maven2
[ivy:resolve]     found org.apache.hadoop#hadoop-core;1.0.0 in maven2
[ivy:resolve]     found commons-cli#commons-cli;1.2 in maven2
[ivy:resolve]     found xmlenc#xmlenc;0.52 in maven2
[ivy:resolve]     found commons-httpclient#commons-httpclient;3.0.1 in maven2
[ivy:resolve]     found commons-codec#commons-codec;1.4 in maven2
[ivy:resolve]     found org.apache.commons#commons-math;2.1 in maven2
[ivy:resolve]     found commons-configuration#commons-configuration;1.6 in maven2
[ivy:resolve]     found commons-collections#commons-collections;3.2.1 in maven2
[ivy:resolve]     found commons-lang#commons-lang;2.4 in maven2
[ivy:resolve]     found commons-logging#commons-logging;1.1.1 in maven2
[ivy:resolve]     found commons-digester#commons-digester;1.8 in maven2
[ivy:resolve]     found commons-beanutils#commons-beanutils;1.7.0 in maven2
[ivy:resolve]     found commons-beanutils#commons-beanutils-core;1.8.0 in maven2
[ivy:resolve]     found commons-net#commons-net;1.4.1 in maven2
[ivy:resolve]     found oro#oro;2.0.8 in maven2
[ivy:resolve]     found org.mortbay.jetty#jetty;6.1.26 in maven2
[ivy:resolve]     found org.mortbay.jetty#jetty-util;6.1.26 in maven2
[ivy:resolve]     found org.mortbay.jetty#servlet-api;2.5-20081211 in maven2
[ivy:resolve]     found tomcat#jasper-runtime;5.5.12 in maven2
[ivy:resolve]     found tomcat#jasper-compiler;5.5.12 in maven2
[ivy:resolve]     found org.mortbay.jetty#jsp-api-2.1;6.1.14 in maven2
[ivy:resolve]     found org.mortbay.jetty#servlet-api-2.5;6.1.14 in maven2
[ivy:resolve]     found org.mortbay.jetty#jsp-2.1;6.1.14 in maven2
[ivy:resolve]     found org.eclipse.jdt#core;3.1.1 in maven2
[ivy:resolve]     found ant#ant;1.6.5 in maven2
[ivy:resolve]     found commons-el#commons-el;1.0 in maven2
[ivy:resolve]     found net.java.dev.jets3t#jets3t;0.7.1 in maven2
[ivy:resolve]     found net.sf.kosmosfs#kfs;0.3 in maven2
[ivy:resolve]     found hsqldb#hsqldb;1.8.0.10 in maven2
[ivy:resolve]     found org.apache.hadoop#hadoop-test;1.0.0 in maven2
[ivy:resolve]     found org.apache.ftpserver#ftplet-api;1.0.0 in maven2
[ivy:resolve]     found org.apache.mina#mina-core;2.0.0-M5 in maven2
[ivy:resolve]     found org.slf4j#slf4j-api;1.5.2 in maven2
[ivy:resolve]     found org.apache.ftpserver#ftpserver-core;1.0.0 in maven2
[ivy:resolve]     found org.apache.ftpserver#ftpserver-deprecated;1.0.0-M2 in maven2
[ivy:resolve]     found log4j#log4j;1.2.16 in maven2
[ivy:resolve]     found org.slf4j#slf4j-log4j12;1.6.1 in maven2
[ivy:resolve]     found org.slf4j#slf4j-api;1.6.1 in maven2
[ivy:resolve]     found org.apache.avro#avro;1.5.3 in maven2
[ivy:resolve]     found com.googlecode.json-simple#json-simple;1.1 in maven2
[ivy:resolve]     found com.jcraft#jsch;0.1.38 in maven2
[ivy:resolve]     found jline#jline;0.9.94 in maven2
[ivy:resolve]     found net.java.dev.javacc#javacc;4.2 in maven2
[ivy:resolve]     found org.codehaus.jackson#jackson-mapper-asl;1.7.3 in maven2
[ivy:resolve]     found org.codehaus.jackson#jackson-core-asl;1.7.3 in maven2
[ivy:resolve]     found joda-time#joda-time;1.6 in maven2
[ivy:resolve]     found com.google.guava#guava;11.0 in maven2
[ivy:resolve]     found org.python#jython;2.5.0 in maven2
[ivy:resolve]     found rhino#js;1.7R2 in maven2
[ivy:resolve]     found org.antlr#antlr;3.4 in maven2
[ivy:resolve]     found org.antlr#antlr-runtime;3.4 in maven2
[ivy:resolve]     found org.antlr#stringtemplate;3.2.1 in maven2
[ivy:resolve]     found antlr#antlr;2.7.7 in maven2
[ivy:resolve]     found org.antlr#ST4;4.0.4 in maven2
[ivy:resolve]     found org.apache.zookeeper#zookeeper;3.3.3 in maven2
[ivy:resolve]     found org.jboss.netty#netty;3.2.2.Final in maven2
[ivy:resolve]     found org.apache.hbase#hbase;0.90.0 in maven2
[ivy:resolve]     found org.vafer#jdeb;0.8 in maven2
[ivy:resolve]     found junit#junit;4.5 in maven2
[ivy:resolve]     found org.apache.hive#hive-exec;0.8.0 in maven2
[ivy:resolve] :: resolution report :: resolve 168ms :: artifacts dl 15ms
[ivy:resolve]     :: evicted modules:
[ivy:resolve]     junit#junit;3.8.1 by [junit#junit;4.5] in [buildJar]
[ivy:resolve]     commons-logging#commons-logging;1.0.3 by [commons-logging#commons-logging;1.1.1] in [buildJar]
[ivy:resolve]     commons-codec#commons-codec;1.2 by [commons-codec#commons-codec;1.4] in [buildJar]
[ivy:resolve]     commons-logging#commons-logging;1.1 by [commons-logging#commons-logging;1.1.1] in [buildJar]
[ivy:resolve]     commons-codec#commons-codec;1.3 by [commons-codec#commons-codec;1.4] in [buildJar]
[ivy:resolve]     commons-httpclient#commons-httpclient;3.1 by [commons-httpclient#commons-httpclient;3.0.1] in [buildJar]
[ivy:resolve]     org.codehaus.jackson#jackson-mapper-asl;1.0.1 by [org.codehaus.jackson#jackson-mapper-asl;1.7.3] in [buildJar]
[ivy:resolve]     org.slf4j#slf4j-api;1.5.2 by [org.slf4j#slf4j-api;1.6.1] in [buildJar]
[ivy:resolve]     org.apache.mina#mina-core;2.0.0-M4 by [org.apache.mina#mina-core;2.0.0-M5] in [buildJar]
[ivy:resolve]     org.apache.ftpserver#ftplet-api;1.0.0-M2 by [org.apache.ftpserver#ftplet-api;1.0.0] in [buildJar]
[ivy:resolve]     org.apache.ftpserver#ftpserver-core;1.0.0-M2 by [org.apache.ftpserver#ftpserver-core;1.0.0] in [buildJar]
[ivy:resolve]     org.apache.mina#mina-core;2.0.0-M2 by [org.apache.mina#mina-core;2.0.0-M5] in [buildJar]
[ivy:resolve]     commons-cli#commons-cli;1.0 by [commons-cli#commons-cli;1.2] in [buildJar]
[ivy:resolve]     org.antlr#antlr-runtime;3.3 by [org.antlr#antlr-runtime;3.4] in [buildJar]
    ---------------------------------------------------------------------
    |                  |            modules            ||   artifacts   |
    |       conf       | number| search|dwnlded|evicted|| number|dwnlded|
    ---------------------------------------------------------------------
    |     buildJar     |   74 |   0   |   0   |   14 ||   61 |   0   |
    ---------------------------------------------------------------------
[ivy:retrieve] :: retrieving :: org.apache.pig#Pig
[ivy:retrieve]     confs: [buildJar]
[ivy:retrieve]     0 artifacts copied, 61 already retrieved (0kB/9ms)

buildJar-withouthadoop:
     [echo] svnString exported
      [jar] Building jar: /home/anil/hadoop/pig/pig-0.9.2/build/pig-0.9.3-SNAPSHOT-withouthadoop.jar
     [copy] Copying 1 file to /home/anil/hadoop/pig/pig-0.9.2
[taskdef] Could not load definitions from resource net/sf/antcontrib/antcontrib.properties. It could not be found.

jar-withouthadoopWithOutSvn:

jar-all:

BUILD SUCCESSFUL
Total time: 5 minutes 38 seconds
==========================

Looks like Pig was build successfully. This step was needed to build piggybank.

Now go to the directory where piggybank resides.

=====================
anil@sadbhav:~/hadoop/pig/pig-0.9.2/contrib/piggybank/java$ ant
Buildfile: /home/anil/hadoop/pig/pig-0.9.2/contrib/piggybank/java/build.xml

init:

compile:
     [echo] *** Compiling Pig UDFs ***
    [javac] /home/anil/hadoop/pig/pig-0.9.2/contrib/piggybank/java/build.xml:92: warning: 'includeantruntime' was not set, defaulting to build.sysclasspath=last; set to false for repeatable builds
    [javac] Compiling 153 source files to /home/anil/hadoop/pig/pig-0.9.2/contrib/piggybank/java/build/classes
    [javac] Note: Some input files use or override a deprecated API.
    [javac] Note: Recompile with -Xlint:deprecation for details.

jar:
     [echo] *** Creating pigudf.jar ***
      [jar] Building jar: /home/anil/hadoop/pig/pig-0.9.2/contrib/piggybank/java/piggybank.jar

BUILD SUCCESSFUL
Total time: 3 seconds
======================================

3) Now I have a directory to test my pig scripts.
Let us call it "anilpig".

I create the following pig script (distance.pig) which is a direct copy of what Alex has:

======================================
REGISTER /home/anil/hadoop/pig/pig-0.9.2/contrib/piggybank/java/piggybank.jar;

define radians org.apache.pig.piggybank.evaluation.math.toRadians();
define sin org.apache.pig.piggybank.evaluation.math.SIN();
define cos org.apache.pig.piggybank.evaluation.math.COS();
define sqrt org.apache.pig.piggybank.evaluation.math.SQRT();
define atan2 org.apache.pig.piggybank.evaluation.math.ATAN2();

geo = load 'haversine.csv' using PigStorage(';') as (id1: long, lat1: double, lon1: double);
geo2 = load 'haversine.csv' using PigStorage(';') as (id2: long, lat2: double, lon2: double);

geoCross = CROSS geo, geo2;

geoDist = FOREACH geoCross GENERATE id1, id2, 6371 * 2 * atan2(sqrt(sin(radians(lat2 - lat1) / 2) * sin(radians(lat2 - lat1) / 2) + cos(radians(lat1)) * cos(radians(lat2)) * sin(radians(lon2 - lon1) / 2) * sin(radians(lon2 - lon1) / 2)), sqrt(1 - (sin(radians(lat2 - lat1) / 2) * sin(radians(lat2 - lat1) / 2) + cos(radians(lat1)) * cos(radians(lat2)) * sin(radians(lon2 - lon1) / 2) * sin(radians(lon2 - lon1) / 2)))) as dist;

dump geoDist;
======================================

Please do not forget to update the path to piggybank.jar.

I also create the following haversine.csv file
===============
1;48.8583;2.2945
2;48.8738;2.295
================

4) Let us run pig to see if the values match what Alex quotes in his blog post.

==================
~/hadoop/pig/anilpig$ ../pig-0.9.2/bin/pig -x local distance.pig

which: no hadoop in (/usr/local/bin:/usr/bin:/bin:/usr/local/sbin:/usr/sbin:/sbin:/usr/bin:/usr/sbin:/usr/java/jdk1.6.0_30/bin:/opt/apache-maven-3.0.2/bin:/home/anil/.local/bin:/home/anil/bin:/usr/bin:/usr/sbin:/usr/java/jdk1.6.0_30/bin:/opt/apache-maven-3.0.2/bin)
2012-02-19 12:05:13,316 [main] INFO org.apache.pig.Main - Logging error messages to: /home/anil/hadoop/pig/anilpig/pig_1329674713314.log
2012-02-19 12:05:13,409 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: file:///
2012-02-19 12:05:13,911 [main] WARN org.apache.pig.PigServer - Encountered Warning IMPLICIT_CAST_TO_DOUBLE 10 time(s).
2012-02-19 12:05:13,916 [main] INFO org.apache.pig.tools.pigstats.ScriptState - Pig features used in the script: CROSS
2012-02-19 12:05:14,051 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler - File concatenation threshold: 100 optimistic? false
2012-02-19 12:05:14,083 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler$LastInputStreamingOptimizer - Rewrite: POPackage->POForEach to POJoinPackage
2012-02-19 12:05:14,090 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size before optimization: 1
2012-02-19 12:05:14,090 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size after optimization: 1
2012-02-19 12:05:14,108 [main] INFO org.apache.pig.tools.pigstats.ScriptState - Pig script settings are added to the job
2012-02-19 12:05:14,114 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - mapred.job.reduce.markreset.buffer.percent is not set, set to default 0.3
2012-02-19 12:05:14,133 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Setting up single store job
2012-02-19 12:05:14,148 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - BytesPerReducer=1000000000 maxReducers=999 totalInputFileSize=66
2012-02-19 12:05:14,148 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Neither PARALLEL nor default parallelism is set for this job. Setting number of reducers to 1
2012-02-19 12:05:14,224 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 1 map-reduce job(s) waiting for submission.
2012-02-19 12:05:14,234 [Thread-2] WARN org.apache.hadoop.util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2012-02-19 12:05:14,239 [Thread-2] WARN org.apache.hadoop.mapred.JobClient - No job jar file set. User classes may not be found. See JobConf(Class) or JobConf#setJar(String).
2012-02-19 12:05:14,315 [Thread-2] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1
2012-02-19 12:05:14,315 [Thread-2] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1
2012-02-19 12:05:14,323 [Thread-2] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths (combined) to process : 1
2012-02-19 12:05:14,329 [Thread-2] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1
2012-02-19 12:05:14,329 [Thread-2] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1
2012-02-19 12:05:14,329 [Thread-2] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths (combined) to process : 1
2012-02-19 12:05:14,562 [Thread-3] INFO org.apache.hadoop.util.ProcessTree - setsid exited with exit code 0
2012-02-19 12:05:14,564 [Thread-3] INFO org.apache.hadoop.mapred.Task - Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@313816e0
2012-02-19 12:05:14,578 [Thread-3] INFO org.apache.hadoop.mapred.MapTask - io.sort.mb = 100
2012-02-19 12:05:14,600 [Thread-3] INFO org.apache.hadoop.mapred.MapTask - data buffer = 79691776/99614720
2012-02-19 12:05:14,600 [Thread-3] INFO org.apache.hadoop.mapred.MapTask - record buffer = 262144/327680
2012-02-19 12:05:14,636 [Thread-3] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigRecordReader - Created input record counter: Input records from _0_haversine.csv
2012-02-19 12:05:14,638 [Thread-3] INFO org.apache.hadoop.mapred.MapTask - Starting flush of map output
2012-02-19 12:05:14,643 [Thread-3] INFO org.apache.hadoop.mapred.MapTask - Finished spill 0
2012-02-19 12:05:14,645 [Thread-3] INFO org.apache.hadoop.mapred.Task - Task:attempt_local_0001_m_000000_0 is done. And is in the process of commiting
2012-02-19 12:05:14,725 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - HadoopJobId: job_local_0001
2012-02-19 12:05:14,725 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 0% complete
2012-02-19 12:05:17,545 [Thread-3] INFO org.apache.hadoop.mapred.LocalJobRunner -
2012-02-19 12:05:17,546 [Thread-3] INFO org.apache.hadoop.mapred.Task - Task 'attempt_local_0001_m_000000_0' done.
2012-02-19 12:05:17,549 [Thread-3] INFO org.apache.hadoop.mapred.Task - Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@36d83365
2012-02-19 12:05:17,551 [Thread-3] INFO org.apache.hadoop.mapred.MapTask - io.sort.mb = 100
2012-02-19 12:05:17,572 [Thread-3] INFO org.apache.hadoop.mapred.MapTask - data buffer = 79691776/99614720
2012-02-19 12:05:17,572 [Thread-3] INFO org.apache.hadoop.mapred.MapTask - record buffer = 262144/327680
2012-02-19 12:05:17,591 [Thread-3] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigRecordReader - Created input record counter: Input records from _1_haversine.csv
2012-02-19 12:05:17,592 [Thread-3] INFO org.apache.hadoop.mapred.MapTask - Starting flush of map output
2012-02-19 12:05:17,593 [Thread-3] INFO org.apache.hadoop.mapred.MapTask - Finished spill 0
2012-02-19 12:05:17,594 [Thread-3] INFO org.apache.hadoop.mapred.Task - Task:attempt_local_0001_m_000001_0 is done. And is in the process of commiting
2012-02-19 12:05:20,547 [Thread-3] INFO org.apache.hadoop.mapred.LocalJobRunner -
2012-02-19 12:05:20,548 [Thread-3] INFO org.apache.hadoop.mapred.Task - Task 'attempt_local_0001_m_000001_0' done.
2012-02-19 12:05:20,560 [Thread-3] INFO org.apache.hadoop.mapred.Task - Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@2f2e43f1
2012-02-19 12:05:20,560 [Thread-3] INFO org.apache.hadoop.mapred.LocalJobRunner -
2012-02-19 12:05:20,564 [Thread-3] INFO org.apache.hadoop.mapred.Merger - Merging 2 sorted segments
2012-02-19 12:05:20,568 [Thread-3] INFO org.apache.hadoop.mapred.Merger - Down to the last merge-pass, with 2 segments left of total size: 160 bytes
2012-02-19 12:05:20,568 [Thread-3] INFO org.apache.hadoop.mapred.LocalJobRunner -
2012-02-19 12:05:20,623 [Thread-3] INFO org.apache.hadoop.mapred.Task - Task:attempt_local_0001_r_000000_0 is done. And is in the process of commiting
2012-02-19 12:05:20,623 [Thread-3] INFO org.apache.hadoop.mapred.LocalJobRunner -
2012-02-19 12:05:20,624 [Thread-3] INFO org.apache.hadoop.mapred.Task - Task attempt_local_0001_r_000000_0 is allowed to commit now
2012-02-19 12:05:20,625 [Thread-3] INFO org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter - Saved output of task 'attempt_local_0001_r_000000_0' to file:/tmp/temp371866094/tmp-1622554263
2012-02-19 12:05:23,558 [Thread-3] INFO org.apache.hadoop.mapred.LocalJobRunner - reduce > reduce
2012-02-19 12:05:23,558 [Thread-3] INFO org.apache.hadoop.mapred.Task - Task 'attempt_local_0001_r_000000_0' done.
2012-02-19 12:05:24,730 [main] WARN org.apache.pig.tools.pigstats.PigStatsUtil - Failed to get RunningJob for job job_local_0001
2012-02-19 12:05:24,732 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 100% complete
2012-02-19 12:05:24,732 [main] INFO org.apache.pig.tools.pigstats.SimplePigStats - Detected Local mode. Stats reported below may be incomplete
2012-02-19 12:05:24,734 [main] INFO org.apache.pig.tools.pigstats.SimplePigStats - Script Statistics:

HadoopVersion    PigVersion    UserId    StartedAt    FinishedAt    Features
1.0.0    0.9.3-SNAPSHOT    anil    2012-02-19 12:05:14    2012-02-19 12:05:24    CROSS

Success!

Job Stats (time in seconds):
JobId    Alias    Feature    Outputs
job_local_0001    geo,geo2,geoCross,geoDist        file:/tmp/temp371866094/tmp-1622554263,

Input(s):
Successfully read records from: "file:///home/anil/hadoop/pig/anilpig/haversine.csv"
Successfully read records from: "file:///home/anil/hadoop/pig/anilpig/haversine.csv"

Output(s):
Successfully stored records in: "file:/tmp/temp371866094/tmp-1622554263"

Job DAG:
job_local_0001

2012-02-19 12:05:24,736 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Success!
2012-02-19 12:05:24,739 [main] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1
2012-02-19 12:05:24,739 [main] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1
(1,1,0.0)
(1,2,1.7239093620868347)
(2,1,1.7239093620868347)
(2,2,0.0)
===================

Pig has kicked out map reduce in the background.

How much time did this script take?
Let us look at the first log entry and the last one.
-------------------------
2012-02-19 12:05:13,316 [main] INFO org.apache.pig.Main - Logging error

2012-02-19 12:05:24,739 [main] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1
-------------------------
About 11 secs.

The run does show some stats:
---------------
HadoopVersion    PigVersion    UserId    StartedAt    FinishedAt    Features
1.0.0    0.9.3-SNAPSHOT    anil    2012-02-19 12:05:14    2012-02-19 12:05:24    CROSS
---------------------
About 10 secs.

As you can see, the values (1.724) match with what Alex quotes. So I have been successful in testing the Haversine script from AlexP. Next step is to play with the script further to try out Pig's extended functionality.

Additional Details:
CROSS is described here. Computes the cross product of two or more relations.

References
http://fierydata.com/2012/05/11/hadoop-fundamentals-an-introduction-to-pig-2/

PLEASE DO NOT FORGET TO SEE MY POST: View

Wednesday, February 15, 2012

Map Reduce Programming with Hadoop

I understand the Map Reduce paradigm. But it has taken me some time to grasp the finer details of Hadoop Map Reduce programming.

I liked the following blog post on M/R programming on Hadoop:
Thinking Map Reduce with Hadoop. This blog post gives a good description of what M/R programming with Hadoop involves.

Word Count Program

I copied the classic word count program from Hadoop examples, to try out M/R. Nothing great about that.

I have two versions of the program - one in the Older Hadoop Programming Style and the other in the Newer Hadoop Programming Style. If you are starting fresh, always go with the Newer Hadoop Programming Style.

Older Hadoop Programming Style

================
package mapreduce;

import java.io.File;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.util.Iterator;
import java.util.StringTokenizer;

import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reducer;
import org.apache.hadoop.mapred.Reporter;
import org.apache.hadoop.mapred.TextInputFormat;
import org.apache.hadoop.mapred.TextOutputFormat;

public class Processing {

    public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable>{
        private final static IntWritable one = new IntWritable(1);
        private Text word = new Text();
        @Override
        public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output,
                Reporter reporter) throws IOException {
            String line = value.toString();
            StringTokenizer tokenizer = new StringTokenizer(line);
            while (tokenizer.hasMoreTokens()) {
                word.set(tokenizer.nextToken());
                output.collect(word, one);
            }
        }
    }

    public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable>{

        @Override
        public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output,
                Reporter reporter) throws IOException {
            int sum = 0;
            while (values.hasNext()) {
                sum += values.next().get();
            }
            output.collect(key, new IntWritable(sum));
        }
    }

    private static void delete(File f) throws IOException {
        if (f.isDirectory()) {
            for (File c : f.listFiles())
              delete(c);
        }
        if (!f.delete())
            throw new FileNotFoundException("Failed to delete file: " + f);
        }

    public static void main(String[] args) throws IOException{
        JobConf jobConf = new JobConf(Processing.class);
        jobConf.setJobName("TestingHadoop");

        jobConf.setOutputKeyClass(Text.class);
        jobConf.setOutputValueClass(IntWritable.class);

        jobConf.setInputFormat(TextInputFormat.class);
        jobConf.setOutputFormat(TextOutputFormat.class);

        jobConf.setMapperClass(Map.class);
        jobConf.setCombinerClass(Reduce.class);
        jobConf.setReducerClass(Reduce.class);

        //Ensure that the output directory does not exist already
        File outputDir = new File(args[1]);
        delete(outputDir);

        FileInputFormat.setInputPaths(jobConf, new Path(args[0]));
        FileOutputFormat.setOutputPath(jobConf, new Path(args[1]));

        JobClient.runJob(jobConf);
    }
}
================

Now the program arguments are as follows:
/home/anil/hadoopinput /home/anil/output

Basically, I have an directory in my home directory called hadoopinput. This will contain the input files. The program will create an output directory called "output".

I have 2 files in the hadoopinput directory, namely a.txt and b.txt

a.txt contains:
Java is a cool language .
Hadoop is written in Java
HDFS is part of Hadoop

b.txt contains:
Are there Hadoop users ?
Do you like Hadoop ?

Now the program trace for running this class is:
==========================================
Feb 15, 2012 9:38:35 PM org.apache.hadoop.util.NativeCodeLoader <clinit>
WARNING: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Feb 15, 2012 9:38:35 PM org.apache.hadoop.mapred.JobClient copyAndConfigureFiles
WARNING: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
Feb 15, 2012 9:38:35 PM org.apache.hadoop.mapred.JobClient copyAndConfigureFiles
WARNING: No job jar file set. User classes may not be found. See JobConf(Class) or JobConf#setJar(String).
Feb 15, 2012 9:38:35 PM org.apache.hadoop.mapred.FileInputFormat listStatus
INFO: Total input paths to process : 2
Feb 15, 2012 9:38:35 PM org.apache.hadoop.mapred.JobClient monitorAndPrintJob
INFO: Running job: job_local_0001
Feb 15, 2012 9:38:35 PM org.apache.hadoop.util.ProcessTree isSetsidSupported
INFO: setsid exited with exit code 0
Feb 15, 2012 9:38:35 PM org.apache.hadoop.mapred.Task initialize
INFO: Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@3794d372
Feb 15, 2012 9:38:35 PM org.apache.hadoop.mapred.MapTask runOldMapper
INFO: numReduceTasks: 1
Feb 15, 2012 9:38:36 PM org.apache.hadoop.mapred.MapTask$MapOutputBuffer <init>
INFO: io.sort.mb = 100
Feb 15, 2012 9:38:36 PM org.apache.hadoop.mapred.MapTask$MapOutputBuffer <init>
INFO: data buffer = 79691776/99614720
Feb 15, 2012 9:38:36 PM org.apache.hadoop.mapred.MapTask$MapOutputBuffer <init>
INFO: record buffer = 262144/327680
Feb 15, 2012 9:38:36 PM org.apache.hadoop.mapred.MapTask$MapOutputBuffer flush
INFO: Starting flush of map output
Feb 15, 2012 9:38:36 PM org.apache.hadoop.mapred.MapTask$MapOutputBuffer sortAndSpill
INFO: Finished spill 0
Feb 15, 2012 9:38:36 PM org.apache.hadoop.mapred.Task done
INFO: Task:attempt_local_0001_m_000000_0 is done. And is in the process of commiting
Feb 15, 2012 9:38:36 PM org.apache.hadoop.mapred.JobClient monitorAndPrintJob
INFO: map 0% reduce 0%
Feb 15, 2012 9:38:38 PM org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
INFO: file:/home/anil/hadoopinput/a.txt:0+75
Feb 15, 2012 9:38:38 PM org.apache.hadoop.mapred.Task sendDone
INFO: Task 'attempt_local_0001_m_000000_0' done.
Feb 15, 2012 9:38:38 PM org.apache.hadoop.mapred.Task initialize
INFO: Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@73da669c
Feb 15, 2012 9:38:38 PM org.apache.hadoop.mapred.MapTask runOldMapper
INFO: numReduceTasks: 1
Feb 15, 2012 9:38:38 PM org.apache.hadoop.mapred.MapTask$MapOutputBuffer <init>
INFO: io.sort.mb = 100
Feb 15, 2012 9:38:38 PM org.apache.hadoop.mapred.MapTask$MapOutputBuffer <init>
INFO: data buffer = 79691776/99614720
Feb 15, 2012 9:38:38 PM org.apache.hadoop.mapred.MapTask$MapOutputBuffer <init>
INFO: record buffer = 262144/327680
Feb 15, 2012 9:38:38 PM org.apache.hadoop.mapred.MapTask$MapOutputBuffer flush
INFO: Starting flush of map output
Feb 15, 2012 9:38:39 PM org.apache.hadoop.mapred.MapTask$MapOutputBuffer sortAndSpill
INFO: Finished spill 0
Feb 15, 2012 9:38:39 PM org.apache.hadoop.mapred.Task done
INFO: Task:attempt_local_0001_m_000001_0 is done. And is in the process of commiting
Feb 15, 2012 9:38:39 PM org.apache.hadoop.mapred.JobClient monitorAndPrintJob
INFO: map 100% reduce 0%
Feb 15, 2012 9:38:41 PM org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
INFO: file:/home/anil/hadoopinput/b.txt:0+46
Feb 15, 2012 9:38:41 PM org.apache.hadoop.mapred.Task sendDone
INFO: Task 'attempt_local_0001_m_000001_0' done.
Feb 15, 2012 9:38:41 PM org.apache.hadoop.mapred.Task initialize
INFO: Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@1d329642
Feb 15, 2012 9:38:41 PM org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
INFO:
Feb 15, 2012 9:38:41 PM org.apache.hadoop.mapred.Merger$MergeQueue merge
INFO: Merging 2 sorted segments
Feb 15, 2012 9:38:41 PM org.apache.hadoop.mapred.Merger$MergeQueue merge
INFO: Down to the last merge-pass, with 2 segments left of total size: 218 bytes
Feb 15, 2012 9:38:41 PM org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
INFO:
Feb 15, 2012 9:38:41 PM org.apache.hadoop.mapred.Task done
INFO: Task:attempt_local_0001_r_000000_0 is done. And is in the process of commiting
Feb 15, 2012 9:38:41 PM org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
INFO:
Feb 15, 2012 9:38:41 PM org.apache.hadoop.mapred.Task commit
INFO: Task attempt_local_0001_r_000000_0 is allowed to commit now
Feb 15, 2012 9:38:41 PM org.apache.hadoop.mapred.FileOutputCommitter commitTask
INFO: Saved output of task 'attempt_local_0001_r_000000_0' to file:/home/anil/output
Feb 15, 2012 9:38:44 PM org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
INFO: reduce > reduce
Feb 15, 2012 9:38:44 PM org.apache.hadoop.mapred.Task sendDone
INFO: Task 'attempt_local_0001_r_000000_0' done.
Feb 15, 2012 9:38:45 PM org.apache.hadoop.mapred.JobClient monitorAndPrintJob
INFO: map 100% reduce 100%
Feb 15, 2012 9:38:45 PM org.apache.hadoop.mapred.JobClient monitorAndPrintJob
INFO: Job complete: job_local_0001
Feb 15, 2012 9:38:45 PM org.apache.hadoop.mapred.Counters log
INFO: Counters: 21
Feb 15, 2012 9:38:45 PM org.apache.hadoop.mapred.Counters log
INFO:   File Input Format Counters
Feb 15, 2012 9:38:45 PM org.apache.hadoop.mapred.Counters log
INFO:     Bytes Read=121
Feb 15, 2012 9:38:45 PM org.apache.hadoop.mapred.Counters log
INFO:   File Output Format Counters
Feb 15, 2012 9:38:45 PM org.apache.hadoop.mapred.Counters log
INFO:     Bytes Written=137
Feb 15, 2012 9:38:45 PM org.apache.hadoop.mapred.Counters log
INFO:   FileSystemCounters
Feb 15, 2012 9:38:45 PM org.apache.hadoop.mapred.Counters log
INFO:     FILE_BYTES_READ=1642
Feb 15, 2012 9:38:45 PM org.apache.hadoop.mapred.Counters log
INFO:     FILE_BYTES_WRITTEN=96728
Feb 15, 2012 9:38:45 PM org.apache.hadoop.mapred.Counters log
INFO:   Map-Reduce Framework
Feb 15, 2012 9:38:45 PM org.apache.hadoop.mapred.Counters log
INFO:     Map output materialized bytes=226
Feb 15, 2012 9:38:45 PM org.apache.hadoop.mapred.Counters log
INFO:     Map input records=5
Feb 15, 2012 9:38:45 PM org.apache.hadoop.mapred.Counters log
INFO:     Reduce shuffle bytes=0
Feb 15, 2012 9:38:45 PM org.apache.hadoop.mapred.Counters log
INFO:     Spilled Records=40
Feb 15, 2012 9:38:45 PM org.apache.hadoop.mapred.Counters log
INFO:     Map output bytes=225
Feb 15, 2012 9:38:45 PM org.apache.hadoop.mapred.Counters log
INFO:     Total committed heap usage (bytes)=869203968
Feb 15, 2012 9:38:45 PM org.apache.hadoop.mapred.Counters log
INFO:     CPU time spent (ms)=0
Feb 15, 2012 9:38:45 PM org.apache.hadoop.mapred.Counters log
INFO:     Map input bytes=121
Feb 15, 2012 9:38:45 PM org.apache.hadoop.mapred.Counters log
INFO:     SPLIT_RAW_BYTES=172
Feb 15, 2012 9:38:45 PM org.apache.hadoop.mapred.Counters log
INFO:     Combine input records=26
Feb 15, 2012 9:38:45 PM org.apache.hadoop.mapred.Counters log
INFO:     Reduce input records=20
Feb 15, 2012 9:38:45 PM org.apache.hadoop.mapred.Counters log
INFO:     Reduce input groups=19
Feb 15, 2012 9:38:45 PM org.apache.hadoop.mapred.Counters log
INFO:     Combine output records=20
Feb 15, 2012 9:38:45 PM org.apache.hadoop.mapred.Counters log
INFO:     Physical memory (bytes) snapshot=0
Feb 15, 2012 9:38:45 PM org.apache.hadoop.mapred.Counters log
INFO:     Reduce output records=19
Feb 15, 2012 9:38:45 PM org.apache.hadoop.mapred.Counters log
INFO:     Virtual memory (bytes) snapshot=0
Feb 15, 2012 9:38:45 PM org.apache.hadoop.mapred.Counters log
INFO:     Map output records=26
=========================

When I go to the directory called "output" in my home directory.

~/output$ ls
part-00000 _SUCCESS

Let me peek inside the part-0000 file
~/output$ vi part-00000

================
.       1
?       2
Are     1
Do      1
HDFS    1
Hadoop 4
Java    2
a       1
cool    1
in      1
is      3
language        1
like    1
of      1
part    1
there   1
users   1
written 1
you     1
===================

Newer Hadoop Programming Style

The program is

=============
package mapreduce;

import java.io.File;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.util.HashSet;
import java.util.Iterator;
import java.util.Set;
import java.util.StringTokenizer;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;

public class Processing extends Configured implements Tool {

    public static class Mapper extends org.apache.hadoop.mapreduce.Mapper<LongWritable,Text,Text,IntWritable> {

        private final static IntWritable counter = new IntWritable(1);
        private Text word = new Text();
        @Override
        protected void map(LongWritable key, Text value, Context context)
                throws IOException, InterruptedException {
            String line = value.toString();
            StringTokenizer tokenizer = new StringTokenizer(line);
            while (tokenizer.hasMoreTokens()) {
                word.set(tokenizer.nextToken());
                context.write(word, counter);
            }
        }
    }

    /*public static class Combiner extends org.apache.hadoop.mapreduce.Reducer<Text,IntWritable,Text,IntWritable>{

        protected void reduce(Text key, Iterable<IntWritable> values, Context context)
                throws IOException, InterruptedException {
            Set<IntWritable> uniques = new HashSet<IntWritable>();
            for (IntWritable value : values) {
                if (uniques.add(value)) {
                    context.write(key, value);
                }
            }
        }
    }*/

    public static class Reducer extends org.apache.hadoop.mapreduce.Reducer<Text, IntWritable, Text, IntWritable> {

        protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {

            Iterator<IntWritable> iter = values.iterator();
            int sum = 0;
            while (iter.hasNext()) {
                sum += iter.next().get();
            }

            context.write(key, new IntWritable(sum));
        }
    }

    private void delete(File f) throws IOException {
        if (f.isDirectory()) {
            for (File c : f.listFiles())
                delete(c);
        }
        if (!f.delete())
            throw new FileNotFoundException("Failed to delete file: " + f);
    }

    @Override
    public int run(String[] args) throws Exception {
        Path inputPath = new Path(args[0]);
        File outputDir = new File(args[1]);
        delete(outputDir);
        Path outputPath = new Path(args[1]);

        Job job = new Job(getConf(), "TestingHadoop");

        job.setJarByClass(Processing.class);

        FileInputFormat.setInputPaths(job, inputPath);
        FileOutputFormat.setOutputPath(job, outputPath);

        job.setMapperClass(Mapper.class);
        //job.setCombinerClass(Combiner.class);
        job.setReducerClass(Reducer.class);

        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(IntWritable.class);

        return job.waitForCompletion(false) ? 0 : -1;
    }

    public static void main(String[] args) throws Exception {
        System.exit(ToolRunner.run(new Configuration(), new Processing(), args));
    }
}
=============

Note: I have commented out the Combiner class because we do not use a combiner in this example.

The program arguments are the same as shown in the "old hadoop programming style".

Log of the running program.
===================
Feb 15, 2012 10:42:25 PM org.apache.hadoop.util.NativeCodeLoader <clinit>
WARNING: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Feb 15, 2012 10:42:25 PM org.apache.hadoop.mapred.JobClient copyAndConfigureFiles
WARNING: No job jar file set. User classes may not be found. See JobConf(Class) or JobConf#setJar(String).
Feb 15, 2012 10:42:25 PM org.apache.hadoop.mapreduce.lib.input.FileInputFormat listStatus
INFO: Total input paths to process : 2
Feb 15, 2012 10:42:25 PM org.apache.hadoop.util.ProcessTree isSetsidSupported
INFO: setsid exited with exit code 0
Feb 15, 2012 10:42:25 PM org.apache.hadoop.mapred.Task initialize
INFO: Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@5b976011
Feb 15, 2012 10:42:25 PM org.apache.hadoop.mapred.MapTask$MapOutputBuffer <init>
INFO: io.sort.mb = 100
Feb 15, 2012 10:42:25 PM org.apache.hadoop.mapred.MapTask$MapOutputBuffer <init>
INFO: data buffer = 79691776/99614720
Feb 15, 2012 10:42:25 PM org.apache.hadoop.mapred.MapTask$MapOutputBuffer <init>
INFO: record buffer = 262144/327680
Feb 15, 2012 10:42:25 PM org.apache.hadoop.mapred.MapTask$MapOutputBuffer flush
INFO: Starting flush of map output
Feb 15, 2012 10:42:25 PM org.apache.hadoop.mapred.MapTask$MapOutputBuffer sortAndSpill
INFO: Finished spill 0
Feb 15, 2012 10:42:25 PM org.apache.hadoop.mapred.Task done
INFO: Task:attempt_local_0001_m_000000_0 is done. And is in the process of commiting
Feb 15, 2012 10:42:28 PM org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
INFO:
Feb 15, 2012 10:42:28 PM org.apache.hadoop.mapred.Task sendDone
INFO: Task 'attempt_local_0001_m_000000_0' done.
Feb 15, 2012 10:42:28 PM org.apache.hadoop.mapred.Task initialize
INFO: Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@388530b8
Feb 15, 2012 10:42:28 PM org.apache.hadoop.mapred.MapTask$MapOutputBuffer <init>
INFO: io.sort.mb = 100
Feb 15, 2012 10:42:28 PM org.apache.hadoop.mapred.MapTask$MapOutputBuffer <init>
INFO: data buffer = 79691776/99614720
Feb 15, 2012 10:42:28 PM org.apache.hadoop.mapred.MapTask$MapOutputBuffer <init>
INFO: record buffer = 262144/327680
Feb 15, 2012 10:42:28 PM org.apache.hadoop.mapred.MapTask$MapOutputBuffer flush
INFO: Starting flush of map output
Feb 15, 2012 10:42:28 PM org.apache.hadoop.mapred.MapTask$MapOutputBuffer sortAndSpill
INFO: Finished spill 0
Feb 15, 2012 10:42:28 PM org.apache.hadoop.mapred.Task done
INFO: Task:attempt_local_0001_m_000001_0 is done. And is in the process of commiting
Feb 15, 2012 10:42:31 PM org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
INFO:
Feb 15, 2012 10:42:31 PM org.apache.hadoop.mapred.Task sendDone
INFO: Task 'attempt_local_0001_m_000001_0' done.
Feb 15, 2012 10:42:31 PM org.apache.hadoop.mapred.Task initialize
INFO: Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@39e4853f
Feb 15, 2012 10:42:31 PM org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
INFO:
Feb 15, 2012 10:42:31 PM org.apache.hadoop.mapred.Merger$MergeQueue merge
INFO: Merging 2 sorted segments
Feb 15, 2012 10:42:31 PM org.apache.hadoop.mapred.Merger$MergeQueue merge
INFO: Down to the last merge-pass, with 2 segments left of total size: 281 bytes
Feb 15, 2012 10:42:31 PM org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
INFO:
Feb 15, 2012 10:42:31 PM org.apache.hadoop.mapred.Task done
INFO: Task:attempt_local_0001_r_000000_0 is done. And is in the process of commiting
Feb 15, 2012 10:42:31 PM org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
INFO:
Feb 15, 2012 10:42:31 PM org.apache.hadoop.mapred.Task commit
INFO: Task attempt_local_0001_r_000000_0 is allowed to commit now
Feb 15, 2012 10:42:31 PM org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter commitTask
INFO: Saved output of task 'attempt_local_0001_r_000000_0' to /home/anil/output
Feb 15, 2012 10:42:34 PM org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
INFO: reduce > reduce
Feb 15, 2012 10:42:34 PM org.apache.hadoop.mapred.Task sendDone
INFO: Task 'attempt_local_0001_r_000000_0' done.
===================

Now let's us look at the output directory.

~/output$ ls
part-r-00000 _SUCCESS

~/output$ vi part-r-00000

=================
.       1
?       2
Are     1
Do      1
HDFS    1
Hadoop 4
Java    2
a       1
cool    1
in      1
is      3
language        1
like    1
of      1
part    1
there   1
users   1
written 1
=================

Magical, isn't it?