Thursday, May 22, 2008

Pig redux

I rewrote my original pig script to use a local foreach block.

It now looks like:

A = load 'myInputFile' using PigStorage(',') as (seqid, cl, local, domain, fn, ln, zip, email);
B = foreach A generate flatten(SortChars(local)) as sorted_local, cl;
C = GROUP B by sorted_local PARALLEL 10;
D = FOREACH C {
CLS = B.cl;
DCLS = DISTINCT CLS;
GENERATE group, COUNT(DCLS) as clCount;
}
E = FILTER D by clCount > 1;
F = FOREACH E GENERATE group;
store F into 'myDir' using PigStorage();

A, B, and C are the same as before. However, we now push the distinct into the FOREACH loop for D. This allows Pig to use a DistinctBag to dedup the CLs for a sorted_local part. Before, the DISTINCT had to use a separate map-reduce step; however, now it can do it all in only one map-reduce job. This effectively halves the time the script takes.

For more on how things get executed, you can use the explain option. For instance, explain F; will show how F gets executed. More information can also be found on the pig wiki.

No comments: