I rewrote my original pig script to use a local foreach block.
It now looks like:
A = load 'myInputFile' using PigStorage(',') as (seqid, cl, local, domain, fn, ln, zip, email);
B = foreach A generate flatten(SortChars(local)) as sorted_local, cl;
C = GROUP B by sorted_local PARALLEL 10;
D = FOREACH C {
CLS = B.cl;
DCLS = DISTINCT CLS;
GENERATE group, COUNT(DCLS) as clCount;
}
E = FILTER D by clCount > 1;
F = FOREACH E GENERATE group;
store F into 'myDir' using PigStorage();
A, B, and C are the same as before. However, we now push the distinct into the FOREACH loop for D. This allows Pig to use a DistinctBag to dedup the CLs for a sorted_local part. Before, the DISTINCT had to use a separate map-reduce step; however, now it can do it all in only one map-reduce job. This effectively halves the time the script takes.
For more on how things get executed, you can use the explain option. For instance, explain F; will show how F gets executed. More information can also be found on the pig wiki.
Wobble
1 day ago
No comments:
Post a Comment