Now that I have the sorted local parts of an email address, I want to know which emails are identified by those sorted local parts. So, I know that bbgist has more than one CL associated with it. Now, I want to know that tgibbs and tgbbis both go to bbgist. Also, I only want those emails with more than one CL, so I don't want tgobbs because it (let's pretend) only associated with one CL.
Here is the pig script to do that:
A = load 'myFile' using PigStorage(',') as (seqid, cl, local, domain, fn, ln, email);
B = load 'mySortedLocalFile' using PigStorage() as (sorted_local);
C = foreach A generate local;
D = DISTINCT C PARALLEL 10;
E = FOREACH D generate local, flatten(SortChars(local)) as sorted_local;
F = JOIN E by sorted_local, B by sorted_local PARALLEL 10;
G = FOREACH F generate local;
store G into '/user/tgibbs/eproducts-local-merged-out' using PigStorage();
On line 1, once again, I register my SortChars function that is in my jar file.
Line 2 loads the original email file.
Line 3 loads the file of sorted local parts that I created last blog entry.
Line 4 reduces the original email file down to just the local part.
Line 5 (the D line) gets rid of any duplicate local parts, since I don't care that tgibbs appears 10 times and it reduces the amount of data I have to deal with later.
Line 6 (the E line) generates the sorted version of the local part and keeps the original local part as well.
Line 7 (F) joins the two files by the sorted local portion. It's schema looks something like (E::local, sorted_local, sorted_local).
As an example, it would have (tgibbs, bbgist, bbgist) and (tgbbis, bbgist, bbgist).
Line 8 (G) just gets the local part, which is what I want to store and the last line stores it.
Next up, I plan to try out the Illustrate command which should display the execution plan.
6 hours ago