I don't believe there is a decent description of how to create a min-hash signature for bigrams on the web, so I'm going to try and provide one. Of course, my description will probably be flawed, but I hope that it will be better than what is out there currently.
First, what is a min-hash signature?
The idea is that, given two records, you can provide signatures such that the similarity of the signatures is approximately equal to the similarity of the records.
Here is a slide show based around using min-hash signature to provide keys for indexing into a database. Here is the relevant paper.
So, basically it is a way of generating a fuzzy key such that if the two keys match there is a high probability that the two records will match.
We're going to examine a way of doing this using bigrams.
Let's assume we have the local part of an email address. In our base file, we have the following entries:
tgibbs
tanton_gibbs
bbarker
tkieth
bspears
If we look at the bigrams generated for each local part, we have the following:
tgibbs = { tg, gi, ib, bb, bs }
tanton_gibbs = {ta, an, nt, to, on, n_, _g, gi, ib, bb, bs}
gbarker = {gb, ba, ar, rk, ke, er}
tkisth = {tk, ki, is, et, th}
bspears = {bs, sp, pe, ea, ar, rs}
If we order all the bigrams alphabetically, then we have the following universe of bigrams: {an, ba, bb, bs, ... th}
Now, a min hash signature requires an input f that tells how many hash functions to use. Let's set f to 3. That means we'll hash each record in 3 different ways.
To generate a hash function, we randomly permute the bigram universe.
So, our first permutation may look like:
{tk, ib, an, pe, ... rs}
our second permutation may look like:
{rk, gi, bs, th, ... pe}
and our third permutation may look like:
{bb, ba, er, ke, ... an}
Now, the next thing we need is a similarity threshold, t. Let's assume t is 0.8.
So, for each record, we will use 80% of the bigrams to produce the hash.
So, if our input record is tgbbis, then we have the following bigrams to choose from
{tg, gb, bb, bi, is} We would choose 3 or 4 bigrams (probably both) to produce the hash. But, which 3 or 4 do we choose?
For each permutation of our bigram universe that we produced above, we will pick the bigrams that appear the earliest. So, for permutation 1, we might end up choosing {gb, bi, is}. For permutation 2, we might end up with {bi, tg, bb}. For permutation 3, we might have {bb, is, gb}. We would do the same thing for combinations of 4 bigrams.
Now, we can create a string from the bigrams {gbbiis, bitgbb, bbisgb}. These strings become our prospecting keys. We perform the same key generation routine on the base records. Then we can join our input keys to our base record keys to find our match candidates.
If we want a more approximate min-hash signature, we could sort the bigrams before creating the key so that we would have the keys {bigbis, bbbitg, bbgbis}. Obviously, the duplicate key could be thrown away. This has the effect of handling more transpositions in the input at the cost of bringing back more candidates.
Hopefully this illuminates min-hash signatures a bit more so that the references above make sense.
CodeSOD: Counting it All
17 hours ago
2 comments:
Thanks for the very nice explanation. Yes, this is the only webpage where I found min hash <-> bigrams.
"These strings become our prospecting keys. We perform the same key generation routine on the base records. Then we can join our input keys to our base record keys to find our match candidates."
I am little confused with this statement. What are the base records - the local part of e-mail or full e-mail in DB or something else? Similarly, what are input keys?
Any comments to the last question?
Post a Comment