|
|
-
MinHash implementation thoughts
sam wu 2012-06-16, 00:38
I am playing with MinHashMapper, and have the following 2 penny thoughts
1. Efficiency
in the map(..), around line 85 ---------- for (int i = 0; i < numHashFunctions; i++) { for (Vector.Element ele : featureVector) { ----* int value = (int) ele.get(); ----** bytesToHash[0] = (byte) (value >> 24); ..... } } } ----------- the featureVector (which can be random sparse vector) iterator ---* goes through all elements (including nonzero). When I test ASF email program (from Grant's ...). Each email has ~30000 elements. By using iteratorNonZero(), I reduce from 30000 to around 50 operations (both for converting from int to byte[], and hash computation) for some email.???
2. which field key to hash we read from sequence file, so we have key and value. If we read from tfidf doc sequenceFile, then key will be termId, value will be tfidf value. What to hash for MinHash ? I think we need to hash on termId, instead of tfidf (meaningless for doc MinHash ?) so we need to get value = ele.index() ---**
When I test the current code, I get almost all value = 0
For general purpose, sometime we might need to hash on value, so we might need an extra parameter , hashOnKey ?
3. How to compute custerID ?
I thought LSH normally has r(rows in a band) and b (band). numberHashFunction = r *b
#clusterId = b
so we can easily compute probability 1- exp( (1-exp(s,r)),b)
for (int i = 0; i < b; i++) { for (int j = 0; j < r; j++) { clusterIdBuilder.append(minHashValues[i * b + j ]).append('-'); } .... }
the current implementation works in some way, but ?? This is Friday late afternoon,
Hope I am still making sense. BR
Sam
-
Re: MinHash implementation thoughts
chsz wu 2012-06-18, 15:10
Sorry, forget about my previous comment about 2. ----- For general purpose, sometime we might need to hash on value, so we might need an extra parameter , hashOnKey ?
------ This is not correct, the hash key selection is based on mappers input values. if the mapper input value is vector, then we need to select by vector key or value (using value() or index()), not by mapper key or mapper value.
the extra parameter might look like hashKeyOnMapperKey (or hashkeyOnMapperValue)
Sam ________________________________ From: sam wu <[EMAIL PROTECTED]> To: [EMAIL PROTECTED] Sent: Friday, June 15, 2012 5:38 PM Subject: MinHash implementation thoughts I am playing with MinHashMapper, and have the following 2 penny thoughts
1. Efficiency
in the map(..), around line 85 ---------- for (int i = 0; i < numHashFunctions; i++) { for (Vector.Element ele : featureVector) { ----* int value = (int) ele.get(); ----** bytesToHash[0] = (byte) (value >> 24); ..... } } } ----------- the featureVector (which can be random sparse vector) iterator ---* goes through all elements (including nonzero). When I test ASF email program (from Grant's ...). Each email has ~30000 elements. By using iteratorNonZero(), I reduce from 30000 to around 50 operations (both for converting from int to byte[], and hash computation) for some email.???
2. which field key to hash we read from sequence file, so we have key and value. If we read from tfidf doc sequenceFile, then key will be termId, value will be tfidf value. What to hash for MinHash ? I think we need to hash on termId, instead of tfidf (meaningless for doc MinHash ?) so we need to get value = ele.index() ---**
When I test the current code, I get almost all value = 0
For general purpose, sometime we might need to hash on value, so we might need an extra parameter , hashOnKey ?
3. How to compute custerID ?
I thought LSH normally has r(rows in a band) and b (band). numberHashFunction = r *b
#clusterId = b
so we can easily compute probability 1- exp( (1-exp(s,r)),b)
for (int i = 0; i < b; i++) { for (int j = 0; j < r; j++) { clusterIdBuilder.append(minHashValues[i * b + j ]).append('-'); } .... }
the current implementation works in some way, but ?? This is Friday late afternoon,
Hope I am still making sense. BR
Sam
-
Re: MinHash implementation thoughts
Lance Norskog 2012-06-19, 09:05
Some Mahout people are not sure that the MinHash implementation is right. You should try your versions of how the algorithms should work.
On Mon, Jun 18, 2012 at 8:10 AM, chsz wu <[EMAIL PROTECTED]> wrote: > Sorry, forget about my previous comment about 2. > ----- > For general purpose, sometime we might need to hash on value, so we might > need an extra parameter , hashOnKey ? > > ------ > This is not correct, the hash key selection is based on mappers input values. > if the mapper input value is vector, then we need to select by vector key or value (using value() or index()), not > by mapper key or mapper value. > > the extra parameter might look like hashKeyOnMapperKey (or hashkeyOnMapperValue) > > > > Sam > > > ________________________________ > From: sam wu <[EMAIL PROTECTED]> > To: [EMAIL PROTECTED] > Sent: Friday, June 15, 2012 5:38 PM > Subject: MinHash implementation thoughts > > I am playing with MinHashMapper, and have the following 2 penny thoughts > > 1. Efficiency > > in the map(..), around line 85 > ---------- > for (int i = 0; i < numHashFunctions; i++) { > for (Vector.Element ele : featureVector) { ----* > int value = (int) ele.get(); ----** > bytesToHash[0] = (byte) (value >> 24); > ..... > } > } > } > ----------- > the featureVector (which can be random sparse vector) iterator ---* goes > through all elements (including nonzero). > When I test ASF email program (from Grant's ...). Each email has ~30000 > elements. > By using iteratorNonZero(), > I reduce from 30000 to around 50 operations (both for converting from int > to byte[], and hash computation) for some email.??? > > 2. which field key to hash > we read from sequence file, so we have key and value. > If we read from tfidf doc sequenceFile, then > key will be termId, value will be tfidf value. > What to hash for MinHash ? > I think we need to hash on termId, instead of tfidf (meaningless for doc > MinHash ?) > so we need to get > value = ele.index() ---** > > When I test the current code, I get almost all value = 0 > > For general purpose, sometime we might need to hash on value, so we might > need an extra parameter , hashOnKey ? > > 3. How to compute custerID ? > > I thought LSH normally has r(rows in a band) and b (band). > numberHashFunction = r *b > > #clusterId = b > > so we can easily compute probability 1- exp( (1-exp(s,r)),b) > > for (int i = 0; i < b; i++) { > for (int j = 0; j < r; j++) { > clusterIdBuilder.append(minHashValues[i * b + j ]).append('-'); > } > .... > } > > the current implementation works in some way, but ?? > > > This is Friday late afternoon, > > Hope I am still making sense. > > > BR > > Sam
-- Lance Norskog [EMAIL PROTECTED]
-
Re: MinHash implementation thoughts
Sean Owen 2012-06-19, 09:20
Lance, what are you referring to here? Sam if you have some patch to suggest that would be good, open a JIRA issue.
On Tue, Jun 19, 2012 at 10:05 AM, Lance Norskog <[EMAIL PROTECTED]> wrote: > Some Mahout people are not sure that the MinHash implementation is > right. You should try your versions of how the algorithms should work. > > On Mon, Jun 18, 2012 at 8:10 AM, chsz wu <[EMAIL PROTECTED]> wrote: >> Sorry, forget about my previous comment about 2. >> ----- >> For general purpose, sometime we might need to hash on value, so we might >> need an extra parameter , hashOnKey ? >> >> ------ >> This is not correct, the hash key selection is based on mappers input values. >> if the mapper input value is vector, then we need to select by vector key or value (using value() or index()), not >> by mapper key or mapper value. >> >> the extra parameter might look like hashKeyOnMapperKey (or hashkeyOnMapperValue) >> >> >> >> Sam >> >> >> ________________________________ >> From: sam wu <[EMAIL PROTECTED]> >> To: [EMAIL PROTECTED] >> Sent: Friday, June 15, 2012 5:38 PM >> Subject: MinHash implementation thoughts >> >> I am playing with MinHashMapper, and have the following 2 penny thoughts >> >> 1. Efficiency >> >> in the map(..), around line 85 >> ---------- >> for (int i = 0; i < numHashFunctions; i++) { >> for (Vector.Element ele : featureVector) { ----* >> int value = (int) ele.get(); ----** >> bytesToHash[0] = (byte) (value >> 24); >> ..... >> } >> } >> } >> ----------- >> the featureVector (which can be random sparse vector) iterator ---* goes >> through all elements (including nonzero). >> When I test ASF email program (from Grant's ...). Each email has ~30000 >> elements. >> By using iteratorNonZero(), >> I reduce from 30000 to around 50 operations (both for converting from int >> to byte[], and hash computation) for some email.??? >> >> 2. which field key to hash >> we read from sequence file, so we have key and value. >> If we read from tfidf doc sequenceFile, then >> key will be termId, value will be tfidf value. >> What to hash for MinHash ? >> I think we need to hash on termId, instead of tfidf (meaningless for doc >> MinHash ?) >> so we need to get >> value = ele.index() ---** >> >> When I test the current code, I get almost all value = 0 >> >> For general purpose, sometime we might need to hash on value, so we might >> need an extra parameter , hashOnKey ? >> >> 3. How to compute custerID ? >> >> I thought LSH normally has r(rows in a band) and b (band). >> numberHashFunction = r *b >> >> #clusterId = b >> >> so we can easily compute probability 1- exp( (1-exp(s,r)),b) >> >> for (int i = 0; i < b; i++) { >> for (int j = 0; j < r; j++) { >> clusterIdBuilder.append(minHashValues[i * b + j ]).append('-'); >> } >> .... >> } >> >> the current implementation works in some way, but ?? >> >> >> This is Friday late afternoon, >> >> Hope I am still making sense. >> >> >> BR >> >> Sam > > > > -- > Lance Norskog > [EMAIL PROTECTED]
-
Re: MinHash implementation thoughts
chsz wu 2012-06-19, 15:54
I think for minhash to be generic and efficient, 3 things needed to be done at least, 1. Prepare term-doc matrix to minimize hash computation (one more MR job). Quite often we have doc-term data, for each term we have to averagely hash average_doc_freq times, With term-doc matrix, we reduce per term hash to one. 2. Where is the hash key and docId With Hadoop, mapper has input key and value. We need to map out Id and hash_key. To be general, we need to pass in a hashKeyMap function to extract id and hashkey. the combined clusterId(**_**_...) can be rehashed one more time to reduce size 3. Post processing, sort by similarity probability (numbers of hash_match), kind of sorting common friends by the number of friends in common. Another MR job.
I'll try to look at how to make a patch, and do my best.
Sam
________________________________ From: Sean Owen <[EMAIL PROTECTED]> To: [EMAIL PROTECTED] Sent: Tuesday, June 19, 2012 2:20 AM Subject: Re: MinHash implementation thoughts Lance, what are you referring to here? Sam if you have some patch to suggest that would be good, open a JIRA issue.
On Tue, Jun 19, 2012 at 10:05 AM, Lance Norskog <[EMAIL PROTECTED]> wrote: > Some Mahout people are not sure that the MinHash implementation is > right. You should try your versions of how the algorithms should work. > > On Mon, Jun 18, 2012 at 8:10 AM, chsz wu <[EMAIL PROTECTED]> wrote: >> Sorry, forget about my previous comment about 2. >> ----- >> For general purpose, sometime we might need to hash on value, so we might >> need an extra parameter , hashOnKey ? >> >> ------ >> This is not correct, the hash key selection is based on mappers input values. >> if the mapper input value is vector, then we need to select by vector key or value (using value() or index()), not >> by mapper key or mapper value. >> >> the extra parameter might look like hashKeyOnMapperKey (or hashkeyOnMapperValue) >> >> >> >> Sam >> >> >> ________________________________ >> From: sam wu <[EMAIL PROTECTED]> >> To: [EMAIL PROTECTED] >> Sent: Friday, June 15, 2012 5:38 PM >> Subject: MinHash implementation thoughts >> >> I am playing with MinHashMapper, and have the following 2 penny thoughts >> >> 1. Efficiency >> >> in the map(..), around line 85 >> ---------- >> for (int i = 0; i < numHashFunctions; i++) { >> for (Vector.Element ele : featureVector) { ----* >> int value = (int) ele.get(); ----** >> bytesToHash[0] = (byte) (value >> 24); >> ..... >> } >> } >> } >> ----------- >> the featureVector (which can be random sparse vector) iterator ---* goes >> through all elements (including nonzero). >> When I test ASF email program (from Grant's ...). Each email has ~30000 >> elements. >> By using iteratorNonZero(), >> I reduce from 30000 to around 50 operations (both for converting from int >> to byte[], and hash computation) for some email.??? >> >> 2. which field key to hash >> we read from sequence file, so we have key and value. >> If we read from tfidf doc sequenceFile, then >> key will be termId, value will be tfidf value. >> What to hash for MinHash ? >> I think we need to hash on termId, instead of tfidf (meaningless for doc >> MinHash ?) >> so we need to get >> value = ele.index() ---** >> >> When I test the current code, I get almost all value = 0 >> >> For general purpose, sometime we might need to hash on value, so we might >> need an extra parameter , hashOnKey ? >> >> 3. How to compute custerID ? >> >> I thought LSH normally has r(rows in a band) and b (band). >> numberHashFunction = r *b >> >> #clusterId = b >> >> so we can easily compute probability 1- exp( (1-exp(s,r)),b) >> >> for (int i = 0; i < b; i++) { >> for (int j = 0; j < r; j++) { >> clusterIdBuilder.append(minHashValues[i * b + j ]).append('-'); >> } >> .... >> } >> >> the current implementation works in some way, but ?? >> >> >> This is Friday late afternoon,
-
Re: MinHash implementation thoughts
Lance Norskog 2012-06-20, 03:18
Sean- http://www.lucidimagination.com/search/document/1d24eb86fcf3b8ab#b90e74b43cfb4c42http://www.lucidimagination.com/search/document/ffc7e312b87088dcOn Tue, Jun 19, 2012 at 8:54 AM, chsz wu <[EMAIL PROTECTED]> wrote: > I think for minhash to be generic and efficient, 3 things needed to be done at least, > 1. Prepare term-doc matrix to minimize hash computation (one more MR job). > Quite often we have doc-term data, for each term we have to averagely hash average_doc_freq times, > With term-doc matrix, we reduce per term hash to one. > 2. Where is the hash key and docId > With Hadoop, mapper has input key and value. We need to map out Id and hash_key. > To be general, we need to pass in a hashKeyMap function to extract id and hashkey. > the combined clusterId(**_**_...) can be rehashed one more time to reduce size > 3. Post processing, sort by similarity probability (numbers of hash_match), kind of sorting common friends by the number of friends in common. > Another MR job. > > I'll try to look at how to make a patch, and do my best. > > Sam > > > > > > ________________________________ > From: Sean Owen <[EMAIL PROTECTED]> > To: [EMAIL PROTECTED] > Sent: Tuesday, June 19, 2012 2:20 AM > Subject: Re: MinHash implementation thoughts > > Lance, what are you referring to here? > Sam if you have some patch to suggest that would be good, open a JIRA issue. > > On Tue, Jun 19, 2012 at 10:05 AM, Lance Norskog <[EMAIL PROTECTED]> wrote: >> Some Mahout people are not sure that the MinHash implementation is >> right. You should try your versions of how the algorithms should work. >> >> On Mon, Jun 18, 2012 at 8:10 AM, chsz wu <[EMAIL PROTECTED]> wrote: >>> Sorry, forget about my previous comment about 2. >>> ----- >>> For general purpose, sometime we might need to hash on value, so we might >>> need an extra parameter , hashOnKey ? >>> >>> ------ >>> This is not correct, the hash key selection is based on mappers input values. >>> if the mapper input value is vector, then we need to select by vector key or value (using value() or index()), not >>> by mapper key or mapper value. >>> >>> the extra parameter might look like hashKeyOnMapperKey (or hashkeyOnMapperValue) >>> >>> >>> >>> Sam >>> >>> >>> ________________________________ >>> From: sam wu <[EMAIL PROTECTED]> >>> To: [EMAIL PROTECTED] >>> Sent: Friday, June 15, 2012 5:38 PM >>> Subject: MinHash implementation thoughts >>> >>> I am playing with MinHashMapper, and have the following 2 penny thoughts >>> >>> 1. Efficiency >>> >>> in the map(..), around line 85 >>> ---------- >>> for (int i = 0; i < numHashFunctions; i++) { >>> for (Vector.Element ele : featureVector) { ----* >>> int value = (int) ele.get(); ----** >>> bytesToHash[0] = (byte) (value >> 24); >>> ..... >>> } >>> } >>> } >>> ----------- >>> the featureVector (which can be random sparse vector) iterator ---* goes >>> through all elements (including nonzero). >>> When I test ASF email program (from Grant's ...). Each email has ~30000 >>> elements. >>> By using iteratorNonZero(), >>> I reduce from 30000 to around 50 operations (both for converting from int >>> to byte[], and hash computation) for some email.??? >>> >>> 2. which field key to hash >>> we read from sequence file, so we have key and value. >>> If we read from tfidf doc sequenceFile, then >>> key will be termId, value will be tfidf value. >>> What to hash for MinHash ? >>> I think we need to hash on termId, instead of tfidf (meaningless for doc >>> MinHash ?) >>> so we need to get >>> value = ele.index() ---** >>> >>> When I test the current code, I get almost all value = 0 >>> >>> For general purpose, sometime we might need to hash on value, so we might >>> need an extra parameter , hashOnKey ? >>> >>> 3. How to compute custerID ? >>> >>> I thought LSH normally has r(rows in a band) and b (band). >>> numberHashFunction = r *b >>> >>> #clusterId = b Lance Norskog [EMAIL PROTECTED]
-
Re: MinHash implementation thoughts
chsz wu 2012-06-20, 04:59
Not surprised with the non-promising result. As I said in my first email, if the mapper input sequence file is key (docId), value (vector of term-tfIdf). then the hash key should be ~~ value->iteratorNonZero()->getNext()->getIndex(), getIndex() return term(Id) , right hash key getValue() return tfIdf , not meaningful for hash. Sam ________________________________ From: Lance Norskog <[EMAIL PROTECTED]> To: [EMAIL PROTECTED]; chsz wu <[EMAIL PROTECTED]> Sent: Tuesday, June 19, 2012 8:18 PM Subject: Re: MinHash implementation thoughts Sean- http://www.lucidimagination.com/search/document/1d24eb86fcf3b8ab#b90e74b43cfb4c42http://www.lucidimagination.com/search/document/ffc7e312b87088dcOn Tue, Jun 19, 2012 at 8:54 AM, chsz wu <[EMAIL PROTECTED]> wrote: > I think for minhash to be generic and efficient, 3 things needed to be done at least, > 1. Prepare term-doc matrix to minimize hash computation (one more MR job). > Quite often we have doc-term data, for each term we have to averagely hash average_doc_freq times, > With term-doc matrix, we reduce per term hash to one. > 2. Where is the hash key and docId > With Hadoop, mapper has input key and value. We need to map out Id and hash_key. > To be general, we need to pass in a hashKeyMap function to extract id and hashkey. > the combined clusterId(**_**_...) can be rehashed one more time to reduce size > 3. Post processing, sort by similarity probability (numbers of hash_match), kind of sorting common friends by the number of friends in common. > Another MR job. > > I'll try to look at how to make a patch, and do my best. > > Sam > > > > > > ________________________________ > From: Sean Owen <[EMAIL PROTECTED]> > To: [EMAIL PROTECTED] > Sent: Tuesday, June 19, 2012 2:20 AM > Subject: Re: MinHash implementation thoughts > > Lance, what are you referring to here? > Sam if you have some patch to suggest that would be good, open a JIRA issue. > > On Tue, Jun 19, 2012 at 10:05 AM, Lance Norskog <[EMAIL PROTECTED]> wrote: >> Some Mahout people are not sure that the MinHash implementation is >> right. You should try your versions of how the algorithms should work. >> >> On Mon, Jun 18, 2012 at 8:10 AM, chsz wu <[EMAIL PROTECTED]> wrote: >>> Sorry, forget about my previous comment about 2. >>> ----- >>> For general purpose, sometime we might need to hash on value, so we might >>> need an extra parameter , hashOnKey ? >>> >>> ------ >>> This is not correct, the hash key selection is based on mappers input values. >>> if the mapper input value is vector, then we need to select by vector key or value (using value() or index()), not >>> by mapper key or mapper value. >>> >>> the extra parameter might look like hashKeyOnMapperKey (or hashkeyOnMapperValue) >>> >>> >>> >>> Sam >>> >>> >>> ________________________________ >>> From: sam wu <[EMAIL PROTECTED]> >>> To: [EMAIL PROTECTED] >>> Sent: Friday, June 15, 2012 5:38 PM >>> Subject: MinHash implementation thoughts >>> >>> I am playing with MinHashMapper, and have the following 2 penny thoughts >>> >>> 1. Efficiency >>> >>> in the map(..), around line 85 >>> ---------- >>> for (int i = 0; i < numHashFunctions; i++) { >>> for (Vector.Element ele : featureVector) { ----* >>> int value = (int) ele.get(); ----** >>> bytesToHash[0] = (byte) (value >> 24); >>> ..... >>> } >>> } >>> } >>> ----------- >>> the featureVector (which can be random sparse vector) iterator ---* goes >>> through all elements (including nonzero). >>> When I test ASF email program (from Grant's ...). Each email has ~30000 >>> elements. >>> By using iteratorNonZero(), >>> I reduce from 30000 to around 50 operations (both for converting from int >>> to byte[], and hash computation) for some email.??? >>> >>> 2. which field key to hash >>> we read from sequence file, so we have key and value. >>> If we read from tfidf doc sequenceFile, then >>> key will be termId, value will be tfidf value. Lance Norskog [EMAIL PROTECTED]
|
|