Home | About | Sematext search-lucene.com search-hadoop.com
 Search Lucene and all its subprojects:

Switch to Threaded View
Mahout, mail # user - MinHash implementation thoughts


Copy link to this message
-
Re: MinHash implementation thoughts
Lance Norskog 2012-06-20, 03:18
Sean-

http://www.lucidimagination.com/search/document/1d24eb86fcf3b8ab#b90e74b43cfb4c42
http://www.lucidimagination.com/search/document/ffc7e312b87088dc

On Tue, Jun 19, 2012 at 8:54 AM, chsz wu <[EMAIL PROTECTED]> wrote:
> I think for minhash to be generic and efficient, 3 things needed to be done at least,
> 1. Prepare term-doc matrix to minimize hash computation (one more MR job).
>    Quite often we have doc-term data, for each term we have to averagely hash average_doc_freq times,
>    With term-doc matrix, we reduce  per term hash to one.
> 2. Where is the hash key and docId
>    With Hadoop, mapper has input key and value. We need to map out Id and hash_key.
>    To be general, we need to pass in a hashKeyMap function to extract id and hashkey.
>    the combined clusterId(**_**_...) can be rehashed one more time to reduce size
> 3. Post processing, sort by similarity probability (numbers of hash_match), kind of sorting common friends by the number of friends in common.
>    Another MR job.
>
> I'll try to look at how to make a patch, and do my best.
>
> Sam
>
>
>
>
>
> ________________________________
>  From: Sean Owen <[EMAIL PROTECTED]>
> To: [EMAIL PROTECTED]
> Sent: Tuesday, June 19, 2012 2:20 AM
> Subject: Re: MinHash implementation thoughts
>
> Lance, what are you referring to here?
> Sam if you have some patch to suggest that would be good, open a JIRA issue.
>
> On Tue, Jun 19, 2012 at 10:05 AM, Lance Norskog <[EMAIL PROTECTED]> wrote:
>> Some Mahout people are not sure that the MinHash implementation is
>> right. You should try your versions of how the algorithms should work.
>>
>> On Mon, Jun 18, 2012 at 8:10 AM, chsz wu <[EMAIL PROTECTED]> wrote:
>>> Sorry, forget about my previous comment about 2.
>>> -----
>>> For general purpose, sometime we might need to hash on value, so we might
>>> need an extra parameter , hashOnKey ?
>>>
>>> ------
>>> This is not correct, the hash key selection is based on mappers input values.
>>> if the mapper input value is vector, then we need to select by vector key or value (using value() or index()), not
>>> by mapper key or mapper value.
>>>
>>> the extra parameter might look like  hashKeyOnMapperKey (or hashkeyOnMapperValue)
>>>
>>>
>>>
>>> Sam
>>>
>>>
>>> ________________________________
>>>  From: sam wu <[EMAIL PROTECTED]>
>>> To: [EMAIL PROTECTED]
>>> Sent: Friday, June 15, 2012 5:38 PM
>>> Subject: MinHash implementation thoughts
>>>
>>> I am playing with MinHashMapper, and have the following 2 penny thoughts
>>>
>>> 1. Efficiency
>>>
>>> in the map(..), around line 85
>>> ----------
>>> for (int i = 0; i < numHashFunctions; i++) {
>>>       for (Vector.Element ele : featureVector) {         ----*
>>>         int value = (int) ele.get();                                  ----**
>>>          bytesToHash[0] = (byte) (value >> 24);
>>>         .....
>>>         }
>>>       }
>>>     }
>>> -----------
>>> the featureVector (which can be random sparse vector) iterator ---* goes
>>> through all elements (including nonzero).
>>> When I test ASF email program (from Grant's ...). Each email has ~30000
>>> elements.
>>> By using iteratorNonZero(),
>>> I reduce from 30000 to around 50 operations (both for converting from int
>>> to byte[], and hash computation) for some email.???
>>>
>>> 2. which field key to hash
>>> we read from sequence file, so we have key and value.
>>> If we read from tfidf doc sequenceFile, then
>>> key will be termId, value will be tfidf value.
>>> What to hash for MinHash ?
>>> I think we need to hash on termId, instead of tfidf (meaningless for doc
>>> MinHash ?)
>>> so we need to get
>>> value = ele.index() ---**
>>>
>>> When I test the current code, I get almost all value = 0
>>>
>>> For general purpose, sometime we might need to hash on value, so we might
>>> need an extra parameter , hashOnKey ?
>>>
>>> 3. How to compute custerID ?
>>>
>>> I thought LSH normally has r(rows in a band) and b (band).
>>> numberHashFunction = r *b
>>>
>>> #clusterId = b

Lance Norskog
[EMAIL PROTECTED]