|
Elena Smirnova
2012-07-30, 08:35
Sean Owen
2012-07-30, 09:58
Elena Smirnova
2012-07-30, 11:51
Sean Owen
2012-07-30, 12:02
Elena Smirnova
2012-07-30, 12:11
Sean Owen
2012-07-30, 12:19
Elena Smirnova
2012-07-30, 12:25
Sean Owen
2012-07-30, 12:27
Elena Smirnova
2012-07-30, 12:41
Sean Owen
2012-07-30, 13:06
Sean Owen
2012-07-30, 21:18
Suneel Marthi
2012-07-30, 22:44
Suneel Marthi
2012-07-30, 22:46
Elena Smirnova
2012-07-31, 07:19
|
-
Possible issue in MinHashMapperElena Smirnova 2012-07-30, 08:35
Hello,
It seems to me that there is an issue in MinHashMapper class. In map method, the loop goes over the elements in the vector. In many cases the instance of Vector abstract class is a SparseVector and iteration would meant to be over non-zeros values (e.g., documents as a sparse vector of words). However, in current implementation the iteration will go over all the elements including zero-valued (as using the vector iterator by default). This can produce meaningless clustering. In addition, in this case I think we should hash the index of the element rather than it's value. Can somebody confirm or disprove this? Thanks, Best regards, Elena Smirnova.
-
Re: Possible issue in MinHashMapperSean Owen 2012-07-30, 09:58
I think that's right, though I think the effect on correctness is quite
small, but the effect on performance is large. This will always hash zero even if zero were not really present in the vector. That is not likely to produce the smallest hash value though. Hashing all those zeroes is wasteful an I'm not clear whether it is demanded by the semantics. I suppose I'd also like the author to confirm whether this can be changed to look at only non-default values. On Mon, Jul 30, 2012 at 9:35 AM, Elena Smirnova <[EMAIL PROTECTED]>wrote: > Hello, > > It seems to me that there is an issue in MinHashMapper class. In map > method, the loop goes over the elements in the vector. In many cases the > instance of Vector abstract class is a SparseVector and iteration would > meant to be over non-zeros values (e.g., documents as a sparse vector of > words). However, in current implementation the iteration will go over all > the elements including zero-valued (as using the vector iterator by > default). This can produce meaningless clustering. In addition, in this > case I think we should hash the index of the element rather than it's > value. > > Can somebody confirm or disprove this? > > Thanks, > > Best regards, > Elena Smirnova. >
-
Re: Possible issue in MinHashMapperElena Smirnova 2012-07-30, 11:51
I agree about performance effect of iterating over zeros. But the
correctness effect comes due to hashing values of the element and not its index (at least in documents and words example). Do you agree? On Mon, Jul 30, 2012 at 11:58 AM, Sean Owen <[EMAIL PROTECTED]> wrote: > I think that's right, though I think the effect on correctness is quite > small, but the effect on performance is large. This will always hash zero > even if zero were not really present in the vector. That is not likely to > produce the smallest hash value though. > > Hashing all those zeroes is wasteful an I'm not clear whether it is > demanded by the semantics. I suppose I'd also like the author to confirm > whether this can be changed to look at only non-default values. > > On Mon, Jul 30, 2012 at 9:35 AM, Elena Smirnova <[EMAIL PROTECTED] > >wrote: > > > Hello, > > > > It seems to me that there is an issue in MinHashMapper class. In map > > method, the loop goes over the elements in the vector. In many cases the > > instance of Vector abstract class is a SparseVector and iteration would > > meant to be over non-zeros values (e.g., documents as a sparse vector of > > words). However, in current implementation the iteration will go over > all > > the elements including zero-valued (as using the vector iterator by > > default). This can produce meaningless clustering. In addition, in this > > case I think we should hash the index of the element rather than it's > > value. > > > > Can somebody confirm or disprove this? > > > > Thanks, > > > > Best regards, > > Elena Smirnova. > > >
-
Re: Possible issue in MinHashMapperSean Owen 2012-07-30, 12:02
Yes I know what you mean. In my understanding you typically apply minhash
to a large sparse vector that acts like a bit set, where the index is really the set member. There you want to hash the index, and doing so by considering all indices would be completely wrong. Here I think the set elements are the values. and the vectors seem to be treated as a list, really. So I'm not surprised they're treated as dense. I still think it's a good idea to iterate over non-default items, since I'm not clear whether the implementation is guaranteed to accept only dense input vectors, where all dimensions have a value -- in which case it doesn't matter and the current implementation is OK. Ankur are you still around to answer? I think that's a good guess as to the original intent. On Mon, Jul 30, 2012 at 12:51 PM, Elena Smirnova <[EMAIL PROTECTED]>wrote: > I agree about performance effect of iterating over zeros. But the > correctness effect comes due to hashing values of the element and not its > index (at least in documents and words example). > > Do you agree?
-
Re: Possible issue in MinHashMapperElena Smirnova 2012-07-30, 12:11
If vectors are treated as dense, then we have to modify the example given
for this class, which clearly talks about documents and words: https://cwiki.apache.org/confluence/display/MAHOUT/Minhash+Clustering On Mon, Jul 30, 2012 at 2:02 PM, Sean Owen <[EMAIL PROTECTED]> wrote: > Yes I know what you mean. In my understanding you typically apply minhash > to a large sparse vector that acts like a bit set, where the index is > really the set member. There you want to hash the index, and doing so by > considering all indices would be completely wrong. > > Here I think the set elements are the values. and the vectors seem to be > treated as a list, really. So I'm not surprised they're treated as dense. I > still think it's a good idea to iterate over non-default items, since I'm > not clear whether the implementation is guaranteed to accept only dense > input vectors, where all dimensions have a value -- in which case it > doesn't matter and the current implementation is OK. > > Ankur are you still around to answer? I think that's a good guess as to the > original intent. > > On Mon, Jul 30, 2012 at 12:51 PM, Elena Smirnova <[EMAIL PROTECTED] > >wrote: > > > I agree about performance effect of iterating over zeros. But the > > correctness effect comes due to hashing values of the element and not its > > index (at least in documents and words example). > > > > Do you agree? >
-
Re: Possible issue in MinHashMapperSean Owen 2012-07-30, 12:19
Hmm yeah what's the output format of that job? At first glance it looks
like it makes a dictionary and fills in sequentially with term frequencies. If so, applying min-hash this way isn't wildly wrong but I still don't think it works. Can anyone who knows this code at all weigh in? it's not my area... On Mon, Jul 30, 2012 at 1:11 PM, Elena Smirnova <[EMAIL PROTECTED]>wrote: > If vectors are treated as dense, then we have to modify the example given > for this class, which clearly talks about documents and words: > https://cwiki.apache.org/confluence/display/MAHOUT/Minhash+Clustering > > On Mon, Jul 30, 2012 at 2:02 PM, Sean Owen <[EMAIL PROTECTED]> wrote: > > > Yes I know what you mean. In my understanding you typically apply minhash > > to a large sparse vector that acts like a bit set, where the index is > > really the set member. There you want to hash the index, and doing so by > > considering all indices would be completely wrong. > > > > Here I think the set elements are the values. and the vectors seem to be > > treated as a list, really. So I'm not surprised they're treated as > dense. I > > still think it's a good idea to iterate over non-default items, since I'm > > not clear whether the implementation is guaranteed to accept only dense > > input vectors, where all dimensions have a value -- in which case it > > doesn't matter and the current implementation is OK. > > > > Ankur are you still around to answer? I think that's a good guess as to > the > > original intent. > > > > On Mon, Jul 30, 2012 at 12:51 PM, Elena Smirnova <[EMAIL PROTECTED] > > >wrote: > > > > > I agree about performance effect of iterating over zeros. But the > > > correctness effect comes due to hashing values of the element and not > its > > > index (at least in documents and words example). > > > > > > Do you agree? > > >
-
Re: Possible issue in MinHashMapperElena Smirnova 2012-07-30, 12:25
The input to the MinHashDriver are tf-vectors, e.g.:
hdfs -text /..../tf-vectors/part-r-00000 10020926 {96037:1.0,114292:1.0,119745:1.0,64091:1.0,84488:1.0,116350:1.0,135598:1.0,53382:1.0} 10020927 {53382:1.0,1076:1.0,116350:1.0,36446:1.0,135598:1.0,96037:1.0,84488:1.0} 10020928 {53382:1.0,58740:1.0,13358:1.0,116350:1.0,126929:1.0,135598:1.0,136284:1.0} 10020929 {53382:1.0,13358:1.0,116350:1.0,126929:1.0,135598:1.0,114292:1.0,76732:1.0} So the vectors are sparse and indexed by wordId. Clearly, this example doesn't fit into current implementation. On Mon, Jul 30, 2012 at 2:19 PM, Sean Owen <[EMAIL PROTECTED]> wrote: > Hmm yeah what's the output format of that job? At first glance it looks > like it makes a dictionary and fills in sequentially with term frequencies. > If so, applying min-hash this way isn't wildly wrong but I still don't > think it works. > > Can anyone who knows this code at all weigh in? it's not my area... > > On Mon, Jul 30, 2012 at 1:11 PM, Elena Smirnova <[EMAIL PROTECTED] > >wrote: > > > If vectors are treated as dense, then we have to modify the example given > > for this class, which clearly talks about documents and words: > > https://cwiki.apache.org/confluence/display/MAHOUT/Minhash+Clustering > > > > On Mon, Jul 30, 2012 at 2:02 PM, Sean Owen <[EMAIL PROTECTED]> wrote: > > > > > Yes I know what you mean. In my understanding you typically apply > minhash > > > to a large sparse vector that acts like a bit set, where the index is > > > really the set member. There you want to hash the index, and doing so > by > > > considering all indices would be completely wrong. > > > > > > Here I think the set elements are the values. and the vectors seem to > be > > > treated as a list, really. So I'm not surprised they're treated as > > dense. I > > > still think it's a good idea to iterate over non-default items, since > I'm > > > not clear whether the implementation is guaranteed to accept only dense > > > input vectors, where all dimensions have a value -- in which case it > > > doesn't matter and the current implementation is OK. > > > > > > Ankur are you still around to answer? I think that's a good guess as to > > the > > > original intent. > > > > > > On Mon, Jul 30, 2012 at 12:51 PM, Elena Smirnova <[EMAIL PROTECTED] > > > >wrote: > > > > > > > I agree about performance effect of iterating over zeros. But the > > > > correctness effect comes due to hashing values of the element and not > > its > > > > index (at least in documents and words example). > > > > > > > > Do you agree? > > > > > >
-
Re: Possible issue in MinHashMapperSean Owen 2012-07-30, 12:27
>From my reading of the code and the unit test, that is totally different
from the expected input of MinHashDriver. So I think the example should just be deleted. You would have to convert this to something like: {0:96037,1:114292,2:119745,...} to make it work correctly. On Mon, Jul 30, 2012 at 1:25 PM, Elena Smirnova <[EMAIL PROTECTED]>wrote: > The input to the MinHashDriver are tf-vectors, e.g.: > hdfs -text /..../tf-vectors/part-r-00000 > 10020926 > > {96037:1.0,114292:1.0,119745:1.0,64091:1.0,84488:1.0,116350:1.0,135598:1.0,53382:1.0} > 10020927 > {53382:1.0,1076:1.0,116350:1.0,36446:1.0,135598:1.0,96037:1.0,84488:1.0} > 10020928 > {53382:1.0,58740:1.0,13358:1.0,116350:1.0,126929:1.0,135598:1.0,136284:1.0} > 10020929 > {53382:1.0,13358:1.0,116350:1.0,126929:1.0,135598:1.0,114292:1.0,76732:1.0} > > So the vectors are sparse and indexed by wordId. Clearly, this example > doesn't fit into current implementation. > > On Mon, Jul 30, 2012 at 2:19 PM, Sean Owen <[EMAIL PROTECTED]> wrote: > > > Hmm yeah what's the output format of that job? At first glance it looks > > like it makes a dictionary and fills in sequentially with term > frequencies. > > If so, applying min-hash this way isn't wildly wrong but I still don't > > think it works. > > > > Can anyone who knows this code at all weigh in? it's not my area... > > > > On Mon, Jul 30, 2012 at 1:11 PM, Elena Smirnova <[EMAIL PROTECTED] > > >wrote: > > > > > If vectors are treated as dense, then we have to modify the example > given > > > for this class, which clearly talks about documents and words: > > > https://cwiki.apache.org/confluence/display/MAHOUT/Minhash+Clustering > > > > > > On Mon, Jul 30, 2012 at 2:02 PM, Sean Owen <[EMAIL PROTECTED]> wrote: > > > > > > > Yes I know what you mean. In my understanding you typically apply > > minhash > > > > to a large sparse vector that acts like a bit set, where the index is > > > > really the set member. There you want to hash the index, and doing so > > by > > > > considering all indices would be completely wrong. > > > > > > > > Here I think the set elements are the values. and the vectors seem to > > be > > > > treated as a list, really. So I'm not surprised they're treated as > > > dense. I > > > > still think it's a good idea to iterate over non-default items, since > > I'm > > > > not clear whether the implementation is guaranteed to accept only > dense > > > > input vectors, where all dimensions have a value -- in which case it > > > > doesn't matter and the current implementation is OK. > > > > > > > > Ankur are you still around to answer? I think that's a good guess as > to > > > the > > > > original intent. > > > > > > > > On Mon, Jul 30, 2012 at 12:51 PM, Elena Smirnova < > [EMAIL PROTECTED] > > > > >wrote: > > > > > > > > > I agree about performance effect of iterating over zeros. But the > > > > > correctness effect comes due to hashing values of the element and > not > > > its > > > > > index (at least in documents and words example). > > > > > > > > > > Do you agree? > > > > > > > > > >
-
Re: Possible issue in MinHashMapperElena Smirnova 2012-07-30, 12:41
OK. But I preferred to modify a couple of lines of MinHashMapper rather
than writing my own spaseVectors generator: ................... for (int i = 0; i < numHashFunctions; i++) { Iterator it = featureVector.iterateNonZero(); while (it.hasNext()) { Vector.Element next = (Vector.Element) it.next(); int value = next.index(); .................. It works fine. On Mon, Jul 30, 2012 at 2:27 PM, Sean Owen <[EMAIL PROTECTED]> wrote: > From my reading of the code and the unit test, that is totally different > from the expected input of MinHashDriver. So I think the example should > just be deleted. > > You would have to convert this to something like: > > {0:96037,1:114292,2:119745,...} > > to make it work correctly. > > On Mon, Jul 30, 2012 at 1:25 PM, Elena Smirnova <[EMAIL PROTECTED] > >wrote: > > > The input to the MinHashDriver are tf-vectors, e.g.: > > hdfs -text /..../tf-vectors/part-r-00000 > > 10020926 > > > > > {96037:1.0,114292:1.0,119745:1.0,64091:1.0,84488:1.0,116350:1.0,135598:1.0,53382:1.0} > > 10020927 > > {53382:1.0,1076:1.0,116350:1.0,36446:1.0,135598:1.0,96037:1.0,84488:1.0} > > 10020928 > > > {53382:1.0,58740:1.0,13358:1.0,116350:1.0,126929:1.0,135598:1.0,136284:1.0} > > 10020929 > > > {53382:1.0,13358:1.0,116350:1.0,126929:1.0,135598:1.0,114292:1.0,76732:1.0} > > > > So the vectors are sparse and indexed by wordId. Clearly, this example > > doesn't fit into current implementation. > > > > On Mon, Jul 30, 2012 at 2:19 PM, Sean Owen <[EMAIL PROTECTED]> wrote: > > > > > Hmm yeah what's the output format of that job? At first glance it looks > > > like it makes a dictionary and fills in sequentially with term > > frequencies. > > > If so, applying min-hash this way isn't wildly wrong but I still don't > > > think it works. > > > > > > Can anyone who knows this code at all weigh in? it's not my area... > > > > > > On Mon, Jul 30, 2012 at 1:11 PM, Elena Smirnova <[EMAIL PROTECTED] > > > >wrote: > > > > > > > If vectors are treated as dense, then we have to modify the example > > given > > > > for this class, which clearly talks about documents and words: > > > > > https://cwiki.apache.org/confluence/display/MAHOUT/Minhash+Clustering > > > > > > > > On Mon, Jul 30, 2012 at 2:02 PM, Sean Owen <[EMAIL PROTECTED]> wrote: > > > > > > > > > Yes I know what you mean. In my understanding you typically apply > > > minhash > > > > > to a large sparse vector that acts like a bit set, where the index > is > > > > > really the set member. There you want to hash the index, and doing > so > > > by > > > > > considering all indices would be completely wrong. > > > > > > > > > > Here I think the set elements are the values. and the vectors seem > to > > > be > > > > > treated as a list, really. So I'm not surprised they're treated as > > > > dense. I > > > > > still think it's a good idea to iterate over non-default items, > since > > > I'm > > > > > not clear whether the implementation is guaranteed to accept only > > dense > > > > > input vectors, where all dimensions have a value -- in which case > it > > > > > doesn't matter and the current implementation is OK. > > > > > > > > > > Ankur are you still around to answer? I think that's a good guess > as > > to > > > > the > > > > > original intent. > > > > > > > > > > On Mon, Jul 30, 2012 at 12:51 PM, Elena Smirnova < > > [EMAIL PROTECTED] > > > > > >wrote: > > > > > > > > > > > I agree about performance effect of iterating over zeros. But the > > > > > > correctness effect comes due to hashing values of the element and > > not > > > > its > > > > > > index (at least in documents and words example). > > > > > > > > > > > > Do you agree? > > > > > > > > > > > > > > >
-
Re: Possible issue in MinHashMapperSean Owen 2012-07-30, 13:06
Definitely, that works for the example, though it then fails for the unit
test. On Mon, Jul 30, 2012 at 1:41 PM, Elena Smirnova <[EMAIL PROTECTED]>wrote: > OK. But I preferred to modify a couple of lines of MinHashMapper rather > than writing my own spaseVectors generator: > ................... > for (int i = 0; i < numHashFunctions; i++) { > Iterator it = featureVector.iterateNonZero(); > while (it.hasNext()) { > Vector.Element next = (Vector.Element) it.next(); > int value = next.index(); > .................. > > It works fine.
-
Re: Possible issue in MinHashMapperSean Owen 2012-07-30, 21:18
(Does anyone have edit rights to delete this wiki please? It appears
to be just wrong.) https://cwiki.apache.org/confluence/display/MAHOUT/Minhash+Clustering On Mon, Jul 30, 2012 at 1:27 PM, Sean Owen <[EMAIL PROTECTED]> wrote: > From my reading of the code and the unit test, that is totally different > from the expected input of MinHashDriver. So I think the example should just > be deleted. > > You would have to convert this to something like: > > {0:96037,1:114292,2:119745,...} > > to make it work correctly. > > > On Mon, Jul 30, 2012 at 1:25 PM, Elena Smirnova <[EMAIL PROTECTED]> > wrote: >> >> The input to the MinHashDriver are tf-vectors, e.g.: >> hdfs -text /..../tf-vectors/part-r-00000 >> 10020926 >> >> {96037:1.0,114292:1.0,119745:1.0,64091:1.0,84488:1.0,116350:1.0,135598:1.0,53382:1.0} >> 10020927 >> {53382:1.0,1076:1.0,116350:1.0,36446:1.0,135598:1.0,96037:1.0,84488:1.0} >> 10020928 >> >> {53382:1.0,58740:1.0,13358:1.0,116350:1.0,126929:1.0,135598:1.0,136284:1.0} >> 10020929 >> >> {53382:1.0,13358:1.0,116350:1.0,126929:1.0,135598:1.0,114292:1.0,76732:1.0} >> >> So the vectors are sparse and indexed by wordId. Clearly, this example >> doesn't fit into current implementation. >> >> On Mon, Jul 30, 2012 at 2:19 PM, Sean Owen <[EMAIL PROTECTED]> wrote: >> >> > Hmm yeah what's the output format of that job? At first glance it looks >> > like it makes a dictionary and fills in sequentially with term >> > frequencies. >> > If so, applying min-hash this way isn't wildly wrong but I still don't >> > think it works. >> > >> > Can anyone who knows this code at all weigh in? it's not my area... >> > >> > On Mon, Jul 30, 2012 at 1:11 PM, Elena Smirnova <[EMAIL PROTECTED] >> > >wrote: >> > >> > > If vectors are treated as dense, then we have to modify the example >> > > given >> > > for this class, which clearly talks about documents and words: >> > > https://cwiki.apache.org/confluence/display/MAHOUT/Minhash+Clustering
-
Re: Possible issue in MinHashMapperSuneel Marthi 2012-07-30, 22:44
Sean,
I'll take care of this, I added this sometime last year but was never convinced that it ever worked right. ________________________________ From: Sean Owen <[EMAIL PROTECTED]> To: [EMAIL PROTECTED] Sent: Monday, July 30, 2012 5:18 PM Subject: Re: Possible issue in MinHashMapper (Does anyone have edit rights to delete this wiki please? It appears to be just wrong.) https://cwiki.apache.org/confluence/display/MAHOUT/Minhash+Clustering On Mon, Jul 30, 2012 at 1:27 PM, Sean Owen <[EMAIL PROTECTED]> wrote: > From my reading of the code and the unit test, that is totally different > from the expected input of MinHashDriver. So I think the example should just > be deleted. > > You would have to convert this to something like: > > {0:96037,1:114292,2:119745,...} > > to make it work correctly. > > > On Mon, Jul 30, 2012 at 1:25 PM, Elena Smirnova <[EMAIL PROTECTED]> > wrote: >> >> The input to the MinHashDriver are tf-vectors, e.g.: >> hdfs -text /..../tf-vectors/part-r-00000 >> 10020926 >> >> {96037:1.0,114292:1.0,119745:1.0,64091:1.0,84488:1.0,116350:1.0,135598:1.0,53382:1.0} >> 10020927 >> {53382:1.0,1076:1.0,116350:1.0,36446:1.0,135598:1.0,96037:1.0,84488:1.0} >> 10020928 >> >> {53382:1.0,58740:1.0,13358:1.0,116350:1.0,126929:1.0,135598:1.0,136284:1.0} >> 10020929 >> >> {53382:1.0,13358:1.0,116350:1.0,126929:1.0,135598:1.0,114292:1.0,76732:1.0} >> >> So the vectors are sparse and indexed by wordId. Clearly, this example >> doesn't fit into current implementation. >> >> On Mon, Jul 30, 2012 at 2:19 PM, Sean Owen <[EMAIL PROTECTED]> wrote: >> >> > Hmm yeah what's the output format of that job? At first glance it looks >> > like it makes a dictionary and fills in sequentially with term >> > frequencies. >> > If so, applying min-hash this way isn't wildly wrong but I still don't >> > think it works. >> > >> > Can anyone who knows this code at all weigh in? it's not my area... >> > >> > On Mon, Jul 30, 2012 at 1:11 PM, Elena Smirnova <[EMAIL PROTECTED] >> > >wrote: >> > >> > > If vectors are treated as dense, then we have to modify the example >> > > given >> > > for this class, which clearly talks about documents and words: >> > > https://cwiki.apache.org/confluence/display/MAHOUT/Minhash+Clustering
-
Re: Possible issue in MinHashMapperSuneel Marthi 2012-07-30, 22:46
Done.
________________________________ From: Suneel Marthi <[EMAIL PROTECTED]> To: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]> Sent: Monday, July 30, 2012 6:44 PM Subject: Re: Possible issue in MinHashMapper Sean, I'll take care of this, I added this sometime last year but was never convinced that it ever worked right. ________________________________ From: Sean Owen <[EMAIL PROTECTED]> To: [EMAIL PROTECTED] Sent: Monday, July 30, 2012 5:18 PM Subject: Re: Possible issue in MinHashMapper (Does anyone have edit rights to delete this wiki please? It appears to be just wrong.) https://cwiki.apache.org/confluence/display/MAHOUT/Minhash+Clustering On Mon, Jul 30, 2012 at 1:27 PM, Sean Owen <[EMAIL PROTECTED]> wrote: > From my reading of the code and the unit test, that is totally different > from the expected input of MinHashDriver. So I think the example should just > be deleted. > > You would have to convert this to something like: > > {0:96037,1:114292,2:119745,...} > > to make it work correctly. > > > On Mon, Jul 30, 2012 at 1:25 PM, Elena Smirnova <[EMAIL PROTECTED]> > wrote: >> >> The input to the MinHashDriver are tf-vectors, e.g.: >> hdfs -text /..../tf-vectors/part-r-00000 >> 10020926 >> >> {96037:1.0,114292:1.0,119745:1.0,64091:1.0,84488:1.0,116350:1.0,135598:1.0,53382:1.0} >> 10020927 >> {53382:1.0,1076:1.0,116350:1.0,36446:1.0,135598:1.0,96037:1.0,84488:1.0} >> 10020928 >> >> {53382:1.0,58740:1.0,13358:1.0,116350:1.0,126929:1.0,135598:1.0,136284:1.0} >> 10020929 >> >> {53382:1.0,13358:1.0,116350:1.0,126929:1.0,135598:1.0,114292:1.0,76732:1.0} >> >> So the vectors are sparse and indexed by wordId. Clearly, this example >> doesn't fit into current implementation. >> >> On Mon, Jul 30, 2012 at 2:19 PM, Sean Owen <[EMAIL PROTECTED]> wrote: >> >> > Hmm yeah what's the output format of that job? At first glance it looks >> > like it makes a dictionary and fills in sequentially with term >> > frequencies. >> > If so, applying min-hash this way isn't wildly wrong but I still don't >> > think it works. >> > >> > Can anyone who knows this code at all weigh in? it's not my area... >> > >> > On Mon, Jul 30, 2012 at 1:11 PM, Elena Smirnova <[EMAIL PROTECTED] >> > >wrote: >> > >> > > If vectors are treated as dense, then we have to modify the example >> > > given >> > > for this class, which clearly talks about documents and words: >> > > https://cwiki.apache.org/confluence/display/MAHOUT/Minhash+Clustering
-
Re: Possible issue in MinHashMapperElena Smirnova 2012-07-31, 07:19
Thanks Suneel,
I think it would also be interesting for Mahout users to have both use cases. From my experience (Information Retrieval), the use case with hashing indexes is quite popular.. For example, we can define a parameter in MinHashDriver job that says if we want to hash values or indexes of the vectors. That way we can keep unit tests functional and add similar one for hashing indexes. Also we could also add back the example. What do you think? On Tue, Jul 31, 2012 at 12:46 AM, Suneel Marthi <[EMAIL PROTECTED]>wrote: > Done. > > > > ________________________________ > From: Suneel Marthi <[EMAIL PROTECTED]> > To: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]> > Sent: Monday, July 30, 2012 6:44 PM > Subject: Re: Possible issue in MinHashMapper > > Sean, > > I'll take care of this, I added this sometime last year but was never > convinced that it ever worked right. > > > > ________________________________ > From: Sean Owen <[EMAIL PROTECTED]> > To: [EMAIL PROTECTED] > Sent: Monday, July 30, 2012 5:18 PM > Subject: Re: Possible issue in MinHashMapper > > (Does anyone have edit rights to delete this wiki please? It appears > to be just wrong.) > > https://cwiki.apache.org/confluence/display/MAHOUT/Minhash+Clustering > > On Mon, Jul 30, 2012 at 1:27 PM, Sean Owen <[EMAIL PROTECTED]> wrote: > > From my reading of the code and the unit test, that is totally different > > from the expected input of MinHashDriver. So I think the example should > just > > be deleted. > > > > You would have to convert this to something like: > > > > {0:96037,1:114292,2:119745,...} > > > > to make it work correctly. > > > > > > On Mon, Jul 30, 2012 at 1:25 PM, Elena Smirnova <[EMAIL PROTECTED]> > > wrote: > >> > >> The input to the MinHashDriver are tf-vectors, e.g.: > >> hdfs -text /..../tf-vectors/part-r-00000 > >> 10020926 > >> > >> > {96037:1.0,114292:1.0,119745:1.0,64091:1.0,84488:1.0,116350:1.0,135598:1.0,53382:1.0} > >> 10020927 > >> {53382:1.0,1076:1.0,116350:1.0,36446:1.0,135598:1.0,96037:1.0,84488:1.0} > >> 10020928 > >> > >> > {53382:1.0,58740:1.0,13358:1.0,116350:1.0,126929:1.0,135598:1.0,136284:1.0} > >> 10020929 > >> > >> > {53382:1.0,13358:1.0,116350:1.0,126929:1.0,135598:1.0,114292:1.0,76732:1.0} > >> > >> So the vectors are sparse and indexed by wordId. Clearly, this example > >> doesn't fit into current implementation. > >> > >> On Mon, Jul 30, 2012 at 2:19 PM, Sean Owen <[EMAIL PROTECTED]> wrote: > >> > >> > Hmm yeah what's the output format of that job? At first glance it > looks > >> > like it makes a dictionary and fills in sequentially with term > >> > frequencies. > >> > If so, applying min-hash this way isn't wildly wrong but I still don't > >> > think it works. > >> > > >> > Can anyone who knows this code at all weigh in? it's not my area... > >> > > >> > On Mon, Jul 30, 2012 at 1:11 PM, Elena Smirnova <[EMAIL PROTECTED] > >> > >wrote: > >> > > >> > > If vectors are treated as dense, then we have to modify the example > >> > > given > >> > > for this class, which clearly talks about documents and words: > >> > > > https://cwiki.apache.org/confluence/display/MAHOUT/Minhash+Clustering > |