|
|
+
Grant Ingersoll 2012-01-13, 13:53
+
Lance Norskog 2012-01-17, 01:46
+
Suneel Marthi 2012-01-17, 01:51
+
Frank Scholten 2012-03-05, 14:13
+
Suneel Marthi 2012-03-08, 07:22
+
Frank Scholten 2012-03-08, 09:17
-
Re: Minhash reviewSuneel Marthi 2012-03-08, 12:44
That's correct.
________________________________ From: Frank Scholten <[EMAIL PROTECTED]> To: [EMAIL PROTECTED] Sent: Thursday, March 8, 2012 4:17 AM Subject: Re: Minhash review I agree with Grant that it's good to first get a working implementation that matches the paper. Later on we can work on other approaches. So if I understand correctly the vectorization step can be skipped and we can run SequenceFilesFromDirectory -> CollocDriver -> MinHashDriver Correct? On Thu, Mar 8, 2012 at 8:22 AM, Suneel Marthi <[EMAIL PROTECTED]> wrote: > Frank, > > I modified the present MinHash to hash on the index as opposed to the present tf-idf weights, but the change had no impact on the output and I still get bad clusters. > > I did read the blog posting you mention and that seems to be the right approach (and conforms to Broder's original paper on this subject - http://dl.acm.org/citation.cfm?id=736184). > > I can work on this. Do we modify the existing minhash code to be compliant with Broder's paper or do we implement a different MinHash based on Broder's paper? > > Regards. > > > > ________________________________ > From: Frank Scholten <[EMAIL PROTECTED]> > To: [EMAIL PROTECTED] > Sent: Monday, March 5, 2012 9:13 AM > Subject: Re: Minhash review > > I am also curious about the current MinHash implementation. In the > current implementation the vector TF or TF-IDF weights are hashed via > Vector.Element.get(). Jeff Hansen pointed out in a previous thread on > the mailinglist that this is incorrect and the index should be hashed > because the index identifies an N-gram in the dictionary. > > However in this blog > > http://notskateboarding.blogspot.com/2011/01/minhashing-is-reaaally-cool.html > > hashing is done directly on the N-gram itself. > > How is this algorithm supposed to work? Thoughts? > > On Tue, Jan 17, 2012 at 2:51 AM, Suneel Marthi <[EMAIL PROTECTED]> wrote: >> Lance, >> >> I don't think this problem is confined to DisplayMinHash alone, even the regular MinHash clustering doesn't seem right when run on the Reuter's dataset (using cluster-reuters.sh) and a few other data sets I had tried. I am playing with the the keyGroups values to determine if that improves the quality of clustering. >> >> >> >> ________________________________ >> From: Lance Norskog <[EMAIL PROTECTED]> >> To: [EMAIL PROTECTED] >> Sent: Monday, January 16, 2012 8:46 PM >> Subject: Re: Minhash review >> >> Minhash works better and better with the more dimensions you throw at >> it, right? All of the Display classes use two dimensions. Would >> MinHash more sense if it uses a few hundred dimensions and then >> collapse down to two? Maybe with SVD? >> >> Are there other clustering algorithms that have this problem? >> >> On Fri, Jan 13, 2012 at 5:53 AM, Grant Ingersoll <[EMAIL PROTECTED]> wrote: >>> I've had a sneaking suspicion for a while now that our minhash clustering isn't right. Looking at the DisplayMinHash contributed issue further cements this feeling, but I can't quite put my finger on what is wrong. I don't think it is completely true to the Broder paper, but that doesn't necessarily make it wrong. It's just both the cluster-reuters output and the DisplayMinHash output seem to be of pretty low quality. My gut says it has to do with the group stuff whereby we create the signatures. >>> >>> I think before we do 0.6 it could use a few eyeballs. >>> >>> >> >> >> >> -- >> Lance Norskog >> [EMAIL PROTECTED] +
Frank Scholten 2012-03-25, 19:28
|