|
|
-
Canopies and RowSimilarity
Pat Ferrel 2012-05-06, 21:08
I have calculated similarity for all my docs. It has been suggested that this might be a good way to pick distances to use for canopies. When I look at distances for similar docs I see them all over the map, of course. And some that seem far away look pretty good. Is this just a matter of eyeballing or is there some better way of picking canopy distances from similarity distances?
BTW Could I vote for a better description of using RowSimilarity? Shouldn't it have a -ow parameter? It would also be nice if it calculated the number of columns from the input "matrix". These things make it hard to automate in scripts.
-
Re: Canopies and RowSimilarity
Sebastian Schelter 2012-05-07, 04:51
On 06.05.2012 23:08, Pat Ferrel wrote:
> BTW Could I vote for a better description of using RowSimilarity? > Shouldn't it have a -ow parameter? It would also be nice if it > calculated the number of columns from the input "matrix". These things > make it hard to automate in scripts.
Could you open a JIRA ticket for that? Sounds like good feature requests. Would you like to tackle these things yourself?
--sebastian
-
Re: Canopies and RowSimilarity
Suneel Marthi 2012-05-07, 12:02
1. Please take a look at MAHOUT-834 for the -ow option, there is a patch available and is pebnding review..
2. Please take a look at MAHOUT-979 for calculating the number of columns from input matrix, I can work on this and upload a patch sometime this week.
________________________________ From: Sebastian Schelter <[EMAIL PROTECTED]> To: [EMAIL PROTECTED] Sent: Monday, May 7, 2012 12:51 AM Subject: Re: Canopies and RowSimilarity On 06.05.2012 23:08, Pat Ferrel wrote:
> BTW Could I vote for a better description of using RowSimilarity? > Shouldn't it have a -ow parameter? It would also be nice if it > calculated the number of columns from the input "matrix". These things > make it hard to automate in scripts.
Could you open a JIRA ticket for that? Sounds like good feature requests. Would you like to tackle these things yourself?
--sebastian
-
Re: Canopies and RowSimilarity
Sebastian Schelter 2012-05-07, 12:18
The problem with the patch in MAHOUT-834 is that it always cleans the temp dir, which we don't want to have as standard behavior as Sean put in the comments. Sometimes other jobs rely on the temp output, so we should retain it.
We could however include the temp dir cleaning when -ow is provided.
On 07.05.2012 14:02, Suneel Marthi wrote: > 1. Please take a look at MAHOUT-834 for the -ow option, there is a patch available and is pebnding review.. > > 2. Please take a look at MAHOUT-979 for calculating the number of columns from input matrix, I can work on this and upload a patch sometime this week. > > > > ________________________________ > From: Sebastian Schelter <[EMAIL PROTECTED]> > To: [EMAIL PROTECTED] > Sent: Monday, May 7, 2012 12:51 AM > Subject: Re: Canopies and RowSimilarity > > On 06.05.2012 23:08, Pat Ferrel wrote: > >> BTW Could I vote for a better description of using RowSimilarity? >> Shouldn't it have a -ow parameter? It would also be nice if it >> calculated the number of columns from the input "matrix". These things >> make it hard to automate in scripts. > > Could you open a JIRA ticket for that? Sounds like good feature > requests. Would you like to tackle these things yourself? > > --sebastian
-
Re: Canopies and RowSimilarity
Suneel Marthi 2012-05-07, 12:39
Uploaded a patch that only deletes the temp output if -ow has been specified.
________________________________ From: Sebastian Schelter <[EMAIL PROTECTED]> To: [EMAIL PROTECTED] Sent: Monday, May 7, 2012 8:18 AM Subject: Re: Canopies and RowSimilarity The problem with the patch in MAHOUT-834 is that it always cleans the temp dir, which we don't want to have as standard behavior as Sean put in the comments. Sometimes other jobs rely on the temp output, so we should retain it.
We could however include the temp dir cleaning when -ow is provided.
On 07.05.2012 14:02, Suneel Marthi wrote: > 1. Please take a look at MAHOUT-834 for the -ow option, there is a patch available and is pebnding review.. > > 2. Please take a look at MAHOUT-979 for calculating the number of columns from input matrix, I can work on this and upload a patch sometime this week. > > > > ________________________________ > From: Sebastian Schelter <[EMAIL PROTECTED]> > To: [EMAIL PROTECTED] > Sent: Monday, May 7, 2012 12:51 AM > Subject: Re: Canopies and RowSimilarity > > On 06.05.2012 23:08, Pat Ferrel wrote: > >> BTW Could I vote for a better description of using RowSimilarity? >> Shouldn't it have a -ow parameter? It would also be nice if it >> calculated the number of columns from the input "matrix". These things >> make it hard to automate in scripts. > > Could you open a JIRA ticket for that? Sounds like good feature > requests. Would you like to tackle these things yourself? > > --sebastian
-
Re: Canopies and RowSimilarity
Pat Ferrel 2012-05-07, 14:46
As to my first question, what was your idea for using rowsimilarity to estimate canopy sizes? My corpus size changes often so it would be interesting to find a way to automatically generate the canopy parameters.
On 5/7/12 5:39 AM, Suneel Marthi wrote: > Uploaded a patch that only deletes the temp output if -ow has been specified. > > > > ________________________________ > From: Sebastian Schelter<[EMAIL PROTECTED]> > To: [EMAIL PROTECTED] > Sent: Monday, May 7, 2012 8:18 AM > Subject: Re: Canopies and RowSimilarity > > The problem with the patch in MAHOUT-834 is that it always cleans the > temp dir, which we don't want to have as standard behavior as Sean put > in the comments. Sometimes other jobs rely on the temp output, so we > should retain it. > > We could however include the temp dir cleaning when -ow is provided. > > > > On 07.05.2012 14:02, Suneel Marthi wrote: >> 1. Please take a look at MAHOUT-834 for the -ow option, there is a patch available and is pebnding review.. >> >> 2. Please take a look at MAHOUT-979 for calculating the number of columns from input matrix, I can work on this and upload a patch sometime this week. >> >> >> >> ________________________________ >> From: Sebastian Schelter<[EMAIL PROTECTED]> >> To: [EMAIL PROTECTED] >> Sent: Monday, May 7, 2012 12:51 AM >> Subject: Re: Canopies and RowSimilarity >> >> On 06.05.2012 23:08, Pat Ferrel wrote: >> >>> BTW Could I vote for a better description of using RowSimilarity? >>> Shouldn't it have a -ow parameter? It would also be nice if it >>> calculated the number of columns from the input "matrix". These things >>> make it hard to automate in scripts. >> Could you open a JIRA ticket for that? Sounds like good feature >> requests. Would you like to tackle these things yourself? >> >> --sebastian
|
|