Home | About | Sematext search-lucene.com search-hadoop.com
 Search Lucene and all its subprojects:

Switch to Threaded View
Mahout, mail # user - choosing appropriate t1,t2 for canopy clustering


Copy link to this message
-
Re: choosing appropriate t1,t2 for canopy clustering
Paritosh Ranjan 2012-05-16, 08:16
"calculated the mean distance between all the pairs of vectors"
This can be a very costly operation if the dataset is reasonably large.

On 16-05-2012 13:34, ivan obeso wrote:
> In my project of text clustering I used the Euclidean distance as
> measurement method. I wrote a method which calculated the mean distance
> between all the pairs of vectors (documents) and used this mean as T2, and
> for T1 I used mean*2. This approach worked really good for me, giving
> a reasonably
> number of clusters in various corpus.
>
> On Tue, May 15, 2012 at 10:45 AM, Robert Stewart<[EMAIL PROTECTED]>wrote:
>
>> I am trying to run canopy clustering on vectors extracted from lucene
>> index.  I want to use CosineDistanceMeasure.  How do I know what
>> appropriate values to use for t1 and t2 distance threshold?  I would assume
>> that Cosine distance measure would return "distances" as a range from 0.0
>> to 1.0 but that seems not the case, so how do I know what the potential
>> distance ranges are to pick t1 and t2 (other than many trial and errors)?
>>
>> Thanks
>> Bob