Home | About | Sematext search-lucene.com search-hadoop.com
 Search Lucene and all its subprojects:

Switch to Plain View
Mahout, mail # user - Re: number of clusters (Canopy Clustering)


+
Periya.Data 2012-01-05, 04:44
+
Paritosh Ranjan 2012-01-06, 04:47
+
Periya.Data 2012-01-07, 05:31
Copy link to this message
-
RE: number of clusters (Canopy Clustering)
Paritosh Ranjan 2012-01-07, 12:08
"Is there a way for me to determine the distance from command line? "

I am not aware of any. If anyone else is, then please suggest.
________________________________________
From: Periya.Data [[EMAIL PROTECTED]]
Sent: Saturday, January 07, 2012 6:31 AM
To: [EMAIL PROTECTED]
Subject: Re: number of clusters (Canopy Clustering)

I agree that if all the distances are < t2, I will get only one cluster. I
was just "hoping" that they do fall within that range and was basically
shooting in dark when twiddling with various t1 and t2 values.

Is there an easy way to determine the distance between vectors? In the
CanopyCluster shell script, I use EuclideanDistanceMeasure. The TFIDF
vectors are in binary and I have no idea how to proceed.

Is there a way for me to determine the distance from command line? So far,
I am not using any Java program to do my experiments. As a beginner, I am
running shell scripts and learning.

$MAHOUT_HOME/bin/mahout canopy       -i
/input/mahout/vectorized/tfidf-vectors \
                        -o
$HDFS_OUTPUT_DIR/bigdata-canopy-centroids \
                        -dm
org.apache.mahout.common.distance.EuclideanDistanceMeasure \
                        -t1          0.9 \
                        -t2          0.2 \
                        --overwrite

Thanks for your suggestions,
PD.
On Thu, Jan 5, 2012 at 8:47 PM, Paritosh Ranjan <[EMAIL PROTECTED]> wrote:

> What is the distance between vectors with the Distance measure you are
> using?
> If all the vectors lie within the range of t2, then you will get only 1
> cluster.
>
> Write some piece of test code which creates vectors of the data you are
> using, and then find the distance between the vectors ( using the same
> distance measure you are using while clustering ). If all distances are
> within t2, then you will get only one cluster.
>
>
> On 05-01-2012 10:14, Periya.Data wrote:
>
>> Hi Paritosh,
>>     Thanks for your suggestions. I am currently trying to use Canopy
>> Clustering to guess the number of clusters. I have tried various values
>> (between 0 and 1) for t1 and t2 (t1>  t2). Still I get only one cluster. I
>> tried (0.9, 0.2), (0.05, 0.001), (0.005, 0.00001) etc. I thought if I make
>> t2 very close to 0, I would a lot of clusters...but, it is very
>> strange...I
>> am getting only one cluster for a vast set of t1/t2 values.
>>
>> Is this because I am using just one text file for my analysis?
>>
>> I have only one large text file and want to cluster the words and see how
>> they are clustered. I thought this would be a simple way to begin
>> exploring
>> clustering/mahout.
>>
>> Your suggestions are appreciated,
>> PD.
>>
>> On Sat, Dec 31, 2011 at 2:48 AM, Paritosh Ranjan<[EMAIL PROTECTED]>
>>  wrote:
>>
>>  There can be two reasons for only one cluster being found.
>>>
>>> 1) The vectors are really close to each other and the clusters converge.
>>> 2) The distance measure you are using is not appropriate with your vector
>>> values.
>>>
>>> Try to
>>> 1) Analyze the vectors and the distance between them. Are they good
>>> candidates to be inside different clusters?
>>> 2) Try to use CanopyClustering first to guess the number of clusters (
>>> experiment a bit by changing values of t1 and t2 ).
>>> 3) Then provided the clusters returned by CanopyClustering to KMeans.
>>> 4) Use EuclideanDistance instead of Squared...
>>>
>>> Paritosh
>>>
>>> ______________________________**__________
>>> From: Periya.Data [[EMAIL PROTECTED]]
>>> Sent: Saturday, December 31, 2011 1:07 AM
>>> To: [EMAIL PROTECTED]
>>> Subject: number of clusters
>>>
>>> Hi all,
>>>    I am a newbie to Mahout. I am running a basic k-means clustering on a
>>> sample txt file. No matter what number I give to the --numClusters
>>> parameter, I always get only one cluster (VL-0). Can someone please point
>>> out any mistake and suggest what I should do to see a decent number of
>>> clusters?
>>>
>>> I successfully convert the txt file into seq-file and then to vectorized
+
Jeff Eastman 2012-01-08, 15:48