Home | About | Sematext search-lucene.com search-hadoop.com
 Search Lucene and all its subprojects:

Switch to Plain View
Mahout, mail # user - How to present mahout cluster in combination with Solr results


+
Vikas Pandya 2012-01-19, 03:18
+
Ioan Eugen Stan 2012-01-19, 08:45
+
Frank Scholten 2012-01-19, 09:24
+
Ioan Eugen Stan 2012-01-19, 09:50
+
Vikas Pandya 2012-01-19, 16:05
+
Vikas Pandya 2012-01-20, 15:01
+
Frank Scholten 2012-01-20, 17:48
+
Frank Scholten 2012-02-01, 08:28
+
Vikas Pandya 2012-02-02, 11:38
Copy link to this message
-
Re: How to present mahout cluster in combination with Solr results
Frank Scholten 2012-02-02, 13:19
Checkout the recent mailing list post 'Clustering user profiles'

Jeff (Eastman) sums it up clearly.

> Mahout clustering (unsupervised classification) can only deal with continuous, homogeneous vector representations of the input data, where each vector element is weighted the same as the other elements. Mahout
> (supervised) classification can deal with continuous, categorical, word-like and text-like features such as in your problem space.

> To address your problem with Mahout clustering, you would need to develop a mapping for each of your features to continuous vector elements and use a WeightedDistanceMeasure to account for the different element > types and their relative impacts on the overall distance computation. This would be an iterative process which might or might not produce useful results.

> An alternative approach would be to train a Mahout classifier with the various features using marked training data which classifies similar users into a finite number of "clusters" that seem natural to you. With such a
> model, you could then classify new users into those "clusters". This approach would not be very useful for discovering new "clusters" in your data, but it would leverage the classifier training mechanisms to develop the > models as more of a black box than above.

Question also to other people reading this. I looked into this and saw
that there are clustering algorithms for categorical data such as
K-modes. Are these effective for solving these kind of problems? If so
would they be interesting to add to Mahout?

Cheers,

Frank

On Thu, Feb 2, 2012 at 12:38 PM, Vikas Pandya <[EMAIL PROTECTED]> wrote:
> Frank. Thanks.
>>>In your case you want to cluster items that have several risk levels
>>> as well as other properties. You have to use your original numerical
>>> data, (I assume probabilities) in a clustering algorithm, not the
>>> labels like low, medium, high. How were these labels assigned?
>
>
> RiskLevel1,RiskLevel2,RiskLevel3 all are having actual lookup values (High, Medium,Low etc) in Solr index (Index is stored flatten)
>
> -Vikas
>
>
> ________________________________
>  From: Frank Scholten <[EMAIL PROTECTED]>
> To: [EMAIL PROTECTED]
> Sent: Wednesday, February 1, 2012 3:28 AM
> Subject: Re: How to present mahout cluster in combination with Solr results
>
> Vikas,
>
> Please send messages to the mailinglist so everyone can benefit.
>
>> Frank,
>>
>> To give further details about the usecase.
>>
>> 1)User searches for a free text, this search is served from Solr.
>> 2)User selects a record from the search result, subsequently need to display all the items where RiskLevels of the items match the values of Risk Levels of a selected item from search result (and put them under "Similar items" in UI).
>>
>> upon indexing I am copying RiskLevel1, RiskLevel2,RiskLevel3 into a single field (solr copyField). Vector is created against that field for Mahout to create clusters on it. Now the issue is (understandably) when clusters are created it will find distance between words and its very much possible that following three records get clustered into a single cluster.
>> RiskLevel1, RiskLevel2, RiskLevel3
>> High             High       Low
>> High             High             High
>> High             High         Medium
>
> Just to make sure, in my presentation I talk about using text
> clustering for document tagging. The documents are vectorized and
> weighted with TF/IDF and are fed into a Mahout clustering algorithm.
>
> In your case you want to cluster items that have several risk levels
> as well as other properties. You have to use your original numerical
> data, (I assume probabilities) in a clustering algorithm, not the
> labels like low, medium, high. How were these labels assigned?
>
>>
>> But clustering on these metadata columns, requirement is to cluster as below (sequence of the values DO matter)
>>
>> Cluster1:
>> RiskLevel1, RiskLevel2,RiskLevel3
>> High             High           Low