|
|
-
How to present mahout cluster in combination with Solr results
Vikas Pandya 2012-01-19, 03:18
Hello,
I have successfully created vectors from reading my existing Solr Index. Then created sequenceFile and mahout clusters from it. As I understand that currently solr and mahout clustering aren't integrated, what's the best way to represent mahout clusters to the user? Mine is a search application which renders results by querying solr index. Now I need to incorporate Mahout created clusters in the result. While Solr-Mahout integration isn't there yet, what's the best alternative way to represent this info?
Thanks,
+
Vikas Pandya 2012-01-19, 03:18
-
Re: How to present mahout cluster in combination with Solr results
Ioan Eugen Stan 2012-01-19, 08:45
Pe 19.01.2012 05:18, Vikas Pandya a scris: > Hello, > > I have successfully created vectors from reading my existing Solr Index. Then created sequenceFile and mahout clusters from it. As I understand that currently solr and mahout clustering aren't integrated, what's the best way to represent mahout clusters to the user? Mine is a search application which renders results by querying solr index. Now I need to incorporate Mahout created clusters in the result. While Solr-Mahout integration isn't there yet, what's the best alternative way to represent this info? > > Thanks, The only thing I can think of is to render the results yourself by reading the clusters. It depends very much on what information are you trying to extract and present to the user. I can think of about two things that you can find out: - similar documents to the ones provided by a Solr search (by getting the cluster to which they belong and getting the documents). - documents that have top terms that match the search query You can find out good examples on how to do this by looking at the ClusterDumper utility that reads and dumps clusters. Hope this helps, -- Ioan Eugen Stan http://ieugen.blogspot.com
+
Ioan Eugen Stan 2012-01-19, 08:45
-
Re: How to present mahout cluster in combination with Solr results
Frank Scholten 2012-01-19, 09:24
Hi Vikas, I suggest indexing the cluster label, cluster size and cluster-document mappings so you can use that information to build a tag cloud of your data. Checkout this presentation http://java.dzone.com/videos/configuring-mahout-clusteringCheers, Frank On Thu, Jan 19, 2012 at 4:18 AM, Vikas Pandya <[EMAIL PROTECTED]> wrote: > Hello, > > I have successfully created vectors from reading my existing Solr Index. Then created sequenceFile and mahout clusters from it. As I understand that currently solr and mahout clustering aren't integrated, what's the best way to represent mahout clusters to the user? Mine is a search application which renders results by querying solr index. Now I need to incorporate Mahout created clusters in the result. While Solr-Mahout integration isn't there yet, what's the best alternative way to represent this info? > > Thanks,
+
Frank Scholten 2012-01-19, 09:24
-
Re: How to present mahout cluster in combination with Solr results
Ioan Eugen Stan 2012-01-19, 09:50
Pe 19.01.2012 11:24, Frank Scholten a scris: > Hi Vikas, > > I suggest indexing the cluster label, cluster size and > cluster-document mappings so you can use that information to build a > tag cloud of your data. Checkout this presentation > http://java.dzone.com/videos/configuring-mahout-clustering> > Cheers, > > Frank Wow Frank, this is good. I didn't thought of it this way. > > On Thu, Jan 19, 2012 at 4:18 AM, Vikas Pandya<[EMAIL PROTECTED]> wrote: >> Hello, >> >> I have successfully created vectors from reading my existing Solr Index. Then created sequenceFile and mahout clusters from it. As I understand that currently solr and mahout clustering aren't integrated, what's the best way to represent mahout clusters to the user? Mine is a search application which renders results by querying solr index. Now I need to incorporate Mahout created clusters in the result. While Solr-Mahout integration isn't there yet, what's the best alternative way to represent this info? >> >> Thanks, -- Ioan Eugen Stan http://ieugen.blogspot.com
+
Ioan Eugen Stan 2012-01-19, 09:50
-
Re: How to present mahout cluster in combination with Solr results
Vikas Pandya 2012-01-19, 16:05
Hi Frank, Thanks for the link. That was useful. It's still bit unclear on how he built his index. are we saying, we index clusterId,clusterSize and clusterLable in the same index (where other data is indexed)? So one index will have two sets of Solr documents in it? one containing cluster info? My requirement again; I have bunch of db columns which are being indexed. e.g. Title, RiskLevel1, RiskLevel2,RiskLevel3 etc Title1 High Medium Low Current requirement is to cluster documents based on their riskLevels and NOT the title. Thanks, ________________________________ From: Frank Scholten <[EMAIL PROTECTED]> To: [EMAIL PROTECTED]; Vikas Pandya <[EMAIL PROTECTED]> Sent: Thursday, January 19, 2012 4:24 AM Subject: Re: How to present mahout cluster in combination with Solr results Hi Vikas, I suggest indexing the cluster label, cluster size and cluster-document mappings so you can use that information to build a tag cloud of your data. Checkout this presentation http://java.dzone.com/videos/configuring-mahout-clusteringCheers, Frank On Thu, Jan 19, 2012 at 4:18 AM, Vikas Pandya <[EMAIL PROTECTED]> wrote: > Hello, > > I have successfully created vectors from reading my existing Solr Index. Then created sequenceFile and mahout clusters from it. As I understand that currently solr and mahout clustering aren't integrated, what's the best way to represent mahout clusters to the user? Mine is a search application which renders results by querying solr index. Now I need to incorporate Mahout created clusters in the result. While Solr-Mahout integration isn't there yet, what's the best alternative way to represent this info? > > Thanks,
+
Vikas Pandya 2012-01-19, 16:05
-
Re: How to present mahout cluster in combination with Solr results
Vikas Pandya 2012-01-20, 15:01
>From the example below, solr search results should be clustered in some following way list all the items which have matching RiskLevels e.g. Cluster 1: Title RiskLevel1 RiskLevel2 RiskLevel3 abc High Medium Low xyz High Medium High def Low Medium High Cluster 2: Title RiskLevel1 RiskLevel2 RiskLevel3 omn Low Medium Low yui Low Medium High bnm Medium Medium High Though I have a feeling I don't need to use Mahout clustering for this, I am still trying to hook in mahout for this since we have more clustering requirements in the pipeline to cluster based on other features (attributes of objects). Any thoughts? ________________________________ From: Vikas Pandya <[EMAIL PROTECTED]> To: Frank Scholten <[EMAIL PROTECTED]>; "[EMAIL PROTECTED]" <[EMAIL PROTECTED]> Sent: Thursday, January 19, 2012 11:05 AM Subject: Re: How to present mahout cluster in combination with Solr results Hi Frank, Thanks for the link. That was useful. It's still bit unclear on how he built his index. are we saying, we index clusterId,clusterSize and clusterLable in the same index (where other data is indexed)? So one index will have two sets of Solr documents in it? one containing cluster info? My requirement again; I have bunch of db columns which are being indexed. e.g. Title, RiskLevel1, RiskLevel2,RiskLevel3 etc Title1 High Medium Low Current requirement is to cluster documents based on their riskLevels and NOT the title. Thanks, ________________________________ From: Frank Scholten <[EMAIL PROTECTED]> To: [EMAIL PROTECTED]; Vikas Pandya <[EMAIL PROTECTED]> Sent: Thursday, January 19, 2012 4:24 AM Subject: Re: How to present mahout cluster in combination with Solr results Hi Vikas, I suggest indexing the cluster label, cluster size and cluster-document mappings so you can use that information to build a tag cloud of your data. Checkout this presentation http://java.dzone.com/videos/configuring-mahout-clusteringCheers, Frank On Thu, Jan 19, 2012 at 4:18 AM, Vikas Pandya <[EMAIL PROTECTED]> wrote: > Hello, > > I have successfully created vectors from reading my existing Solr Index. Then created sequenceFile and mahout clusters from it. As I understand that currently solr and mahout clustering aren't integrated, what's the best way to represent mahout clusters to the user? Mine is a search application which renders results by querying solr index. Now I need to incorporate Mahout created clusters in the result. While Solr-Mahout integration isn't there yet, what's the best alternative way to represent this info? > > Thanks,
+
Vikas Pandya 2012-01-20, 15:01
-
Re: How to present mahout cluster in combination with Solr results
Frank Scholten 2012-01-20, 17:48
On Fri, Jan 20, 2012 at 4:01 PM, Vikas Pandya <[EMAIL PROTECTED]> wrote: > From the example below, solr search results should be clustered in some > following way > list all the items which have matching RiskLevels e.g. > > > Cluster 1: > Title RiskLevel1 RiskLevel2 RiskLevel3 > abc High Medium Low > xyz High Medium High > def Low Medium High > > Cluster 2: > Title RiskLevel1 RiskLevel2 RiskLevel3 > omn Low Medium Low > yui Low Medium High > bnm Medium Medium High > > Though I have a feeling I don't need to use Mahout clustering for this, I am > still trying to hook in mahout for this since we have more clustering > requirements in the pipeline to cluster based on other features (attributes > of objects). > You only have 27 unique risklevel combinations. You could just sort by or more risklevels to get a sense of the data. If you have more attributes then you could indeed look into clustering, Cheers, Frank > Any thoughts? > > ________________________________ > From: Vikas Pandya <[EMAIL PROTECTED]> > To: Frank Scholten <[EMAIL PROTECTED]>; "[EMAIL PROTECTED]" > <[EMAIL PROTECTED]> > Sent: Thursday, January 19, 2012 11:05 AM > > Subject: Re: How to present mahout cluster in combination with Solr results > > Hi Frank, > > Thanks for the link. That was useful. It's still bit unclear on how he built > his index. are we saying, we index clusterId,clusterSize and clusterLable > in the same index (where other data is indexed)? So one index will have two > sets of Solr documents in it? one containing cluster info? > > My requirement again; I have bunch of db columns which are being indexed. > e.g. > Title, RiskLevel1, RiskLevel2,RiskLevel3 etc > Title1 High Medium Low > > Current requirement is to cluster documents based on their riskLevels and > NOT the title. > > Thanks, > > > ________________________________ > From: Frank Scholten <[EMAIL PROTECTED]> > To: [EMAIL PROTECTED]; Vikas Pandya <[EMAIL PROTECTED]> > Sent: Thursday, January 19, 2012 4:24 AM > Subject: Re: How to present mahout cluster in combination with Solr results > > Hi Vikas, > > I suggest indexing the cluster label, cluster size and > cluster-document mappings so you can use that information to build a > tag cloud of your data. Checkout this presentation > http://java.dzone.com/videos/configuring-mahout-clustering> > Cheers, > > Frank > > On Thu, Jan 19, 2012 at 4:18 AM, Vikas Pandya <[EMAIL PROTECTED]> wrote: >> Hello, >> >> I have successfully created vectors from reading my existing Solr Index. >> Then created sequenceFile and mahout clusters from it. As I understand that >> currently solr and mahout clustering aren't integrated, what's the best way >> to represent mahout clusters to the user? Mine is a search application which >> renders results by querying solr index. Now I need to incorporate Mahout >> created clusters in the result. While Solr-Mahout integration isn't there >> yet, what's the best alternative way to represent this info? >> >> Thanks, >
+
Frank Scholten 2012-01-20, 17:48
-
Re: How to present mahout cluster in combination with Solr results
Frank Scholten 2012-02-01, 08:28
Vikas, Please send messages to the mailinglist so everyone can benefit. > Frank, > > To give further details about the usecase. > > 1)User searches for a free text, this search is served from Solr. > 2)User selects a record from the search result, subsequently need to display all the items where RiskLevels of the items match the values of Risk Levels of a selected item from search result (and put them under "Similar items" in UI). > > upon indexing I am copying RiskLevel1, RiskLevel2,RiskLevel3 into a single field (solr copyField). Vector is created against that field for Mahout to create clusters on it. Now the issue is (understandably) when clusters are created it will find distance between words and its very much possible that following three records get clustered into a single cluster. > RiskLevel1, RiskLevel2, RiskLevel3 > High High Low > High High High > High High Medium Just to make sure, in my presentation I talk about using text clustering for document tagging. The documents are vectorized and weighted with TF/IDF and are fed into a Mahout clustering algorithm. In your case you want to cluster items that have several risk levels as well as other properties. You have to use your original numerical data, (I assume probabilities) in a clustering algorithm, not the labels like low, medium, high. How were these labels assigned? > > But clustering on these metadata columns, requirement is to cluster as below (sequence of the values DO matter) > > Cluster1: > RiskLevel1, RiskLevel2,RiskLevel3 > High High Low > High High Low > > Cluster2: > RiskLevel1, RiskLevel2,RiskLevel3 > High High High > High High High > > Cluster3: > RiskLevel1, RiskLevel2,RiskLevel3 > High High Medium > High High Medium > > I started thinking about using classification over clustering? but while playing with Weka ( http://www.cs.waikato.ac.nz/ml/weka/ ) Swing based GUI tool where one can easily play around with different algorithms from UI directly, I found DBScan clustering did cluster results correctly per my requirements, to be precise it created three different clusters (if you pick above mentioned example). > > can clustering be done the way I need it to work in Mahout? or any other ideas that can be explore further? > > Thanks, On Fri, Jan 20, 2012 at 6:48 PM, Frank Scholten <[EMAIL PROTECTED]> wrote: > On Fri, Jan 20, 2012 at 4:01 PM, Vikas Pandya <[EMAIL PROTECTED]> wrote: >> From the example below, solr search results should be clustered in some >> following way >> list all the items which have matching RiskLevels e.g. >> >> >> Cluster 1: >> Title RiskLevel1 RiskLevel2 RiskLevel3 >> abc High Medium Low >> xyz High Medium High >> def Low Medium High >> >> Cluster 2: >> Title RiskLevel1 RiskLevel2 RiskLevel3 >> omn Low Medium Low >> yui Low Medium High >> bnm Medium Medium High >> >> Though I have a feeling I don't need to use Mahout clustering for this, I am >> still trying to hook in mahout for this since we have more clustering >> requirements in the pipeline to cluster based on other features (attributes >> of objects). >> > > You only have 27 unique risklevel combinations. You could just sort by > or more risklevels to get a sense of the data. > > If you have more attributes then you could indeed look into clustering, > > Cheers, > > Frank > >> Any thoughts? >> >> ________________________________ >> From: Vikas Pandya <[EMAIL PROTECTED]> >> To: Frank Scholten <[EMAIL PROTECTED]>; "[EMAIL PROTECTED]" >> <[EMAIL PROTECTED]> >> Sent: Thursday, January 19, 2012 11:05 AM
+
Frank Scholten 2012-02-01, 08:28
-
Re: How to present mahout cluster in combination with Solr results
Vikas Pandya 2012-02-02, 11:38
Frank. Thanks. >>In your case you want to cluster items that have several risk levels >> as well as other properties. You have to use your original numerical >> data, (I assume probabilities) in a clustering algorithm, not the >> labels like low, medium, high. How were these labels assigned? RiskLevel1,RiskLevel2,RiskLevel3 all are having actual lookup values (High, Medium,Low etc) in Solr index (Index is stored flatten) -Vikas ________________________________ From: Frank Scholten <[EMAIL PROTECTED]> To: [EMAIL PROTECTED] Sent: Wednesday, February 1, 2012 3:28 AM Subject: Re: How to present mahout cluster in combination with Solr results Vikas, Please send messages to the mailinglist so everyone can benefit. > Frank, > > To give further details about the usecase. > > 1)User searches for a free text, this search is served from Solr. > 2)User selects a record from the search result, subsequently need to display all the items where RiskLevels of the items match the values of Risk Levels of a selected item from search result (and put them under "Similar items" in UI). > > upon indexing I am copying RiskLevel1, RiskLevel2,RiskLevel3 into a single field (solr copyField). Vector is created against that field for Mahout to create clusters on it. Now the issue is (understandably) when clusters are created it will find distance between words and its very much possible that following three records get clustered into a single cluster. > RiskLevel1, RiskLevel2, RiskLevel3 > High High Low > High High High > High High Medium Just to make sure, in my presentation I talk about using text clustering for document tagging. The documents are vectorized and weighted with TF/IDF and are fed into a Mahout clustering algorithm. In your case you want to cluster items that have several risk levels as well as other properties. You have to use your original numerical data, (I assume probabilities) in a clustering algorithm, not the labels like low, medium, high. How were these labels assigned? > > But clustering on these metadata columns, requirement is to cluster as below (sequence of the values DO matter) > > Cluster1: > RiskLevel1, RiskLevel2,RiskLevel3 > High High Low > High High Low > > Cluster2: > RiskLevel1, RiskLevel2,RiskLevel3 > High High High > High High High > > Cluster3: > RiskLevel1, RiskLevel2,RiskLevel3 > High High Medium > High High Medium > > I started thinking about using classification over clustering? but while playing with Weka ( http://www.cs.waikato.ac.nz/ml/weka/ ) Swing based GUI tool where one can easily play around with different algorithms from UI directly, I found DBScan clustering did cluster results correctly per my requirements, to be precise it created three different clusters (if you pick above mentioned example). > > can clustering be done the way I need it to work in Mahout? or any other ideas that can be explore further? > > Thanks, On Fri, Jan 20, 2012 at 6:48 PM, Frank Scholten <[EMAIL PROTECTED]> wrote: > On Fri, Jan 20, 2012 at 4:01 PM, Vikas Pandya <[EMAIL PROTECTED]> wrote: >> From the example below, solr search results should be clustered in some >> following way >> list all the items which have matching RiskLevels e.g. >> >> >> Cluster 1: >> Title RiskLevel1 RiskLevel2 RiskLevel3 >> abc High Medium Low >> xyz High Medium High >> def Low Medium High >> >> Cluster 2: >> Title RiskLevel1 RiskLevel2 RiskLevel3 >> omn Low Medium Low >> yui Low Medium High >> bnm Medium Medium High >> >> Though I have a feeling I don't need to use Mahout clustering for this, I am
+
Vikas Pandya 2012-02-02, 11:38
-
Re: How to present mahout cluster in combination with Solr results
Frank Scholten 2012-02-02, 13:19
Checkout the recent mailing list post 'Clustering user profiles'
Jeff (Eastman) sums it up clearly.
> Mahout clustering (unsupervised classification) can only deal with continuous, homogeneous vector representations of the input data, where each vector element is weighted the same as the other elements. Mahout > (supervised) classification can deal with continuous, categorical, word-like and text-like features such as in your problem space.
> To address your problem with Mahout clustering, you would need to develop a mapping for each of your features to continuous vector elements and use a WeightedDistanceMeasure to account for the different element > types and their relative impacts on the overall distance computation. This would be an iterative process which might or might not produce useful results.
> An alternative approach would be to train a Mahout classifier with the various features using marked training data which classifies similar users into a finite number of "clusters" that seem natural to you. With such a > model, you could then classify new users into those "clusters". This approach would not be very useful for discovering new "clusters" in your data, but it would leverage the classifier training mechanisms to develop the > models as more of a black box than above.
Question also to other people reading this. I looked into this and saw that there are clustering algorithms for categorical data such as K-modes. Are these effective for solving these kind of problems? If so would they be interesting to add to Mahout?
Cheers,
Frank
On Thu, Feb 2, 2012 at 12:38 PM, Vikas Pandya <[EMAIL PROTECTED]> wrote: > Frank. Thanks. >>>In your case you want to cluster items that have several risk levels >>> as well as other properties. You have to use your original numerical >>> data, (I assume probabilities) in a clustering algorithm, not the >>> labels like low, medium, high. How were these labels assigned? > > > RiskLevel1,RiskLevel2,RiskLevel3 all are having actual lookup values (High, Medium,Low etc) in Solr index (Index is stored flatten) > > -Vikas > > > ________________________________ > From: Frank Scholten <[EMAIL PROTECTED]> > To: [EMAIL PROTECTED] > Sent: Wednesday, February 1, 2012 3:28 AM > Subject: Re: How to present mahout cluster in combination with Solr results > > Vikas, > > Please send messages to the mailinglist so everyone can benefit. > >> Frank, >> >> To give further details about the usecase. >> >> 1)User searches for a free text, this search is served from Solr. >> 2)User selects a record from the search result, subsequently need to display all the items where RiskLevels of the items match the values of Risk Levels of a selected item from search result (and put them under "Similar items" in UI). >> >> upon indexing I am copying RiskLevel1, RiskLevel2,RiskLevel3 into a single field (solr copyField). Vector is created against that field for Mahout to create clusters on it. Now the issue is (understandably) when clusters are created it will find distance between words and its very much possible that following three records get clustered into a single cluster. >> RiskLevel1, RiskLevel2, RiskLevel3 >> High High Low >> High High High >> High High Medium > > Just to make sure, in my presentation I talk about using text > clustering for document tagging. The documents are vectorized and > weighted with TF/IDF and are fed into a Mahout clustering algorithm. > > In your case you want to cluster items that have several risk levels > as well as other properties. You have to use your original numerical > data, (I assume probabilities) in a clustering algorithm, not the > labels like low, medium, high. How were these labels assigned? > >> >> But clustering on these metadata columns, requirement is to cluster as below (sequence of the values DO matter) >> >> Cluster1: >> RiskLevel1, RiskLevel2,RiskLevel3 >> High High Low
+
Frank Scholten 2012-02-02, 13:19
|