|
Bogdan Vatkov
2010-01-01, 06:41
Ted Dunning
2010-01-01, 10:00
Grant Ingersoll
2010-01-01, 13:24
Bogdan Vatkov
2010-01-01, 15:45
Ted Dunning
2010-01-01, 17:15
Ted Dunning
2010-01-01, 17:17
Ted Dunning
2010-01-01, 17:18
Shashikant Kore
2010-01-02, 07:15
Grant Ingersoll
2010-01-02, 13:45
Grant Ingersoll
2010-01-02, 14:45
Grant Ingersoll
2010-01-02, 16:36
Grant Ingersoll
2010-01-03, 16:10
Ted Dunning
2010-01-03, 18:26
Ian Holsman
2010-01-04, 01:59
Palleti, Pallavi
2010-01-04, 12:03
Ted Dunning
2010-01-04, 19:44
Dawid Weiss
2010-01-04, 19:55
Rao, Vaijanath
2010-01-05, 03:32
Pallavi Palleti
2010-01-05, 04:21
Ted Dunning
2010-01-05, 06:31
Rao, Vaijanath
2010-01-05, 07:12
Bogdan Vatkov
2010-01-06, 02:12
Ted Dunning
2010-01-06, 02:31
Bogdan Vatkov
2010-01-06, 02:36
Drew Farris
2010-01-06, 02:43
Bogdan Vatkov
2010-01-06, 02:45
Drew Farris
2010-01-06, 02:46
Bogdan Vatkov
2010-01-06, 02:51
Pallavi Palleti
2010-01-06, 02:51
Drew Farris
2010-01-06, 02:58
Bogdan Vatkov
2010-01-06, 03:08
Grant Ingersoll
2010-01-06, 12:04
Abbas
2012-03-07, 09:24
|
-
Re: Clustering techniques, tips and tricksBogdan Vatkov 2010-01-01, 06:41
Hi guys,
First of all - Happy New Year - wish you healthy and successful 2010 year! I would like to give some feedback. And ask some questions as well :). I find the classification above very useful (I actually found some details on the behavior of the k-means and Dirichlet algorithms that I did not see before in the docs around Mahout). I played around with Carrot2 for 2 weeks and it really has great level of usability and simplicity but ...I had to give up on it since my very first practical clustering task required to cluster 23K+ documents. And while I was able to run the Lingo algorithm on 5 000 docs with 6 GB of heap memory for approx. 2 min. it "failed" when I tried to cluster 8 000 docs...(it just ran for 30 mins and I had to kill the app since it was definitely not practical - compared to 5 000 docs clustered for 2 minutes). That was the point where I really had to give up on Carrot2 and started learning Mahout/Hadoop. I have managed to do some clustering on my 23 000+ docs with Mahout/k-means for something like 10 min (in standalone mode - no parallel processing at all, I didn't even use all of my (3:-) ) cores yet with Hadoop/Mahout) but I am still learning and still trying to analyze if the result clusters are really meaningful for my docs. One thing I can tell already now is that I definitely, desperately, need word-stopping (which worked like a charm in Carrot2) but I am struggling a little bit with Mahout code to come up with the right way to apply stopwords. As far as I know there is no such feature already in the code. I could use any piece of advice from you guys on where and how to apply stopwords in Mahout clustering algorithms. Or maybe I should do it already during the Text-to-Vector phase? But it would be valuable for me to be able to come back later to the complete context of a document (i.e. with the stopwords inside) - maybe it is a question on its own - how can I easily go back from clusters->original docs (an not just vectors), I do not know maybe some kind of mapper which maps vectors to the original documents somehow (e.g. sort of URL for a document based on the vector id/index or something?). So, I definitely need to apply stopwords - without it I can't make any use of my clusters - at least the ones I received after running (one iteration) k-means over my original docs (Lucene derived vector) were not really practical as I ended up with lots of clusters based on words that should have been stopped. Another question I have (maybe I simply overlooked something in the docs) but as far as I got it one should really run the Canopy algorithm first to get some "good" seeds for initial clusters and only then go for k-means clustering but I could not really find the example doing so, as far as I got it the examples are showing how to run the different algorithms (Canopy, k-means) but only separately and not like chained one ofter the other. Did I miss something? Can I get more guidance on how to chain Canopy and k-means? Yet another question - I read in some docs around that k-means actually has to be run multiple times somehow but I could not find it in the very examples. How can I do that? One thing is for sure - I cannot go back to Carrot2 - it is very useful for small volumes of docs but won't work on what I am heading for: My initial attempt is to cluster 23K+ docs but then I would like to move to my complete set of 100K+ docs. I would definitely like to be in control of the number (parametrized) of clusters. I think I will get better results if I can also apply stemming. What would be you recommendation when using mahout? Should I do the stemming again somewhere in the input vector forming? Or do you expect this to be part of the clustering algorithms? It is also really essential for me to have "updateable" algorithms as I am adding new documents on daily basis, and I definitely like to have them clustered immediately (incrementally) - I do not know if this is what is called "classification" in Mahout and I did not reach these examples yet (I wanted to really get acquainted with the clustering first). And that is not all - I do not only want to have new documents clustered against existing clusters but what I want in addition is that clusters could actually change with new docs coming. Of course one could not observe new clusters popping up after a single new doc is added to the analysis but clusters should really be adaptable/updateable with new docs. Would that be possible with k-means for example? I could give more feedback and possibly more questions :) when get a little bit further on this. Best regards, Bogdan On Thu, Dec 31, 2009 at 10:39 PM, Ted Dunning <[EMAIL PROTECTED]> wrote: Bogdan Vatkov email: [EMAIL PROTECTED] phone: +359 889 197 756
-
Re: Clustering techniques, tips and tricksTed Dunning 2010-01-01, 10:00
On Thu, Dec 31, 2009 at 10:41 PM, Bogdan Vatkov <[EMAIL PROTECTED]>wrote:
> > I would like to give some feedback. And ask some questions as well :). > Thank you! Very helpful feedback. > ... Carrot2 for 2 weeks ... has great level of > usability and simplicity but ...I had to give up on it since my very first > practical clustering task required to cluster 23K+ documents. Not too surprising. > ... > I have managed to do some clustering on my 23 000+ docs with Mahout/k-means > for something like 10 min (in standalone mode - no parallel processing at > all, I didn't even use all of my (3:-) ) cores yet with Hadoop/Mahout) but > I > am still learning and still trying to analyze if the result clusters are > really meaningful for my docs. > I have seen this effect before where a map-reduce program run sequentially is much faster than an all-in-memory implementation. > One thing I can tell already now is that I definitely, desperately, need > word-stopping You should be able to do this in the document -> vector conversion. You could also do this at the vector level by multiplying the coordinates of all stop words by zero, but that is not as nice a solution. > ... But it would be valuable for me to be able > to come back later to the complete context of a document (i.e. with the > stopwords inside) - maybe it is a question on its own - how can I easily go > back from clusters->original docs (an not just vectors), I do not know > maybe > some kind of mapper which maps vectors to the original documents somehow > (e.g. sort of URL for a document based on the vector id/index or > something?). > To do this, you should use the document ID and just return the original content from some other content store. Lucene or especially SOLR can help with this. > ... > I think I will get better results if I can also apply stemming. What would > be you recommendation when using mahout? Should I do the stemming again > somewhere in the input vector forming? Yes. That is exactly correct. It is also really essential for me to have "updateable" algorithms as I am > adding new documents on daily basis, and I definitely like to have them > clustered immediately (incrementally) - I do not know if this is what is > called "classification" in Mahout and I did not reach these examples yet (I > wanted to really get acquainted with the clustering first). > I can't comment on exactly how this should be done, but we definitely need to support this use case. > And that is not all - I do not only want to have new documents clustered > against existing clusters but what I want in addition is that clusters > could > actually change with new docs coming. > Exactly. This is easy algorithmically with k-means. It just needs to be supported by the software. > Of course one could not observe new clusters popping up after a single new > doc is added to the analysis but clusters should really be > adaptable/updateable with new docs. > Yes. It is eminently doable. Occasionally you should run back through all of the document vectors so you can look at old documents in light of new data but that should be very, very fast in your case.
-
Re: Clustering techniques, tips and tricksGrant Ingersoll 2010-01-01, 13:24
On Jan 1, 2010, at 5:00 AM, Ted Dunning wrote: > On Thu, Dec 31, 2009 at 10:41 PM, Bogdan Vatkov <[EMAIL PROTECTED]>wrote: > >> >> I would like to give some feedback. And ask some questions as well :). >> > > Thank you! > > Very helpful feedback. > > >> ... Carrot2 for 2 weeks ... has great level of >> usability and simplicity but ...I had to give up on it since my very first >> practical clustering task required to cluster 23K+ documents. > > > Not too surprising. Right, Carrot2 is designed for clustering search results, and of that mainly the title and snippet. While it can do larger docs, they are specifically not the target. Plus, C2 is an in-memory tool designed to be very fast for search results. > > >> ... >> I have managed to do some clustering on my 23 000+ docs with Mahout/k-means >> for something like 10 min (in standalone mode - no parallel processing at >> all, I didn't even use all of my (3:-) ) cores yet with Hadoop/Mahout) but >> I >> am still learning and still trying to analyze if the result clusters are >> really meaningful for my docs. >> > > I have seen this effect before where a map-reduce program run sequentially > is much faster than an all-in-memory implementation. > > >> One thing I can tell already now is that I definitely, desperately, need >> word-stopping > > > You should be able to do this in the document -> vector conversion. You > could also do this at the vector level by multiplying the coordinates of all > stop words by zero, but that is not as nice a solution. Right, or if you are using the Lucene extraction method, at Lucene indexing time. > > >> ... But it would be valuable for me to be able >> to come back later to the complete context of a document (i.e. with the >> stopwords inside) - maybe it is a question on its own - how can I easily go >> back from clusters->original docs (an not just vectors), I do not know >> maybe >> some kind of mapper which maps vectors to the original documents somehow >> (e.g. sort of URL for a document based on the vector id/index or >> something?). >> > > To do this, you should use the document ID and just return the original > content from some other content store. Lucene or especially SOLR can help > with this. Right, Mahout's vector can take labels. > > >> ... >> I think I will get better results if I can also apply stemming. What would >> be you recommendation when using mahout? Should I do the stemming again >> somewhere in the input vector forming? > > > Yes. That is exactly correct. Again, really easy to do if you use the Lucene method for creating vectors. > > It is also really essential for me to have "updateable" algorithms as I am >> adding new documents on daily basis, and I definitely like to have them >> clustered immediately (incrementally) - I do not know if this is what is >> called "classification" in Mahout and I did not reach these examples yet (I >> wanted to really get acquainted with the clustering first). >> > > I can't comment on exactly how this should be done, but we definitely need > to support this use case. Don't people usually see if the new docs fit into an existing cluster and if they are a good fit, add them there, otherwise, maybe put them in the best match and kick off a new job. > > >> And that is not all - I do not only want to have new documents clustered >> against existing clusters but what I want in addition is that clusters >> could >> actually change with new docs coming. >> > > Exactly. This is easy algorithmically with k-means. It just needs to be > supported by the software. Makes sense and shouldn't be that hard to do. I'd imagine we just need to be able to use the centroids from the previous run as the seeds for the new run. > > >> Of course one could not observe new clusters popping up after a single new >> doc is added to the analysis but clusters should really be >> adaptable/updateable with new docs. >> > > Yes. It is eminently doable. Occasionally you should run back through all
-
Re: Clustering techniques, tips and tricksBogdan Vatkov 2010-01-01, 15:45
On Fri, Jan 1, 2010 at 3:24 PM, Grant Ingersoll <[EMAIL PROTECTED]> wrote:
> > On Jan 1, 2010, at 5:00 AM, Ted Dunning wrote: > > > On Thu, Dec 31, 2009 at 10:41 PM, Bogdan Vatkov <[EMAIL PROTECTED] > >wrote: > > > >> > >> I would like to give some feedback. And ask some questions as well :). > >> > > > > Thank you! > > > > Very helpful feedback. > > > > > >> ... Carrot2 for 2 weeks ... has great level of > >> usability and simplicity but ...I had to give up on it since my very > first > >> practical clustering task required to cluster 23K+ documents. > > > > > > Not too surprising. > > Right, Carrot2 is designed for clustering search results, and of that > mainly the title and snippet. While it can do larger docs, they are > specifically not the target. Plus, C2 is an in-memory tool designed to be > very fast for search results. > > > > > > > >> ... > >> I have managed to do some clustering on my 23 000+ docs with > Mahout/k-means > >> for something like 10 min (in standalone mode - no parallel processing > at > >> all, I didn't even use all of my (3:-) ) cores yet with Hadoop/Mahout) > but > >> I > >> am still learning and still trying to analyze if the result clusters are > >> really meaningful for my docs. > >> > > > > I have seen this effect before where a map-reduce program run > sequentially > > is much faster than an all-in-memory implementation. > > > > > >> One thing I can tell already now is that I definitely, desperately, need > >> word-stopping > > > > > > You should be able to do this in the document -> vector conversion. You > > could also do this at the vector level by multiplying the coordinates of > all > > stop words by zero, but that is not as nice a solution. > > Right, or if you are using the Lucene extraction method, at Lucene indexing > time. > > Ok, so it seems I have to use the stop wording feature of Lucene itself, right? I just saw there is something about stop words in Lucene but I am yet to find out how to use that capability. > > > > > >> ... But it would be valuable for me to be able > >> to come back later to the complete context of a document (i.e. with the > >> stopwords inside) - maybe it is a question on its own - how can I easily > go > >> back from clusters->original docs (an not just vectors), I do not know > >> maybe > >> some kind of mapper which maps vectors to the original documents somehow > >> (e.g. sort of URL for a document based on the vector id/index or > >> something?). > >> > > > > To do this, you should use the document ID and just return the original > > content from some other content store. Lucene or especially SOLR can > help > > with this. > > Right, Mahout's vector can take labels. > > > > > > >> ... > >> I think I will get better results if I can also apply stemming. What > would > >> be you recommendation when using mahout? Should I do the stemming again > >> somewhere in the input vector forming? > > > > > > Yes. That is exactly correct. > > Again, really easy to do if you use the Lucene method for creating vectors. > > Do you mean I have to apply stemming during the vector creation or already in Lucene indexing? Maybe from clustering POV it is the same but what would you recommend? > > > > It is also really essential for me to have "updateable" algorithms as I > am > >> adding new documents on daily basis, and I definitely like to have them > >> clustered immediately (incrementally) - I do not know if this is what is > >> called "classification" in Mahout and I did not reach these examples yet > (I > >> wanted to really get acquainted with the clustering first). > >> > > > > I can't comment on exactly how this should be done, but we definitely > need > > to support this use case. > > Don't people usually see if the new docs fit into an existing cluster and > if they are a good fit, add them there, otherwise, maybe put them in the > best match and kick off a new job. > > Actually this question goes back to the original attempt - to analyze documents automatically by the machine, and not by people :). One of my goals is to not read the new document but rather the system to tell me if I should read it ;) - e.g. if it gets clustered/classified against given cluster/topic which I am interested (not interested) in I could then take more informed decision whether to read it (not to read it). I do not know how this updatable clustering works (using previous results as centroids for new clusterings), is there an example I could see in action? Additionally I would like to see an example of how could one combine Canopy and k-means, I just saw this described in theory somehow but could not find an example of it. Best regards, Bogdan
-
Re: Clustering techniques, tips and tricksTed Dunning 2010-01-01, 17:15
Either would work. Spilling vectors from a Lucene index is more developed
at this point in time and thus would probably be easier. On Fri, Jan 1, 2010 at 7:45 AM, Bogdan Vatkov <[EMAIL PROTECTED]>wrote: > Do you mean I have to apply stemming during the vector creation or already > in Lucene indexing? Maybe from clustering POV it is the same but what would > you recommend? > -- Ted Dunning, CTO DeepDyve
-
Re: Clustering techniques, tips and tricksTed Dunning 2010-01-01, 17:17
If you start gathering data about whether a particular user would like a
document then you might also be interested in the on-line learning algorithms that are currently being developed. These should be able to let the user say whether they wanted to read a document or not. The clustering could be used initially, but as interest information is collected, you can build a per user profile of interest. On Fri, Jan 1, 2010 at 7:45 AM, Bogdan Vatkov <[EMAIL PROTECTED]>wrote: > > Don't people usually see if the new docs fit into an existing cluster and > > if they are a good fit, add them there, otherwise, maybe put them in the > > best match and kick off a new job. > > > > Actually this question goes back to the original attempt - to analyze > documents automatically by the machine, and not by people :). One of my > goals is to not read the new document but rather the system to tell me if I > should read it ;) - e.g. if it gets clustered/classified against given > cluster/topic which I am interested (not interested) in I could then take > more informed decision whether to read it (not to read it). -- Ted Dunning, CTO DeepDyve
-
Re: Clustering techniques, tips and tricksTed Dunning 2010-01-01, 17:18
I don't think that this code exists yet. It is possible in theory, but not
yet reduced to practice. On Fri, Jan 1, 2010 at 7:45 AM, Bogdan Vatkov <[EMAIL PROTECTED]>wrote: > I do not know how this updatable clustering works (using previous results > as > centroids for new clusterings), is there an example I could see in action? > -- Ted Dunning, CTO DeepDyve
-
Re: Clustering techniques, tips and tricksShashikant Kore 2010-01-02, 07:15
On Thu, Dec 31, 2009 at 10:40 PM, Grant Ingersoll <[EMAIL PROTECTED]> wrote:
> > The other thing I'm interested in is people's real world feedback on using clustering to solve their text related problems. > For instance, what type of feature reduction did you do (stopword removal, stemming, etc.)? What algorithms worked for you? > What didn't work? Any and all insight is welcome and I don't particularly care if it is Mahout specific (for instance, part of > the chapter is about search result clustering using Carrot2 and so Mahout isn't applicable) > Let me start by saying Mahout works great for us. We can run k-means on 250k docs (10 iterations, 100 seeds) in less than 30 minutes on a single host. Using vector normalization like L2 norm helped quite a bit. Thanks to Ted for this suggestion. In text clustering, you have lots of small documents. This results into very sparse vectors (total of 100K features with each vector having 200 features.) Using vanilla TFIDF weights doesn't work as nicely. Even if we don't do explicit stop word removal, the threshold values for document count does that in a better fashion. If you exclude the features which are extremely common (say more than 40% documents) or extremely rare (say in less than 50 documents in a corpus of 100K docs), you have a meaningful set of features. The current K-Means already accepts these parameters. Stemming can be used for feature reduction, but it has a minor issue. When you want to find out prominent features of the resulting cluster centroid, the feature may not be meaningful. For example, if a prominent feature is "beautiful", when you get it back, you will get "beauti." Ouch. I tried fuzzy K-Means for soft clustering, but I didn't get good results. May be the corpus had the issue. One observation about the clustering process is that it is geared, by accident or by design, towards batch processing. There is no support for real-time clustering. There needs to be glue which ties all the components together to make the process seamless. I suppose, someone in need of this feature will contribute it to Mahout. Grant, If I recall more, I will mail it to you. --shashi
-
Re: Clustering techniques, tips and tricksGrant Ingersoll 2010-01-02, 13:45
On Jan 1, 2010, at 10:45 AM, Bogdan Vatkov wrote: > I do not know how this updatable clustering works (using previous results as > centroids for new clusterings), is there an example I could see in action? > Additionally I would like to see an example of how could one combine Canopy > and k-means, I just saw this described in theory somehow but could not find > an example of it. You should just be able to run Canopy and then use the output of that as the value for --clusters on the KMeansDriver.
-
Re: Clustering techniques, tips and tricksGrant Ingersoll 2010-01-02, 14:45
On Jan 2, 2010, at 2:15 AM, Shashikant Kore wrote: > On Thu, Dec 31, 2009 at 10:40 PM, Grant Ingersoll <[EMAIL PROTECTED]> wrote: >> >> The other thing I'm interested in is people's real world feedback on using clustering to solve their text related problems. >> For instance, what type of feature reduction did you do (stopword removal, stemming, etc.)? What algorithms worked for you? >> What didn't work? Any and all insight is welcome and I don't particularly care if it is Mahout specific (for instance, part of >> the chapter is about search result clustering using Carrot2 and so Mahout isn't applicable) >> > > Let me start by saying Mahout works great for us. We can run k-means > on 250k docs (10 iterations, 100 seeds) in less than 30 minutes on a > single host. > > Using vector normalization like L2 norm helped quite a bit. Thanks to > Ted for this suggestion. In text clustering, you have lots of small > documents. This results into very sparse vectors (total of 100K > features with each vector having 200 features.) Using vanilla TFIDF > weights doesn't work as nicely. > > Even if we don't do explicit stop word removal, the threshold values > for document count does that in a better fashion. If you exclude the > features which are extremely common (say more than 40% documents) or > extremely rare (say in less than 50 documents in a corpus of 100K > docs), you have a meaningful set of features. The current K-Means > already accepts these parameters. You mean the Lucene Driver that creates the vectors, right? > > Stemming can be used for feature reduction, but it has a minor issue. > When you want to find out prominent features of the resulting cluster > centroid, the feature may not be meaningful. For example, if a > prominent feature is "beautiful", when you get it back, you will get > "beauti." Ouch. Right, but this is easily handled via something like Lucene's highlighter functionality. I bet it could be made to work on Mahout's vectors (+ a dictionary) fairly easily. > > I tried fuzzy K-Means for soft clustering, but I didn't get good > results. May be the corpus had the issue. > > One observation about the clustering process is that it is geared, by > accident or by design, towards batch processing. There is no > support for real-time clustering. There needs to be glue which ties > all the components together to make the process seamless. I suppose, > someone in need of this feature will contribute it to Mahout. Right. This should be pretty easy to remedy, though. One could simply use the previous results as the --clusters option, right? > > Grant, If I recall more, I will mail it to you. Great! Thank you. > > --shashi -------------------------- Grant Ingersoll http://www.lucidimagination.com/ Search the Lucene ecosystem using Solr/Lucene: http://www.lucidimagination.com/search
-
Re: Clustering techniques, tips and tricksGrant Ingersoll 2010-01-02, 16:36
On Jan 2, 2010, at 2:15 AM, Shashikant Kore wrote: > On Thu, Dec 31, 2009 at 10:40 PM, Grant Ingersoll <[EMAIL PROTECTED]> wrote: >> >> The other thing I'm interested in is people's real world feedback on using clustering to solve their text related problems. >> For instance, what type of feature reduction did you do (stopword removal, stemming, etc.)? What algorithms worked for you? >> What didn't work? Any and all insight is welcome and I don't particularly care if it is Mahout specific (for instance, part of >> the chapter is about search result clustering using Carrot2 and so Mahout isn't applicable) >> > > > Using vector normalization like L2 norm helped quite a bit. As I recall, it is important that the choice of norms aligns with the choice of distance measures, as well as data source (http://www.lucidimagination.com/search/document/34ffc2a83a71a055/centroid_calculations_with_sparse_vectors and http://www.lucidimagination.com/search/document/34ffc2a83a71a055/centroid_calculations_with_sparse_vectors#3d8310376b6cdf6b)
-
Re: Clustering techniques, tips and tricksGrant Ingersoll 2010-01-03, 16:10
On Dec 31, 2009, at 3:39 PM, Ted Dunning wrote: > - can the clustering algorithm be viewed in a probabilistic framework > (k-means, LDA, Dirichlet = yes, agglomerative clustering using nearest > neighbors = not so much) > > - is the definition of a cluster abstract enough to be flexible with regard > to whether a cluster is a model or does it require stronger limits. > (k-means = symmetric Gaussian with equal variance, Dirichlet = almost any > probabilistic model) Can you elaborate a bit more on these two? I can see a bit on the probability side, as those approaches play a factor in how similarity is determined, but I don't get the significance of "cluster as a model". Is it just a simplification that then makes it easier to ask: does this document fit into the model? -Grant
-
Re: Clustering techniques, tips and tricksTed Dunning 2010-01-03, 18:26
Clustering in the k-means style can be viewed as fitting a mixture model to
the data. A mixture model is a model of the generation of random data where you first pick one of n sub-models and then generate a point from the sub-model. There are parameters that express the probability of each sub-model and then the parameters for each sub-model. With k-means, you don't explicitly compute the mixing parameters and you assume that the sub-models are symmetric Gaussians with fixed variance. One of the simplest algorithms for this is the so-called E-M algorithm. For mixture distributions, this algorithm breaks into two passes. The first assigns data points to models, the second estimates each sub-model from the data assigned to that model. This algorithm is easy to implement and works well in some cases, particularly for k-means. Unfortunately, the E-M algorithm as it stands is doing what is called a maximum likelihood fit. If you take k-means and try to use a more elaborate model than identical Gaussians, the use of maximum likelihood instantly causes horrible problems because if you allow the variance to vary, the algorithm will position some clusters on a single point and drive the variance to zero. The result is some point-like clusters that each describe a single observation, and a few diffuse clusters to handle the rest of the data. Generalization goes out the window because the point-like clusters won't describe new data at all. So... back to your question. k-means, LDA and the Dirichlet clustering are all implementations of E-M algorithms for fitting mixture distributions (or first cousins to avoid the max-likelihood traps). This allows very nice mathematical frameworks to be brought to bear on the problem of how to make these algorithms work well. For the second question, the distinction is based on the fact that k-means is pretty inflexible with regard to the sub-models because of the maximum likelihood problems. The Dirichlet clustering tackles this problem head-on and thus allows much more general models. This transition from purely distance based metaphors to probabilistic mixture models as a fundamental metaphor for clustering is a difficult one to make. You can work to interpret the mixture models in terms of the distance based metaphor, but I find that counter-productive because the analogies used lead to some very dangerous algorithms without obvious repair. Viewing the problem as a statistical learning problems not only makes the extension of the algorithms to new domains easier, but it provides nice fixes to the problems encountered. If you want to stick with the distance (similarity) based metaphor, you can view the probability p(X | m_i) of an observed data point X given the i-th model m_i as the similarity of the data X to model m_i. Then the estimation of the new version of the model from a number of data points can be considered to be the analog of the centroid computation in k-means. On Sun, Jan 3, 2010 at 8:10 AM, Grant Ingersoll <[EMAIL PROTECTED]> wrote: > > On Dec 31, 2009, at 3:39 PM, Ted Dunning wrote: > > > > - can the clustering algorithm be viewed in a probabilistic framework > > (k-means, LDA, Dirichlet = yes, agglomerative clustering using nearest > > neighbors = not so much) > > > > - is the definition of a cluster abstract enough to be flexible with > regard > > to whether a cluster is a model or does it require stronger limits. > > (k-means = symmetric Gaussian with equal variance, Dirichlet = almost any > > probabilistic model) > > Can you elaborate a bit more on these two? I can see a bit on the > probability side, as those approaches play a factor in how similarity is > determined, but I don't get the significance of "cluster as a model". Is it > just a simplification that then makes it easier to ask: does this document > fit into the model? > > -Grant -- Ted Dunning, CTO DeepDyve
-
Re: Clustering techniques, tips and tricksIan Holsman 2010-01-04, 01:59
On 1/2/10 6:15 PM, Shashikant Kore wrote:
> On Thu, Dec 31, 2009 at 10:40 PM, Grant Ingersoll<[EMAIL PROTECTED]> wrote: > >> The other thing I'm interested in is people's real world feedback on using clustering to solve their text related problems. >> For instance, what type of feature reduction did you do (stopword removal, stemming, etc.)? What algorithms worked for you? >> What didn't work? Any and all insight is welcome and I don't particularly care if it is Mahout specific (for instance, part of >> the chapter is about search result clustering using Carrot2 and so Mahout isn't applicable) >> >> > Let me start by saying Mahout works great for us. We can run k-means > on 250k docs (10 iterations, 100 seeds) in less than 30 minutes on a > single host. > > Using vector normalization like L2 norm helped quite a bit. Thanks to > Ted for this suggestion. In text clustering, you have lots of small > documents. This results into very sparse vectors (total of 100K > features with each vector having 200 features.) Using vanilla TFIDF > weights doesn't work as nicely. > I'm not sure what L2 norm is, but wouldn't the frequent pattern mining feature help here? (from mahout-157) I was hoping to use it for feature reduction. > Even if we don't do explicit stop word removal, the threshold values > for document count does that in a better fashion. If you exclude the > features which are extremely common (say more than 40% documents) or > extremely rare (say in less than 50 documents in a corpus of 100K > docs), you have a meaningful set of features. The current K-Means > already accepts these parameters. > > Stemming can be used for feature reduction, but it has a minor issue. > When you want to find out prominent features of the resulting cluster > centroid, the feature may not be meaningful. For example, if a > prominent feature is "beautiful", when you get it back, you will get > "beauti." Ouch. > > I tried fuzzy K-Means for soft clustering, but I didn't get good > results. May be the corpus had the issue. > > One observation about the clustering process is that it is geared, by > accident or by design, towards batch processing. There is no > support for real-time clustering. There needs to be glue which ties > all the components together to make the process seamless. I suppose, > someone in need of this feature will contribute it to Mahout. > > Grant, If I recall more, I will mail it to you. > > --shashi > >
-
RE: Clustering techniques, tips and tricksPalleti, Pallavi 2010-01-04, 12:03
Here are my two cents.
I experimented with both kmeans and fuzzy kmeans on a data similar to movielens. And, what I observed is that the initial cluster seeds are more important. Initially, I used canopy clustering seeds as initial seeds but the results weren't good and the number of clusters depends on the distance thresholds we give as input. Later, I have considered randomly selecting some points from the input dataset and consider them as initial seeds. Again, the results were not good. Now, I have chosen initial seeds from input set in such a way that the points are far from each other and I have observed better clustering using Fuzzy Kmeans. I have not implemented a map-reducable version for this seed selection. I will soon implement a map-reducable version and submit a patch. Also, regarding canopy, I found that what actually canopy is defined in theory is different from what we have in mahout. However, some one can clarify on that. Thanks Pallavi -----Original Message----- From: Shashikant Kore [mailto:[EMAIL PROTECTED]] Sent: Saturday, January 02, 2010 12:45 PM To: [EMAIL PROTECTED] Subject: Re: Clustering techniques, tips and tricks On Thu, Dec 31, 2009 at 10:40 PM, Grant Ingersoll <[EMAIL PROTECTED]> wrote: > > The other thing I'm interested in is people's real world feedback on using clustering to solve their text related problems. > For instance, what type of feature reduction did you do (stopword removal, stemming, etc.)? What algorithms worked for you? > What didn't work? Any and all insight is welcome and I don't > particularly care if it is Mahout specific (for instance, part of the > chapter is about search result clustering using Carrot2 and so Mahout > isn't applicable) > Let me start by saying Mahout works great for us. We can run k-means on 250k docs (10 iterations, 100 seeds) in less than 30 minutes on a single host. Using vector normalization like L2 norm helped quite a bit. Thanks to Ted for this suggestion. In text clustering, you have lots of small documents. This results into very sparse vectors (total of 100K features with each vector having 200 features.) Using vanilla TFIDF weights doesn't work as nicely. Even if we don't do explicit stop word removal, the threshold values for document count does that in a better fashion. If you exclude the features which are extremely common (say more than 40% documents) or extremely rare (say in less than 50 documents in a corpus of 100K docs), you have a meaningful set of features. The current K-Means already accepts these parameters. Stemming can be used for feature reduction, but it has a minor issue. When you want to find out prominent features of the resulting cluster centroid, the feature may not be meaningful. For example, if a prominent feature is "beautiful", when you get it back, you will get "beauti." Ouch. I tried fuzzy K-Means for soft clustering, but I didn't get good results. May be the corpus had the issue. One observation about the clustering process is that it is geared, by accident or by design, towards batch processing. There is no support for real-time clustering. There needs to be glue which ties all the components together to make the process seamless. I suppose, someone in need of this feature will contribute it to Mahout. Grant, If I recall more, I will mail it to you. --shashi
-
Re: Clustering techniques, tips and tricksTed Dunning 2010-01-04, 19:44
Pallavi,
This is very useful feedback. What you have done is very similar to the k-means++ algorithm and it is clearly a very good thing. There is already an issue for tracking a k-means++ implementation: http://issues.apache.org/jira/browse/MAHOUT-153 Could you post your patch there? On Mon, Jan 4, 2010 at 4:03 AM, Palleti, Pallavi < [EMAIL PROTECTED]> wrote: > Initially, I used canopy clustering seeds as initial seeds but the results > weren't good and the number of clusters depends on the distance thresholds > we give as input. Later, I have considered randomly selecting some points > from the input dataset and consider them as initial seeds. Again, the > results were not good. Now, I have chosen initial seeds from input set in > such a way that the points are far from each other and I have observed > better clustering using Fuzzy Kmeans. I have not implemented a map-reducable > version for this seed selection. I will soon implement a map-reducable > version and submit a patch. > -- Ted Dunning, CTO DeepDyve
-
Re: Clustering techniques, tips and tricksDawid Weiss 2010-01-04, 19:55
> The carrot2 implementation is among the best that I have used in terms of
> the quality of clustering small batches of results such as search result > lists. (blushing). Thanks, it's a big compliment coming from you, Ted.
-
RE: Clustering techniques, tips and tricksRao, Vaijanath 2010-01-05, 03:32
Hi,
I have another suggestion that I am working at, it's not yet complete but I should be able to submit some patch in coming weeks. The idea is to have two distances one each for A) Across the cluster B) Within the cluster These distances help in setting up the inclusion and exclusion of the input data point in the cluster. In the iter-1 I allow inifite distance for both of these distance and run the regular logic of clustering. At the end of it calculate the min and max for both of these distances. For the remaining iterations do the following A) find the Lowest distance cluster for the input point. If the distance of the input point increases the distance within the cluster then the previously assigned then ignore it and create a new cluster with this point. If it does not voilate the older distance keep it with following probability 1. 1 if the distance of Input point is less then min distance. 2. in the ratio of (max-distance of the input from center)/max_distance B) If there are two clusters which have cluster distance less then the minimum cluster distance then merge them and calculate the within cluster distance and across cluster distance. I have observed that this has helped me in getting good clusters. Currently I have a single version code which has the above calculation and I am wokring on getting this into Map-red and should be able to submit a patch very soon. Let me know If you think this will generally work. --Thanks and Regards Vaijanath -----Original Message----- From: Palleti, Pallavi [mailto:[EMAIL PROTECTED]] Sent: Monday, January 04, 2010 5:34 PM To: [EMAIL PROTECTED] Subject: RE: Clustering techniques, tips and tricks Here are my two cents. I experimented with both kmeans and fuzzy kmeans on a data similar to movielens. And, what I observed is that the initial cluster seeds are more important. Initially, I used canopy clustering seeds as initial seeds but the results weren't good and the number of clusters depends on the distance thresholds we give as input. Later, I have considered randomly selecting some points from the input dataset and consider them as initial seeds. Again, the results were not good. Now, I have chosen initial seeds from input set in such a way that the points are far from each other and I have observed better clustering using Fuzzy Kmeans. I have not implemented a map-reducable version for this seed selection. I will soon implement a map-reducable version and submit a patch. Also, regarding canopy, I found that what actually canopy is defined in theory is different from what we have in mahout. However, some one can clarify on that. Thanks Pallavi -----Original Message----- From: Shashikant Kore [mailto:[EMAIL PROTECTED]] Sent: Saturday, January 02, 2010 12:45 PM To: [EMAIL PROTECTED] Subject: Re: Clustering techniques, tips and tricks On Thu, Dec 31, 2009 at 10:40 PM, Grant Ingersoll <[EMAIL PROTECTED]> wrote: > > The other thing I'm interested in is people's real world feedback on using clustering to solve their text related problems. > For instance, what type of feature reduction did you do (stopword removal, stemming, etc.)? What algorithms worked for you? > What didn't work? Any and all insight is welcome and I don't > particularly care if it is Mahout specific (for instance, part of the > chapter is about search result clustering using Carrot2 and so Mahout > isn't applicable) > Let me start by saying Mahout works great for us. We can run k-means on 250k docs (10 iterations, 100 seeds) in less than 30 minutes on a single host. Using vector normalization like L2 norm helped quite a bit. Thanks to Ted for this suggestion. In text clustering, you have lots of small documents. This results into very sparse vectors (total of 100K features with each vector having 200 features.) Using vanilla TFIDF weights doesn't work as nicely. Even if we don't do explicit stop word removal, the threshold values for document count does that in a better fashion. If you exclude the features which are extremely common (say more than 40% documents) or extremely rare (say in less than 50 documents in a corpus of 100K docs), you have a meaningful set of features. The current K-Means already accepts these parameters. Stemming can be used for feature reduction, but it has a minor issue. When you want to find out prominent features of the resulting cluster centroid, the feature may not be meaningful. For example, if a prominent feature is "beautiful", when you get it back, you will get "beauti." Ouch. I tried fuzzy K-Means for soft clustering, but I didn't get good results. May be the corpus had the issue. One observation about the clustering process is that it is geared, by accident or by design, towards batch processing. There is no support for real-time clustering. There needs to be glue which ties all the components together to make the process seamless. I suppose, someone in need of this feature will contribute it to Mahout. Grant, If I recall more, I will mail it to you.
-
Re: Clustering techniques, tips and tricksPallavi Palleti 2010-01-05, 04:21
Sure Ted. I will do that.
Thanks Pallavi Ted Dunning wrote: > Pallavi, > > This is very useful feedback. > > What you have done is very similar to the k-means++ algorithm and it is > clearly a very good thing. > > There is already an issue for tracking a k-means++ implementation: > http://issues.apache.org/jira/browse/MAHOUT-153 > > Could you post your patch there? > > On Mon, Jan 4, 2010 at 4:03 AM, Palleti, Pallavi < > [EMAIL PROTECTED]> wrote: > > >> Initially, I used canopy clustering seeds as initial seeds but the results >> weren't good and the number of clusters depends on the distance thresholds >> we give as input. Later, I have considered randomly selecting some points >> from the input dataset and consider them as initial seeds. Again, the >> results were not good. Now, I have chosen initial seeds from input set in >> such a way that the points are far from each other and I have observed >> better clustering using Fuzzy Kmeans. I have not implemented a map-reducable >> version for this seed selection. I will soon implement a map-reducable >> version and submit a patch. >> >> > > > >
-
Re: Clustering techniques, tips and tricksTed Dunning 2010-01-05, 06:31
This definition isn't quite solid. Probably the problem is just writing it
down in English without mathematical notation. On Mon, Jan 4, 2010 at 7:32 PM, Rao, Vaijanath <[EMAIL PROTECTED]>wrote: > The idea is to have two distances one each for > A) Across the cluster > B) Within the cluster > What do you mean be these, exactly? Are these distances from a point to a cluster? Between clusters? The maximum or minimum of all distances from a point to members of the cluster? In the iter-1 I allow inifite distance for both of these distance and run > the regular logic of clustering. At the end of it calculate the min and max > for both of these distances. > Regular logic? Do you mean k-means? Min and max across all points against the clusters they are in? Or against all clusters? > For the remaining iterations do the following > To each data point? > > A) find the Lowest distance cluster for the input point. If the distance of > the input point increases the distance within the cluster then the > previously assigned then ignore it and create a new cluster with this point. > > What do you mean by increases the distance within the cluster? The sum of all distances from points in a cluster to the centroid? Or do you mean to say that the new assignment is farther than any previous member of the cluster from the centroid? What are the new min and max distances for the new cluster? If it does not voilate the older distance keep it with following probability > 1. 1 if the distance of Input point is less then min distance. > Min over what? for just this cluster? > 2. in the ratio of (max-distance of the input from > center)/max_distance > These two probabilities to not have the same limit for distances just larger and just smaller than the min distance. Do you mean that the second should be (max-distance to center) / (max - min) ? What do you mean be "keep it"? Keep the assignment? What if you don't keep the assignment? Do you then form a new cluster with this point? If you do that, what is the min and max distance for the new cluster? B) If there are two clusters which have cluster distance less then the > minimum cluster distance then merge them and calculate the within cluster > distance and across cluster distance. > What is "cluster distance"? What is "minimum cluster distance" ? How can any particular cluster distance be smaller than the smallest cluster distance? > I have observed that this has helped me in getting good clusters. Currently > I have a single version code which has the above calculation and I am > wokring on getting this into Map-red and should be able to submit a patch > very soon. > Does single version code mean sequential code? If so, don't wait until you have a parallel version! File a JIRA and post what you have! > Let me know If you think this will generally work. > I think that algorithms like this heuristic have a good chance of working. I don't see that this particular algorithm is going to handle high dimensions very well, nor do I think it will necessarily converge because it looks like there is always a good chance that cluster members will be assigned to a new cluster. In order to improve convergence, you might consider annealing this new cluster behavior, possibly taking 1 - Math.pow(1 - ratio-as-you-defined, iteration). This will cause the clusters to become more and more stable. You may not be interested in a stable clustering. The Dirichlet clustering that we have is already a non-deterministic clustering algorithm that produces a distribution over possible numbers of clusters and cluster assignments. Perhaps that is what your algorithm should be creating. I am also curious about whether your algorithm is a variant of something like neural gas clustering.
-
RE: Clustering techniques, tips and tricksRao, Vaijanath 2010-01-05, 07:12
Hi Ted,
My replies are inside -----Original Message----- From: Ted Dunning [mailto:[EMAIL PROTECTED]] Sent: Tuesday, January 05, 2010 12:01 PM To: [EMAIL PROTECTED] Subject: Re: Clustering techniques, tips and tricks This definition isn't quite solid. Probably the problem is just writing it down in English without mathematical notation. On Mon, Jan 4, 2010 at 7:32 PM, Rao, Vaijanath <[EMAIL PROTECTED]>wrote: > The idea is to have two distances one each for > A) Across the cluster > B) Within the cluster > Q? What do you mean be these, exactly? Are these distances from a point to a cluster? Between clusters? The maximum or minimum of all distances from a point to members of the cluster? --For Within the cluster distance A cluster will have some points which are near to centroid and some may be farther from the centroid. Distance within cluster implies the distance of the point and the centroid of the cluster in which the point belongs. The min distance represent the minimum distance a point has with the centroid ( which is near to centroid), whereas the max distance represents the maximum distance a point has with the centroid of same cluster. For across the cluster distance This is the distance between the cetroids of the clusters. The min distance here represents the minimum distance between two such cluster centroids and max distance represents the maximum distance between the centroids of two clusters. Ideally we want any clustering algorithm to have smaller intra-cluster distance ( distance between centroid and the point within the cluster ) and larger inter-cluster distance ( distance between two cluster centroids ) In the iter-1 I allow inifite distance for both of these distance and run > the regular logic of clustering. At the end of it calculate the min > and max for both of these distances. > Q? Regular logic? Do you mean k-means? --Regular logic here implies the selection of the cluster based on data point. In case od K-Means it is minimum distance. For Some other algorithm it will be within this distance or threashold and so on. So the assignment of data point to cluster remains as per the Choice of Clustering algorithm > For the remaining iterations do the following > Q? To each data point? --Yes for every data point > > A) find the Lowest distance cluster for the input point. If the > distance of the input point increases the distance within the cluster > then the previously assigned then ignore it and create a new cluster with this point. > > Q? What do you mean by increases the distance within the cluster? The sum of all distances from points in a cluster to the centroid? Or do you mean to say that the new assignment is farther than any previous member of the cluster from the centroid? --Here I meant that when the data point is assigned to this cluster it incrreased the within cluster ( intra-cluster distance ) Q? What are the new min and max distances for the new cluster? For new clusters that are created it will be set to infinite but if the cluster already exisit the new min and max will be claculated based on the points that are present in that cluster. If it does not voilate the older distance keep it with following probability > 1. 1 if the distance of Input point is less then min distance. > Q? Min over what? for just this cluster? Yes, if new point assigned to this cluster is smaller then the prev min, then it's added as is to centroid else this datapoint will carry a smaller weight when added to centroid > 2. in the ratio of (max-distance of the input from > center)/max_distance > Q? These two probabilities to not have the same limit for distances just larger and just smaller than the min distance. Do you mean that the second should be (max-distance to center) / (max - min) ? I haven't played with the weightage, but your eq should give the values in the range of 0-1 which defines the wt of the datapoint carries when the centroid is calculated. I will use this Q? What do you mean be "keep it"? Keep the assignment? What if you don't keep the assignment? Do you then form a new cluster with this point? If you do that, what is the min and max distance for the new cluster? cluster with min and max distance set to infinite. Which means when B) If there are two clusters which have cluster distance less then the Q? What is "cluster distance"? What is "minimum cluster distance" ? How can any particular cluster distance be smaller than the smallest cluster distance? there are two cluster centroids which have distance less then the last iteration minimum inter-cluster distance, then we can merge these two clusters to get one big cluster. At this point we need to recalculate the inter-cluster distances and intra-cluster distances. Q? Does single version code mean sequential code? Q?If so, don't wait until you have a parallel version! File a JIRA and post what you have! Q? I think that algorithms like this heuristic have a good chance of working. I don't see that this particular algorithm is going to handle high dimensions very well, nor do I think it will necessarily converge because it looks like there is always a good chance that cluster members will be assigned to a new cluster. In order to improve convergence, you might consider annealing this new cluster behavior, possibly taking 1 - Math.pow(1 - ratio-as-you-defined, iteration). This will cause the clusters to become more and more stable. try to add annealing as this will help it to be more stable. Q? You may not be interested in a stable clustering. The Dirichlet clustering that we have is already a non-deterministic clustering algorithm that produces a distribution over possible numbers of clusters and cluster assignments. Perhaps that is what your algorithm should be creating. sure If I am going in the right direction to get those. Q? I am also curious about whether your algorithm is a variant of something l
-
Re: Clustering techniques, tips and tricksBogdan Vatkov 2010-01-06, 02:12
>>> stopwords inside) - maybe it is a question on its own - how can I easily
go >>> back from clusters->original docs (an not just vectors), I do not know >>> maybe >>> some kind of mapper which maps vectors to the original documents somehow >>> (e.g. sort of URL for a document based on the vector id/index or >>> something?). >>> >> >> To do this, you should use the document ID and just return the original >> content from some other content store. Lucene or especially SOLR can help >> with this. > Right, Mahout's vector can take labels. what do you mean by using the document ID and that vectors can take labels? is it something I could use right away from the current cluster vectors of should I change some Mahout code to get to the documents ID?
-
Re: Clustering techniques, tips and tricksTed Dunning 2010-01-06, 02:31
On Tue, Jan 5, 2010 at 6:12 PM, Bogdan Vatkov <[EMAIL PROTECTED]>wrote:
> > what do you mean by using the document ID and that vectors can take labels? > is it something I could use right away from the current cluster vectors of > should I change some Mahout code to get to the documents ID? > This means that a vector can have a name. That name can be the ID of your original document. I think, but am not sure, that you can use this now. Grant would know better. -- Ted Dunning, CTO DeepDyve
-
Re: Clustering techniques, tips and tricksBogdan Vatkov 2010-01-06, 02:36
Is there some description of the content of the cluster vector?
I also noticed that I end up with some folders clusters-0 and clusters-1, but sometimes it is only clusters-0, when do we get the different folders and which should be used as end result - e.g. by the ClusterDumper? On Wed, Jan 6, 2010 at 4:31 AM, Ted Dunning <[EMAIL PROTECTED]> wrote: > On Tue, Jan 5, 2010 at 6:12 PM, Bogdan Vatkov <[EMAIL PROTECTED] > >wrote: > > > > > what do you mean by using the document ID and that vectors can take > labels? > > is it something I could use right away from the current cluster vectors > of > > should I change some Mahout code to get to the documents ID? > > > > This means that a vector can have a name. That name can be the ID of your > original document. > > I think, but am not sure, that you can use this now. Grant would know > better. > > -- > Ted Dunning, CTO > DeepDyve > -- Best regards, Bogdan
-
Re: Clustering techniques, tips and tricksDrew Farris 2010-01-06, 02:43
Each iteration of kmeans procuses a cluster-X folder, with X starting
at 0. You would get clusters-0 in cases where the clusters converge after the first run. Whether your clusters will retain document id's is based on how you create the vectors. For example, the lucene vector dumper can be told to extract the value from a specific field in the index to use for the vector labels. These are carried through to the points file produced at the end of the k-means run. On Tue, Jan 5, 2010 at 9:36 PM, Bogdan Vatkov <[EMAIL PROTECTED]> wrote: > Is there some description of the content of the cluster vector? > I also noticed that I end up with some folders clusters-0 and clusters-1, > but sometimes it is only clusters-0, when do we get the different folders > and which should be used as end result - e.g. by the ClusterDumper?
-
Re: Clustering techniques, tips and tricksBogdan Vatkov 2010-01-06, 02:45
Is there a description of the output structure of the results, I see also
some folders like points which is used by the ClusterDumper but I do not know the technical details. I would be interested what kind of data is available as a result of the clustering. Is it different when different algorithm is used (kmeans, canopy, dirichlet)? I also have one more theoretical question: I get for the cluster with the highest "points" a term - the third by weight which is at the same time with word freq = 9 - according to Solr Dictionary (and according to my knowledge of the corpora too) - this is for 23 000+ input docs. Is it something with the kmeans algorithm? the rest of the terms, clusters seem to be somehow ok, but that one really astonished me, I am almost sure it is not a problem with the (index - dictionary mapping) like I had before ;) (but that was general problem then - I was using the wrong dictionary file). I am running with convergence 0.5 is that ok? Best regards, Bogdan
-
Re: Clustering techniques, tips and tricksDrew Farris 2010-01-06, 02:46
On Tue, Jan 5, 2010 at 9:43 PM, Drew Farris <[EMAIL PROTECTED]> wrote:
> Each iteration of kmeans procuses a cluster-X folder, with X starting > at 0. You would get clusters-0 in cases where the clusters converge > after the first run. To correct/clarify: you get clusters-0 for every run -- you would get clusters-0 and no other output directories for runs that converged in a single iteration. I believe you always want to use the results in the highest numbered cluster-X directory.
-
Re: Clustering techniques, tips and tricksBogdan Vatkov 2010-01-06, 02:51
I customized the lucene index-to-vector dumper already quite a lot (e.g.
applied stop-words (from file), stop-regex) but I am wondering how the input vectors are later reachable if I start from cluster vectors, you say points are somehow doing that, where can I read more or can you tell me more, or is there a piece of code which would best guide me through the points format? On Wed, Jan 6, 2010 at 4:43 AM, Drew Farris <[EMAIL PROTECTED]> wrote: > Each iteration of kmeans procuses a cluster-X folder, with X starting > at 0. You would get clusters-0 in cases where the clusters converge > after the first run. > > Whether your clusters will retain document id's is based on how you > create the vectors. For example, the lucene vector dumper can be told > to extract the value from a specific field in the index to use for the > vector labels. These are carried through to the points file produced > at the end of the k-means run. > > On Tue, Jan 5, 2010 at 9:36 PM, Bogdan Vatkov <[EMAIL PROTECTED]> > wrote: > > Is there some description of the content of the cluster vector? > > I also noticed that I end up with some folders clusters-0 and clusters-1, > > but sometimes it is only clusters-0, when do we get the different folders > > and which should be used as end result - e.g. by the ClusterDumper? > -- Best regards, Bogdan
-
Re: Clustering techniques, tips and tricksPallavi Palleti 2010-01-06, 02:51
Clusters-i directory is for each iteration and points is the folder
where you have the final output data in consumable format. For example, in FuzzyKMeans, the clusters-0 directory contains a format like clustersid\tclusterVector as key value pair. This will be consumed by next iteration to read the centriods. Where as, the points directory contains data as itemVector\tclusterProbabilities. This gives you the item and the cluster probabilities (p(cluster/item) for this item. Thanks Pallavi Bogdan Vatkov wrote: > Is there a description of the output structure of the results, I see also > some folders like points which is used by the ClusterDumper but I do not > know the technical details. > I would be interested what kind of data is available as a result of the > clustering. Is it different when different algorithm is used (kmeans, > canopy, dirichlet)? > > I also have one more theoretical question: I get for the cluster with the > highest "points" a term - the third by weight which is at the same time with > word freq = 9 - according to Solr Dictionary (and according to my knowledge > of the corpora too) - this is for 23 000+ input docs. Is it something with > the kmeans algorithm? the rest of the terms, clusters seem to be somehow ok, > but that one really astonished me, I am almost sure it is not a problem with > the (index - dictionary mapping) like I had before ;) (but that was general > problem then - I was using the wrong dictionary file). > I am running with convergence 0.5 is that ok? > > Best regards, > Bogdan > >
-
Re: Clustering techniques, tips and tricksDrew Farris 2010-01-06, 02:58
Take a look a o.a.m.clustering.ClusterDumper in mahout-utils. The
points file is a SequenceFile<Text,Text> where the key is the vector id and the value is a cluster id. On Tue, Jan 5, 2010 at 9:51 PM, Bogdan Vatkov <[EMAIL PROTECTED]> wrote: > I customized the lucene index-to-vector dumper already quite a lot (e.g. > applied stop-words (from file), stop-regex) but I am wondering how the input > vectors are later reachable if I start from cluster vectors, you say points > are somehow doing that, where can I read more or can you tell me more, or is > there a piece of code which would best guide me through the points format?
-
Re: Clustering techniques, tips and tricksBogdan Vatkov 2010-01-06, 03:08
10x, that was what I needed :)
On Wed, Jan 6, 2010 at 4:58 AM, Drew Farris <[EMAIL PROTECTED]> wrote: > Take a look a o.a.m.clustering.ClusterDumper in mahout-utils. The > points file is a SequenceFile<Text,Text> where the key is the vector > id and the value is a cluster id. > > On Tue, Jan 5, 2010 at 9:51 PM, Bogdan Vatkov <[EMAIL PROTECTED]> > wrote: > > I customized the lucene index-to-vector dumper already quite a lot (e.g. > > applied stop-words (from file), stop-regex) but I am wondering how the > input > > vectors are later reachable if I start from cluster vectors, you say > points > > are somehow doing that, where can I read more or can you tell me more, or > is > > there a piece of code which would best guide me through the points > format? > -- Best regards, Bogdan
-
Re: Clustering techniques, tips and tricksGrant Ingersoll 2010-01-06, 12:04
Sounds like we need some more wiki docs.
-Grant On Jan 5, 2010, at 10:08 PM, Bogdan Vatkov wrote: > 10x, that was what I needed :) > > On Wed, Jan 6, 2010 at 4:58 AM, Drew Farris <[EMAIL PROTECTED]> wrote: > >> Take a look a o.a.m.clustering.ClusterDumper in mahout-utils. The >> points file is a SequenceFile<Text,Text> where the key is the vector >> id and the value is a cluster id. >> >> On Tue, Jan 5, 2010 at 9:51 PM, Bogdan Vatkov <[EMAIL PROTECTED]> >> wrote: >>> I customized the lucene index-to-vector dumper already quite a lot (e.g. >>> applied stop-words (from file), stop-regex) but I am wondering how the >> input >>> vectors are later reachable if I start from cluster vectors, you say >> points >>> are somehow doing that, where can I read more or can you tell me more, or >> is >>> there a piece of code which would best guide me through the points >> format? >> > > > > -- > Best regards, > Bogdan
-
Re: Clustering techniques, tips and tricksAbbas 2012-03-07, 09:24
Hi Bogdan,
This is in reply to your previous post where you asked about having word- stoppers in Mahout. Well, recently I was fighting with the same thing and found a solution, which worked perfectly fine. What you should do is - 1. Create your own (customized) Lucene Analyzer by extending Analyzer class and overriding tokenStream method 2. Create a jar file containing your custom analyzer. Make sure to have your lucene jar file in the MANIFEST.mf. 3. Place the jar in mahout/examples/target/dependency. In case you get ClassNotFoundException in the next step, you may like to put the two jar files in hadoop/lib/ as well. Also you can try making entries of the jar files in HADOOP_CLASSPATH and CLASSPATH environment variable. 4. Then run your seq2sparse command by mentioning your custom analyzer in -a parameter 5. Run your k-means command as you would otherwise do. Hope this helps If you need the complete code for custom analyzer, let me know. Thanks Abbas |