|
Pat Ferrel
2012-05-10, 00:36
Jeff Eastman
2012-05-10, 13:12
Pat Ferrel
2012-05-10, 16:20
Jeff Eastman
2012-05-11, 14:58
Ted Dunning
2012-05-11, 15:26
Pat Ferrel
2012-05-12, 16:51
Ted Dunning
2012-05-12, 17:34
gaurav redkar
2012-05-11, 15:39
Pat Ferrel
2012-05-12, 15:53
Ted Dunning
2012-05-12, 17:33
Pat Ferrel
2012-05-12, 18:19
Ted Dunning
2012-05-12, 19:11
|
-
Canopy estimatorPat Ferrel 2012-05-10, 00:36
Some thoughts on https://issues.apache.org/jira/browse/MAHOUT-563
Did anything ever get done with this? Ted mentions limited usefulness. This may be true but the cases he mentions as counter examples are also not very good for using canopy ahead of kmeans, no? That info would be a useful result. To use canopies I find myself running it over and over trying to see some inflection in the number of clusters. Why not automate this? Even if the data shows nothing, that is itself an answer of value and it would save a lot of hand work to find out the same thing. +
Pat Ferrel 2012-05-10, 00:36
-
Re: Canopy estimatorJeff Eastman 2012-05-10, 13:12
No, the issue was discussed but never reached critical mass. I typically
do a binary search to find the best value setting T1==T2 and then tweak T1 up a bit. For feeding k-means, this latter step is not so important. If you could figure out a way to automate this we would be interested. Conceptually, using the RandomSeedGenerator to sample a few vectors and comparing them with your chosen DistanceMeasure would give you a hint at the T-value to begin the search. A utility to do that would be a useful contribution. On 5/9/12 8:36 PM, Pat Ferrel wrote: > Some thoughts on https://issues.apache.org/jira/browse/MAHOUT-563 > > Did anything ever get done with this? Ted mentions limited usefulness. > This may be true but the cases he mentions as counter examples are > also not very good for using canopy ahead of kmeans, no? That info > would be a useful result. To use canopies I find myself running it > over and over trying to see some inflection in the number of clusters. > Why not automate this? Even if the data shows nothing, that is itself > an answer of value and it would save a lot of hand work to find out > the same thing. > > +
Jeff Eastman 2012-05-10, 13:12
-
Re: Canopy estimatorPat Ferrel 2012-05-10, 16:20
Naively I imagine giving a range, divide up into equal increments and
calculate all relevant cluster numbers. It would take the order of (# of increments)**2 time to do but it seems to me that for a given corpus you wouldn't need to do this very often (actually you only need 1/2 this data). You would get a 3-d surface/histogram with magnitude = # of clusters, x and y = t1 and t2. Then search this data for local maxes, mins and inflection points. I'm not sure what this data would look like -- hence the "naively" disclaimer at the start. It is certainly a large landscape to search by hand. Your method only looks at the diagonal (t1==t2)and maybe that is the most interesting part, in which case the calculations are much quicker. Ultimately I'm interested in finding a better way to do hierarchical clustering. Information very often has a natural hierarchy but the usual methods produce spotty results. If we had a reasonable canopy estimator we could employ it at each level on the subset of the corpus being clustered. Doing this by hand quickly becomes prohibitive given that the number of times you have to estimate canopy values increases exponentially with each level of hierarchy Even a mediocre estimator would likely be better that picking k out of the air. And the times it would fail to produce would also tell you something about your data. On 5/10/12 6:12 AM, Jeff Eastman wrote: > No, the issue was discussed but never reached critical mass. I > typically do a binary search to find the best value setting T1==T2 and > then tweak T1 up a bit. For feeding k-means, this latter step is not > so important. > > If you could figure out a way to automate this we would be interested. > Conceptually, using the RandomSeedGenerator to sample a few vectors > and comparing them with your chosen DistanceMeasure would give you a > hint at the T-value to begin the search. A utility to do that would be > a useful contribution. > > On 5/9/12 8:36 PM, Pat Ferrel wrote: >> Some thoughts on https://issues.apache.org/jira/browse/MAHOUT-563 >> >> Did anything ever get done with this? Ted mentions limited >> usefulness. This may be true but the cases he mentions as counter >> examples are also not very good for using canopy ahead of kmeans, no? >> That info would be a useful result. To use canopies I find myself >> running it over and over trying to see some inflection in the number >> of clusters. Why not automate this? Even if the data shows nothing, >> that is itself an answer of value and it would save a lot of hand >> work to find out the same thing. >> >> > +
Pat Ferrel 2012-05-10, 16:20
-
Re: Canopy estimatorJeff Eastman 2012-05-11, 14:58
The reason I use T1==T2 is that T2 is the only threshold that determines
the number of clusters. T1 affects how many adjacent points are considered in the centroid calculations. So you could simplify your histogram analysis to 2-d without affecting #clusters. Hierarchical clustering is one way to think about the clustering of information that we have just recently added to Mahout. Any experiences you can share with its application would be valuable. On 5/10/12 12:20 PM, Pat Ferrel wrote: > Naively I imagine giving a range, divide up into equal increments and > calculate all relevant cluster numbers. It would take the order of (# > of increments)**2 time to do but it seems to me that for a given > corpus you wouldn't need to do this very often (actually you only need > 1/2 this data). You would get a 3-d surface/histogram with magnitude = > # of clusters, x and y = t1 and t2. Then search this data for local > maxes, mins and inflection points. I'm not sure what this data would > look like -- hence the "naively" disclaimer at the start. It is > certainly a large landscape to search by hand. > > Your method only looks at the diagonal (t1==t2)and maybe that is the > most interesting part, in which case the calculations are much quicker. > > Ultimately I'm interested in finding a better way to do hierarchical > clustering. Information very often has a natural hierarchy but the > usual methods produce spotty results. If we had a reasonable canopy > estimator we could employ it at each level on the subset of the corpus > being clustered. Doing this by hand quickly becomes prohibitive given > that the number of times you have to estimate canopy values increases > exponentially with each level of hierarchy > > Even a mediocre estimator would likely be better that picking k out of > the air. And the times it would fail to produce would also tell you > something about your data. > > On 5/10/12 6:12 AM, Jeff Eastman wrote: >> No, the issue was discussed but never reached critical mass. I >> typically do a binary search to find the best value setting T1==T2 >> and then tweak T1 up a bit. For feeding k-means, this latter step is >> not so important. >> >> If you could figure out a way to automate this we would be >> interested. Conceptually, using the RandomSeedGenerator to sample a >> few vectors and comparing them with your chosen DistanceMeasure would >> give you a hint at the T-value to begin the search. A utility to do >> that would be a useful contribution. >> >> On 5/9/12 8:36 PM, Pat Ferrel wrote: >>> Some thoughts on https://issues.apache.org/jira/browse/MAHOUT-563 >>> >>> Did anything ever get done with this? Ted mentions limited >>> usefulness. This may be true but the cases he mentions as counter >>> examples are also not very good for using canopy ahead of kmeans, >>> no? That info would be a useful result. To use canopies I find >>> myself running it over and over trying to see some inflection in the >>> number of clusters. Why not automate this? Even if the data shows >>> nothing, that is itself an answer of value and it would save a lot >>> of hand work to find out the same thing. >>> >>> >> > > +
Jeff Eastman 2012-05-11, 14:58
-
Re: Canopy estimatorTed Dunning 2012-05-11, 15:26
The streaming k-means stuff might be an interesting alternative to setting
parameters manually. In that work, the algorithm adaptively sets a parameter that has similar function to T1 and T2. More importantly, the output of the main pass is a large number of weighted centroids that can be used as a small surrogate for the entire data set in subsequent clustering. Since these centroids should fit in memory, you could do an adaptive search for propitious values of T1 and T2. My github repo has a description of this algorithm with an analysis of scaling properties. See https://github.com/tdunning/knn As soon as we finish the cleanup release, I will start folding in this code. On Fri, May 11, 2012 at 7:58 AM, Jeff Eastman <[EMAIL PROTECTED]>wrote: > The reason I use T1==T2 is that T2 is the only threshold that determines > the number of clusters. T1 affects how many adjacent points are considered > in the centroid calculations. So you could simplify your histogram analysis > to 2-d without affecting #clusters. > > Hierarchical clustering is one way to think about the clustering of > information that we have just recently added to Mahout. Any experiences you > can share with its application would be valuable. > > On 5/10/12 12:20 PM, Pat Ferrel wrote: > >> Naively I imagine giving a range, divide up into equal increments and >> calculate all relevant cluster numbers. It would take the order of (# of >> increments)**2 time to do but it seems to me that for a given corpus you >> wouldn't need to do this very often (actually you only need 1/2 this data). >> You would get a 3-d surface/histogram with magnitude = # of clusters, x and >> y = t1 and t2. Then search this data for local maxes, mins and inflection >> points. I'm not sure what this data would look like -- hence the "naively" >> disclaimer at the start. It is certainly a large landscape to search by >> hand. >> >> Your method only looks at the diagonal (t1==t2)and maybe that is the most >> interesting part, in which case the calculations are much quicker. >> >> Ultimately I'm interested in finding a better way to do hierarchical >> clustering. Information very often has a natural hierarchy but the usual >> methods produce spotty results. If we had a reasonable canopy estimator we >> could employ it at each level on the subset of the corpus being clustered. >> Doing this by hand quickly becomes prohibitive given that the number of >> times you have to estimate canopy values increases exponentially with each >> level of hierarchy >> >> Even a mediocre estimator would likely be better that picking k out of >> the air. And the times it would fail to produce would also tell you >> something about your data. >> >> On 5/10/12 6:12 AM, Jeff Eastman wrote: >> >>> No, the issue was discussed but never reached critical mass. I typically >>> do a binary search to find the best value setting T1==T2 and then tweak T1 >>> up a bit. For feeding k-means, this latter step is not so important. >>> >>> If you could figure out a way to automate this we would be interested. >>> Conceptually, using the RandomSeedGenerator to sample a few vectors and >>> comparing them with your chosen DistanceMeasure would give you a hint at >>> the T-value to begin the search. A utility to do that would be a useful >>> contribution. >>> >>> On 5/9/12 8:36 PM, Pat Ferrel wrote: >>> >>>> Some thoughts on https://issues.apache.org/**jira/browse/MAHOUT-563<https://issues.apache.org/jira/browse/MAHOUT-563> >>>> >>>> Did anything ever get done with this? Ted mentions limited usefulness. >>>> This may be true but the cases he mentions as counter examples are also not >>>> very good for using canopy ahead of kmeans, no? That info would be a useful >>>> result. To use canopies I find myself running it over and over trying to >>>> see some inflection in the number of clusters. Why not automate this? Even >>>> if the data shows nothing, that is itself an answer of value and it would >>>> save a lot of hand work to find out the same thing. +
Ted Dunning 2012-05-11, 15:26
-
Re: Canopy estimatorPat Ferrel 2012-05-12, 16:51
As I said in another reply my data doesn't seem to cluster well contrary
to my intuition. I may have a data problem or need to do some dimensional reduction. I've skimmed your docs and would love to give it a try. With 10x or 100x faster results alone it would be a big help. It seems to promise doing away with the canopy step all together, no? But I need to read your docs more carefully. Ultimately I'm trying to find a good way to automatically generate several levels of clustering with varying "specificity". Each level might be completely independent of the other but the clusters should be more and more specific. They could be hierarchical but it might be better to just find the nearest clusters from less specific to more specific. From a high number of doc members to a low number. This why canopy has been frustrating because by varying t I would have hoped to generate these levels of specificity, then replace hierarchical clustering with a similarity measure. In other words L1 has 1000 docs per cluster, L2 has 100 docs per cluster. I'd find the 100 docs closest to L1 clusters (that's all the user wants to see in my case) and reference the 10 L2 clusters nearest by centroid similarity using rowsimilarity to calculate. I'm hoping that this is a useful way to browse the information space. Naively speaking your streaming k seems to have elements of this built in. On 5/11/12 8:26 AM, Ted Dunning wrote: > The streaming k-means stuff might be an interesting alternative to setting > parameters manually. In that work, the algorithm adaptively sets a > parameter that has similar function to T1 and T2. > > More importantly, the output of the main pass is a large number of weighted > centroids that can be used as a small surrogate for the entire data set in > subsequent clustering. Since these centroids should fit in memory, you > could do an adaptive search for propitious values of T1 and T2. > > My github repo has a description of this algorithm with an analysis of > scaling properties. See > > https://github.com/tdunning/knn > > As soon as we finish the cleanup release, I will start folding in this code. > > On Fri, May 11, 2012 at 7:58 AM, Jeff Eastman<[EMAIL PROTECTED]>wrote: > >> The reason I use T1==T2 is that T2 is the only threshold that determines >> the number of clusters. T1 affects how many adjacent points are considered >> in the centroid calculations. So you could simplify your histogram analysis >> to 2-d without affecting #clusters. >> >> Hierarchical clustering is one way to think about the clustering of >> information that we have just recently added to Mahout. Any experiences you >> can share with its application would be valuable. >> >> On 5/10/12 12:20 PM, Pat Ferrel wrote: >> >>> Naively I imagine giving a range, divide up into equal increments and >>> calculate all relevant cluster numbers. It would take the order of (# of >>> increments)**2 time to do but it seems to me that for a given corpus you >>> wouldn't need to do this very often (actually you only need 1/2 this data). >>> You would get a 3-d surface/histogram with magnitude = # of clusters, x and >>> y = t1 and t2. Then search this data for local maxes, mins and inflection >>> points. I'm not sure what this data would look like -- hence the "naively" >>> disclaimer at the start. It is certainly a large landscape to search by >>> hand. >>> >>> Your method only looks at the diagonal (t1==t2)and maybe that is the most >>> interesting part, in which case the calculations are much quicker. >>> >>> Ultimately I'm interested in finding a better way to do hierarchical >>> clustering. Information very often has a natural hierarchy but the usual >>> methods produce spotty results. If we had a reasonable canopy estimator we >>> could employ it at each level on the subset of the corpus being clustered. >>> Doing this by hand quickly becomes prohibitive given that the number of >>> times you have to estimate canopy values increases exponentially with each +
Pat Ferrel 2012-05-12, 16:51
-
Re: Canopy estimatorTed Dunning 2012-05-12, 17:34
Roughly.
But it also gives you a small-ish surrogate for your data that would let you use all kinds of different clustering methods since the surrogate fits in memory. On Sat, May 12, 2012 at 9:51 AM, Pat Ferrel <[EMAIL PROTECTED]> wrote: > This why canopy has been frustrating because by varying t I would have > hoped to generate these levels of specificity, then replace hierarchical > clustering with a similarity measure. In other words L1 has 1000 docs per > cluster, L2 has 100 docs per cluster. I'd find the 100 docs closest to L1 > clusters (that's all the user wants to see in my case) and reference the 10 > L2 clusters nearest by centroid similarity using rowsimilarity to > calculate. I'm hoping that this is a useful way to browse the information > space. > > Naively speaking your streaming k seems to have elements of this built in. > +
Ted Dunning 2012-05-12, 17:34
-
Re: Canopy estimatorgaurav redkar 2012-05-11, 15:39
I have tried out a naive method to estimate the values for t1 and t2 to be
used for meanshift clustering. Any suggestions or comments about shortcomings of my approach are welcome. i take a sample of the dataset and compute pairwise similarity between all the points in the sample using the same distance measure that i will use while performing clustering(euclidean distance in my case). For simplicity of explanation, assume i take 3 points from my dataset. computing similarity among these points give me a similarity matrix as shown 0 1.56 1.4 1.56 0 1.36 1.4 1.36 0 now looking at each column(or row coz the matrix is symmetric) u can see that the largest element in the column(or row) is the distance which if used as t1 will cover all the points from the chosen sample.(and probably cover a sizeable percentage of the entire dataset). that would result to all the points mergin to 1 or few clusters in order to choose t1:- i take the mean of all the elements in each column (ignoring the 0's in the diagonal). for the matrix shown above, we get the values 1.48 1.46 1.38 then i take the average of these values , i.e. t1=1.44 in order to choose t2:- i find the minimum element in each column (ignoring the 0's in the diagonal) which will give me 1.4 ,1.36 , 1.36. to choose the value of t2 i intend to take mean of all the minimum elements in each column. then select the mean of these values , t2=1.37 Any comments on the approach Thanks Gaurav On Fri, May 11, 2012 at 8:28 PM, Jeff Eastman <[EMAIL PROTECTED]>wrote: > The reason I use T1==T2 is that T2 is the only threshold that determines > the number of clusters. T1 affects how many adjacent points are considered > in the centroid calculations. So you could simplify your histogram analysis > to 2-d without affecting #clusters. > > Hierarchical clustering is one way to think about the clustering of > information that we have just recently added to Mahout. Any experiences you > can share with its application would be valuable. > > > On 5/10/12 12:20 PM, Pat Ferrel wrote: > >> Naively I imagine giving a range, divide up into equal increments and >> calculate all relevant cluster numbers. It would take the order of (# of >> increments)**2 time to do but it seems to me that for a given corpus you >> wouldn't need to do this very often (actually you only need 1/2 this data). >> You would get a 3-d surface/histogram with magnitude = # of clusters, x and >> y = t1 and t2. Then search this data for local maxes, mins and inflection >> points. I'm not sure what this data would look like -- hence the "naively" >> disclaimer at the start. It is certainly a large landscape to search by >> hand. >> >> Your method only looks at the diagonal (t1==t2)and maybe that is the most >> interesting part, in which case the calculations are much quicker. >> >> Ultimately I'm interested in finding a better way to do hierarchical >> clustering. Information very often has a natural hierarchy but the usual >> methods produce spotty results. If we had a reasonable canopy estimator we >> could employ it at each level on the subset of the corpus being clustered. >> Doing this by hand quickly becomes prohibitive given that the number of >> times you have to estimate canopy values increases exponentially with each >> level of hierarchy >> >> Even a mediocre estimator would likely be better that picking k out of >> the air. And the times it would fail to produce would also tell you >> something about your data. >> >> On 5/10/12 6:12 AM, Jeff Eastman wrote: >> >>> No, the issue was discussed but never reached critical mass. I typically >>> do a binary search to find the best value setting T1==T2 and then tweak T1 >>> up a bit. For feeding k-means, this latter step is not so important. >>> >>> If you could figure out a way to automate this we would be interested. >>> Conceptually, using the RandomSeedGenerator to sample a few vectors and >>> comparing them with your chosen DistanceMeasure would give you a hint at +
gaurav redkar 2012-05-11, 15:39
-
Re: Canopy estimatorPat Ferrel 2012-05-12, 15:53
Wrote a shell script to do t1==t2 over a range and ist does give useful
information. Picking a few point outside of t1==t2 doesn't seem to affect things by much, number of clusters-wise. Since there is really no way to talk about canopy quality AKAIK the number is how I make a decision. One problem I have is that virtually any value for T gives me a very large number of canopies--on the order of 2-5 docs per cluster. Whether I create clusters using random seeds or canopies they are of poor quality to my eye. A few are good but many are silly. I've tried a wide range of vectorizing knobs including L2 norm, n-grams with a high ml, and doing a cutom lucene filter to filer out numbers and do stemming to little avail. Using your method of t1==t2 - get 2 docs per cluster with t=0.3 (tanimoto or cosine) and 5 docs per cluster with t = 0.95. This is telling me that the docs are not really clusterable contrary to intuition. Next stop SVD? Maybe a larger data set from fewer sources will help? As to hierarchical clustering in my case it makes little sense when canopies gives 2-5 docs per cluster. My experimental data set is web crawled news since it has a clear hierarchy, you can easily see it in categories like root:sports:baseball, soccer, basketball, etc. As to hierarchical clustering using another tool set where we had a proprietary patented algorithm for picking k it worked pretty well. It was for email though so it was not very noisy data. What I was hoping to do is use canopy or other method to estimate cluster numbers automatically for each level and if I can get a crude canopy estimator working I'll report back. On 5/11/12 7:58 AM, Jeff Eastman wrote: > The reason I use T1==T2 is that T2 is the only threshold that > determines the number of clusters. T1 affects how many adjacent points > are considered in the centroid calculations. So you could simplify > your histogram analysis to 2-d without affecting #clusters. > > Hierarchical clustering is one way to think about the clustering of > information that we have just recently added to Mahout. Any > experiences you can share with its application would be valuable. > > On 5/10/12 12:20 PM, Pat Ferrel wrote: >> Naively I imagine giving a range, divide up into equal increments and >> calculate all relevant cluster numbers. It would take the order of (# >> of increments)**2 time to do but it seems to me that for a given >> corpus you wouldn't need to do this very often (actually you only >> need 1/2 this data). You would get a 3-d surface/histogram with >> magnitude = # of clusters, x and y = t1 and t2. Then search this data >> for local maxes, mins and inflection points. I'm not sure what this >> data would look like -- hence the "naively" disclaimer at the start. >> It is certainly a large landscape to search by hand. >> >> Your method only looks at the diagonal (t1==t2)and maybe that is the >> most interesting part, in which case the calculations are much quicker. >> >> Ultimately I'm interested in finding a better way to do hierarchical >> clustering. Information very often has a natural hierarchy but the >> usual methods produce spotty results. If we had a reasonable canopy >> estimator we could employ it at each level on the subset of the >> corpus being clustered. Doing this by hand quickly becomes >> prohibitive given that the number of times you have to estimate >> canopy values increases exponentially with each level of hierarchy >> >> Even a mediocre estimator would likely be better that picking k out >> of the air. And the times it would fail to produce would also tell >> you something about your data. >> >> On 5/10/12 6:12 AM, Jeff Eastman wrote: >>> No, the issue was discussed but never reached critical mass. I >>> typically do a binary search to find the best value setting T1==T2 >>> and then tweak T1 up a bit. For feeding k-means, this latter step is >>> not so important. >>> >>> If you could figure out a way to automate this we would be +
Pat Ferrel 2012-05-12, 15:53
-
Re: Canopy estimatorTed Dunning 2012-05-12, 17:33
One thing that may be happening here is that the scale of your data varies
from place to place. Have you tried the upcoming k-means stuff? On Sat, May 12, 2012 at 8:53 AM, Pat Ferrel <[EMAIL PROTECTED]> wrote: > One problem I have is that virtually any value for T gives me a very large > number of canopies--on the order of 2-5 docs per cluster. Whether I create > clusters using random seeds or canopies they are of poor quality to my eye. > A few are good but many are silly. I've tried a wide range of vectorizing > knobs including L2 norm, n-grams with a high ml, and doing a cutom lucene > filter to filer out numbers and do stemming to little avail. Using your > method of t1==t2 - get 2 docs per cluster with t=0.3 (tanimoto or cosine) > and 5 docs per cluster with t = 0.95. This is telling me that the docs are > not really clusterable contrary to +
Ted Dunning 2012-05-12, 17:33
-
Re: Canopy estimatorPat Ferrel 2012-05-12, 18:19
If I understand your comment correctly this is why I hope that applying
levels of specificity will help. On a particular subject L1 will give good quality and on another L2 will be better. I may be able to use an estimate of quality here to prune out bad clusters, not sure. The nature of my problem gives me no control over the input data in production so I have to come up with methods that are adaptive. If you are asking about using your post 0.7 clustering, no I haven't yet. Will it help with varying scale? I assume by scale you mean the density of docs in certain areas of the vector space? One thing I am trying now is limiting the subject matter crawled and getting a much larger sample, which should get me a denser distribution. If you think it might help do I build it inside 0.7 snapshot? Is it a drop in replacement for kmeans? On 5/12/12 10:33 AM, Ted Dunning wrote: > One thing that may be happening here is that the scale of your data varies > from place to place. > > Have you tried the upcoming k-means stuff? > > On Sat, May 12, 2012 at 8:53 AM, Pat Ferrel<[EMAIL PROTECTED]> wrote: > >> One problem I have is that virtually any value for T gives me a very large >> number of canopies--on the order of 2-5 docs per cluster. Whether I create >> clusters using random seeds or canopies they are of poor quality to my eye. >> A few are good but many are silly. I've tried a wide range of vectorizing >> knobs including L2 norm, n-grams with a high ml, and doing a cutom lucene >> filter to filer out numbers and do stemming to little avail. Using your >> method of t1==t2 - get 2 docs per cluster with t=0.3 (tanimoto or cosine) >> and 5 docs per cluster with t = 0.95. This is telling me that the docs are >> not really clusterable contrary to +
Pat Ferrel 2012-05-12, 18:19
-
Re: Canopy estimatorTed Dunning 2012-05-12, 19:11
Yes. It may help with variable scale.
The class technique for dealing with that is to cluster with a small number of clusters at a gross level and then cluster each set of documents that belong to a single large cluster. This automatically adapts to different scales. The new stuff would greatly facilitate your experimentation. On Sat, May 12, 2012 at 11:19 AM, Pat Ferrel <[EMAIL PROTECTED]> wrote: > If you are asking about using your post 0.7 clustering, no I haven't yet. > Will it help with varying scale? I assume by scale you mean the density of > docs in certain areas of the vector space? One thing I am trying now is > limiting the subject matter crawled and getting a much larger sample, which > should get me a denser distribution. +
Ted Dunning 2012-05-12, 19:11
|