|
Pat Ferrel
2012-05-12, 23:29
Suneel Marthi
2012-05-13, 02:06
Sebastian Schelter
2012-05-13, 06:08
Suneel Marthi
2012-05-13, 06:25
Sebastian Schelter
2012-05-13, 06:38
Pat Ferrel
2012-05-13, 15:33
Sebastian Schelter
2012-05-13, 16:10
Pat Ferrel
2012-05-14, 17:30
Sebastian Schelter
2012-05-14, 17:35
Pat Ferrel
2012-05-14, 22:22
Pat Ferrel
2012-07-13, 19:47
|
-
RowSimilarityPat Ferrel 2012-05-12, 23:29
I tried an experiment running RowSimilarity with 16 docs of short
quotations on a similar subject. It looks to me that using tanimoto the largest pair-wise distance allowed for the similar docs was 0.4. Though I asked for 10 similar docs I got 0 to 10. I see this same effect with larger data sets but haven't seen an obvious cut-off point I was expecting to be able to make the decision about cut-off distance myself. In other words I was expecting to always get 20 similar docs when I asked for 20. It is useful to see what docs are at larger distances. How is RowSimilarity deciding when to cut-off the returned docs?
-
Re: RowSimilaritySuneel Marthi 2012-05-13, 02:06
The consider() method in the distance measure (Tanimoto in ur scenario) is the one that does the cut-off.
All of the similarity measures (almost all of them) have some implementation of consider() so as to cut-off the returned results. Have a look at Sebastian's explanation in https://issues.apache.org/jira/browse/MAHOUT-803. ________________________________ From: Pat Ferrel <[EMAIL PROTECTED]> To: [EMAIL PROTECTED] Sent: Saturday, May 12, 2012 7:29 PM Subject: RowSimilarity I tried an experiment running RowSimilarity with 16 docs of short quotations on a similar subject. It looks to me that using tanimoto the largest pair-wise distance allowed for the similar docs was 0.4. Though I asked for 10 similar docs I got 0 to 10. I see this same effect with larger data sets but haven't seen an obvious cut-off point I was expecting to be able to make the decision about cut-off distance myself. In other words I was expecting to always get 20 similar docs when I asked for 20. It is useful to see what docs are at larger distances. How is RowSimilarity deciding when to cut-off the returned docs?
-
Re: RowSimilaritySebastian Schelter 2012-05-13, 06:08
The option 'maxSimilaritiesPerRow' determines the maximum number of
similar docs/items/rows per row. It depends on your data if there are enough similar rows per row, so you can't always get 20 similar docs. The option 'threshold' determines the minimum similarity value for a pair of docs (otherwise it will be dropped). This option is not activated by default however. Best, Sebastian On 13.05.2012 01:29, Pat Ferrel wrote: > I tried an experiment running RowSimilarity with 16 docs of short > quotations on a similar subject. It looks to me that using tanimoto the > largest pair-wise distance allowed for the similar docs was 0.4. Though > I asked for 10 similar docs I got 0 to 10. I see this same effect with > larger data sets but haven't seen an obvious cut-off point > > I was expecting to be able to make the decision about cut-off distance > myself. In other words I was expecting to always get 20 similar docs > when I asked for 20. It is useful to see what docs are at larger distances. > > How is RowSimilarity deciding when to cut-off the returned docs? >
-
Re: RowSimilaritySuneel Marthi 2012-05-13, 06:25
Pat's question was that he was seeing less documents than that specified by 'maxSimilaritiesPerRow', this could be happening due to the 'consider' functionality of the applied similarity measure.
________________________________ From: Sebastian Schelter <[EMAIL PROTECTED]> To: [EMAIL PROTECTED] Sent: Sunday, May 13, 2012 2:08 AM Subject: Re: RowSimilarity The option 'maxSimilaritiesPerRow' determines the maximum number of similar docs/items/rows per row. It depends on your data if there are enough similar rows per row, so you can't always get 20 similar docs. The option 'threshold' determines the minimum similarity value for a pair of docs (otherwise it will be dropped). This option is not activated by default however. Best, Sebastian On 13.05.2012 01:29, Pat Ferrel wrote: > I tried an experiment running RowSimilarity with 16 docs of short > quotations on a similar subject. It looks to me that using tanimoto the > largest pair-wise distance allowed for the similar docs was 0.4. Though > I asked for 10 similar docs I got 0 to 10. I see this same effect with > larger data sets but haven't seen an obvious cut-off point > > I was expecting to be able to make the decision about cut-off distance > myself. In other words I was expecting to always get 20 similar docs > when I asked for 20. It is useful to see what docs are at larger distances. > > How is RowSimilarity deciding when to cut-off the returned docs? >
-
Re: RowSimilaritySebastian Schelter 2012-05-13, 06:38
This could be simply due to the fact that there are less similar docs
than the number specified in 'maxSimilaritiesPerRow'. consider() is only invoked if a threshold was specified. Best, Sebastian On 13.05.2012 08:25, Suneel Marthi wrote: > Pat's question was that he was seeing less documents than that specified by 'maxSimilaritiesPerRow', this could be happening due to the 'consider' functionality of the applied similarity measure. > > > > ________________________________ > From: Sebastian Schelter <[EMAIL PROTECTED]> > To: [EMAIL PROTECTED] > Sent: Sunday, May 13, 2012 2:08 AM > Subject: Re: RowSimilarity > > The option 'maxSimilaritiesPerRow' determines the maximum number of > similar docs/items/rows per row. It depends on your data if there are > enough similar rows per row, so you can't always get 20 similar docs. > > The option 'threshold' determines the minimum similarity value for a > pair of docs (otherwise it will be dropped). This option is not > activated by default however. > > Best, > Sebastian > > On 13.05.2012 01:29, Pat Ferrel wrote: >> I tried an experiment running RowSimilarity with 16 docs of short >> quotations on a similar subject. It looks to me that using tanimoto the >> largest pair-wise distance allowed for the similar docs was 0.4. Though >> I asked for 10 similar docs I got 0 to 10. I see this same effect with >> larger data sets but haven't seen an obvious cut-off point >> >> I was expecting to be able to make the decision about cut-off distance >> myself. In other words I was expecting to always get 20 similar docs >> when I asked for 20. It is useful to see what docs are at larger distances. >> >> How is RowSimilarity deciding when to cut-off the returned docs? >>
-
Re: RowSimilarityPat Ferrel 2012-05-13, 15:33
To paraphrase:
There is some internal threshold to be considered 'similar'. This is the one supplied with the 'threshold' option mentioned below and I need to do a special build to get this option activated? I assume it is not active because it has not been tested well? So currently how is the threshold calculated? How can I determine its value? Can I vote that this be activated as an optional parameter in the future? I ask this in part because I want to use RowSimilarity in an experiment to do something like a non-partitioning hierarchical clustering where I'll need to find close centroids in clusters calculated with different levels of specificity. On 5/12/12 11:38 PM, Sebastian Schelter wrote: > This could be simply due to the fact that there are less similar docs > than the number specified in 'maxSimilaritiesPerRow'. > > consider() is only invoked if a threshold was specified. > > Best, > Sebastian > > > On 13.05.2012 08:25, Suneel Marthi wrote: >> Pat's question was that he was seeing less documents than that specified by 'maxSimilaritiesPerRow', this could be happening due to the 'consider' functionality of the applied similarity measure. >> >> >> >> ________________________________ >> From: Sebastian Schelter<[EMAIL PROTECTED]> >> To: [EMAIL PROTECTED] >> Sent: Sunday, May 13, 2012 2:08 AM >> Subject: Re: RowSimilarity >> >> The option 'maxSimilaritiesPerRow' determines the maximum number of >> similar docs/items/rows per row. It depends on your data if there are >> enough similar rows per row, so you can't always get 20 similar docs. >> >> The option 'threshold' determines the minimum similarity value for a >> pair of docs (otherwise it will be dropped). This option is not >> activated by default however. >> >> Best, >> Sebastian >> >> On 13.05.2012 01:29, Pat Ferrel wrote: >>> I tried an experiment running RowSimilarity with 16 docs of short >>> quotations on a similar subject. It looks to me that using tanimoto the >>> largest pair-wise distance allowed for the similar docs was 0.4. Though >>> I asked for 10 similar docs I got 0 to 10. I see this same effect with >>> larger data sets but haven't seen an obvious cut-off point >>> >>> I was expecting to be able to make the decision about cut-off distance >>> myself. In other words I was expecting to always get 20 similar docs >>> when I asked for 20. It is useful to see what docs are at larger distances. >>> >>> How is RowSimilarity deciding when to cut-off the returned docs? >>> > >
-
Re: RowSimilaritySebastian Schelter 2012-05-13, 16:10
Hi Pat,
RowSimilarityJob allows the use of a lot of different similarity measures (cosine, jaccard coefficient, number of cooccurrences, etc) all of which compute a single number for a pair of vectors that denotes how similar those are. All these measures have the characteristic that two vectors that do not share at least one non-zero value in a single dimension are considered not similar (have similarity 0). In general, an all-pairs comparison, as it is conducted by RowSimilarityJob, has quadratic complexity and is therefore not scalable. If we have sparse data such as text or ratings however, we can exploit the fact that we only need to compare pairs which share at least one non-zero value in a dimension. This is the basic idea behind row similarity job to avoid an all-pairs comparison. In some real-world usecases you will furthermore encounter a lot of pairs with near-zero similarities that are of little value for you. To be able to avoid computing these, RowSimilarityJob provides the option to specify a minimum threshold so that it ignores pairs with a similarity value below this threshold. This threshold is data-dependent and you have to experimentally find it. --sebastian On 13.05.2012 17:33, Pat Ferrel wrote: > To paraphrase: > > There is some internal threshold to be considered 'similar'. This is the > one supplied with the 'threshold' option mentioned below and I need to > do a special build to get this option activated? I assume it is not > active because it has not been tested well? > > So currently how is the threshold calculated? How can I determine its > value? Can I vote that this be activated as an optional parameter in the > future? > > I ask this in part because I want to use RowSimilarity in an experiment > to do something like a non-partitioning hierarchical clustering where > I'll need to find close centroids in clusters calculated with different > levels of specificity. > > On 5/12/12 11:38 PM, Sebastian Schelter wrote: >> This could be simply due to the fact that there are less similar docs >> than the number specified in 'maxSimilaritiesPerRow'. >> >> consider() is only invoked if a threshold was specified. >> >> Best, >> Sebastian >> >> >> On 13.05.2012 08:25, Suneel Marthi wrote: >>> Pat's question was that he was seeing less documents than that >>> specified by 'maxSimilaritiesPerRow', this could be happening due to >>> the 'consider' functionality of the applied similarity measure. >>> >>> >>> >>> ________________________________ >>> From: Sebastian Schelter<[EMAIL PROTECTED]> >>> To: [EMAIL PROTECTED] >>> Sent: Sunday, May 13, 2012 2:08 AM >>> Subject: Re: RowSimilarity >>> >>> The option 'maxSimilaritiesPerRow' determines the maximum number of >>> similar docs/items/rows per row. It depends on your data if there are >>> enough similar rows per row, so you can't always get 20 similar docs. >>> >>> The option 'threshold' determines the minimum similarity value for a >>> pair of docs (otherwise it will be dropped). This option is not >>> activated by default however. >>> >>> Best, >>> Sebastian >>> >>> On 13.05.2012 01:29, Pat Ferrel wrote: >>>> I tried an experiment running RowSimilarity with 16 docs of short >>>> quotations on a similar subject. It looks to me that using tanimoto the >>>> largest pair-wise distance allowed for the similar docs was 0.4. Though >>>> I asked for 10 similar docs I got 0 to 10. I see this same effect with >>>> larger data sets but haven't seen an obvious cut-off point >>>> >>>> I was expecting to be able to make the decision about cut-off distance >>>> myself. In other words I was expecting to always get 20 similar docs >>>> when I asked for 20. It is useful to see what docs are at larger >>>> distances. >>>> >>>> How is RowSimilarity deciding when to cut-off the returned docs? >>>> >> >>
-
Re: RowSimilarityPat Ferrel 2012-05-14, 17:30
Thanks, this is quite clear and reasonable. The cutoff is made based on
lack of term cooccurrences not the distance measure. The optional 'threshold' is based on the distance measure. BTW I assume the 'distance' returned is expressed in the distance measure's units? So using cosine as a distance measure a value near 0 is actually quite similar because the measure is 1-(cosine of the angle between the vectors)? On 5/13/12 9:10 AM, Sebastian Schelter wrote: > Hi Pat, > > RowSimilarityJob allows the use of a lot of different similarity > measures (cosine, jaccard coefficient, number of cooccurrences, etc) all > of which compute a single number for a pair of vectors that denotes how > similar those are. All these measures have the characteristic that two > vectors that do not share at least one non-zero value in a single > dimension are considered not similar (have similarity 0). > > In general, an all-pairs comparison, as it is conducted by > RowSimilarityJob, has quadratic complexity and is therefore not scalable. > > If we have sparse data such as text or ratings however, we can exploit > the fact that we only need to compare pairs which share at least one > non-zero value in a dimension. This is the basic idea behind row > similarity job to avoid an all-pairs comparison. > > In some real-world usecases you will furthermore encounter a lot of > pairs with near-zero similarities that are of little value for you. To > be able to avoid computing these, RowSimilarityJob provides the option > to specify a minimum threshold so that it ignores pairs with a > similarity value below this threshold. This threshold is data-dependent > and you have to experimentally find it. > > --sebastian > > > On 13.05.2012 17:33, Pat Ferrel wrote: >> To paraphrase: >> >> There is some internal threshold to be considered 'similar'. This is the >> one supplied with the 'threshold' option mentioned below and I need to >> do a special build to get this option activated? I assume it is not >> active because it has not been tested well? >> >> So currently how is the threshold calculated? How can I determine its >> value? Can I vote that this be activated as an optional parameter in the >> future? >> >> I ask this in part because I want to use RowSimilarity in an experiment >> to do something like a non-partitioning hierarchical clustering where >> I'll need to find close centroids in clusters calculated with different >> levels of specificity. >> >> On 5/12/12 11:38 PM, Sebastian Schelter wrote: >>> This could be simply due to the fact that there are less similar docs >>> than the number specified in 'maxSimilaritiesPerRow'. >>> >>> consider() is only invoked if a threshold was specified. >>> >>> Best, >>> Sebastian >>> >>> >>> On 13.05.2012 08:25, Suneel Marthi wrote: >>>> Pat's question was that he was seeing less documents than that >>>> specified by 'maxSimilaritiesPerRow', this could be happening due to >>>> the 'consider' functionality of the applied similarity measure. >>>> >>>> >>>> >>>> ________________________________ >>>> From: Sebastian Schelter<[EMAIL PROTECTED]> >>>> To: [EMAIL PROTECTED] >>>> Sent: Sunday, May 13, 2012 2:08 AM >>>> Subject: Re: RowSimilarity >>>> >>>> The option 'maxSimilaritiesPerRow' determines the maximum number of >>>> similar docs/items/rows per row. It depends on your data if there are >>>> enough similar rows per row, so you can't always get 20 similar docs. >>>> >>>> The option 'threshold' determines the minimum similarity value for a >>>> pair of docs (otherwise it will be dropped). This option is not >>>> activated by default however. >>>> >>>> Best, >>>> Sebastian >>>> >>>> On 13.05.2012 01:29, Pat Ferrel wrote: >>>>> I tried an experiment running RowSimilarity with 16 docs of short >>>>> quotations on a similar subject. It looks to me that using tanimoto the >>>>> largest pair-wise distance allowed for the similar docs was 0.4. Though >>>>> I asked for 10 similar docs I got 0 to 10. I see this same effect with
-
Re: RowSimilaritySebastian Schelter 2012-05-14, 17:35
"The cutoff is made based on lack of term cooccurrences not the distance
measure." I'd rather use the term similarity measure not distance measure as a lot of the measures implemented are not metric and the term 'distance' might be misleading A lack of (term) cooccurrences is equivalent to a similarity of 0 by definition, therefore the "default cutoff" is also based on the similarity measure. --sebastian On 14.05.2012 19:30, Pat Ferrel wrote: > Thanks, this is quite clear and reasonable. The optional > 'threshold' is based on the distance measure. > > BTW I assume the 'distance' returned is expressed in the distance > measure's units? So using cosine as a distance measure a value near 0 is > actually quite similar because the measure is 1-(cosine of the angle > between the vectors)? > > On 5/13/12 9:10 AM, Sebastian Schelter wrote: >> Hi Pat, >> >> RowSimilarityJob allows the use of a lot of different similarity >> measures (cosine, jaccard coefficient, number of cooccurrences, etc) all >> of which compute a single number for a pair of vectors that denotes how >> similar those are. All these measures have the characteristic that two >> vectors that do not share at least one non-zero value in a single >> dimension are considered not similar (have similarity 0). >> >> In general, an all-pairs comparison, as it is conducted by >> RowSimilarityJob, has quadratic complexity and is therefore not scalable. >> >> If we have sparse data such as text or ratings however, we can exploit >> the fact that we only need to compare pairs which share at least one >> non-zero value in a dimension. This is the basic idea behind row >> similarity job to avoid an all-pairs comparison. >> >> In some real-world usecases you will furthermore encounter a lot of >> pairs with near-zero similarities that are of little value for you. To >> be able to avoid computing these, RowSimilarityJob provides the option >> to specify a minimum threshold so that it ignores pairs with a >> similarity value below this threshold. This threshold is data-dependent >> and you have to experimentally find it. >> >> --sebastian >> >> >> On 13.05.2012 17:33, Pat Ferrel wrote: >>> To paraphrase: >>> >>> There is some internal threshold to be considered 'similar'. This is the >>> one supplied with the 'threshold' option mentioned below and I need to >>> do a special build to get this option activated? I assume it is not >>> active because it has not been tested well? >>> >>> So currently how is the threshold calculated? How can I determine its >>> value? Can I vote that this be activated as an optional parameter in the >>> future? >>> >>> I ask this in part because I want to use RowSimilarity in an experiment >>> to do something like a non-partitioning hierarchical clustering where >>> I'll need to find close centroids in clusters calculated with different >>> levels of specificity. >>> >>> On 5/12/12 11:38 PM, Sebastian Schelter wrote: >>>> This could be simply due to the fact that there are less similar docs >>>> than the number specified in 'maxSimilaritiesPerRow'. >>>> >>>> consider() is only invoked if a threshold was specified. >>>> >>>> Best, >>>> Sebastian >>>> >>>> >>>> On 13.05.2012 08:25, Suneel Marthi wrote: >>>>> Pat's question was that he was seeing less documents than that >>>>> specified by 'maxSimilaritiesPerRow', this could be happening due to >>>>> the 'consider' functionality of the applied similarity measure. >>>>> >>>>> >>>>> >>>>> ________________________________ >>>>> From: Sebastian Schelter<[EMAIL PROTECTED]> >>>>> To: [EMAIL PROTECTED] >>>>> Sent: Sunday, May 13, 2012 2:08 AM >>>>> Subject: Re: RowSimilarity >>>>> >>>>> The option 'maxSimilaritiesPerRow' determines the maximum number of >>>>> similar docs/items/rows per row. It depends on your data if there are >>>>> enough similar rows per row, so you can't always get 20 similar docs. >>>>> >>>>> The option 'threshold' determines the minimum similarity value for a >>>>> pair of docs (otherwise it will be dropped). This option is not
-
Re: RowSimilarityPat Ferrel 2012-05-14, 22:22
Sorry but I'm still confused. So the similarity magnitude has nothing to
do with one of mahout's distance measures, the similarity class is used only to specify the algorithm used to calculate this magnitude and does not imply a connection between distance and similarity? I'm now a bit unsure about how to read my results. * Using tanimoto for example is a value of 0.0001 more similar than a value of 0.9? This seems to fit my results even though below you say "A lack of (term) cooccurrences is equivalent to a similarity of 0" * Is there a description somewhere of what the similarity magnitude describes? Thanks, Pat On 5/14/12 10:35 AM, Sebastian Schelter wrote: > "The cutoff is made based on lack of term cooccurrences not the distance > measure." > > I'd rather use the term similarity measure not distance measure as a lot > of the measures implemented are not metric and the term 'distance' might > be misleading > > A lack of (term) cooccurrences is equivalent to a similarity of 0 by > definition, therefore the "default cutoff" is also based on the > similarity measure. > > --sebastian > > > On 14.05.2012 19:30, Pat Ferrel wrote: >> Thanks, this is quite clear and reasonable. The optional >> 'threshold' is based on the distance measure. >> >> BTW I assume the 'distance' returned is expressed in the distance >> measure's units? So using cosine as a distance measure a value near 0 is >> actually quite similar because the measure is 1-(cosine of the angle >> between the vectors)? >> >> On 5/13/12 9:10 AM, Sebastian Schelter wrote: >>> Hi Pat, >>> >>> RowSimilarityJob allows the use of a lot of different similarity >>> measures (cosine, jaccard coefficient, number of cooccurrences, etc) all >>> of which compute a single number for a pair of vectors that denotes how >>> similar those are. All these measures have the characteristic that two >>> vectors that do not share at least one non-zero value in a single >>> dimension are considered not similar (have similarity 0). >>> >>> In general, an all-pairs comparison, as it is conducted by >>> RowSimilarityJob, has quadratic complexity and is therefore not scalable. >>> >>> If we have sparse data such as text or ratings however, we can exploit >>> the fact that we only need to compare pairs which share at least one >>> non-zero value in a dimension. This is the basic idea behind row >>> similarity job to avoid an all-pairs comparison. >>> >>> In some real-world usecases you will furthermore encounter a lot of >>> pairs with near-zero similarities that are of little value for you. To >>> be able to avoid computing these, RowSimilarityJob provides the option >>> to specify a minimum threshold so that it ignores pairs with a >>> similarity value below this threshold. This threshold is data-dependent >>> and you have to experimentally find it. >>> >>> --sebastian >>> >>> >>> On 13.05.2012 17:33, Pat Ferrel wrote: >>>> To paraphrase: >>>> >>>> There is some internal threshold to be considered 'similar'. This is the >>>> one supplied with the 'threshold' option mentioned below and I need to >>>> do a special build to get this option activated? I assume it is not >>>> active because it has not been tested well? >>>> >>>> So currently how is the threshold calculated? How can I determine its >>>> value? Can I vote that this be activated as an optional parameter in the >>>> future? >>>> >>>> I ask this in part because I want to use RowSimilarity in an experiment >>>> to do something like a non-partitioning hierarchical clustering where >>>> I'll need to find close centroids in clusters calculated with different >>>> levels of specificity. >>>> >>>> On 5/12/12 11:38 PM, Sebastian Schelter wrote: >>>>> This could be simply due to the fact that there are less similar docs >>>>> than the number specified in 'maxSimilaritiesPerRow'. >>>>> >>>>> consider() is only invoked if a threshold was specified. >>>>> >>>>> Best, >>>>> Sebastian >>>>> >>>>> >>>>> On 13.05.2012 08:25, Suneel Marthi wrote:
-
Re: RowSimilarityPat Ferrel 2012-07-13, 19:47
I also do clustering so that's an obvious optimization I just haven't
gotten to it yet (doing similar only on docs clustered together). I'm also trying to decide how to downsample. However the results from similarity are quite good so understanding how to scale is #1. Clustering gives docs closest to a centroid. RowSimilarity finds docs similar to each docs. What I really need is to calculate the k most similar docs to a short list, known ahead of time. I don't know of an algorithm to do this (other than brute force). It would take a realatively small set of docs and find similar docs in a much much larger set. Rowsimilarity finds all pair-wise similarities. Strictly speaking I need only a tiny number of those. I think lucene has a weighted verctor based search that I need to investigate it further. On 7/13/12 9:32 AM, Sebastian Schelter wrote: > Pat, > > RowSimilarityJob compares all pairs of rows, which is by definition a > quadratic and therefore non-scalable problem. The comparison is however > done in a way that only rows that have at least one non-zero value in a > common dimension are compared. > > Therefore if you have certain sparse types of input such as ratings for > example, you only have to look at a relatively small number of pairs and > the comparison scales. > > RowSimilarityJob is mainly used for the collaborative filtering stuff in > Mahout. We have a special job to prepare the data > (PreparePreferenceMatrixJob) that will take care of sampling down > entries in the rating matrix that might cause too much cooccurrences. > > If you directly use RowSimilarityJob, you have to ensure that your input > data is of a shape suitable for the job. It seems to me that this is not > the case, you created 76GB of intermediate output (cooccurring terms) > from 35k documents, its clear that it takes hadoop a long time to sort > that in the shuffle phase. > > My advice would be that you either take a deeper look at your data and > try to downsample highly frequent terms more, or that you take a look at > other techniques such as clustering or locality sensitive hashing to > find similar documents. > > Best, > Sebastian > > > > On 13.07.2012 18:03, Pat Ferrel wrote: >> I increased the timeout to 100 minutes and added another machine (does >> the new machine matter in this case?). The job completed successfully. >> >> You say the algorithm is non-scalable--did you mean it's not >> parallelizable? I assume I'll need to keep increasing this limit? >> >> I'm sure you know better than I that it is not really good for the >> efficiency of a cluster to increase the timeout so far since it means >> jobs can take much longer in the case of transient task failures. >> >> On 7/12/12 8:26 AM, Pat Ferrel wrote: >>> OK, thanks. I haven't checked for sparsity. However I have many >>> successful runs of rowsimilarity with up to 150,000 docs and 250,000 >>> terms as I said below. This run has a much smaller matrix. I >>> understand that spasity is a different question but anyway since the >>> data in all cases is a crawl of the same sites I'd expect the same >>> sparsity in all the data sets whether they succeeded or timed out. >>> >>> My issue has nothing to do with the elapsed time although I'll have to >>> consider it in larger data sets (thanks for the heads up). Is it >>> impossible to check in with the task tracker, avoiding a timeout? Or >>> is there some other issue? >>> >>> On 7/12/12 8:06 AM, Sebastian Schelter wrote: >>>> It's important to note that the performance of RowSimilarityJob >>>> heavily depends on the sparsity of the input data, because in general >>>> comparing all pairs of things is a quadratic (non-scalable) problem. >>>> >>>> 2012/7/12 Sebastian Schelter <[EMAIL PROTECTED]>: >>>>> Sorry, I overread that its more than one machine. Could you provide >>>>> the values for the counters from RowSimilarityJob (ROWS, >>>>> COOCCURRENCES, PRUNED_COOCCURRENCES)? >>>>> >>>>> Best, >>>>> Sebastian >>>>> >>>>> 2012/7/12 Pat Ferrel <[EMAIL PROTECTED]>: |