|
Alejandro Bellogin Kouki
2011-04-06, 14:13
Grant Ingersoll
2011-04-06, 14:33
Sean Owen
2011-04-06, 15:33
Sebastian Schelter
2011-04-06, 15:43
Alejandro Bellogin Kouki
2011-04-06, 16:00
Daniel McEnnis
2011-04-06, 16:11
Sean Owen
2011-04-06, 16:39
Sean Owen
2011-04-06, 16:43
Alejandro Bellogin Kouki
2011-04-06, 16:51
Sean Owen
2011-04-06, 16:56
|
-
Bug in similarity computationAlejandro Bellogin Kouki 2011-04-06, 14:13
Hi all,
I've been using Mahout for many years now, mainly for my Master's thesis, and now for my PhD thesis. That is why, first, I want to congratulate you for the effort of putting such a library as open source. At this point, my main concern is recommendation, and, because of that, I have been using the different recommenders, evaluators and similarities implemented in the library. However, today, after many times inspecting your code, I have found, IMHO, a relevant bug with further implications. It is related with the computation of the similarity. Although this is not the only implemented similarity, Pearson's correlation is one of the most popular one. This similarity requires to normalise (or "center") the data using the user's mean, in order to be able to distinguish a user who usually rates items with 5's from a user who usually rates them with 3's, even though in a particular item both rated it with a 5. The problem is that the user's means are being calculated using ONLY the items in common between the two users, leading to strange similarity computations (or worse, to no similarity at all!). It is not difficult to find small examples showing this behaviour, besides, seminal papers assume the overall mean rating is used [1, 2]. Since I am a newbie on this patch and bug/fix terminology, I would like to know what is the best (or the only?) way of including this finding. I have to say that I already have fixed the code (it affects to the AbstractSimilarity class, and therefore, it would have an impact on other similarity functions too). Best regards, Alejandro [1] M. J. Pazzani: "A framework for collaborative, content-based and demographic filtering". Artificial Intelligence Review 13, pp. 393-408. 1999 [2] C. Desrosiers, G. Karypis: "A comprehensive survey of neighborhood-based recommendation methods". Recommender Systems Handbook, chapter 4. 2010 -- Alejandro Bellogin Kouki http://rincon.uam.es/dir?cw=435275268554687
-
Re: Bug in similarity computationGrant Ingersoll 2011-04-06, 14:33
Hi Alejandro,
I won't comment on the issue itself (I am sure Sean and others will), since I haven't looked at the code, but https://cwiki.apache.org/confluence/display/MAHOUT/How+To+Contribute describes how to submit a patch. File a ticket in JIRA and provide the patch along with your test cases. -Grant On Apr 6, 2011, at 10:13 AM, Alejandro Bellogin Kouki wrote: > Hi all, > > I've been using Mahout for many years now, mainly for my Master's thesis, and now for my PhD thesis. That is why, first, I want to congratulate you for the effort of putting such a library as open source. > > At this point, my main concern is recommendation, and, because of that, I have been using the different recommenders, evaluators and similarities implemented in the library. However, today, after many times inspecting your code, I have found, IMHO, a relevant bug with further implications. > > It is related with the computation of the similarity. Although this is not the only implemented similarity, Pearson's correlation is one of the most popular one. This similarity requires to normalise (or "center") the data using the user's mean, in order to be able to distinguish a user who usually rates items with 5's from a user who usually rates them with 3's, even though in a particular item both rated it with a 5. The problem is that the user's means are being calculated using ONLY the items in common between the two users, leading to strange similarity computations (or worse, to no similarity at all!). It is not difficult to find small examples showing this behaviour, besides, seminal papers assume the overall mean rating is used [1, 2]. > > Since I am a newbie on this patch and bug/fix terminology, I would like to know what is the best (or the only?) way of including this finding. I have to say that I already have fixed the code (it affects to the AbstractSimilarity class, and therefore, it would have an impact on other similarity functions too). > > Best regards, > Alejandro > > [1] M. J. Pazzani: "A framework for collaborative, content-based and demographic filtering". Artificial Intelligence Review 13, pp. 393-408. 1999 > [2] C. Desrosiers, G. Karypis: "A comprehensive survey of neighborhood-based recommendation methods". Recommender Systems Handbook, chapter 4. 2010 > > -- > Alejandro Bellogin Kouki > http://rincon.uam.es/dir?cw=435275268554687 > -------------------------- Grant Ingersoll Lucene Revolution -- Lucene and Solr User Conference May 25-26 in San Francisco www.lucenerevolution.org
-
Re: Bug in similarity computationSean Owen 2011-04-06, 15:33
It's a good question.
The Pearson correlation of two series does not change if the series means change. That is, subtracting the same value from all elements of one series (or scaling the values) doesn't change the correlation. In that sense, I would not say you must center the series to make either one's mean 0. It wouldn't make a difference, no matter what number you subtracted, even if it were the mean of all ratings by the user. The code you see in the project *does* center the data, because *if* the means are 0, then the computation result is the same as the cosine measure, and that seems nice. (There's also an uncentered cosine measure version.) What I think you're really getting at is, can't we expand the series to include all items that either one or the other user rated? Then the question is, what are the missing values you want to fill in? There's not a great answer to that, since any answer is artificial, but picking the user's mean rating is a decent choice. This is not quite the same as centering. You can do that in Mahout -- use AveragingPreferenceInferrer to do exactly this with these similarity metrics. It will slow things down and anecdotally I don't think it's worth it, but it's certainly there. I don't think the normal version, without a PreferenceInferrer, is "wrong". It is just implementing the Pearson correlation on all data available, and you have to add a setting to tell it to make up data. On Wed, Apr 6, 2011 at 3:13 PM, Alejandro Bellogin Kouki <[EMAIL PROTECTED]> wrote: > Hi all, > > I've been using Mahout for many years now, mainly for my Master's thesis, > and now for my PhD thesis. That is why, first, I want to congratulate you > for the effort of putting such a library as open source. > > At this point, my main concern is recommendation, and, because of that, I > have been using the different recommenders, evaluators and similarities > implemented in the library. However, today, after many times inspecting your > code, I have found, IMHO, a relevant bug with further implications. > > It is related with the computation of the similarity. Although this is not > the only implemented similarity, Pearson's correlation is one of the most > popular one. This similarity requires to normalise (or "center") the data > using the user's mean, in order to be able to distinguish a user who usually > rates items with 5's from a user who usually rates them with 3's, even > though in a particular item both rated it with a 5. The problem is that the > user's means are being calculated using ONLY the items in common between the > two users, leading to strange similarity computations (or worse, to no > similarity at all!). It is not difficult to find small examples showing this > behaviour, besides, seminal papers assume the overall mean rating is used > [1, 2]. > > Since I am a newbie on this patch and bug/fix terminology, I would like to > know what is the best (or the only?) way of including this finding. I have > to say that I already have fixed the code (it affects to the > AbstractSimilarity class, and therefore, it would have an impact on other > similarity functions too). > > Best regards, > Alejandro > > [1] M. J. Pazzani: "A framework for collaborative, content-based and > demographic filtering". Artificial Intelligence Review 13, pp. 393-408. 1999 > [2] C. Desrosiers, G. Karypis: "A comprehensive survey of neighborhood-based > recommendation methods". Recommender Systems Handbook, chapter 4. 2010 > > -- > Alejandro Bellogin Kouki > http://rincon.uam.es/dir?cw=435275268554687 > >
-
Re: Bug in similarity computationSebastian Schelter 2011-04-06, 15:43
IIRC Sarwar et.al.'s "Item-Based Collaborative Filtering Recommendation
Algorithms" explicitly mentions to only use the co-rated cases for Pearson correlation. --sebastian On 06.04.2011 17:33, Sean Owen wrote: > It's a good question. > > The Pearson correlation of two series does not change if the series > means change. That is, subtracting the same value from all elements of > one series (or scaling the values) doesn't change the correlation. In > that sense, I would not say you must center the series to make either > one's mean 0. It wouldn't make a difference, no matter what number you > subtracted, even if it were the mean of all ratings by the user. > > The code you see in the project *does* center the data, because *if* > the means are 0, then the computation result is the same as the cosine > measure, and that seems nice. (There's also an uncentered cosine > measure version.) > > > What I think you're really getting at is, can't we expand the series > to include all items that either one or the other user rated? Then the > question is, what are the missing values you want to fill in? There's > not a great answer to that, since any answer is artificial, but > picking the user's mean rating is a decent choice. This is not quite > the same as centering. > > You can do that in Mahout -- use AveragingPreferenceInferrer to do > exactly this with these similarity metrics. It will slow things down > and anecdotally I don't think it's worth it, but it's certainly there. > > I don't think the normal version, without a PreferenceInferrer, is > "wrong". It is just implementing the Pearson correlation on all data > available, and you have to add a setting to tell it to make up data. > > > > On Wed, Apr 6, 2011 at 3:13 PM, Alejandro Bellogin Kouki > <[EMAIL PROTECTED]> wrote: >> Hi all, >> >> I've been using Mahout for many years now, mainly for my Master's thesis, >> and now for my PhD thesis. That is why, first, I want to congratulate you >> for the effort of putting such a library as open source. >> >> At this point, my main concern is recommendation, and, because of that, I >> have been using the different recommenders, evaluators and similarities >> implemented in the library. However, today, after many times inspecting your >> code, I have found, IMHO, a relevant bug with further implications. >> >> It is related with the computation of the similarity. Although this is not >> the only implemented similarity, Pearson's correlation is one of the most >> popular one. This similarity requires to normalise (or "center") the data >> using the user's mean, in order to be able to distinguish a user who usually >> rates items with 5's from a user who usually rates them with 3's, even >> though in a particular item both rated it with a 5. The problem is that the >> user's means are being calculated using ONLY the items in common between the >> two users, leading to strange similarity computations (or worse, to no >> similarity at all!). It is not difficult to find small examples showing this >> behaviour, besides, seminal papers assume the overall mean rating is used >> [1, 2]. >> >> Since I am a newbie on this patch and bug/fix terminology, I would like to >> know what is the best (or the only?) way of including this finding. I have >> to say that I already have fixed the code (it affects to the >> AbstractSimilarity class, and therefore, it would have an impact on other >> similarity functions too). >> >> Best regards, >> Alejandro >> >> [1] M. J. Pazzani: "A framework for collaborative, content-based and >> demographic filtering". Artificial Intelligence Review 13, pp. 393-408. 1999 >> [2] C. Desrosiers, G. Karypis: "A comprehensive survey of neighborhood-based >> recommendation methods". Recommender Systems Handbook, chapter 4. 2010 >> >> -- >> Alejandro Bellogin Kouki >> http://rincon.uam.es/dir?cw=435275268554687 >> >>
-
Re: Bug in similarity computationAlejandro Bellogin Kouki 2011-04-06, 16:00
Hi,
maybe I didn't express myself correctly... I'm talking about the calculation of user's or item's mean (R_i in Sarwar's paper), which should be computed using ALL the items of that user/item, BUT in Mahout it is computed using only the items corated by both users/items. This causes strange effects, for instance, if we have two users with two items in common, and other unknown ratings: i1 i2 i3 i4 u1 4 4 -- 5 u2 3 3 5 -- the current code in Mahout computes the mean of u1 as 4, and of u2 as 3, which leads to a 0 when it is used for centering the data, instead of 4.3 and 3.6, resp. I hope it is more clear now. Alejandro Sebastian Schelter escribió: > IIRC Sarwar et.al.'s "Item-Based Collaborative Filtering > Recommendation Algorithms" explicitly mentions to only use the > co-rated cases for Pearson correlation. > > --sebastian > > On 06.04.2011 17:33, Sean Owen wrote: >> It's a good question. >> >> The Pearson correlation of two series does not change if the series >> means change. That is, subtracting the same value from all elements of >> one series (or scaling the values) doesn't change the correlation. In >> that sense, I would not say you must center the series to make either >> one's mean 0. It wouldn't make a difference, no matter what number you >> subtracted, even if it were the mean of all ratings by the user. >> >> The code you see in the project *does* center the data, because *if* >> the means are 0, then the computation result is the same as the cosine >> measure, and that seems nice. (There's also an uncentered cosine >> measure version.) >> >> >> What I think you're really getting at is, can't we expand the series >> to include all items that either one or the other user rated? Then the >> question is, what are the missing values you want to fill in? There's >> not a great answer to that, since any answer is artificial, but >> picking the user's mean rating is a decent choice. This is not quite >> the same as centering. >> >> You can do that in Mahout -- use AveragingPreferenceInferrer to do >> exactly this with these similarity metrics. It will slow things down >> and anecdotally I don't think it's worth it, but it's certainly there. >> >> I don't think the normal version, without a PreferenceInferrer, is >> "wrong". It is just implementing the Pearson correlation on all data >> available, and you have to add a setting to tell it to make up data. >> >> >> >> On Wed, Apr 6, 2011 at 3:13 PM, Alejandro Bellogin Kouki >> <[EMAIL PROTECTED]> wrote: >>> Hi all, >>> >>> I've been using Mahout for many years now, mainly for my Master's >>> thesis, >>> and now for my PhD thesis. That is why, first, I want to >>> congratulate you >>> for the effort of putting such a library as open source. >>> >>> At this point, my main concern is recommendation, and, because of >>> that, I >>> have been using the different recommenders, evaluators and similarities >>> implemented in the library. However, today, after many times >>> inspecting your >>> code, I have found, IMHO, a relevant bug with further implications. >>> >>> It is related with the computation of the similarity. Although this >>> is not >>> the only implemented similarity, Pearson's correlation is one of the >>> most >>> popular one. This similarity requires to normalise (or "center") the >>> data >>> using the user's mean, in order to be able to distinguish a user who >>> usually >>> rates items with 5's from a user who usually rates them with 3's, even >>> though in a particular item both rated it with a 5. The problem is >>> that the >>> user's means are being calculated using ONLY the items in common >>> between the >>> two users, leading to strange similarity computations (or worse, to no >>> similarity at all!). It is not difficult to find small examples >>> showing this >>> behaviour, besides, seminal papers assume the overall mean rating is >>> used >>> [1, 2]. >>> >>> Since I am a newbie on this patch and bug/fix terminology, I would Alejandro Bellogin Kouki http://rincon.uam.es/dir?cw=435275268554687
-
Re: Bug in similarity computationDaniel McEnnis 2011-04-06, 16:11
Alejandro,
The difficulty lies in that values that are normally zero are in fact Double.NaN. Including these extra values to get a correct result means, invariably, ending up with Double.NaN as a result. To avoid this, Mahout uses non-standard implementations that only considers co-occurrence entries in the result. Whether these distance metrics should be called the same as their non-recommender cousins is a question for debate.... Daniel. On Wed, Apr 6, 2011 at 12:00 PM, Alejandro Bellogin Kouki <[EMAIL PROTECTED]> wrote: > Hi, > > maybe I didn't express myself correctly... I'm talking about the calculation > of user's or item's mean (R_i in Sarwar's paper), which should be computed > using ALL the items of that user/item, BUT in Mahout it is computed using > only the items corated by both users/items. > > This causes strange effects, for instance, if we have two users with two > items in common, and other unknown ratings: > i1 i2 i3 i4 > u1 4 4 -- 5 > u2 3 3 5 -- > > the current code in Mahout computes the mean of u1 as 4, and of u2 as 3, > which leads to a 0 when it is used for centering the data, instead of 4.3 > and 3.6, resp. > > I hope it is more clear now. > > Alejandro > > Sebastian Schelter escribió: >> >> IIRC Sarwar et.al.'s "Item-Based Collaborative Filtering Recommendation >> Algorithms" explicitly mentions to only use the co-rated cases for Pearson >> correlation. >> >> --sebastian >> >> On 06.04.2011 17:33, Sean Owen wrote: >>> >>> It's a good question. >>> >>> The Pearson correlation of two series does not change if the series >>> means change. That is, subtracting the same value from all elements of >>> one series (or scaling the values) doesn't change the correlation. In >>> that sense, I would not say you must center the series to make either >>> one's mean 0. It wouldn't make a difference, no matter what number you >>> subtracted, even if it were the mean of all ratings by the user. >>> >>> The code you see in the project *does* center the data, because *if* >>> the means are 0, then the computation result is the same as the cosine >>> measure, and that seems nice. (There's also an uncentered cosine >>> measure version.) >>> >>> >>> What I think you're really getting at is, can't we expand the series >>> to include all items that either one or the other user rated? Then the >>> question is, what are the missing values you want to fill in? There's >>> not a great answer to that, since any answer is artificial, but >>> picking the user's mean rating is a decent choice. This is not quite >>> the same as centering. >>> >>> You can do that in Mahout -- use AveragingPreferenceInferrer to do >>> exactly this with these similarity metrics. It will slow things down >>> and anecdotally I don't think it's worth it, but it's certainly there. >>> >>> I don't think the normal version, without a PreferenceInferrer, is >>> "wrong". It is just implementing the Pearson correlation on all data >>> available, and you have to add a setting to tell it to make up data. >>> >>> >>> >>> On Wed, Apr 6, 2011 at 3:13 PM, Alejandro Bellogin Kouki >>> <[EMAIL PROTECTED]> wrote: >>>> >>>> Hi all, >>>> >>>> I've been using Mahout for many years now, mainly for my Master's >>>> thesis, >>>> and now for my PhD thesis. That is why, first, I want to congratulate >>>> you >>>> for the effort of putting such a library as open source. >>>> >>>> At this point, my main concern is recommendation, and, because of that, >>>> I >>>> have been using the different recommenders, evaluators and similarities >>>> implemented in the library. However, today, after many times inspecting >>>> your >>>> code, I have found, IMHO, a relevant bug with further implications. >>>> >>>> It is related with the computation of the similarity. Although this is >>>> not >>>> the only implemented similarity, Pearson's correlation is one of the >>>> most >>>> popular one. This similarity requires to normalise (or "center") the >>>> data
-
Re: Bug in similarity computationSean Owen 2011-04-06, 16:39
I am saying that no matter what mean or constant you subtract from the
series, the resulting Pearson correlation is unchanged. So, there is no bug related to centering regarding the Pearson correlation -- centering is irrelevant. You could always subtract 10000000 from the values and it wouldn't matter. What you are really saying is that you want to use i3 and i4 in this similarity computation. The problem is you have no value for (u1,i3) or (u2,i4). What are you proposing to use? I think you are proposing "0" is the answer. That usually doesn't work, but, here it will -- you are trying to subtract a mean so that 0 is also the mean of the ratings. Then, 0 is no longer arbitrary; you're inserting the user's mean rating (which is now 0) as a filler value and that's a reasonable thing to do. I'm saying Mahout already does this too. If you use AveragingPreferenceInferrer, it will insert the user's mean rating as missing values.* It doesn't then matter whether centering happens; it doesn't affect a Pearson correlation. I think the Mahout default version is the purer interpretation, since it doesn't involve making up ratings; I don't know that there's an "official" way to do it in CF. There are a hundred variants on anything that could be useful. In any event, the framework does both since both are valid. But no I do not consider the default implementation in the project non-standard, let alone wrong. On Wed, Apr 6, 2011 at 5:00 PM, Alejandro Bellogin Kouki <[EMAIL PROTECTED]> wrote: > Hi, > > maybe I didn't express myself correctly... I'm talking about the calculation > of user's or item's mean (R_i in Sarwar's paper), which should be computed > using ALL the items of that user/item, BUT in Mahout it is computed using > only the items corated by both users/items. > > This causes strange effects, for instance, if we have two users with two > items in common, and other unknown ratings: > i1 i2 i3 i4 > u1 4 4 -- 5 > u2 3 3 5 -- > > the current code in Mahout computes the mean of u1 as 4, and of u2 as 3, > which leads to a 0 when it is used for centering the data, instead of 4.3 > and 3.6, resp. > > I hope it is more clear now. > > Alejandro > > Sebastian Schelter escribió: >> >> IIRC Sarwar et.al.'s "Item-Based Collaborative Filtering Recommendation >> Algorithms" explicitly mentions to only use the co-rated cases for Pearson >> correlation. >> >> --sebastian >> >> On 06.04.2011 17:33, Sean Owen wrote: >>> >>> It's a good question. >>> >>> The Pearson correlation of two series does not change if the series >>> means change. That is, subtracting the same value from all elements of >>> one series (or scaling the values) doesn't change the correlation. In >>> that sense, I would not say you must center the series to make either >>> one's mean 0. It wouldn't make a difference, no matter what number you >>> subtracted, even if it were the mean of all ratings by the user. >>> >>> The code you see in the project *does* center the data, because *if* >>> the means are 0, then the computation result is the same as the cosine >>> measure, and that seems nice. (There's also an uncentered cosine >>> measure version.) >>> >>> >>> What I think you're really getting at is, can't we expand the series >>> to include all items that either one or the other user rated? Then the >>> question is, what are the missing values you want to fill in? There's >>> not a great answer to that, since any answer is artificial, but >>> picking the user's mean rating is a decent choice. This is not quite >>> the same as centering. >>> >>> You can do that in Mahout -- use AveragingPreferenceInferrer to do >>> exactly this with these similarity metrics. It will slow things down >>> and anecdotally I don't think it's worth it, but it's certainly there. >>> >>> I don't think the normal version, without a PreferenceInferrer, is >>> "wrong". It is just implementing the Pearson correlation on all data >>> available, and you have to add a setting to tell it to make up data.
-
Re: Bug in similarity computationSean Owen 2011-04-06, 16:43
No, I don't think it's anything to do with NaN. The result and
implementation is quite by design. I really don't understand this talk of "non standard" Pearson correlation. On the contrary, the implementation is quite strictly a Pearson correlation. The request seems to be to "fix" the computation to, say, compute a Pearson correlation on series like (1,2) and (3,6,1,2). This isn't even well-formed -- the series aren't of the same size. It makes sense if you want to pad the series to be of equal size. That's a good question. But it's not a question of how the Pearson correlation is defined or implemented, but of how the data fed into it is "defined". And I'm saying it's a valid variant, one implemented already. On Wed, Apr 6, 2011 at 5:11 PM, Daniel McEnnis <[EMAIL PROTECTED]> wrote: > Alejandro, > > The difficulty lies in that values that are normally zero are in fact > Double.NaN. Including these extra values to get a correct result > means, invariably, ending up with Double.NaN as a result. To avoid > this, Mahout uses non-standard implementations that only considers > co-occurrence entries in the result. Whether these distance metrics > should be called the same as their non-recommender cousins is a > question for debate.... > > Daniel. > > On Wed, Apr 6, 2011 at 12:00 PM, Alejandro Bellogin Kouki > <[EMAIL PROTECTED]> wrote: >> Hi, >> >> maybe I didn't express myself correctly... I'm talking about the calculation >> of user's or item's mean (R_i in Sarwar's paper), which should be computed >> using ALL the items of that user/item, BUT in Mahout it is computed using >> only the items corated by both users/items. >> >> This causes strange effects, for instance, if we have two users with two >> items in common, and other unknown ratings: >> i1 i2 i3 i4 >> u1 4 4 -- 5 >> u2 3 3 5 -- >> >> the current code in Mahout computes the mean of u1 as 4, and of u2 as 3, >> which leads to a 0 when it is used for centering the data, instead of 4.3 >> and 3.6, resp. >> >> I hope it is more clear now. >> >> Alejandro >> >> Sebastian Schelter escribió: >>> >>> IIRC Sarwar et.al.'s "Item-Based Collaborative Filtering Recommendation >>> Algorithms" explicitly mentions to only use the co-rated cases for Pearson >>> correlation. >>> >>> --sebastian >>> >>> On 06.04.2011 17:33, Sean Owen wrote: >>>> >>>> It's a good question. >>>> >>>> The Pearson correlation of two series does not change if the series >>>> means change. That is, subtracting the same value from all elements of >>>> one series (or scaling the values) doesn't change the correlation. In >>>> that sense, I would not say you must center the series to make either >>>> one's mean 0. It wouldn't make a difference, no matter what number you >>>> subtracted, even if it were the mean of all ratings by the user. >>>> >>>> The code you see in the project *does* center the data, because *if* >>>> the means are 0, then the computation result is the same as the cosine >>>> measure, and that seems nice. (There's also an uncentered cosine >>>> measure version.) >>>> >>>> >>>> What I think you're really getting at is, can't we expand the series >>>> to include all items that either one or the other user rated? Then the >>>> question is, what are the missing values you want to fill in? There's >>>> not a great answer to that, since any answer is artificial, but >>>> picking the user's mean rating is a decent choice. This is not quite >>>> the same as centering. >>>> >>>> You can do that in Mahout -- use AveragingPreferenceInferrer to do >>>> exactly this with these similarity metrics. It will slow things down >>>> and anecdotally I don't think it's worth it, but it's certainly there. >>>> >>>> I don't think the normal version, without a PreferenceInferrer, is >>>> "wrong". It is just implementing the Pearson correlation on all data >>>> available, and you have to add a setting to tell it to make up data. >>>> >>>> >>>> >>>> On Wed, Apr 6, 2011 at 3:13 PM, Alejandro Bellogin Kouki
-
Re: Bug in similarity computationAlejandro Bellogin Kouki 2011-04-06, 16:51
I agree with Sean in that the current Mahout's implementation is a
Pearson correlation, since it only considers paired items (as you said, it does not make sense to correlate two series like that). However, the problem is that, in recommendation, when they use this correlation as a similarity measure, the mean of each variable (i.e., user or item) is not strictly the mean of the observed values in the series being correlated, but it needs to consider some extra values (those items not co-reated with the other user). So, perhaps this is only a notation problem, and this distance should not be consider equivalent to that cited in the references already mentioned. Alejandro Sean Owen escribió: > No, I don't think it's anything to do with NaN. The result and > implementation is quite by design. > > I really don't understand this talk of "non standard" Pearson > correlation. On the contrary, the implementation is quite strictly a > Pearson correlation. The request seems to be to "fix" the computation > to, say, compute a Pearson correlation on series like (1,2) and > (3,6,1,2). This isn't even well-formed -- the series aren't of the > same size. > > It makes sense if you want to pad the series to be of equal size. > That's a good question. But it's not a question of how the Pearson > correlation is defined or implemented, but of how the data fed into it > is "defined". > > And I'm saying it's a valid variant, one implemented already. > > > On Wed, Apr 6, 2011 at 5:11 PM, Daniel McEnnis <[EMAIL PROTECTED]> wrote: > >> Alejandro, >> >> The difficulty lies in that values that are normally zero are in fact >> Double.NaN. Including these extra values to get a correct result >> means, invariably, ending up with Double.NaN as a result. To avoid >> this, Mahout uses non-standard implementations that only considers >> co-occurrence entries in the result. Whether these distance metrics >> should be called the same as their non-recommender cousins is a >> question for debate.... >> >> Daniel. >> >> On Wed, Apr 6, 2011 at 12:00 PM, Alejandro Bellogin Kouki >> <[EMAIL PROTECTED]> wrote: >> >>> Hi, >>> >>> maybe I didn't express myself correctly... I'm talking about the calculation >>> of user's or item's mean (R_i in Sarwar's paper), which should be computed >>> using ALL the items of that user/item, BUT in Mahout it is computed using >>> only the items corated by both users/items. >>> >>> This causes strange effects, for instance, if we have two users with two >>> items in common, and other unknown ratings: >>> i1 i2 i3 i4 >>> u1 4 4 -- 5 >>> u2 3 3 5 -- >>> >>> the current code in Mahout computes the mean of u1 as 4, and of u2 as 3, >>> which leads to a 0 when it is used for centering the data, instead of 4.3 >>> and 3.6, resp. >>> >>> I hope it is more clear now. >>> >>> Alejandro >>> >>> Sebastian Schelter escribió: >>> >>>> IIRC Sarwar et.al.'s "Item-Based Collaborative Filtering Recommendation >>>> Algorithms" explicitly mentions to only use the co-rated cases for Pearson >>>> correlation. >>>> >>>> --sebastian >>>> >>>> On 06.04.2011 17:33, Sean Owen wrote: >>>> >>>>> It's a good question. >>>>> >>>>> The Pearson correlation of two series does not change if the series >>>>> means change. That is, subtracting the same value from all elements of >>>>> one series (or scaling the values) doesn't change the correlation. In >>>>> that sense, I would not say you must center the series to make either >>>>> one's mean 0. It wouldn't make a difference, no matter what number you >>>>> subtracted, even if it were the mean of all ratings by the user. >>>>> >>>>> The code you see in the project *does* center the data, because *if* >>>>> the means are 0, then the computation result is the same as the cosine >>>>> measure, and that seems nice. (There's also an uncentered cosine >>>>> measure version.) >>>>> >>>>> >>>>> What I think you're really getting at is, can't we expand the series >> Alejandro Bellogin Kouki http://rincon.uam.es/dir?cw=435275268554687
-
Re: Bug in similarity computationSean Owen 2011-04-06, 16:56
Yes, and I'm saying that Mahout does that too, already, with
AveragingPreferenceInferrer. The result will be identical to what you are suggesting. Unless I did my math wrong. On Wed, Apr 6, 2011 at 5:51 PM, Alejandro Bellogin Kouki <[EMAIL PROTECTED]> wrote: > I agree with Sean in that the current Mahout's implementation is a Pearson > correlation, since it only considers paired items (as you said, it does not > make sense to correlate two series like that). However, the problem is that, > in recommendation, when they use this correlation as a similarity measure, > the mean of each variable (i.e., user or item) is not strictly the mean of > the observed values in the series being correlated, but it needs to consider > some extra values (those items not co-reated with the other user). > > So, perhaps this is only a notation problem, and this distance should not be > consider equivalent to that cited in the references already mentioned. > |