|
|
-
RecommenderJob in mahout-0.4 returning 1.0 score for each recommendation
Jordi Abad 2010-11-26, 18:26
Hi,
I'm running a RecommenderJob (mahout-0.4 version) over hadoop like this:
hadoop-0.20 jar /mahout-distribution-0.4/mahout-core-0.4-job.jar org.apache.mahout.cf.taste.hadoop.item.RecommenderJob -Dmapred.input.dir=input -Dmapred.output.dir=output -s SIMILARITY_TANIMOTO_COEFFICIENT -b true
The job works fine but when I examine the result I get things like:
12 [1:1.0,2:1.0,3:1.0,5:1.0,6:1.0,11:1.0,168:1.0,173:1.0,180:1.0,199:1.0] 14 [1:1.0,2:1.0,3:1.0,5:1.0,6:1.0,11:1.0,14:1.0,21:1.0,22:1.0,23:1.0] ...
I can't understand why each recommendation gets 1.0 of score. It doesn't matter which SimilarityClass I set. I always get a score of 1.0.
My input file is a "boolean file" (1391374 rows) with values like:
1,6496241 1,4368916 1,4922226 1,4958662 ...
If I run "org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob" job over the same file I get good results for items.
Any ideas?
Thanks in advance.
+
Jordi Abad 2010-11-26, 18:26
-
Re: RecommenderJob in mahout-0.4 returning 1.0 score for each recommendation
Sean Owen 2010-11-26, 18:32
This is because all the ratings are implicitly 1.0 when there are no ratings.
But I actually think this is symptomatic of a problem, since I note that those recommendations are quite suspiciously in order by item ID. I am not sure the current state of the distributed recommender is compatible with boolean data, but I am not an expert here --
Sebastian can we discuss what might be going on here? In the non-distributed code, items are given a "fake" estimated preferences which is not actually an estimated preference (because that would always be 1.0) but some other number that functions as a score -- average similarity to other items for example. This is used as a ranking and also returned as an "estimated preference" even though it's not.
Can we do something like that here? or is it already working this way if certain values / options are set?
On Fri, Nov 26, 2010 at 6:26 PM, Jordi Abad <[EMAIL PROTECTED]> wrote: > Hi, > > I'm running a RecommenderJob (mahout-0.4 version) over hadoop like this: > > hadoop-0.20 jar /mahout-distribution-0.4/mahout-core-0.4-job.jar > org.apache.mahout.cf.taste.hadoop.item.RecommenderJob > -Dmapred.input.dir=input -Dmapred.output.dir=output -s > SIMILARITY_TANIMOTO_COEFFICIENT -b true > > The job works fine but when I examine the result I get things like: > > 12 [1:1.0,2:1.0,3:1.0,5:1.0,6:1.0,11:1.0,168:1.0,173:1.0,180:1.0,199:1.0] > 14 [1:1.0,2:1.0,3:1.0,5:1.0,6:1.0,11:1.0,14:1.0,21:1.0,22:1.0,23:1.0] > ... > > I can't understand why each recommendation gets 1.0 of score. It doesn't > matter which SimilarityClass I set. I always get a score of 1.0. > > My input file is a "boolean file" (1391374 rows) with values like: > > 1,6496241 > 1,4368916 > 1,4922226 > 1,4958662 > ... > > If I run > "org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob" job > over the same file I get good results for items. > > Any ideas? > > Thanks in advance. >
+
Sean Owen 2010-11-26, 18:32
-
Re: RecommenderJob in mahout-0.4 returning 1.0 score for each recommendation
Sebastian Schelter 2010-11-26, 18:33
Hi Jordi,
That's because you compute recommendations on *boolean* data (-b true). There is no weight involved in the preferences then, you either know that a user likes something or you don't know it. The result of that is that you can also not assign a weight to a computed recommendation either. That's where the 1.0s are coming from.
Things might be clearer if we take a look at the math:
u = a user i = an item not yet rated by u N = all items similar to i
Prediction(u,i) = sum(all n from N: similarity(i,n) * rating(u,n)) / sum(all n from N: abs(similarity(i,n)))
If all ratings have value 1 (cause we use boolean data) the result of the Predicition can also only be 1.
--sebastian
Am 26.11.2010 19:26, schrieb Jordi Abad: > Hi, > > I'm running a RecommenderJob (mahout-0.4 version) over hadoop like this: > > hadoop-0.20 jar /mahout-distribution-0.4/mahout-core-0.4-job.jar > org.apache.mahout.cf.taste.hadoop.item.RecommenderJob > -Dmapred.input.dir=input -Dmapred.output.dir=output -s > SIMILARITY_TANIMOTO_COEFFICIENT -b true > > The job works fine but when I examine the result I get things like: > > 12 [1:1.0,2:1.0,3:1.0,5:1.0,6:1.0,11:1.0,168:1.0,173:1.0,180:1.0,199:1.0] > 14 [1:1.0,2:1.0,3:1.0,5:1.0,6:1.0,11:1.0,14:1.0,21:1.0,22:1.0,23:1.0] > ... > > I can't understand why each recommendation gets 1.0 of score. It doesn't > matter which SimilarityClass I set. I always get a score of 1.0. > > My input file is a "boolean file" (1391374 rows) with values like: > > 1,6496241 > 1,4368916 > 1,4922226 > 1,4958662 > ... > > If I run > "org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob" job > over the same file I get good results for items. > > Any ideas? > > Thanks in advance. >
+
Sebastian Schelter 2010-11-26, 18:33
-
Re: RecommenderJob in mahout-0.4 returning 1.0 score for each recommendation
Sean Owen 2010-11-26, 18:35
But is it then ranking the recommendations by the estimated pref? If it's always 1, then the ordering is not meaningful.
Maybe it is, I just haven't looked at your changes in much detail since you made them although it looked broadly correct and proper.
On Fri, Nov 26, 2010 at 6:33 PM, Sebastian Schelter <[EMAIL PROTECTED]> wrote: > If all ratings have value 1 (cause we use boolean data) the result of > the Predicition can also only be 1.
+
Sean Owen 2010-11-26, 18:35
-
Re: RecommenderJob in mahout-0.4 returning 1.0 score for each recommendation
Sebastian Schelter 2010-11-26, 18:45
Hi Sean,
the prediction computation for boolean data is done in AggregateAndRecommendReducer.reduceBooleanData()
It computes *all* possible items to recommend for the current user and writes out only the n first after that, with n being the number specified in the parameter --numRecommendations given to RecommenderJob.
Can you point me to the code where the non-distributed code handles the problem of ranking them? We could certainly emulate that behaviour in the distributed code too.
--sebastian
Am 26.11.2010 19:35, schrieb Sean Owen: > But is it then ranking the recommendations by the estimated pref? If > it's always 1, then the ordering is not meaningful. > > Maybe it is, I just haven't looked at your changes in much detail > since you made them although it looked broadly correct and proper. > > On Fri, Nov 26, 2010 at 6:33 PM, Sebastian Schelter <[EMAIL PROTECTED]> wrote: > >> If all ratings have value 1 (cause we use boolean data) the result of >> the Predicition can also only be 1. >>
+
Sebastian Schelter 2010-11-26, 18:45
-
Re: RecommenderJob in mahout-0.4 returning 1.0 score for each recommendation
Sean Owen 2010-11-26, 18:50
The behavior difference is fairly simple. Instead of a weighted average of preferences (which will always equal 1.0), compute some other function of those weights -- for example, the average of the weights.
See GenericBooleanPrefItemBasedRecommender. It's actually just summing the weights. This is nearly the same thing since the number of items participating in the average is the same for all estimates. *Nearly* the same since some can be NaN.
It's an open question whether there aren't better functions of the weights to use, but this is a fine start, IMHO. On Fri, Nov 26, 2010 at 6:45 PM, Sebastian Schelter <[EMAIL PROTECTED]> wrote: > Hi Sean, > > the prediction computation for boolean data is done in > AggregateAndRecommendReducer.reduceBooleanData() > > It computes *all* possible items to recommend for the current user and > writes out only the n first after that, with n being the number > specified in the parameter --numRecommendations given to RecommenderJob. > > Can you point me to the code where the non-distributed code handles the > problem of ranking them? We could certainly emulate that behaviour in > the distributed code too. > > --sebastian > > > > Am 26.11.2010 19:35, schrieb Sean Owen: >> But is it then ranking the recommendations by the estimated pref? If >> it's always 1, then the ordering is not meaningful. >> >> Maybe it is, I just haven't looked at your changes in much detail >> since you made them although it looked broadly correct and proper. >> >> On Fri, Nov 26, 2010 at 6:33 PM, Sebastian Schelter <[EMAIL PROTECTED]> wrote: >> >>> If all ratings have value 1 (cause we use boolean data) the result of >>> the Predicition can also only be 1. >>> > >
+
Sean Owen 2010-11-26, 18:50
-
Re: RecommenderJob in mahout-0.4 returning 1.0 score for each recommendation
Jordi Abad 2010-11-28, 10:28
Hi,
I applied the changes of MAHOUT-553 (thanks Sebastian!) against mahout-0.4. Everything makes sense now. I've tried it with different similarities (SIMILARITY_LOGLIKELIHOOD, SIMILARITY_TANIMOTO_COEFFICIENT, SIMILARITY_UNCENTERED_COSINE) and it works fine (i.e. I got good recommendations with different scores) but when I tried SIMILARITY_PEARSON_CORRELATION, I got an empty part-00000 file. Is it normal?
On Fri, Nov 26, 2010 at 7:50 PM, Sean Owen <[EMAIL PROTECTED]> wrote:
> The behavior difference is fairly simple. Instead of a weighted > average of preferences (which will always equal 1.0), compute some > other function of those weights -- for example, the average of the > weights. > > See GenericBooleanPrefItemBasedRecommender. It's actually just summing > the weights. This is nearly the same thing since the number of items > participating in the average is the same for all estimates. *Nearly* > the same since some can be NaN. > > It's an open question whether there aren't better functions of the > weights to use, but this is a fine start, IMHO. > > > On Fri, Nov 26, 2010 at 6:45 PM, Sebastian Schelter <[EMAIL PROTECTED]> > wrote: > > Hi Sean, > > > > the prediction computation for boolean data is done in > > AggregateAndRecommendReducer.reduceBooleanData() > > > > It computes *all* possible items to recommend for the current user and > > writes out only the n first after that, with n being the number > > specified in the parameter --numRecommendations given to RecommenderJob. > > > > Can you point me to the code where the non-distributed code handles the > > problem of ranking them? We could certainly emulate that behaviour in > > the distributed code too. > > > > --sebastian > > > > > > > > Am 26.11.2010 19:35, schrieb Sean Owen: > >> But is it then ranking the recommendations by the estimated pref? If > >> it's always 1, then the ordering is not meaningful. > >> > >> Maybe it is, I just haven't looked at your changes in much detail > >> since you made them although it looked broadly correct and proper. > >> > >> On Fri, Nov 26, 2010 at 6:33 PM, Sebastian Schelter <[EMAIL PROTECTED]> > wrote: > >> > >>> If all ratings have value 1 (cause we use boolean data) the result of > >>> the Predicition can also only be 1. > >>> > > > > >
+
Jordi Abad 2010-11-28, 10:28
-
Re: RecommenderJob in mahout-0.4 returning 1.0 score for each recommendation
Sebastian Schelter 2010-11-28, 10:37
Pearson-Correlation and boolean data don't fit, all cooccurring ratings will have value 1 and therefore no correlation can be computed as the compared vectors are identical.
--sebastian
Am 28.11.2010 11:28, schrieb Jordi Abad: > Hi, > > I applied the changes of MAHOUT-553 (thanks Sebastian!) against > mahout-0.4. Everything makes sense now. I've tried it with different > similarities (SIMILARITY_LOGLIKELIHOOD, > SIMILARITY_TANIMOTO_COEFFICIENT, SIMILARITY_UNCENTERED_COSINE) and it > works fine (i.e. I got good recommendations with different scores) but > when I tried SIMILARITY_PEARSON_CORRELATION, I got an empty part-00000 > file. Is it normal? > > On Fri, Nov 26, 2010 at 7:50 PM, Sean Owen <[EMAIL PROTECTED] > <mailto:[EMAIL PROTECTED]>> wrote: > > The behavior difference is fairly simple. Instead of a weighted > average of preferences (which will always equal 1.0), compute some > other function of those weights -- for example, the average of the > weights. > > See GenericBooleanPrefItemBasedRecommender. It's actually just summing > the weights. This is nearly the same thing since the number of items > participating in the average is the same for all estimates. *Nearly* > the same since some can be NaN. > > It's an open question whether there aren't better functions of the > weights to use, but this is a fine start, IMHO. > > > On Fri, Nov 26, 2010 at 6:45 PM, Sebastian Schelter > <[EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]>> wrote: > > Hi Sean, > > > > the prediction computation for boolean data is done in > > AggregateAndRecommendReducer.reduceBooleanData() > > > > It computes *all* possible items to recommend for the current > user and > > writes out only the n first after that, with n being the number > > specified in the parameter --numRecommendations given to > RecommenderJob. > > > > Can you point me to the code where the non-distributed code > handles the > > problem of ranking them? We could certainly emulate that > behaviour in > > the distributed code too. > > > > --sebastian > > > > > > > > Am 26.11.2010 19:35, schrieb Sean Owen: > >> But is it then ranking the recommendations by the estimated > pref? If > >> it's always 1, then the ordering is not meaningful. > >> > >> Maybe it is, I just haven't looked at your changes in much detail > >> since you made them although it looked broadly correct and proper. > >> > >> On Fri, Nov 26, 2010 at 6:33 PM, Sebastian Schelter > <[EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]>> wrote: > >> > >>> If all ratings have value 1 (cause we use boolean data) the > result of > >>> the Predicition can also only be 1. > >>> > > > > > >
+
Sebastian Schelter 2010-11-28, 10:37
-
Re: RecommenderJob in mahout-0.4 returning 1.0 score for each recommendation
Jordi Abad 2010-11-28, 19:03
Ok Sebastian, thanks for the explanation. I'll study each similarity in more detail.
On Sun, Nov 28, 2010 at 11:37 AM, Sebastian Schelter <[EMAIL PROTECTED]> wrote:
> Pearson-Correlation and boolean data don't fit, all cooccurring ratings > will have value 1 and therefore no correlation can be computed as the > compared vectors are identical. > > --sebastian > > Am 28.11.2010 11:28, schrieb Jordi Abad: > > Hi, > > I applied the changes of MAHOUT-553 (thanks Sebastian!) against mahout-0.4. > Everything makes sense now. I've tried it with different similarities > (SIMILARITY_LOGLIKELIHOOD, SIMILARITY_TANIMOTO_COEFFICIENT, > SIMILARITY_UNCENTERED_COSINE) and it works fine (i.e. I got good > recommendations with different scores) but when I tried > SIMILARITY_PEARSON_CORRELATION, I got an empty part-00000 file. Is it > normal? > > On Fri, Nov 26, 2010 at 7:50 PM, Sean Owen <[EMAIL PROTECTED]> wrote: > >> The behavior difference is fairly simple. Instead of a weighted >> average of preferences (which will always equal 1.0), compute some >> other function of those weights -- for example, the average of the >> weights. >> >> See GenericBooleanPrefItemBasedRecommender. It's actually just summing >> the weights. This is nearly the same thing since the number of items >> participating in the average is the same for all estimates. *Nearly* >> the same since some can be NaN. >> >> It's an open question whether there aren't better functions of the >> weights to use, but this is a fine start, IMHO. >> >> >> On Fri, Nov 26, 2010 at 6:45 PM, Sebastian Schelter <[EMAIL PROTECTED]> >> wrote: >> > Hi Sean, >> > >> > the prediction computation for boolean data is done in >> > AggregateAndRecommendReducer.reduceBooleanData() >> > >> > It computes *all* possible items to recommend for the current user and >> > writes out only the n first after that, with n being the number >> > specified in the parameter --numRecommendations given to RecommenderJob. >> > >> > Can you point me to the code where the non-distributed code handles the >> > problem of ranking them? We could certainly emulate that behaviour in >> > the distributed code too. >> > >> > --sebastian >> > >> > >> > >> > Am 26.11.2010 19:35, schrieb Sean Owen: >> >> But is it then ranking the recommendations by the estimated pref? If >> >> it's always 1, then the ordering is not meaningful. >> >> >> >> Maybe it is, I just haven't looked at your changes in much detail >> >> since you made them although it looked broadly correct and proper. >> >> >> >> On Fri, Nov 26, 2010 at 6:33 PM, Sebastian Schelter <[EMAIL PROTECTED]> >> wrote: >> >> >> >>> If all ratings have value 1 (cause we use boolean data) the result of >> >>> the Predicition can also only be 1. >> >>> >> > >> > >> > > >
+
Jordi Abad 2010-11-28, 19:03
|
|