|
Daniel Zohar
2011-11-30, 09:11
Sean Owen
2011-11-30, 13:47
Daniel Zohar
2011-11-30, 14:19
Sean Owen
2011-11-30, 14:24
Daniel Zohar
2011-11-30, 14:50
Daniel Zohar
2011-11-30, 15:18
Sean Owen
2011-11-30, 16:56
Dan Beaulieu
2011-11-30, 20:23
Sean Owen
2011-11-30, 20:28
Sebastian Schelter
2011-12-01, 08:37
Manuel Blechschmidt
2011-12-01, 08:52
Daniel Zohar
2011-12-01, 14:11
Ted Dunning
2011-12-01, 14:21
Ted Dunning
2011-12-01, 14:22
Ted Dunning
2011-12-01, 14:26
Daniel Zohar
2011-12-01, 14:32
Manuel Blechschmidt
2011-12-01, 14:56
Daniel Zohar
2011-12-01, 15:06
Sebastian Schelter
2011-12-01, 15:11
Sean Owen
2011-12-01, 15:12
Sebastian Schelter
2011-12-01, 15:16
Manuel Blechschmidt
2011-12-01, 15:18
Daniel Zohar
2011-12-01, 15:24
Manuel Blechschmidt
2011-12-01, 15:24
Sebastian Schelter
2011-12-01, 15:28
Sebastian Schelter
2011-12-01, 18:10
Daniel Zohar
2011-12-01, 22:35
Ted Dunning
2011-12-01, 22:46
Sean Owen
2011-12-01, 22:46
Sean Owen
2011-12-01, 22:46
Sean Owen
2011-12-01, 22:51
Daniel Zohar
2011-12-02, 11:02
Sean Owen
2011-12-02, 11:10
Manuel Blechschmidt
2011-12-02, 11:20
Daniel Zohar
2011-12-02, 11:28
Daniel Zohar
2011-12-02, 11:29
Sean Owen
2011-12-02, 11:53
Daniel Zohar
2011-12-02, 14:33
Daniel Zohar
2011-12-02, 15:26
Sean Owen
2011-12-02, 15:38
Daniel Zohar
2011-12-02, 15:58
Sean Owen
2011-12-02, 16:42
Sean Owen
2011-12-02, 16:46
Ted Dunning
2011-12-02, 17:56
Ted Dunning
2011-12-02, 18:00
Sean Owen
2011-12-02, 18:03
Sean Owen
2011-12-02, 18:05
Ted Dunning
2011-12-02, 18:07
Daniel Zohar
2011-12-02, 18:07
Daniel Zohar
2011-12-02, 18:09
Ted Dunning
2011-12-02, 18:10
Ted Dunning
2011-12-02, 18:13
Sean Owen
2011-12-02, 18:14
Ted Dunning
2011-12-02, 18:18
Sean Owen
2011-12-02, 18:19
Sean Owen
2011-12-02, 18:22
Daniel Zohar
2011-12-02, 18:43
Sean Owen
2011-12-02, 21:15
Sebastian Schelter
2011-12-04, 09:13
Daniel Zohar
2011-12-04, 11:43
Sean Owen
2011-12-04, 12:06
Daniel Zohar
2011-12-04, 12:12
Sean Owen
2011-12-04, 12:20
Sebastian Schelter
2011-12-04, 12:30
Daniel Zohar
2011-12-04, 12:59
Daniel Zohar
2011-12-04, 13:04
Sean Owen
2011-12-04, 13:42
Ted Dunning
2011-12-04, 21:01
Daniel Zohar
2011-12-05, 08:20
Ted Dunning
2011-12-05, 08:27
Daniel Zohar
2011-12-05, 08:34
Ted Dunning
2011-12-05, 08:58
Manuel Blechschmidt
2011-12-05, 12:46
Daniel Zohar
2011-12-05, 12:50
Sean Owen
2011-12-05, 13:04
Daniel Zohar
2011-12-05, 14:20
Sean Owen
2011-12-05, 17:46
Daniel Zohar
2011-12-08, 07:58
|
-
Mahout performance issuesDaniel Zohar 2011-11-30, 09:11
Hello all,
This email follows the correspondence in StackExchange between myself and Sean Owen. Please see http://stackoverflow.com/questions/8240383/apache-mahout-performance-issues I'm building a boolean-based recommendation engine with the following data: - 12M users - 2M items - 18M user-item (boolean) choices The following code is used to build the recommender: DataModel dataModel = new FileDataModel(new File(dataFile)); ItemSimilarity itemSimilarity = new CachingItemSimilarity(new LogLikelihoodSimilarity(dataModel), dataModel); CandidateItemsStrategy candidateItemsStrategy = new SamplingCandidateItemsStrategy(20, 5); MostSimilarItemsCandidateItemsStrategy mostSimilarItemsCandidateItemsStrategy = new SamplingCandidateItemsStrategy(20, 5); this.recommender = new GenericBooleanPrefItemBasedRecommender(dataModel, itemSimilarity, candidateItemsStrategy,mostSimilarItemsCandidateItemsStrategy); My app runs on a Tomcat with the following JVM arguments: *-Xms4096M -Xmx4096M -da -dsa -XX:NewRatio=19 -XX:+UseParallelGC -XX:+UseParallelOldGC* Recommendations with the code above works very well for users who have made 1-2 choices in the past, but can take over to a minute when a user had made tens of choices, especially if one of these choices is a very popular item (i.e. was chosen by many other users). Even when using the *SamplingCandidateItemsStrategy* with (1,1) arguments, I still did not manage to achieve fast results. The only way I managed to get somewhat OK results (max recommendation time ~4 secs), was by rewriting the *SamplingCandidateItemsStrategy* in a way that *doGetCandidateItems* returns a limited amount of items. Following is the doGetCandidateItems method as I re-wrote it: http://pastebin.com/6n9C8Pw1 **I think a good response time for recommendations should be less than a second (preferably less than 500 milliseconds).** How can I make Mahout perform better? I have a feeling some optimization is needed both on the *CandidateItemsStrategy* and the *Recommender* itself. * * Thanks in advance! Daniel
-
Re: Mahout performance issuesSean Owen 2011-11-30, 13:47
I have a few more thoughts.
First, I was wrong about what the first parameter to SamplingCandidateStrategy means. It's effectively a minimum, rather than maximum; setting to 1 just means it will sample at least 1 pref. I think you figured that out. I think values like (5,1) are probably about right for you. I see that your change is to further impose a global cap on the number of candidate items returned. I understand the logic of that -- Sebastian what do you think? (PS you can probably make that run slightly faster by using LongPrimitiveIterator instead of Iterator<Long>.) Something still feels a bit off here, that's a very long time. Your JVM params are impeccable, you have a good amount of RAM and strong machine. Since you're getting speed-up by directly reducing the number of candidate items, I get the idea that your similarity computations are the bottleneck. Does any of your profiling confirm that? Are you using the latest code? I can think of one change in the last few months that I added (certainly since 0.5) that would speed up LogLikelihoodSimilarity a fair bit. I know you're 'boolean' data so this ought to be very fast. I'll also say that the computation here is not multi-threaded. I had always sort of thought that, at scale, you'd be getting parallelism from handling multiple concurrent requests. It would be possible to rewrite a lot of the internals to compute top recs using multiple threads. That might make individual requests return faster on a multi-core machine though wouldn't increase overall throughput. On Wed, Nov 30, 2011 at 9:11 AM, Daniel Zohar <[EMAIL PROTECTED]> wrote: > Hello all, > This email follows the correspondence in StackExchange between myself and > Sean Owen. Please see > http://stackoverflow.com/questions/8240383/apache-mahout-performance-issues > > I'm building a boolean-based recommendation engine with the following data: > > - 12M users > - 2M items > - 18M user-item (boolean) choices > > The following code is used to build the recommender: > > DataModel dataModel = new FileDataModel(new File(dataFile)); > ItemSimilarity itemSimilarity = new CachingItemSimilarity(new > LogLikelihoodSimilarity(dataModel), dataModel); > CandidateItemsStrategy candidateItemsStrategy = new > SamplingCandidateItemsStrategy(20, 5); > MostSimilarItemsCandidateItemsStrategy > mostSimilarItemsCandidateItemsStrategy = new > SamplingCandidateItemsStrategy(20, 5); > > this.recommender = new GenericBooleanPrefItemBasedRecommender(dataModel, > itemSimilarity, > candidateItemsStrategy,mostSimilarItemsCandidateItemsStrategy); > > My app runs on a Tomcat with the following JVM arguments: > *-Xms4096M -Xmx4096M -da -dsa -XX:NewRatio=19 -XX:+UseParallelGC > -XX:+UseParallelOldGC* > > Recommendations with the code above works very well for users who have made > 1-2 choices in the past, but can take over to a minute when a user had made > tens of choices, especially if one of these choices is a very popular item > (i.e. was chosen by many other users). > > Even when using the *SamplingCandidateItemsStrategy* with (1,1) arguments, > I still did not manage to achieve fast results. > > The only way I managed to get somewhat OK results (max recommendation time > ~4 secs), was by rewriting the *SamplingCandidateItemsStrategy* in a way > that *doGetCandidateItems* returns a limited amount of items. Following is > the doGetCandidateItems method as I re-wrote it: > http://pastebin.com/6n9C8Pw1 > > **I think a good response time for recommendations should be less than a > second (preferably less than 500 milliseconds).** > How can I make Mahout perform better? I have a feeling some optimization is > needed both on the *CandidateItemsStrategy* and the *Recommender* itself. > * > * > Thanks in advance! > Daniel >
-
Re: Mahout performance issuesDaniel Zohar 2011-11-30, 14:19
Hi Sean,
First of all let me thank you for all your help thus far :) I am using Mahout 0.5. At the moment the application is not live yet, so I assume multi-threading is not a problem at the moment. I definitely see that the bottleneck is in the similarities computations. Looking at TopItems:getTopItems, I can see that the method iterates over all the 'possible items' and evaluates them using the Estimator which in turn iterates over all the past user choices for every possible item. Now lets assume a user chose 50 items before and has 100 possible items, that's already 5k item-item similarities to calculate. If I wouldn't cap the possible items, it can wind up at much larger numbers. I would also like to add that although the solution I post before improves performance, it extremely damages the quality of the recommendations as it checks a smaller pool of possible items. Thanks! On Wed, Nov 30, 2011 at 3:47 PM, Sean Owen <[EMAIL PROTECTED]> wrote: > I have a few more thoughts. > > First, I was wrong about what the first parameter to > SamplingCandidateStrategy means. It's effectively a minimum, rather than > maximum; setting to 1 just means it will sample at least 1 pref. I think > you figured that out. I think values like (5,1) are probably about right > for you. > > I see that your change is to further impose a global cap on the number of > candidate items returned. I understand the logic of that -- Sebastian what > do you think? (PS you can probably make that run slightly faster by using > LongPrimitiveIterator instead of Iterator<Long>.) > > > Something still feels a bit off here, that's a very long time. Your JVM > params are impeccable, you have a good amount of RAM and strong machine. > > Since you're getting speed-up by directly reducing the number of candidate > items, I get the idea that your similarity computations are the bottleneck. > Does any of your profiling confirm that? > > Are you using the latest code? I can think of one change in the last few > months that I added (certainly since 0.5) that would speed up > LogLikelihoodSimilarity a fair bit. I know you're 'boolean' data so this > ought to be very fast. > > > I'll also say that the computation here is not multi-threaded. I had always > sort of thought that, at scale, you'd be getting parallelism from handling > multiple concurrent requests. It would be possible to rewrite a lot of the > internals to compute top recs using multiple threads. That might make > individual requests return faster on a multi-core machine though wouldn't > increase overall throughput. > > > On Wed, Nov 30, 2011 at 9:11 AM, Daniel Zohar <[EMAIL PROTECTED]> wrote: > > > Hello all, > > This email follows the correspondence in StackExchange between myself and > > Sean Owen. Please see > > > http://stackoverflow.com/questions/8240383/apache-mahout-performance-issues > > > > I'm building a boolean-based recommendation engine with the following > data: > > > > - 12M users > > - 2M items > > - 18M user-item (boolean) choices > > > > The following code is used to build the recommender: > > > > DataModel dataModel = new FileDataModel(new File(dataFile)); > > ItemSimilarity itemSimilarity = new CachingItemSimilarity(new > > LogLikelihoodSimilarity(dataModel), dataModel); > > CandidateItemsStrategy candidateItemsStrategy = new > > SamplingCandidateItemsStrategy(20, 5); > > MostSimilarItemsCandidateItemsStrategy > > mostSimilarItemsCandidateItemsStrategy = new > > SamplingCandidateItemsStrategy(20, 5); > > > > this.recommender = new GenericBooleanPrefItemBasedRecommender(dataModel, > > itemSimilarity, > > candidateItemsStrategy,mostSimilarItemsCandidateItemsStrategy); > > > > My app runs on a Tomcat with the following JVM arguments: > > *-Xms4096M -Xmx4096M -da -dsa -XX:NewRatio=19 -XX:+UseParallelGC > > -XX:+UseParallelOldGC* > > > > Recommendations with the code above works very well for users who have > made > > 1-2 choices in the past, but can take over to a minute when a user had
-
Re: Mahout performance issuesSean Owen 2011-11-30, 14:24
Yeah, I agree that using just a handful of candidates is far too few and
that's not a solution. It should not be so slow even with a reasonable number of prefs and users. Multi-threading *is* a problem insofar as there is no multi-threading helping speed up your request. But that's a side issue. Definitely use the version in Subversion instead of 0.5; I think it will help directly. 5000 item-item similarities shouldn't take that long to compute. You can try pre-computing similarities as you mentioned earlier, if you can find more RAM. Of course, I always also recommend looking at pruning; often 90% of your data says very little and 10% carries most of the information. Figuring out which is which is hard! But it's often possible to drastically simplify things by finding some clever way of deciding what's not useful. And of course you can always start to look at Hadoop-based recommenders. On Wed, Nov 30, 2011 at 2:19 PM, Daniel Zohar <[EMAIL PROTECTED]> wrote: > Hi Sean, > First of all let me thank you for all your help thus far :) > > I am using Mahout 0.5. > At the moment the application is not live yet, so I assume multi-threading > is not a problem at the moment. > > I definitely see that the bottleneck is in the similarities computations. > Looking at TopItems:getTopItems, I can see that the method iterates over > all the 'possible items' and evaluates them using the Estimator which in > turn iterates over all the past user choices for every possible item. > Now lets assume a user chose 50 items before and has 100 possible items, > that's already 5k item-item similarities to calculate. If I wouldn't cap > the possible items, it can wind up at much larger numbers. > > I would also like to add that although the solution I post before improves > performance, it extremely damages the quality of the recommendations as it > checks a smaller pool of possible items. > > Thanks! > > On Wed, Nov 30, 2011 at 3:47 PM, Sean Owen <[EMAIL PROTECTED]> wrote: > > > I have a few more thoughts. > > > > First, I was wrong about what the first parameter to > > SamplingCandidateStrategy means. It's effectively a minimum, rather than > > maximum; setting to 1 just means it will sample at least 1 pref. I think > > you figured that out. I think values like (5,1) are probably about right > > for you. > > > > I see that your change is to further impose a global cap on the number of > > candidate items returned. I understand the logic of that -- Sebastian > what > > do you think? (PS you can probably make that run slightly faster by using > > LongPrimitiveIterator instead of Iterator<Long>.) > > > > > > Something still feels a bit off here, that's a very long time. Your JVM > > params are impeccable, you have a good amount of RAM and strong machine. > > > > Since you're getting speed-up by directly reducing the number of > candidate > > items, I get the idea that your similarity computations are the > bottleneck. > > Does any of your profiling confirm that? > > > > Are you using the latest code? I can think of one change in the last few > > months that I added (certainly since 0.5) that would speed up > > LogLikelihoodSimilarity a fair bit. I know you're 'boolean' data so this > > ought to be very fast. > > > > > > I'll also say that the computation here is not multi-threaded. I had > always > > sort of thought that, at scale, you'd be getting parallelism from > handling > > multiple concurrent requests. It would be possible to rewrite a lot of > the > > internals to compute top recs using multiple threads. That might make > > individual requests return faster on a multi-core machine though wouldn't > > increase overall throughput. > > > > > > On Wed, Nov 30, 2011 at 9:11 AM, Daniel Zohar <[EMAIL PROTECTED]> > wrote: > > > > > Hello all, > > > This email follows the correspondence in StackExchange between myself > and > > > Sean Owen. Please see > > > > > > http://stackoverflow.com/questions/8240383/apache-mahout-performance-issues > > > > >
-
Re: Mahout performance issuesDaniel Zohar 2011-11-30, 14:50
I will now try using the latest snapshot from
http://svn.apache.org/repos/asf/mahout/trunk . I would really prefer to avoid pre-computing the item similarities at the moment. Do you believe I can achieve good performance without it? Is there any specific pruning method you would recommend? I guess this is only relevant in recommendation-time, as the all data is needed in order for all of my users to be able get recommendations. Lastly, as I wrote in before, if a user had chose in the past an item which many other users had chose as well (we have a few items with 100k-400k associated users) the recommendation is significantly slower (unless he chose only very few items). Maybe it's a hint for the bottleneck in the similarities computations? On Wed, Nov 30, 2011 at 4:24 PM, Sean Owen <[EMAIL PROTECTED]> wrote: > Yeah, I agree that using just a handful of candidates is far too few and > that's not a solution. It should not be so slow even with a reasonable > number of prefs and users. > > Multi-threading *is* a problem insofar as there is no multi-threading > helping speed up your request. But that's a side issue. > > Definitely use the version in Subversion instead of 0.5; I think it will > help directly. > 5000 item-item similarities shouldn't take that long to compute. > > You can try pre-computing similarities as you mentioned earlier, if you can > find more RAM. > > Of course, I always also recommend looking at pruning; often 90% of your > data says very little and 10% carries most of the information. Figuring out > which is which is hard! But it's often possible to drastically simplify > things by finding some clever way of deciding what's not useful. > > And of course you can always start to look at Hadoop-based recommenders. > > > On Wed, Nov 30, 2011 at 2:19 PM, Daniel Zohar <[EMAIL PROTECTED]> wrote: > > > Hi Sean, > > First of all let me thank you for all your help thus far :) > > > > I am using Mahout 0.5. > > At the moment the application is not live yet, so I assume > multi-threading > > is not a problem at the moment. > > > > I definitely see that the bottleneck is in the similarities computations. > > Looking at TopItems:getTopItems, I can see that the method iterates over > > all the 'possible items' and evaluates them using the Estimator which in > > turn iterates over all the past user choices for every possible item. > > Now lets assume a user chose 50 items before and has 100 possible items, > > that's already 5k item-item similarities to calculate. If I wouldn't cap > > the possible items, it can wind up at much larger numbers. > > > > I would also like to add that although the solution I post before > improves > > performance, it extremely damages the quality of the recommendations as > it > > checks a smaller pool of possible items. > > > > Thanks! > > > > On Wed, Nov 30, 2011 at 3:47 PM, Sean Owen <[EMAIL PROTECTED]> wrote: > > > > > I have a few more thoughts. > > > > > > First, I was wrong about what the first parameter to > > > SamplingCandidateStrategy means. It's effectively a minimum, rather > than > > > maximum; setting to 1 just means it will sample at least 1 pref. I > think > > > you figured that out. I think values like (5,1) are probably about > right > > > for you. > > > > > > I see that your change is to further impose a global cap on the number > of > > > candidate items returned. I understand the logic of that -- Sebastian > > what > > > do you think? (PS you can probably make that run slightly faster by > using > > > LongPrimitiveIterator instead of Iterator<Long>.) > > > > > > > > > Something still feels a bit off here, that's a very long time. Your JVM > > > params are impeccable, you have a good amount of RAM and strong > machine. > > > > > > Since you're getting speed-up by directly reducing the number of > > candidate > > > items, I get the idea that your similarity computations are the > > bottleneck. > > > Does any of your profiling confirm that? > > > > > > Are you using the latest code? I can think of one change in the last
-
Re: Mahout performance issuesDaniel Zohar 2011-11-30, 15:18
I just tested the app with Mahout 0.6.
There seems to be a small performance improvement, but still recommendations for the 'heavy users' take between 1-5 seconds. On Wed, Nov 30, 2011 at 4:50 PM, Daniel Zohar <[EMAIL PROTECTED]> wrote: > I will now try using the latest snapshot from > http://svn.apache.org/repos/asf/mahout/trunk . > > I would really prefer to avoid pre-computing the item similarities at the > moment. Do you believe I can achieve good performance without it? > > Is there any specific pruning method you would recommend? I guess this is > only relevant in recommendation-time, as the all data is needed in order > for all of my users to be able get recommendations. > > Lastly, as I wrote in before, if a user had chose in the past an item > which many other users had chose as well (we have a few items with > 100k-400k associated users) the recommendation is significantly slower > (unless he chose only very few items). Maybe it's a hint for the bottleneck > in the similarities computations? > > > On Wed, Nov 30, 2011 at 4:24 PM, Sean Owen <[EMAIL PROTECTED]> wrote: > >> Yeah, I agree that using just a handful of candidates is far too few and >> that's not a solution. It should not be so slow even with a reasonable >> number of prefs and users. >> >> Multi-threading *is* a problem insofar as there is no multi-threading >> helping speed up your request. But that's a side issue. >> >> Definitely use the version in Subversion instead of 0.5; I think it will >> help directly. >> 5000 item-item similarities shouldn't take that long to compute. >> >> You can try pre-computing similarities as you mentioned earlier, if you >> can >> find more RAM. >> >> Of course, I always also recommend looking at pruning; often 90% of your >> data says very little and 10% carries most of the information. Figuring >> out >> which is which is hard! But it's often possible to drastically simplify >> things by finding some clever way of deciding what's not useful. >> >> And of course you can always start to look at Hadoop-based recommenders. >> >> >> On Wed, Nov 30, 2011 at 2:19 PM, Daniel Zohar <[EMAIL PROTECTED]> wrote: >> >> > Hi Sean, >> > First of all let me thank you for all your help thus far :) >> > >> > I am using Mahout 0.5. >> > At the moment the application is not live yet, so I assume >> multi-threading >> > is not a problem at the moment. >> > >> > I definitely see that the bottleneck is in the similarities >> computations. >> > Looking at TopItems:getTopItems, I can see that the method iterates over >> > all the 'possible items' and evaluates them using the Estimator which in >> > turn iterates over all the past user choices for every possible item. >> > Now lets assume a user chose 50 items before and has 100 possible items, >> > that's already 5k item-item similarities to calculate. If I wouldn't cap >> > the possible items, it can wind up at much larger numbers. >> > >> > I would also like to add that although the solution I post before >> improves >> > performance, it extremely damages the quality of the recommendations as >> it >> > checks a smaller pool of possible items. >> > >> > Thanks! >> > >> > On Wed, Nov 30, 2011 at 3:47 PM, Sean Owen <[EMAIL PROTECTED]> wrote: >> > >> > > I have a few more thoughts. >> > > >> > > First, I was wrong about what the first parameter to >> > > SamplingCandidateStrategy means. It's effectively a minimum, rather >> than >> > > maximum; setting to 1 just means it will sample at least 1 pref. I >> think >> > > you figured that out. I think values like (5,1) are probably about >> right >> > > for you. >> > > >> > > I see that your change is to further impose a global cap on the >> number of >> > > candidate items returned. I understand the logic of that -- Sebastian >> > what >> > > do you think? (PS you can probably make that run slightly faster by >> using >> > > LongPrimitiveIterator instead of Iterator<Long>.) >> > > >> > > >> > > Something still feels a bit off here, that's a very long time. Your
-
Re: Mahout performance issuesSean Owen 2011-11-30, 16:56
Have you used CachingItemSimilarity? That will hold common similarities in
memory. It's a lot easier than pre-computing and might help. I think something like your change is a good one (Sebastian what do you think) in that it gives you the ultimate lever to control how many candidates are evaluated. That ought to make it go as fast as you like, but it trades off quality. Still I'd be really surprised if there's no viable middle ground -- this works fine at smaller scale, where 100s of candidates are evaluated, perhaps, and you can use your lever to get to 100s of candidates at your scale too. Is that still both slow and inaccurate? On Wed, Nov 30, 2011 at 3:18 PM, Daniel Zohar <[EMAIL PROTECTED]> wrote: > I just tested the app with Mahout 0.6. > There seems to be a small performance improvement, but still > recommendations for the 'heavy users' take between 1-5 seconds. > >
-
Re: Mahout performance issuesDan Beaulieu 2011-11-30, 20:23
Hi all, this is a tangent and can mostly be ignored by the people
interested in this problem. I'm new to Machine Learning and especially Mahout. Following this discussion has made me a bit confused. Isn't Mahout used for large datasets where it makes sense to distribute the work? Why then isn't anyone pointing out that the problem may be the use of one single Mahout node? Is it because it's boolean based? Is it because the data set isn't really that large? Even if for whatever reason a single node will do for this case, is it really expected that the recommendation process would finish in less than half a second? This makes me think if that is the expectation then the data set is actually small and Mahout might be overkill... What obvious piece of the Mahout puzzle am I missing? Thanks. Dan On Wed, Nov 30, 2011 at 11:56 AM, Sean Owen <[EMAIL PROTECTED]> wrote: > Have you used CachingItemSimilarity? That will hold common similarities in > memory. It's a lot easier than pre-computing and might help. > > I think something like your change is a good one (Sebastian what do you > think) in that it gives you the ultimate lever to control how many > candidates are evaluated. That ought to make it go as fast as you like, but > it trades off quality. Still I'd be really surprised if there's no viable > middle ground -- this works fine at smaller scale, where 100s of candidates > are evaluated, perhaps, and you can use your lever to get to 100s of > candidates at your scale too. Is that still both slow and inaccurate? > > On Wed, Nov 30, 2011 at 3:18 PM, Daniel Zohar <[EMAIL PROTECTED]> wrote: > > > I just tested the app with Mahout 0.6. > > There seems to be a small performance improvement, but still > > recommendations for the 'heavy users' take between 1-5 seconds. > > > > >
-
Re: Mahout performance issuesSean Owen 2011-11-30, 20:28
The simple answer is that:
Mahout absorbed a non-distributed recommender project called Taste, which scales up to a point which may be sufficient for a lot of users. It certainly is a lot simpler. Yes it is realistic to do near-real-time recommendations, though it gets harder and harder and requires more tuning, tradeoffs and optimization as this thread shows. The rest, written from scratch, is almost all distributed and Hadoop-based, including distributed re-implementations of the same algorithms. On Wed, Nov 30, 2011 at 8:23 PM, Dan Beaulieu <[EMAIL PROTECTED]>wrote: > Hi all, this is a tangent and can mostly be ignored by the people > interested in this problem. > > I'm new to Machine Learning and especially Mahout. Following this > discussion has made me a bit confused. > Isn't Mahout used for large datasets where it makes sense to distribute the > work? Why then isn't anyone pointing > out that the problem may be the use of one single Mahout node? Is it > because it's boolean based? Is it because the data set > isn't really that large? > > Even if for whatever reason a single node will do for this case, is it > really expected that the recommendation process would finish in less than > half a second? > This makes me think if that is the expectation then the data set is > actually small and Mahout might be overkill... > > What obvious piece of the Mahout puzzle am I missing? > > Thanks. > > Dan > > On Wed, Nov 30, 2011 at 11:56 AM, Sean Owen <[EMAIL PROTECTED]> wrote: > > > Have you used CachingItemSimilarity? That will hold common similarities > in > > memory. It's a lot easier than pre-computing and might help. > > > > I think something like your change is a good one (Sebastian what do you > > think) in that it gives you the ultimate lever to control how many > > candidates are evaluated. That ought to make it go as fast as you like, > but > > it trades off quality. Still I'd be really surprised if there's no viable > > middle ground -- this works fine at smaller scale, where 100s of > candidates > > are evaluated, perhaps, and you can use your lever to get to 100s of > > candidates at your scale too. Is that still both slow and inaccurate? > > > > On Wed, Nov 30, 2011 at 3:18 PM, Daniel Zohar <[EMAIL PROTECTED]> > wrote: > > > > > I just tested the app with Mahout 0.6. > > > There seems to be a small performance improvement, but still > > > recommendations for the 'heavy users' take between 1-5 seconds. > > > > > > > > >
-
Re: Mahout performance issuesSebastian Schelter 2011-12-01, 08:37
Daniel, can you plot two curves showing the distribution of
interactions per user and the distribution of interactions per item? I think we need to get a better picture of your data first. Generally I always recommend to use precomputed similarities. You can still serve new users with realtime recommendations, the only disadvantages are the higher complexity and a delayed inclusion of new items. --sebastian 2011/11/30 Sean Owen <[EMAIL PROTECTED]>: > The simple answer is that: > > Mahout absorbed a non-distributed recommender project called Taste, which > scales up to a point which may be sufficient for a lot of users. It > certainly is a lot simpler. Yes it is realistic to do near-real-time > recommendations, though it gets harder and harder and requires more tuning, > tradeoffs and optimization as this thread shows. > > The rest, written from scratch, is almost all distributed and Hadoop-based, > including distributed re-implementations of the same algorithms. > > On Wed, Nov 30, 2011 at 8:23 PM, Dan Beaulieu > <[EMAIL PROTECTED]>wrote: > >> Hi all, this is a tangent and can mostly be ignored by the people >> interested in this problem. >> >> I'm new to Machine Learning and especially Mahout. Following this >> discussion has made me a bit confused. >> Isn't Mahout used for large datasets where it makes sense to distribute the >> work? Why then isn't anyone pointing >> out that the problem may be the use of one single Mahout node? Is it >> because it's boolean based? Is it because the data set >> isn't really that large? >> >> Even if for whatever reason a single node will do for this case, is it >> really expected that the recommendation process would finish in less than >> half a second? >> This makes me think if that is the expectation then the data set is >> actually small and Mahout might be overkill... >> >> What obvious piece of the Mahout puzzle am I missing? >> >> Thanks. >> >> Dan >> >> On Wed, Nov 30, 2011 at 11:56 AM, Sean Owen <[EMAIL PROTECTED]> wrote: >> >> > Have you used CachingItemSimilarity? That will hold common similarities >> in >> > memory. It's a lot easier than pre-computing and might help. >> > >> > I think something like your change is a good one (Sebastian what do you >> > think) in that it gives you the ultimate lever to control how many >> > candidates are evaluated. That ought to make it go as fast as you like, >> but >> > it trades off quality. Still I'd be really surprised if there's no viable >> > middle ground -- this works fine at smaller scale, where 100s of >> candidates >> > are evaluated, perhaps, and you can use your lever to get to 100s of >> > candidates at your scale too. Is that still both slow and inaccurate? >> > >> > On Wed, Nov 30, 2011 at 3:18 PM, Daniel Zohar <[EMAIL PROTECTED]> >> wrote: >> > >> > > I just tested the app with Mahout 0.6. >> > > There seems to be a small performance improvement, but still >> > > recommendations for the 'heavy users' take between 1-5 seconds. >> > > >> > > >> > >>
-
Re: Mahout performance issuesManuel Blechschmidt 2011-12-01, 08:52
Hello,
On 01.12.2011, at 09:37, Sebastian Schelter wrote: > Daniel, can you plot two curves showing the distribution of > interactions per user and the distribution of interactions per item? I > think we need to get a better picture of your data first. > > Generally I always recommend to use precomputed similarities. You can > still serve new users with realtime recommendations, the only > disadvantages are the higher complexity and a delayed inclusion of new > items. In this paper: Fast Online Learning through Offline Initialization for Time-sensitive Recommendation http://users.cs.fiu.edu/~lzhen001/activities/KDD_USB_key_2010/docs/p703.pdf Deepak Agarwal et. al. describes a solution how to include new items quickly into the recommendations. This is used for personalizing the news stories on the yahoo start page. @Daniel: I would also recommend to profile your application with JVisualVM: http://visualvm.java.net/ After I did this with my recommender. I figured out that the default cache size for item similarities was far to low. The details are described in this ticket: https://issues.apache.org/jira/browse/MAHOUT-905 > > --sebastian /Manuel > > 2011/11/30 Sean Owen <[EMAIL PROTECTED]>: >> The simple answer is that: >> >> Mahout absorbed a non-distributed recommender project called Taste, which >> scales up to a point which may be sufficient for a lot of users. It >> certainly is a lot simpler. Yes it is realistic to do near-real-time >> recommendations, though it gets harder and harder and requires more tuning, >> tradeoffs and optimization as this thread shows. >> >> The rest, written from scratch, is almost all distributed and Hadoop-based, >> including distributed re-implementations of the same algorithms. >> >> On Wed, Nov 30, 2011 at 8:23 PM, Dan Beaulieu >> <[EMAIL PROTECTED]>wrote: >> >>> Hi all, this is a tangent and can mostly be ignored by the people >>> interested in this problem. >>> >>> I'm new to Machine Learning and especially Mahout. Following this >>> discussion has made me a bit confused. >>> Isn't Mahout used for large datasets where it makes sense to distribute the >>> work? Why then isn't anyone pointing >>> out that the problem may be the use of one single Mahout node? Is it >>> because it's boolean based? Is it because the data set >>> isn't really that large? >>> >>> Even if for whatever reason a single node will do for this case, is it >>> really expected that the recommendation process would finish in less than >>> half a second? >>> This makes me think if that is the expectation then the data set is >>> actually small and Mahout might be overkill... >>> >>> What obvious piece of the Mahout puzzle am I missing? >>> >>> Thanks. >>> >>> Dan >>> >>> On Wed, Nov 30, 2011 at 11:56 AM, Sean Owen <[EMAIL PROTECTED]> wrote: >>> >>>> Have you used CachingItemSimilarity? That will hold common similarities >>> in >>>> memory. It's a lot easier than pre-computing and might help. >>>> >>>> I think something like your change is a good one (Sebastian what do you >>>> think) in that it gives you the ultimate lever to control how many >>>> candidates are evaluated. That ought to make it go as fast as you like, >>> but >>>> it trades off quality. Still I'd be really surprised if there's no viable >>>> middle ground -- this works fine at smaller scale, where 100s of >>> candidates >>>> are evaluated, perhaps, and you can use your lever to get to 100s of >>>> candidates at your scale too. Is that still both slow and inaccurate? >>>> >>>> On Wed, Nov 30, 2011 at 3:18 PM, Daniel Zohar <[EMAIL PROTECTED]> >>> wrote: >>>> >>>>> I just tested the app with Mahout 0.6. >>>>> There seems to be a small performance improvement, but still >>>>> recommendations for the 'heavy users' take between 1-5 seconds. >>>>> >>>>> >>>> >>> -- Manuel Blechschmidt Dortustr. 57 14467 Potsdam Mobil: 0173/6322621 Twitter: http://twitter.com/Manuel_B
-
Re: Mahout performance issuesDaniel Zohar 2011-12-01, 14:11
@Sean, yes I am using CachingItemSimilarity, and I can see that over time
performance is better. @Manuel thanks for the tips. I have installed VisualVM and followed are the results I did two sampling - - With the optimized SamplingCandidateItemsStrategy ( http://pastebin.com/6n9C8Pw1): http://static.inky.ws/image/934/image.jpg - Without the optimized SamplingCandidateItemsStrategy: http://static.inky.ws/image/935/image.jpg @Sebastian, here are the two curves you asked for. Item-users: http://static.inky.ws/image/932/image.jpg User-items: http://static.inky.ws/image/932/image.jpg I think from the above curves one can clearly see that a lot of my data is not needed to be checked when looking for similar items. That's because if a user had only a single choice in the past, there's no point of checking for his other choices at all while doing item similarities. I would think it's something that should be integrated into the DataModel. Maybe there should be one Set that holds only users which had made more than one choice. This will greatly improve performance in my case. What do you think? On Thu, Dec 1, 2011 at 10:52 AM, Manuel Blechschmidt < [EMAIL PROTECTED]> wrote: > Hello, > > On 01.12.2011, at 09:37, Sebastian Schelter wrote: > > > Daniel, can you plot two curves showing the distribution of > > interactions per user and the distribution of interactions per item? I > > think we need to get a better picture of your data first. > > > > Generally I always recommend to use precomputed similarities. You can > > still serve new users with realtime recommendations, the only > > disadvantages are the higher complexity and a delayed inclusion of new > > items. > > In this paper: > Fast Online Learning through Offline Initialization for Time-sensitive > Recommendation > http://users.cs.fiu.edu/~lzhen001/activities/KDD_USB_key_2010/docs/p703.pdf > Deepak Agarwal et. al. describes a solution how to include new items > quickly into the recommendations. > This is used for personalizing the news stories on the yahoo start page. > > @Daniel: I would also recommend to profile your application with JVisualVM: > http://visualvm.java.net/ > > After I did this with my recommender. I figured out that the default cache > size for item similarities was > far to low. The details are described in this ticket: > https://issues.apache.org/jira/browse/MAHOUT-905 > > > > > > --sebastian > > /Manuel > > > > > 2011/11/30 Sean Owen <[EMAIL PROTECTED]>: > >> The simple answer is that: > >> > >> Mahout absorbed a non-distributed recommender project called Taste, > which > >> scales up to a point which may be sufficient for a lot of users. It > >> certainly is a lot simpler. Yes it is realistic to do near-real-time > >> recommendations, though it gets harder and harder and requires more > tuning, > >> tradeoffs and optimization as this thread shows. > >> > >> The rest, written from scratch, is almost all distributed and > Hadoop-based, > >> including distributed re-implementations of the same algorithms. > >> > >> On Wed, Nov 30, 2011 at 8:23 PM, Dan Beaulieu > >> <[EMAIL PROTECTED]>wrote: > >> > >>> Hi all, this is a tangent and can mostly be ignored by the people > >>> interested in this problem. > >>> > >>> I'm new to Machine Learning and especially Mahout. Following this > >>> discussion has made me a bit confused. > >>> Isn't Mahout used for large datasets where it makes sense to > distribute the > >>> work? Why then isn't anyone pointing > >>> out that the problem may be the use of one single Mahout node? Is it > >>> because it's boolean based? Is it because the data set > >>> isn't really that large? > >>> > >>> Even if for whatever reason a single node will do for this case, is it > >>> really expected that the recommendation process would finish in less > than > >>> half a second? > >>> This makes me think if that is the expectation then the data set is > >>> actually small and Mahout might be overkill... > >>> > >>> What obvious piece of the Mahout puzzle am I missing?
-
Re: Mahout performance issuesTed Dunning 2011-12-01, 14:21
These are the same curve.
On Thu, Dec 1, 2011 at 6:11 AM, Daniel Zohar <[EMAIL PROTECTED]> wrote: > Sebastian, here are the two curves you asked for. > Item-users: http://static.inky.ws/image/932/image.jpg > User-items: http://static.inky.ws/image/932/image.jpg >
-
Re: Mahout performance issuesTed Dunning 2011-12-01, 14:22
Try here instead:
http://static.inky.ws/image/933/image.jpg On Thu, Dec 1, 2011 at 6:21 AM, Ted Dunning <[EMAIL PROTECTED]> wrote: > These are the same curve. > > > On Thu, Dec 1, 2011 at 6:11 AM, Daniel Zohar <[EMAIL PROTECTED]> wrote: > >> Sebastian, here are the two curves you asked for. >> Item-users: http://static.inky.ws/image/932/image.jpg >> User-items: http://static.inky.ws/image/932/image.jpg >> > >
-
Re: Mahout performance issuesTed Dunning 2011-12-01, 14:26
On Thu, Dec 1, 2011 at 6:11 AM, Daniel Zohar <[EMAIL PROTECTED]> wrote:
> I think from the above curves one can clearly see that a lot of my data is > not needed to be checked when looking for similar items. That's because if > a user had only a single choice in the past, there's no point of checking > for his other choices at all while doing item similarities. > It would be fine to not load those users. > > I would think it's something that should be integrated into the DataModel. > Maybe there should be one Set that holds only users which had made more > than one choice. This will greatly improve performance in my case. What do > you think? > But there is the other problem that several of your users have made an absurd number of choices. This is commonly due to QA processes or spiders. You can moderate this effect by filtering what you consider to be an interaction to be something that requires a human to be engaged with the content. You can also downsample these users without eliminating them. This is done in the off-line processing of recommendation data for item-based recommendations, but I am not sure about on-line recommendations.
-
Re: Mahout performance issuesDaniel Zohar 2011-12-01, 14:32
On Thu, Dec 1, 2011 at 4:26 PM, Ted Dunning <[EMAIL PROTECTED]> wrote:
> On Thu, Dec 1, 2011 at 6:11 AM, Daniel Zohar <[EMAIL PROTECTED]> wrote: > > > I think from the above curves one can clearly see that a lot of my data > is > > not needed to be checked when looking for similar items. That's because > if > > a user had only a single choice in the past, there's no point of checking > > for his other choices at all while doing item similarities. > > > > It would be fine to not load those users. > Does Mahout offer something like that out-of-the-box or should I implement my own DataModel? Do you agree that this is the right place to do so? > > > > > > I would think it's something that should be integrated into the > DataModel. > > Maybe there should be one Set that holds only users which had made more > > than one choice. This will greatly improve performance in my case. What > do > > you think? > > > > But there is the other problem that several of your users have made an > absurd number of choices. This is commonly due to QA processes or spiders. > You can moderate this effect by filtering what you consider to be an > interaction to be something that requires a human to be engaged with the > content. > > You can also downsample these users without eliminating them. This is done > in the off-line processing of recommendation data for item-based > recommendations, but I am not sure about on-line recommendations. >
-
Re: Mahout performance issuesManuel Blechschmidt 2011-12-01, 14:56
Hi Daniel,
actually you are running the profile inside tomcat. You should take a snapshot and then drill down to the functions where the actual recommendation takes place. The current screenshots also contains some profiles from Tomcat threads which are sleeping a lot and therefore taking a lot of time. Further the screenshots does not contain the amount how often the different functions are called. You have to profile multiple requests alone. The CacheItemSimilarity gets filled therefore it should go faster and faster. On 01.12.2011, at 15:11, Daniel Zohar wrote: > @Manuel thanks for the tips. I have installed VisualVM and followed are the > results > I did two sampling - > - With the optimized SamplingCandidateItemsStrategy ( > http://pastebin.com/6n9C8Pw1): http://static.inky.ws/image/934/image.jpg > - Without the optimized SamplingCandidateItemsStrategy: > http://static.inky.ws/image/935/image.jpg > The big hot spot is the function FastIDSet.find(): Optimized: 13,759 s Unoptimized: 246,487 s So you see that your optimization already got you a performance boost of 2000%. Did you play around with the CacheItemSimilarity cache sizes? /Manuel -- Manuel Blechschmidt Dortustr. 57 14467 Potsdam Mobil: 0173/6322621 Twitter: http://twitter.com/Manuel_B
-
Re: Mahout performance issuesDaniel Zohar 2011-12-01, 15:06
Hi Manuel,
I haven't got to the point where CacheItemSimilarity kicks in. That is, I will have to run a lot of recommendations in order to get a real benefit from it. I would first like to optimize the 'cold start' so it's at least serves at reasonable time. Usually cache is used to prevent repeated calculations, but personally I dont think it's a replacement for optimized performance. Don't you agree? Also, I will try to profile the app now as you suggest and send the results asap. Thanks! On Thu, Dec 1, 2011 at 4:56 PM, Manuel Blechschmidt < [EMAIL PROTECTED]> wrote: > Hi Daniel, > actually you are running the profile inside tomcat. You should take a > snapshot and then drill down to the functions where the actual > recommendation takes place. The current screenshots also contains some > profiles from Tomcat threads which are sleeping a lot and therefore taking > a lot of time. > > Further the screenshots does not contain the amount how often the > different functions are called. > > You have to profile multiple requests alone. The CacheItemSimilarity gets > filled therefore it should go faster and faster. > > On 01.12.2011, at 15:11, Daniel Zohar wrote: > > > @Manuel thanks for the tips. I have installed VisualVM and followed are > the > > results > > I did two sampling - > > - With the optimized SamplingCandidateItemsStrategy ( > > http://pastebin.com/6n9C8Pw1): http://static.inky.ws/image/934/image.jpg > > - Without the optimized SamplingCandidateItemsStrategy: > > http://static.inky.ws/image/935/image.jpg > > > > The big hot spot is the function FastIDSet.find(): > > Optimized: 13,759 s > Unoptimized: 246,487 s > > So you see that your optimization already got you a performance boost of > 2000%. > > Did you play around with the CacheItemSimilarity cache sizes? > > /Manuel > > -- > Manuel Blechschmidt > Dortustr. 57 > 14467 Potsdam > Mobil: 0173/6322621 > Twitter: http://twitter.com/Manuel_B > >
-
Re: Mahout performance issuesSebastian Schelter 2011-12-01, 15:11
'Cold start' in collaborative filtering lingo refers to the fact that
you need to see some interactions for an item first before you can use in the recommendation computation. Let's better not use it as a technical term to avoid confusion :) --sebastian On 01.12.2011 16:06, Daniel Zohar wrote: > Hi Manuel, > I haven't got to the point where CacheItemSimilarity kicks in. That is, I > will have to run a lot of recommendations in order to get a real benefit > from it. I would first like to optimize the 'cold start' so it's at least > serves at reasonable time. Usually cache is used to prevent repeated > calculations, but personally I dont think it's a replacement for optimized > performance. Don't you agree? > > Also, I will try to profile the app now as you suggest and send the results > asap. > > Thanks! > > On Thu, Dec 1, 2011 at 4:56 PM, Manuel Blechschmidt < > [EMAIL PROTECTED]> wrote: > >> Hi Daniel, >> actually you are running the profile inside tomcat. You should take a >> snapshot and then drill down to the functions where the actual >> recommendation takes place. The current screenshots also contains some >> profiles from Tomcat threads which are sleeping a lot and therefore taking >> a lot of time. >> >> Further the screenshots does not contain the amount how often the >> different functions are called. >> >> You have to profile multiple requests alone. The CacheItemSimilarity gets >> filled therefore it should go faster and faster. >> >> On 01.12.2011, at 15:11, Daniel Zohar wrote: >> >>> @Manuel thanks for the tips. I have installed VisualVM and followed are >> the >>> results >>> I did two sampling - >>> - With the optimized SamplingCandidateItemsStrategy ( >>> http://pastebin.com/6n9C8Pw1): http://static.inky.ws/image/934/image.jpg >>> - Without the optimized SamplingCandidateItemsStrategy: >>> http://static.inky.ws/image/935/image.jpg >>> >> >> The big hot spot is the function FastIDSet.find(): >> >> Optimized: 13,759 s >> Unoptimized: 246,487 s >> >> So you see that your optimization already got you a performance boost of >> 2000%. >> >> Did you play around with the CacheItemSimilarity cache sizes? >> >> /Manuel >> >> -- >> Manuel Blechschmidt >> Dortustr. 57 >> 14467 Potsdam >> Mobil: 0173/6322621 >> Twitter: http://twitter.com/Manuel_B >> >> >
-
Re: Mahout performance issuesSean Owen 2011-12-01, 15:12
You can 'tickle' the cache asynchronously if you like.
I am still not clear on why you are doing so many item-item similarity calculations. Your change ought to let you do 1, or 10, or 100 per calculation if you like. That, we know, is fast. And a few hundred similarities should start to give reasonable recommendations. What is preventing you from making this tradeoff (with your change)? Yes, it is essential for reasonable performance. On Thu, Dec 1, 2011 at 3:06 PM, Daniel Zohar <[EMAIL PROTECTED]> wrote: > Hi Manuel, > I haven't got to the point where CacheItemSimilarity kicks in. That is, I > will have to run a lot of recommendations in order to get a real benefit > from it. I would first like to optimize the 'cold start' so it's at least > serves at reasonable time. Usually cache is used to prevent repeated > calculations, but personally I dont think it's a replacement for optimized > performance. Don't you agree? > > Also, I will try to profile the app now as you suggest and send the results > asap. > > Thanks! > > On Thu, Dec 1, 2011 at 4:56 PM, Manuel Blechschmidt < > [EMAIL PROTECTED]> wrote: > > > Hi Daniel, > > actually you are running the profile inside tomcat. You should take a > > snapshot and then drill down to the functions where the actual > > recommendation takes place. The current screenshots also contains some > > profiles from Tomcat threads which are sleeping a lot and therefore > taking > > a lot of time. > > > > Further the screenshots does not contain the amount how often the > > different functions are called. > > > > You have to profile multiple requests alone. The CacheItemSimilarity gets > > filled therefore it should go faster and faster. > > > > On 01.12.2011, at 15:11, Daniel Zohar wrote: > > > > > @Manuel thanks for the tips. I have installed VisualVM and followed are > > the > > > results > > > I did two sampling - > > > - With the optimized SamplingCandidateItemsStrategy ( > > > http://pastebin.com/6n9C8Pw1): > http://static.inky.ws/image/934/image.jpg > > > - Without the optimized SamplingCandidateItemsStrategy: > > > http://static.inky.ws/image/935/image.jpg > > > > > > > The big hot spot is the function FastIDSet.find(): > > > > Optimized: 13,759 s > > Unoptimized: 246,487 s > > > > So you see that your optimization already got you a performance boost of > > 2000%. > > > > Did you play around with the CacheItemSimilarity cache sizes? > > > > /Manuel > > > > -- > > Manuel Blechschmidt > > Dortustr. 57 > > 14467 Potsdam > > Mobil: 0173/6322621 > > Twitter: http://twitter.com/Manuel_B > > > > >
-
Re: Mahout performance issuesSebastian Schelter 2011-12-01, 15:16
If I remember correctly, you 12M users and 18M interactions.
If I interpret the plots correctly there is one single item that accounts for 8.5M interactions (nearly half of the overall interactions) and more than two thirds of the users like it? --sebastian On 01.12.2011 16:12, Sean Owen wrote: > You can 'tickle' the cache asynchronously if you like. > > I am still not clear on why you are doing so many item-item similarity > calculations. Your change ought to let you do 1, or 10, or 100 per > calculation if you like. That, we know, is fast. And a few hundred > similarities should start to give reasonable recommendations. > > What is preventing you from making this tradeoff (with your change)? > Yes, it is essential for reasonable performance. > > On Thu, Dec 1, 2011 at 3:06 PM, Daniel Zohar <[EMAIL PROTECTED]> wrote: > >> Hi Manuel, >> I haven't got to the point where CacheItemSimilarity kicks in. That is, I >> will have to run a lot of recommendations in order to get a real benefit >> from it. I would first like to optimize the 'cold start' so it's at least >> serves at reasonable time. Usually cache is used to prevent repeated >> calculations, but personally I dont think it's a replacement for optimized >> performance. Don't you agree? >> >> Also, I will try to profile the app now as you suggest and send the results >> asap. >> >> Thanks! >> >> On Thu, Dec 1, 2011 at 4:56 PM, Manuel Blechschmidt < >> [EMAIL PROTECTED]> wrote: >> >>> Hi Daniel, >>> actually you are running the profile inside tomcat. You should take a >>> snapshot and then drill down to the functions where the actual >>> recommendation takes place. The current screenshots also contains some >>> profiles from Tomcat threads which are sleeping a lot and therefore >> taking >>> a lot of time. >>> >>> Further the screenshots does not contain the amount how often the >>> different functions are called. >>> >>> You have to profile multiple requests alone. The CacheItemSimilarity gets >>> filled therefore it should go faster and faster. >>> >>> On 01.12.2011, at 15:11, Daniel Zohar wrote: >>> >>>> @Manuel thanks for the tips. I have installed VisualVM and followed are >>> the >>>> results >>>> I did two sampling - >>>> - With the optimized SamplingCandidateItemsStrategy ( >>>> http://pastebin.com/6n9C8Pw1): >> http://static.inky.ws/image/934/image.jpg >>>> - Without the optimized SamplingCandidateItemsStrategy: >>>> http://static.inky.ws/image/935/image.jpg >>>> >>> >>> The big hot spot is the function FastIDSet.find(): >>> >>> Optimized: 13,759 s >>> Unoptimized: 246,487 s >>> >>> So you see that your optimization already got you a performance boost of >>> 2000%. >>> >>> Did you play around with the CacheItemSimilarity cache sizes? >>> >>> /Manuel >>> >>> -- >>> Manuel Blechschmidt >>> Dortustr. 57 >>> 14467 Potsdam >>> Mobil: 0173/6322621 >>> Twitter: http://twitter.com/Manuel_B >>> >>> >> >
-
Re: Mahout performance issuesManuel Blechschmidt 2011-12-01, 15:18
Hi Dan,
On 30.11.2011, at 21:23, Dan Beaulieu wrote: > Hi all, this is a tangent and can mostly be ignored by the people > interested in this problem. > > I'm new to Machine Learning and especially Mahout. Following this > discussion has made me a bit confused. > Isn't Mahout used for large datasets where it makes sense to distribute the > work? Why then isn't anyone pointing > out that the problem may be the use of one single Mahout node? Is it > because it's boolean based? Is it because the data set > isn't really that large? Isabel already gave a good explanation. Nevertheless as it turns out at the moment the problem of this performance issues seams to be the item similarity. There is a distributed approach of calculating this data: https://builds.apache.org/job/Mahout-Quality/javadoc/org/apache/mahout/cf/taste/hadoop/similarity/item/ItemSimilarityJob.html Sebastian Schelter wrote a tutorial how to use this job: http://ssc.io/deploying-a-massively-scalable-recommender-system-with-apache-mahout/ Nevertheless not everybody is maintaining a hadoop cluster. For example I did not use a cluster yet. As a rule of thumb (by Sean Owen) you can calculate everything until 100.000.000 Ratings on your normal machine. > > Even if for whatever reason a single node will do for this case, is it > really expected that the recommendation process would finish in less than > half a second? Yes, it is. Recommendation is a real time problem but how to do it in realtime is still a question where a lot of research is put in. A lot of people from mahout are working in an academic context so it is unclear yet how to handle the different problems. Mahout has a lot of possibilities to tweak. For a small dataset I did a benchmark published here: http://thread.gmane.org/gmane.comp.apache.mahout.user/10433 Actually for every recommender there is a trade off between: - accuracy - space - time It is a tough task to find the sweet spot. > This makes me think if that is the expectation then the data set is > actually small and Mahout might be overkill... > > What obvious piece of the Mahout puzzle am I missing? Hope that helps Manuel > > Thanks. > > Dan > > On Wed, Nov 30, 2011 at 11:56 AM, Sean Owen <[EMAIL PROTECTED]> wrote: > >> Have you used CachingItemSimilarity? That will hold common similarities in >> memory. It's a lot easier than pre-computing and might help. >> >> I think something like your change is a good one (Sebastian what do you >> think) in that it gives you the ultimate lever to control how many >> candidates are evaluated. That ought to make it go as fast as you like, but >> it trades off quality. Still I'd be really surprised if there's no viable >> middle ground -- this works fine at smaller scale, where 100s of candidates >> are evaluated, perhaps, and you can use your lever to get to 100s of >> candidates at your scale too. Is that still both slow and inaccurate? >> >> On Wed, Nov 30, 2011 at 3:18 PM, Daniel Zohar <[EMAIL PROTECTED]> wrote: >> >>> I just tested the app with Mahout 0.6. >>> There seems to be a small performance improvement, but still >>> recommendations for the 'heavy users' take between 1-5 seconds. >>> >>> >> -- Manuel Blechschmidt Dortustr. 57 14467 Potsdam Mobil: 0173/6322621 Twitter: http://twitter.com/Manuel_B
-
Re: Mahout performance issuesDaniel Zohar 2011-12-01, 15:24
Here is the correct snapshot - http://static.inky.ws/image/937/image.jpg
If I read this correctly, the FastIDSet.intersectionSize takes most of the time. I really think this can be extremely improved if the following code in GenericBooleanPrefDataModel.getNumUsersWithPreferenceFor will return only users which have made a choice for more than one item: FastIDSet userIDs1 = preferenceForItems.get(itemID1); ..... FastIDSet userIDs2 = preferenceForItems.get(itemID2); ..... return userIDs1.intersectionSize(userIDs2); @Sebastian - No it's the other way around, ~8.5M users had only chosen a single item. The item with the users associated is about 400k. @Sean, Yes, my soution can solves the problem, but I feel that the optimization mentioned above can really boost the performance, and I think that can contribute for all of Mahout's users. What do you think? On Thu, Dec 1, 2011 at 5:16 PM, Sebastian Schelter <[EMAIL PROTECTED]> wrote: > If I remember correctly, you 12M users and 18M interactions. > > If I interpret the plots correctly there is one single item that > accounts for 8.5M interactions (nearly half of the overall interactions) > and more than two thirds of the users like it? > > --sebastian > > On 01.12.2011 16:12, Sean Owen wrote: > > You can 'tickle' the cache asynchronously if you like. > > > > I am still not clear on why you are doing so many item-item similarity > > calculations. Your change ought to let you do 1, or 10, or 100 per > > calculation if you like. That, we know, is fast. And a few hundred > > similarities should start to give reasonable recommendations. > > > > What is preventing you from making this tradeoff (with your change)? > > Yes, it is essential for reasonable performance. > > > > On Thu, Dec 1, 2011 at 3:06 PM, Daniel Zohar <[EMAIL PROTECTED]> wrote: > > > >> Hi Manuel, > >> I haven't got to the point where CacheItemSimilarity kicks in. That is, > I > >> will have to run a lot of recommendations in order to get a real benefit > >> from it. I would first like to optimize the 'cold start' so it's at > least > >> serves at reasonable time. Usually cache is used to prevent repeated > >> calculations, but personally I dont think it's a replacement for > optimized > >> performance. Don't you agree? > >> > >> Also, I will try to profile the app now as you suggest and send the > results > >> asap. > >> > >> Thanks! > >> > >> On Thu, Dec 1, 2011 at 4:56 PM, Manuel Blechschmidt < > >> [EMAIL PROTECTED]> wrote: > >> > >>> Hi Daniel, > >>> actually you are running the profile inside tomcat. You should take a > >>> snapshot and then drill down to the functions where the actual > >>> recommendation takes place. The current screenshots also contains some > >>> profiles from Tomcat threads which are sleeping a lot and therefore > >> taking > >>> a lot of time. > >>> > >>> Further the screenshots does not contain the amount how often the > >>> different functions are called. > >>> > >>> You have to profile multiple requests alone. The CacheItemSimilarity > gets > >>> filled therefore it should go faster and faster. > >>> > >>> On 01.12.2011, at 15:11, Daniel Zohar wrote: > >>> > >>>> @Manuel thanks for the tips. I have installed VisualVM and followed > are > >>> the > >>>> results > >>>> I did two sampling - > >>>> - With the optimized SamplingCandidateItemsStrategy ( > >>>> http://pastebin.com/6n9C8Pw1): > >> http://static.inky.ws/image/934/image.jpg > >>>> - Without the optimized SamplingCandidateItemsStrategy: > >>>> http://static.inky.ws/image/935/image.jpg > >>>> > >>> > >>> The big hot spot is the function FastIDSet.find(): > >>> > >>> Optimized: 13,759 s > >>> Unoptimized: 246,487 s > >>> > >>> So you see that your optimization already got you a performance boost > of > >>> 2000%. > >>> > >>> Did you play around with the CacheItemSimilarity cache sizes? > >>> > >>> /Manuel > >>> > >>> -- > >>> Manuel Blechschmidt > >>> Dortustr. 57 > >>> 14467 Potsdam > >>> Mobil: 0173/6322621 > >>> Twitter: http://twitter.com/Manuel_B
-
Re: Mahout performance issuesManuel Blechschmidt 2011-12-01, 15:24
Hello Daniel,
On 01.12.2011, at 16:06, Daniel Zohar wrote: > Hi Manuel, > I haven't got to the point where CacheItemSimilarity kicks in. That is, I > will have to run a lot of recommendations in order to get a real benefit > from it. I would first like to optimize the 'cold start' so it's at least > serves at reasonable time. Usually cache is used to prevent repeated > calculations, but personally I dont think it's a replacement for optimized > performance. Don't you agree? I agree but making recommendations work in real time is not an engineering problem. It is an academic problem. Mahout is already implemented in an optimized manner. They are not using the Java Collections Framework and are using the colt library for math calculations. What you are currently doing is tuning different parameters of the recommender to get the best fit between: - accuracy - space - time It would be great if you could just specify your requirements for example: 90% of recommendations in less then 0.5s and 512m of RAM and the system automatically adjusts the different tuning parameters to get the best accuracy with this set up. /Manuel > > Also, I will try to profile the app now as you suggest and send the results > asap. > > Thanks! > > On Thu, Dec 1, 2011 at 4:56 PM, Manuel Blechschmidt < > [EMAIL PROTECTED]> wrote: > >> Hi Daniel, >> actually you are running the profile inside tomcat. You should take a >> snapshot and then drill down to the functions where the actual >> recommendation takes place. The current screenshots also contains some >> profiles from Tomcat threads which are sleeping a lot and therefore taking >> a lot of time. >> >> Further the screenshots does not contain the amount how often the >> different functions are called. >> >> You have to profile multiple requests alone. The CacheItemSimilarity gets >> filled therefore it should go faster and faster. >> >> On 01.12.2011, at 15:11, Daniel Zohar wrote: >> >>> @Manuel thanks for the tips. I have installed VisualVM and followed are >> the >>> results >>> I did two sampling - >>> - With the optimized SamplingCandidateItemsStrategy ( >>> http://pastebin.com/6n9C8Pw1): http://static.inky.ws/image/934/image.jpg >>> - Without the optimized SamplingCandidateItemsStrategy: >>> http://static.inky.ws/image/935/image.jpg >>> >> >> The big hot spot is the function FastIDSet.find(): >> >> Optimized: 13,759 s >> Unoptimized: 246,487 s >> >> So you see that your optimization already got you a performance boost of >> 2000%. >> >> Did you play around with the CacheItemSimilarity cache sizes? >> >> /Manuel >> >> -- >> Manuel Blechschmidt >> Dortustr. 57 >> 14467 Potsdam >> Mobil: 0173/6322621 >> Twitter: http://twitter.com/Manuel_B >> >> -- Manuel Blechschmidt Dortustr. 57 14467 Potsdam Mobil: 0173/6322621 Twitter: http://twitter.com/Manuel_B
-
Re: Mahout performance issuesSebastian Schelter 2011-12-01, 15:28
> Sebastian Schelter wrote a tutorial how to use this job: > http://ssc.io/deploying-a-massively-scalable-recommender-system-with-apache-mahout/ Please not that this not a general description of how to use the job, but a description of an architecture for a certain class of usecases.
-
Re: Mahout performance issuesSebastian Schelter 2011-12-01, 18:10
If I remember correctly, you have 12M users and 18M interactions.
If I interpret the plots correctly there is one single item that accounts for 8.5M interactions (nearly half of the overall interactions) and more than two thirds of the users like it? If that is true, this item will co-occurr with virtually every other item in the dataset, ruining the runtime as you will have to estimate the preference for every item each time you compute recommendations. Normally the sampling done by SamplingCandidateItemStrategy should hit such 'top-sellers' harder then the rest and therefore mitigate the impact of them on the runtime, but I guess your dataset has so few per-user interactions overall that the sampling doesn't really help here. This top item is also of no real value as everybody seems to already know it and was able to find it. You can't really learn a lot from an item that everybody likes. Can you check my findings and try to simply throw the item away? --sebastian On 01.12.2011 16:16, Sebastian Schelter wrote: > > --sebastian > > On 01.12.2011 16:12, Sean Owen wrote: >> You can 'tickle' the cache asynchronously if you like. >> >> I am still not clear on why you are doing so many item-item similarity >> calculations. Your change ought to let you do 1, or 10, or 100 per >> calculation if you like. That, we know, is fast. And a few hundred >> similarities should start to give reasonable recommendations. >> >> What is preventing you from making this tradeoff (with your change)? >> Yes, it is essential for reasonable performance. >> >> On Thu, Dec 1, 2011 at 3:06 PM, Daniel Zohar <[EMAIL PROTECTED]> wrote: >> >>> Hi Manuel, >>> I haven't got to the point where CacheItemSimilarity kicks in. That is, I >>> will have to run a lot of recommendations in order to get a real benefit >>> from it. I would first like to optimize the 'cold start' so it's at least >>> serves at reasonable time. Usually cache is used to prevent repeated >>> calculations, but personally I dont think it's a replacement for optimized >>> performance. Don't you agree? >>> >>> Also, I will try to profile the app now as you suggest and send the results >>> asap. >>> >>> Thanks! >>> >>> On Thu, Dec 1, 2011 at 4:56 PM, Manuel Blechschmidt < >>> [EMAIL PROTECTED]> wrote: >>> >>>> Hi Daniel, >>>> actually you are running the profile inside tomcat. You should take a >>>> snapshot and then drill down to the functions where the actual >>>> recommendation takes place. The current screenshots also contains some >>>> profiles from Tomcat threads which are sleeping a lot and therefore >>> taking >>>> a lot of time. >>>> >>>> Further the screenshots does not contain the amount how often the >>>> different functions are called. >>>> >>>> You have to profile multiple requests alone. The CacheItemSimilarity gets >>>> filled therefore it should go faster and faster. >>>> >>>> On 01.12.2011, at 15:11, Daniel Zohar wrote: >>>> >>>>> @Manuel thanks for the tips. I have installed VisualVM and followed are >>>> the >>>>> results >>>>> I did two sampling - >>>>> - With the optimized SamplingCandidateItemsStrategy ( >>>>> http://pastebin.com/6n9C8Pw1): >>> http://static.inky.ws/image/934/image.jpg >>>>> - Without the optimized SamplingCandidateItemsStrategy: >>>>> http://static.inky.ws/image/935/image.jpg >>>>> >>>> >>>> The big hot spot is the function FastIDSet.find(): >>>> >>>> Optimized: 13,759 s >>>> Unoptimized: 246,487 s >>>> >>>> So you see that your optimization already got you a performance boost of >>>> 2000%. >>>> >>>> Did you play around with the CacheItemSimilarity cache sizes? >>>> >>>> /Manuel >>>> >>>> -- >>>> Manuel Blechschmidt >>>> Dortustr. 57 >>>> 14467 Potsdam >>>> Mobil: 0173/6322621 >>>> Twitter: http://twitter.com/Manuel_B >>>> >>>> >>> >> >
-
Re: Mahout performance issuesDaniel Zohar 2011-12-01, 22:35
Sebastian, as I wrote before, it's the other way around. ~8.5M users had
only chosen a single item. The item with the most interactions is about 400k. This is why I'm looking now into improving GenericBooleanPrefDataModel to not take into account users which made one interaction under the 'preferenceForItems' Map. What do you think about this approach? On Thu, Dec 1, 2011 at 8:10 PM, Sebastian Schelter <[EMAIL PROTECTED]> wrote: > If I remember correctly, you have 12M users and 18M interactions. > > If I interpret the plots correctly there is one single item that > accounts for 8.5M interactions (nearly half of the overall interactions) > and more than two thirds of the users like it? > > If that is true, this item will co-occurr with virtually every other > item in the dataset, ruining the runtime as you will have to estimate > the preference for every item each time you compute recommendations. > > Normally the sampling done by SamplingCandidateItemStrategy should hit > such 'top-sellers' harder then the rest and therefore mitigate the > impact of them on the runtime, but I guess your dataset has so few > per-user interactions overall that the sampling doesn't really help here. > > This top item is also of no real value as everybody seems to already > know it and was able to find it. You can't really learn a lot from an > item that everybody likes. > > Can you check my findings and try to simply throw the item away? > > --sebastian > > > > On 01.12.2011 16:16, Sebastian Schelter wrote: > > > > > --sebastian > > > > On 01.12.2011 16:12, Sean Owen wrote: > >> You can 'tickle' the cache asynchronously if you like. > >> > >> I am still not clear on why you are doing so many item-item similarity > >> calculations. Your change ought to let you do 1, or 10, or 100 per > >> calculation if you like. That, we know, is fast. And a few hundred > >> similarities should start to give reasonable recommendations. > >> > >> What is preventing you from making this tradeoff (with your change)? > >> Yes, it is essential for reasonable performance. > >> > >> On Thu, Dec 1, 2011 at 3:06 PM, Daniel Zohar <[EMAIL PROTECTED]> > wrote: > >> > >>> Hi Manuel, > >>> I haven't got to the point where CacheItemSimilarity kicks in. That > is, I > >>> will have to run a lot of recommendations in order to get a real > benefit > >>> from it. I would first like to optimize the 'cold start' so it's at > least > >>> serves at reasonable time. Usually cache is used to prevent repeated > >>> calculations, but personally I dont think it's a replacement for > optimized > >>> performance. Don't you agree? > >>> > >>> Also, I will try to profile the app now as you suggest and send the > results > >>> asap. > >>> > >>> Thanks! > >>> > >>> On Thu, Dec 1, 2011 at 4:56 PM, Manuel Blechschmidt < > >>> [EMAIL PROTECTED]> wrote: > >>> > >>>> Hi Daniel, > >>>> actually you are running the profile inside tomcat. You should take a > >>>> snapshot and then drill down to the functions where the actual > >>>> recommendation takes place. The current screenshots also contains some > >>>> profiles from Tomcat threads which are sleeping a lot and therefore > >>> taking > >>>> a lot of time. > >>>> > >>>> Further the screenshots does not contain the amount how often the > >>>> different functions are called. > >>>> > >>>> You have to profile multiple requests alone. The CacheItemSimilarity > gets > >>>> filled therefore it should go faster and faster. > >>>> > >>>> On 01.12.2011, at 15:11, Daniel Zohar wrote: > >>>> > >>>>> @Manuel thanks for the tips. I have installed VisualVM and followed > are > >>>> the > >>>>> results > >>>>> I did two sampling - > >>>>> - With the optimized SamplingCandidateItemsStrategy ( > >>>>> http://pastebin.com/6n9C8Pw1): > >>> http://static.inky.ws/image/934/image.jpg > >>>>> - Without the optimized SamplingCandidateItemsStrategy: > >>>>> http://static.inky.ws/image/935/image.jpg > >>>>> > >>>> > >>>> The big hot spot is the function FastIDSet.find():
-
Re: Mahout performance issuesTed Dunning 2011-12-01, 22:46
This may or may not help much. My guess is that the improvement will be
very modest. The most serious problem is going to be recommendations for anybody who has rated one of these excessively popular items. That item will bring in a huge number of other users and thus a huge number of items to consider. If you down-sample ratings of the prolific users and kill super-common items, I think you will see much more improvement than simply eliminating the singleton users. The basic issue is that cooccurrence based algorithms have run-time proportional to O(n_max^2) where n_max is the maximum number of items per user. On Thu, Dec 1, 2011 at 2:35 PM, Daniel Zohar <[EMAIL PROTECTED]> wrote: > This is why I'm looking now into improving GenericBooleanPrefDataModel to > not take into account users which made one interaction under the > 'preferenceForItems' Map. What do you think about this approach? >
-
Re: Mahout performance issuesSean Owen 2011-12-01, 22:46
These users should cause problems though. They don't add anything to a set
of candidates. Taking them away means you can't recommend anything to them. I doubt this is quite the issue. (That item with 400K interactions might be just fine to remove!) You are certainly bottleneck on item-item similarity, from your graph -- intersectionSize() is the heart of the loglikelihood computation. I still do not understand why your proposed change does not solve the problem! You can turn down the candidate set size as low as you want. At a "reasonable" size quality will still be OK. I'm missing something here. On Thu, Dec 1, 2011 at 10:35 PM, Daniel Zohar <[EMAIL PROTECTED]> wrote: > Sebastian, as I wrote before, it's the other way around. ~8.5M users had > only chosen a single item. The item with the most interactions is about > 400k. > This is why I'm looking now into improving GenericBooleanPrefDataModel to > not take into account users which made one interaction under the > 'preferenceForItems' Map. What do you think about this approach? > >
-
Re: Mahout performance issuesSean Owen 2011-12-01, 22:46
I meant, "These users should *not* cause problems ..."
On Thu, Dec 1, 2011 at 10:46 PM, Sean Owen <[EMAIL PROTECTED]> wrote: > These users should cause problems though. They don't add anything to a set > of candidates. Taking them away means you can't recommend anything to them. > I doubt this is quite the issue.
-
Re: Mahout performance issuesSean Owen 2011-12-01, 22:51
(Agree, and the sampling happens at the user level now -- so if you sample
one of these users, it slows down a lot. The spirit of the proposed change is to make sampling more fine-grained, at the individual item level. That seems to certainly fix this.) On Thu, Dec 1, 2011 at 10:46 PM, Ted Dunning <[EMAIL PROTECTED]> wrote: > This may or may not help much. My guess is that the improvement will be > very modest. > > The most serious problem is going to be recommendations for anybody who has > rated one of these excessively popular items. That item will bring in a > huge number of other users and thus a huge number of items to consider. If > you down-sample ratings of the prolific users and kill super-common items, > I think you will see much more improvement than simply eliminating the > singleton users. > > The basic issue is that cooccurrence based algorithms have run-time > proportional to O(n_max^2) where n_max is the maximum number of items per > user. > > On Thu, Dec 1, 2011 at 2:35 PM, Daniel Zohar <[EMAIL PROTECTED]> wrote: > > > This is why I'm looking now into improving GenericBooleanPrefDataModel to > > not take into account users which made one interaction under the > > 'preferenceForItems' Map. What do you think about this approach? > > >
-
Re: Mahout performance issuesDaniel Zohar 2011-12-02, 11:02
Hi guys,
@Sean, You are obviously right by saying that reducing the cap limit would yield better performance. However I believe it would yield worse accuracy. This is because the more items a user interacted with, the smaller is the percentage of the capped possible items relatively to the actual possible items. @Ted, your approach is also good and I will think now how to integrate it in my solution. I just ran the fix I proposed earlier and I got great results! The query time was reduced to about a third for the 'heavy users'. Before it was 1-5 secs and now it's 0.5-1.5. The best part is that the accuracy level should remain exactly the same. I also believe it should reduce memory consumption, as the GenericBooleanPrefDataModel.preferenceForItems gets significantly smaller (in my case at least). The fix is merely adding two lines of code to one of the GenericBooleanPrefDataModel constructors. See http://pastebin.com/K5PB68Et, the lines I added are #11, #22. The only problem I see at the moment, is that the similarities implementations are using the num of users per item in the item-item similarity calculation. This _can_ be mitigated by creating an additional Map in the DataModel which maps itemID to numUsers. What do you think about the proposed solution? Perhaps I am missing some other implications? Thanks! On Fri, Dec 2, 2011 at 12:51 AM, Sean Owen <[EMAIL PROTECTED]> wrote: > (Agree, and the sampling happens at the user level now -- so if you sample > one of these users, it slows down a lot. The spirit of the proposed change > is to make sampling more fine-grained, at the individual item level. That > seems to certainly fix this.) > > On Thu, Dec 1, 2011 at 10:46 PM, Ted Dunning <[EMAIL PROTECTED]> > wrote: > > > This may or may not help much. My guess is that the improvement will be > > very modest. > > > > The most serious problem is going to be recommendations for anybody who > has > > rated one of these excessively popular items. That item will bring in a > > huge number of other users and thus a huge number of items to consider. > If > > you down-sample ratings of the prolific users and kill super-common > items, > > I think you will see much more improvement than simply eliminating the > > singleton users. > > > > The basic issue is that cooccurrence based algorithms have run-time > > proportional to O(n_max^2) where n_max is the maximum number of items per > > user. > > > > On Thu, Dec 1, 2011 at 2:35 PM, Daniel Zohar <[EMAIL PROTECTED]> wrote: > > > > > This is why I'm looking now into improving GenericBooleanPrefDataModel > to > > > not take into account users which made one interaction under the > > > 'preferenceForItems' Map. What do you think about this approach? > > > > > >
-
Re: Mahout performance issuesSean Owen 2011-12-02, 11:10
On Fri, Dec 2, 2011 at 11:02 AM, Daniel Zohar <[EMAIL PROTECTED]> wrote:
> Hi guys, > > @Sean, You are obviously right by saying that reducing the cap limit would > yield better performance. However I believe it would yield worse accuracy. > This is because the more items a user interacted with, the smaller is > the percentage of the capped possible items relatively to the actual > possible items. > That's right. I'm saying there must be a middle ground that works on both counts, since it works fine at smaller scales, where you only have hundreds of interactions per recommendation computation. So, if you tune it to use 100, for example, I imagine you get "good" recommendations and it should be pretty fast, right? I don't see why this isn't the solution. > > I just ran the fix I proposed earlier and I got great results! The query > time was reduced to about a third for the 'heavy users'. Before it was 1-5 > secs and now it's 0.5-1.5. The best part is that the accuracy level should > remain exactly the same. I also believe it should reduce memory > consumption, as the GenericBooleanPrefDataModel.preferenceForItems gets > significantly smaller (in my case at least). > > The fix is merely adding two lines of code to one of > the GenericBooleanPrefDataModel constructors. See > http://pastebin.com/K5PB68Et, the lines I added are #11, #22. > I don't think this works though, because you've deleted the one data point you have for those users. They can't get recommendations now. I can't figure out how that speeds up recommendations though, what am I missing? these users aren't providing any more item-item interactions to consider.
-
Re: Mahout performance issuesManuel Blechschmidt 2011-12-02, 11:20
Hello Daniel,
On 02.12.2011, at 12:02, Daniel Zohar wrote: > Hi guys, > > ... > I just ran the fix I proposed earlier and I got great results! The query > time was reduced to about a third for the 'heavy users'. Before it was 1-5 > secs and now it's 0.5-1.5. The best part is that the accuracy level should > remain exactly the same. I also believe it should reduce memory > consumption, as the GenericBooleanPrefDataModel.preferenceForItems gets > significantly smaller (in my case at least). It would be great if you could measure your run time performance and your accuracy with the provided Mahout tools. In your case because you only have boolean feedback precision and recall would make sense. https://cwiki.apache.org/MAHOUT/recommender-documentation.html RecommenderIRStatsEvaluator evaluator = new GenericRecommenderIRStatsEvaluator(); IRStatistics stats = evaluator.evaluate(builder, null, myModel, null, 3, RecommenderIRStatsEvaluator.CHOOSE_THRESHOLD, 1.0); Here is some example code from me: public void testEvaluateRecommender() { try { DataModel myModel = new MyModelImplementationDataModel(); // Users: 12858 // Items: 5467 // MaxPreference: 85850.0 // MinPreference: 50.0 System.out.println("Users: "+myModel.getNumUsers()); System.out.println("Items: "+myModel.getNumItems()); System.out.println("MaxPreference: "+myModel.getMaxPreference()); System.out.println("MinPreference: "+myModel.getMinPreference()); RecommenderBuilder randomBased = new RecommenderBuilder() { public Recommender buildRecommender(DataModel model) { // build and return the Recommender to evaluate here try { return new RandomRecommender(model); } catch (TasteException e) { // TODO Auto-generated catch block e.printStackTrace(); return null; } } }; RecommenderBuilder genericItemBased = new RecommenderBuilder() { public Recommender buildRecommender(DataModel model) { // build and return the Recommender to evaluate here try { return new GenericItemBasedRecommender(model, new PearsonCorrelationSimilarity(model)); } catch (TasteException e) { // TODO Auto-generated catch block e.printStackTrace(); return null; } } }; RecommenderBuilder genericItemBasedCosine = new RecommenderBuilder() { public Recommender buildRecommender(DataModel model) { // build and return the Recommender to evaluate here try { return new GenericItemBasedRecommender(model, new UncenteredCosineSimilarity(model)); } catch (TasteException e) { // TODO Auto-generated catch block e.printStackTrace(); return null; } } }; RecommenderBuilder genericItemBasedLikely = new RecommenderBuilder() { public Recommender buildRecommender(DataModel model) { // build and return the Recommender to evaluate here return new GenericItemBasedRecommender(model, new LogLikelihoodSimilarity(model)); } }; RecommenderBuilder genericUserBasedNN3 = new RecommenderBuilder() { public Recommender buildRecommender(DataModel model) { // build and return the Recommender to evaluate here try { return new GenericUserBasedRecommender( model, new NearestNUserNeighborhood( 3, new PearsonCorrelationSimilarity(model), model), new PearsonCorrelationSimilarity(model)); } catch (TasteException e) { // TODO Auto-generated catch block e.printStackTrace(); return null; } } }; RecommenderBuilder genericUserBasedNN20 = new RecommenderBuilder() { public Recommender buildRecommender(DataModel model) { // build and return the Recommender to evaluate here try { return new GenericUserBasedRecommender( model, new NearestNUserNeighborhood( 20, new PearsonCorrelationSimilarity(model), model), new PearsonCorrelationSimilarity(model)); } catch (TasteException e) { // TODO Auto-generated catch block e.printStackTrace(); return null; } } }; RecommenderBuilder slopeOneBased = new RecommenderBuilder() { public Recommender buildRecommender(DataModel model) { // build and return the Recommender to evaluate here try { return new SlopeOneRecommender(model); } catch (TasteException e) { // TODO Auto-generated catch block e.printStackTrace(); return null; } } }; RecommenderBuilder svdBased = new RecommenderBuilder() { public Recommender buildRecommender(DataModel model) { // build and return the Recommender to evaluate here try { return new SVDRecommender(model, new ALSWRFactorizer( model, 100, 0.3, 5)); } catch (TasteException e) { // TODO Auto-generated catch block e.printStackTrace(); return null; } } }; // Data Set Summary: // 12858 users // 121304 preferences RecommenderEvaluator evaluator = new AverageAbsoluteDifferenceRecommenderEvaluator(); double evaluation = evaluator.evaluate(randomBased, null, myModel, 0.9, 1.0); // Evaluation of randomBased (baseline): 43045.380570443434 // (RandomRecommender(model)) System.out.println("Evaluation of randomBased (baseline): " + evaluation); // evaluation = evaluator.evaluate(genericItemBased, null, myModel, // 0.9, 1.0); // Evaluation of ItemBased with Pearson Correlation: // 315.5804958647985 (GenericItemBasedRecommender(model, // PearsonCorrelationSimilarity(model)) // System.out // .println("Evaluation of ItemBased with Pearson Correlation: " // + evaluation); // evaluation = evaluator.evaluate(genericItemBasedCosine, null, // myModel, 0.9, 1.0); // Evaluation of ItemBase with uncentered Cosine: 198.25393235323375 // (GenericItemBasedRecommender(model, // UncenteredCosineSimilarity(mo
-
Re: Mahout performance issuesDaniel Zohar 2011-12-02, 11:28
On Fri, Dec 2, 2011 at 1:10 PM, Sean Owen <[EMAIL PROTECTED]> wrote:
> On Fri, Dec 2, 2011 at 11:02 AM, Daniel Zohar <[EMAIL PROTECTED]> wrote: > > > Hi guys, > > > > @Sean, You are obviously right by saying that reducing the cap limit > would > > yield better performance. However I believe it would yield worse > accuracy. > > This is because the more items a user interacted with, the smaller is > > the percentage of the capped possible items relatively to the actual > > possible items. > > > > That's right. I'm saying there must be a middle ground that works on both > counts, since it works fine at smaller scales, where you only have hundreds > of interactions per recommendation computation. So, if you tune it to use > 100, for example, I imagine you get "good" recommendations and it should be > pretty fast, right? > > I don't see why this isn't the solution. > > I'm already capping it at 100. If this will be my last resort, I will decrease it more :) > > > > > I just ran the fix I proposed earlier and I got great results! The query > > time was reduced to about a third for the 'heavy users'. Before it was > 1-5 > > secs and now it's 0.5-1.5. The best part is that the accuracy level > should > > remain exactly the same. I also believe it should reduce memory > > consumption, as the GenericBooleanPrefDataModel.preferenceForItems gets > > significantly smaller (in my case at least). > > > > The fix is merely adding two lines of code to one of > > the GenericBooleanPrefDataModel constructors. See > > http://pastebin.com/K5PB68Et, the lines I added are #11, #22. > > > > I don't think this works though, because you've deleted the one data point > you have for those users. They can't get recommendations now. > > I can't figure out how that speeds up recommendations though, what am I > missing? these users aren't providing any more item-item interactions to > consider. > You know this code way better than I do, so perhaps I am missing something here. But as I see it (and I tested it as well) the users data point remains intact. That's because the preferenceFromUsers Set remains the same while only preferenceForItems is optimized. The main reason it improves performance is because of the bottleneck we diagnosed before - `GenericBooleanPrefDataModel.getNumUsersWithPreferenceFor` which in turn calls `FastIDSet.intersectionSize`. Now, if we know _for sure_ that a user interacted with a single item only, what's the point of checking every time if it had interacted with other items? (I hope I make myself clear) Because in my data set, we have over 80% of users which had a single interaction, it gives such a performance boost. (I believe this case might be more common than one might think in web apps)
-
Re: Mahout performance issuesDaniel Zohar 2011-12-02, 11:29
Hello Manuel,
I will run the tests as requested and post the results later. On Fri, Dec 2, 2011 at 1:20 PM, Manuel Blechschmidt < [EMAIL PROTECTED]> wrote: > Hello Daniel, > > On 02.12.2011, at 12:02, Daniel Zohar wrote: > > > Hi guys, > > > > ... > > I just ran the fix I proposed earlier and I got great results! The query > > time was reduced to about a third for the 'heavy users'. Before it was > 1-5 > > secs and now it's 0.5-1.5. The best part is that the accuracy level > should > > remain exactly the same. I also believe it should reduce memory > > consumption, as the GenericBooleanPrefDataModel.preferenceForItems gets > > significantly smaller (in my case at least). > > It would be great if you could measure your run time performance and your > accuracy with the provided Mahout tools. > > In your case because you only have boolean feedback precision and recall > would make sense. > > https://cwiki.apache.org/MAHOUT/recommender-documentation.html > > RecommenderIRStatsEvaluator evaluator = new > GenericRecommenderIRStatsEvaluator(); > IRStatistics stats = evaluator.evaluate(builder, null, myModel, null, 3, > RecommenderIRStatsEvaluator.CHOOSE_THRESHOLD, 1.0); > > > Here is some example code from me: > > public void testEvaluateRecommender() { > try { > DataModel myModel = new > MyModelImplementationDataModel(); > > // Users: 12858 > // Items: 5467 > // MaxPreference: 85850.0 > // MinPreference: 50.0 > System.out.println("Users: "+myModel.getNumUsers()); > System.out.println("Items: "+myModel.getNumItems()); > System.out.println("MaxPreference: > "+myModel.getMaxPreference()); > System.out.println("MinPreference: > "+myModel.getMinPreference()); > > RecommenderBuilder randomBased = new > RecommenderBuilder() { > public Recommender > buildRecommender(DataModel model) { > // build and return the Recommender > to evaluate here > try { > return new > RandomRecommender(model); > } catch (TasteException e) { > // TODO Auto-generated > catch block > e.printStackTrace(); > return null; > } > } > }; > > RecommenderBuilder genericItemBased = new > RecommenderBuilder() { > public Recommender > buildRecommender(DataModel model) { > // build and return the Recommender > to evaluate here > try { > return new > GenericItemBasedRecommender(model, > new > PearsonCorrelationSimilarity(model)); > } catch (TasteException e) { > // TODO Auto-generated > catch block > e.printStackTrace(); > return null; > } > } > }; > > RecommenderBuilder genericItemBasedCosine = new > RecommenderBuilder() { > public Recommender > buildRecommender(DataModel model) { > // build and return the Recommender > to evaluate here > try { > return new
-
Re: Mahout performance issuesSean Owen 2011-12-02, 11:53
On Fri, Dec 2, 2011 at 11:28 AM, Daniel Zohar <[EMAIL PROTECTED]> wrote:
> I'm already capping it at 100. If this will be my last resort, I will > decrease it more :) > This just can't be... 100 item-item similarities takes milliseconds to compute. Something else is going on. I should make a JIRA to propose my own version of this filtering just to make sure we're talking about the same thing. > > You know this code way better than I do, so perhaps I am missing something > here. But as I see it (and I tested it as well) the users data point > remains intact. That's because the preferenceFromUsers Set remains the same > while only preferenceForItems is optimized. The main reason it improves > performance is because of the bottleneck we diagnosed before - > `GenericBooleanPrefDataModel.getNumUsersWithPreferenceFor` which in turn > calls `FastIDSet.intersectionSize`. Now, if we know _for sure_ that a user > interacted with a single item only, what's the point of checking every time > if it had interacted with other items? (I hope I make myself clear) > Because in my data set, we have over 80% of users which had a single > interaction, it gives such a performance boost. (I believe this case might > be more common than one might think in web apps) > Let me propose a better way to address that bottleneck. I think the problem is that the intersection computation is dumb, and should really compute "larger.intersectionSize(smaller)". Try ending getNumUsersWithPreferenceFor() with: return userIDs1.size() < userIDs2.size() ? userIDs2.intersectionSize(userIDs1) : userIDs1.intersectionSize(userIDs2); It won't produce the same speedup, but it's more correct than omitting this data just to get this effect. If it gets 80% of the speedup, that's a great win.
-
Re: Mahout performance issuesDaniel Zohar 2011-12-02, 14:33
On Fri, Dec 2, 2011 at 1:53 PM, Sean Owen <[EMAIL PROTECTED]> wrote:
> On Fri, Dec 2, 2011 at 11:28 AM, Daniel Zohar <[EMAIL PROTECTED]> wrote: > > > I'm already capping it at 100. If this will be my last resort, I will > > decrease it more :) > > > > This just can't be... 100 item-item similarities takes milliseconds to > compute. Something else is going on. > I should make a JIRA to propose my own version of this filtering just to > make sure we're talking about the same thing. > > > I limit only the possibleItemIDs in my altered version of SamplingCandidateItemsStrategy Sine TopItems.getTopItems() computes similarities for every previously interacted item with the set of 'possibleItemIDs', you are correct only when the user have a single interaction. However, if the user had made 20 interactions, we're talking about 2000 item-item similarities. > > > > You know this code way better than I do, so perhaps I am missing > something > > here. But as I see it (and I tested it as well) the users data point > > remains intact. That's because the preferenceFromUsers Set remains the > same > > while only preferenceForItems is optimized. The main reason it improves > > performance is because of the bottleneck we diagnosed before - > > `GenericBooleanPrefDataModel.getNumUsersWithPreferenceFor` which in turn > > calls `FastIDSet.intersectionSize`. Now, if we know _for sure_ that a > user > > interacted with a single item only, what's the point of checking every > time > > if it had interacted with other items? (I hope I make myself clear) > > Because in my data set, we have over 80% of users which had a single > > interaction, it gives such a performance boost. (I believe this case > might > > be more common than one might think in web apps) > > > > Let me propose a better way to address that bottleneck. I think the problem > is that the intersection computation is dumb, and should really compute > "larger.intersectionSize(smaller)". > > Try ending getNumUsersWithPreferenceFor() with: > > return userIDs1.size() < userIDs2.size() ? > userIDs2.intersectionSize(userIDs1) : > userIDs1.intersectionSize(userIDs2); > > It won't produce the same speedup, but it's more correct than omitting this > data just to get this effect. If it gets 80% of the speedup, that's a great > win. > You nailed it! It extremely improves the performance. Without my last fix, we're talking about max 2s with my fix, it didn't go over 0.5s! I still don't see any problem with not including 'singleton users' inside preferenceForItems as long as preferenceFromUsers stays intact. Can you please elaborate more on the problem as you see it? I feel we're some kind of communication problem :P
-
Re: Mahout performance issuesDaniel Zohar 2011-12-02, 15:26
Manuel, I starting running the evaluation as proposed. But it seems it will
take forever to complete. It does the evaluation for each user which takes well over a minute. What am I doing wrong? This is my code : RecommenderBuilder itemBasedBuilder = new RecommenderBuilder() { public Recommender buildRecommender(DataModel model) { // build and return the Recommender to evaluate here try { ItemSimilarity itemSimilarity = newCachingItemSimilarity( new LogLikelihoodSimilarity(model), model); CandidateItemsStrategy candidateItemsStrategy = new OptimizedItemStrategy(20, 2, 100); MostSimilarItemsCandidateItemsStrategy mostSimilarItemsCandidateItemsStrategy = new OptimizedItemStrategy(20, 2, 100); ItemBasedRecommender recommender newGenericBooleanPrefItemBasedRecommender( dataModel, itemSimilarity, candidateItemsStrategy, mostSimilarItemsCandidateItemsStrategy); return recommender; } catch (TasteException e) { // TODO Auto-generated catch block e.printStackTrace(); return null; } } }; RecommenderIRStatsEvaluator evaluator = new GenericRecommenderIRStatsEvaluator(); try { IRStatistics stats = evaluator.evaluate(itemBasedBuilder, null, this.dataModel, null, 3, 0, 1.0); logger.info("Evalute returned:" + stats.toString()); } catch (TasteException e) { // TODO Auto-generated catch block logger.error("",e); } On Fri, Dec 2, 2011 at 1:29 PM, Daniel Zohar <[EMAIL PROTECTED]> wrote: > Hello Manuel, > I will run the tests as requested and post the results later. > > > On Fri, Dec 2, 2011 at 1:20 PM, Manuel Blechschmidt < > [EMAIL PROTECTED]> wrote: > >> Hello Daniel, >> >> On 02.12.2011, at 12:02, Daniel Zohar wrote: >> >> > Hi guys, >> > >> > ... >> > I just ran the fix I proposed earlier and I got great results! The query >> > time was reduced to about a third for the 'heavy users'. Before it was >> 1-5 >> > secs and now it's 0.5-1.5. The best part is that the accuracy level >> should >> > remain exactly the same. I also believe it should reduce memory >> > consumption, as the GenericBooleanPrefDataModel.preferenceForItems gets >> > significantly smaller (in my case at least). >> >> It would be great if you could measure your run time performance and your >> accuracy with the provided Mahout tools. >> >> In your case because you only have boolean feedback precision and recall >> would make sense. >> >> https://cwiki.apache.org/MAHOUT/recommender-documentation.html >> >> RecommenderIRStatsEvaluator evaluator = new >> GenericRecommenderIRStatsEvaluator(); >> IRStatistics stats = evaluator.evaluate(builder, null, myModel, null, 3, >> RecommenderIRStatsEvaluator.CHOOSE_THRESHOLD, 1.0); >> >> >> Here is some example code from me: >> >> public void testEvaluateRecommender() { >> try { >> DataModel myModel = new >> MyModelImplementationDataModel(); >> >> // Users: 12858 >> // Items: 5467 >> // MaxPreference: 85850.0 >> // MinPreference: 50.0 >> System.out.println("Users: >> "+myModel.getNumUsers()); >> System.out.println("Items: >> "+myModel.getNumItems()); >> System.out.println("MaxPreference: >> "+myModel.getMaxPreference()); >> System.out.println("MinPreference: >> "+myModel.getMinPreference()); >> >> RecommenderBuilder randomBased = new >> RecommenderBuilder() { >> public Recommender >> buildRecommender(DataModel model) { >> // build and return the >> Recommender to evaluate here >> try {
-
Re: Mahout performance issuesSean Owen 2011-12-02, 15:38
On Fri, Dec 2, 2011 at 2:33 PM, Daniel Zohar <[EMAIL PROTECTED]> wrote:
> On Fri, Dec 2, 2011 at 1:53 PM, Sean Owen <[EMAIL PROTECTED]> wrote: > > > On Fri, Dec 2, 2011 at 11:28 AM, Daniel Zohar <[EMAIL PROTECTED]> > wrote: > > > > > I'm already capping it at 100. If this will be my last resort, I will > > > decrease it more :) > > > > > > > This just can't be... 100 item-item similarities takes milliseconds to > > compute. Something else is going on. > > I should make a JIRA to propose my own version of this filtering just to > > make sure we're talking about the same thing. > > > > > > > I limit only the possibleItemIDs in my altered version > of SamplingCandidateItemsStrategy > Sine TopItems.getTopItems() computes similarities for every previously > interacted item with the set of 'possibleItemIDs', you are correct only > when the user have a single interaction. However, if the user had made 20 > interactions, we're talking about 2000 item-item similarities. > Sure, but why not limit those 2000 interactions to 100? I'm not sure "100" is the optimal number or not, but clearly, at smaller scale, this all runs quite fast and produces reasonable results. Say that, at some smaller scale, it considers on average X interactions. If you turn this down such that only X interactions are considered here, I don't see why you wouldn't get similar performance and quality. I was actually considering a patch that would simply apply the max both to the number of users sampled per item, but the number of items from each of those users. If you clamped it at 10, then each item you've rated would produce at most 10*10 interactions. We only limit one of those things now. Well... maybe with the change below it speeds up further so that you can get away with more data in less time, so that it is much easier to find a good tradeoff. > > You nailed it! It extremely improves the performance. Without my last fix, > we're talking about max 2s with my fix, it didn't go over 0.5s! > > > .. but does this change alone produce a good speedup? > I still don't see any problem with not including 'singleton users' inside > preferenceForItems as long as preferenceFromUsers stays intact. Can you > please elaborate more on the problem as you see it? I feel we're some kind > of communication problem :P > The calculations are just wrong then, since you don't have the right user counts per item. Methods that answer that question give the wrong answer; similarity metrics like log-likelihood give slightly wrong results too. At this point I don't think it's good to sacrifice correctness for an optimization when there seems to be (?) a way to have most of the speed up without any correctness problem.
-
Re: Mahout performance issuesDaniel Zohar 2011-12-02, 15:58
It's already hard to say what has the most significant impact here. Just to
summarize, we had basically 3 changes which made an enormous improvement in performance : 1. Limit the number of 'possibleItemIDs' returned by SamplingCandidateStrategy 2. Optimizing getNumUsersWithPreferenceFor() with "larger.intersectionSize( smaller)" 3. Excluding 'singleton users' from GenericBooleanPrefDataModel.preferenceForItems I think we already agreed about 1 & 2 (correct me if I'm wrong). Regarding 3, if we had a Map that mapped itemID to numOfUsers that might solve the problem and still give a great performance boost. You can see in my previous results that this change diminishes query time to at least a quarter (2s vs 0.5s). What do you think? On Fri, Dec 2, 2011 at 5:38 PM, Sean Owen <[EMAIL PROTECTED]> wrote: > On Fri, Dec 2, 2011 at 2:33 PM, Daniel Zohar <[EMAIL PROTECTED]> wrote: > > > On Fri, Dec 2, 2011 at 1:53 PM, Sean Owen <[EMAIL PROTECTED]> wrote: > > > > > On Fri, Dec 2, 2011 at 11:28 AM, Daniel Zohar <[EMAIL PROTECTED]> > > wrote: > > > > > > > I'm already capping it at 100. If this will be my last resort, I will > > > > decrease it more :) > > > > > > > > > > This just can't be... 100 item-item similarities takes milliseconds to > > > compute. Something else is going on. > > > I should make a JIRA to propose my own version of this filtering just > to > > > make sure we're talking about the same thing. > > > > > > > > > > > I limit only the possibleItemIDs in my altered version > > of SamplingCandidateItemsStrategy > > Sine TopItems.getTopItems() computes similarities for every previously > > interacted item with the set of 'possibleItemIDs', you are correct only > > when the user have a single interaction. However, if the user had made 20 > > interactions, we're talking about 2000 item-item similarities. > > > > Sure, but why not limit those 2000 interactions to 100? > > I'm not sure "100" is the optimal number or not, but clearly, at smaller > scale, this all runs quite fast and produces reasonable results. Say that, > at some smaller scale, it considers on average X interactions. If you turn > this down such that only X interactions are considered here, I don't see > why you wouldn't get similar performance and quality. > > I was actually considering a patch that would simply apply the max both to > the number of users sampled per item, but the number of items from each of > those users. If you clamped it at 10, then each item you've rated would > produce at most 10*10 interactions. We only limit one of those things now. > > Well... maybe with the change below it speeds up further so that you can > get away with more data in less time, so that it is much easier to find a > good tradeoff. > > > > > > > You nailed it! It extremely improves the performance. Without my last > fix, > > we're talking about max 2s with my fix, it didn't go over 0.5s! > > > > > > > .. but does this change alone produce a good speedup? > > > > > I still don't see any problem with not including 'singleton users' inside > > preferenceForItems as long as preferenceFromUsers stays intact. Can you > > please elaborate more on the problem as you see it? I feel we're some > kind > > of communication problem :P > > > > The calculations are just wrong then, since you don't have the right user > counts per item. Methods that answer that question give the wrong answer; > similarity metrics like log-likelihood give slightly wrong results too. At > this point I don't think it's good to sacrifice correctness for an > optimization when there seems to be (?) a way to have most of the speed up > without any correctness problem. >
-
Re: Mahout performance issuesSean Owen 2011-12-02, 16:42
I'll commit #2 since that's an obvious win.
I'll make a patch with a proposal for #1, which is sort of like what you're doing, just slightly different -- it gives a much stronger cap on the number of interactions considered. For #3: You can't just store counts -- you need the user IDs to compute intersection sizes. If I'm right about the reason #3 makes it run faster, then #2 should make it run faster by about the same amount -- so #2 subsumes #3. And if that's true, I'd strongly prefer to not do #3 as it would add little and make things incorrect. I don't think that the reason that #3 helps is because it makes fewer items available as candidates, because it shouldn't. Say user A rates items X, Y and Z. Say user B has rated only Y. When recommending for A, we'll find the set of users who have rated Y (A, B and others). And when we look at B, we'll add Y as a candidate item (the only item B rated). But Y will be removed at the end since A already rated it. So, B didn't add any more item-item interactions to consider. If you remove user B, you still have the same work to do. I think #3 "helps" because it sets the count of users per item, for some items, to 0 instead of 1, which avoids an intersection computation. But that's later, after you have gotten the list of candidates. It's not reducing the number of candidates, but ignoring a lot of them later by assuming their similarity is 0 (when it isn't!) But that intersection computation doesn't have to be so slow. It's slow to check whether each of 1000 items in one set are members of another set of 1 item. But checking 1 item for membership in another set of 1000 is just one quick hash lookup. Now, I agree that pretending that set has 0 items instead of 1 is even a little faster still! And perhaps you could argue it's a reasonable approximation. But my guess is that it's barely faster, at the price of breaking correctness. On Fri, Dec 2, 2011 at 3:58 PM, Daniel Zohar <[EMAIL PROTECTED]> wrote: > It's already hard to say what has the most significant impact here. Just to > summarize, we had basically 3 changes which made an enormous improvement in > performance : > 1. Limit the number of 'possibleItemIDs' returned > by SamplingCandidateStrategy > 2. Optimizing getNumUsersWithPreferenceFor() with "larger.intersectionSize( > smaller)" > 3. Excluding 'singleton users' from > GenericBooleanPrefDataModel.preferenceForItems > > I think we already agreed about 1 & 2 (correct me if I'm wrong). > Regarding 3, if we had a Map that mapped itemID to numOfUsers that might > solve the problem and still give a great performance boost. You can see in > my previous results that this change diminishes query time to at least a > quarter (2s vs 0.5s). What do you think? > >
-
Re: Mahout performance issuesSean Owen 2011-12-02, 16:46
(PS why don't we move this to JIRA:
https://issues.apache.org/jira/browse/MAHOUT-910) On Fri, Dec 2, 2011 at 4:42 PM, Sean Owen <[EMAIL PROTECTED]> wrote: > I'll commit #2 since that's an obvious win. > > I'll make a patch with a proposal for #1, which is sort of like what > you're doing, just slightly different -- it gives a much stronger cap on > the number of interactions considered. > >
-
Re: Mahout performance issuesTed Dunning 2011-12-02, 17:56
On Fri, Dec 2, 2011 at 3:10 AM, Sean Owen <[EMAIL PROTECTED]> wrote:
> > I just ran the fix I proposed earlier and I got great results! The query > > time was reduced to about a third for the 'heavy users'. Before it was > 1-5 > > secs and now it's 0.5-1.5. The best part is that the accuracy level > should > > remain exactly the same. I also believe it should reduce memory > > consumption, as the GenericBooleanPrefDataModel.preferenceForItems gets > > significantly smaller (in my case at least). > > > > The fix is merely adding two lines of code to one of > > the GenericBooleanPrefDataModel constructors. See > > http://pastebin.com/K5PB68Et, the lines I added are #11, #22. > > > > I don't think this works though, because you've deleted the one data point > you have for those users. They can't get recommendations now. > > I can't figure out how that speeds up recommendations though, what am I > missing? these users aren't providing any more item-item interactions to > consider. > Actually, if these users single item is a fantastically popular item, then all of those users will be roped into the computation (with no effect). Sean's argument would be correct if the users were each interacting with some item that is way out in the low frequency tail. By Murphy, this won't be the case. Better to dump the uninformative items using a kill list.
-
Re: Mahout performance issuesTed Dunning 2011-12-02, 18:00
Does #1 mean down-sample the items in each user? Or does it only
down-sample the number of items for the user that we are producing recommendations for? I recommend down-sampling for all users. IF you down-sample biased toward low frequency items, then this will also kill the problem of high frequency items and you get all the performance gains you are talking about and more, without significant error. On Fri, Dec 2, 2011 at 7:58 AM, Daniel Zohar <[EMAIL PROTECTED]> wrote: > It's already hard to say what has the most significant impact here. Just to > summarize, we had basically 3 changes which made an enormous improvement in > performance : > 1. Limit the number of 'possibleItemIDs' returned > by SamplingCandidateStrategy > 2. Optimizing getNumUsersWithPreferenceFor() with "larger.intersectionSize( > smaller)" > 3. Excluding 'singleton users' from > GenericBooleanPrefDataModel.preferenceForItems > > I think we already agreed about 1 & 2 (correct me if I'm wrong). > Regarding 3, if we had a Map that mapped itemID to numOfUsers that might > solve the problem and still give a great performance boost. You can see in > my previous results that this change diminishes query time to at least a > quarter (2s vs 0.5s). What do you think? > > On Fri, Dec 2, 2011 at 5:38 PM, Sean Owen <[EMAIL PROTECTED]> wrote: > > > On Fri, Dec 2, 2011 at 2:33 PM, Daniel Zohar <[EMAIL PROTECTED]> wrote: > > > > > On Fri, Dec 2, 2011 at 1:53 PM, Sean Owen <[EMAIL PROTECTED]> wrote: > > > > > > > On Fri, Dec 2, 2011 at 11:28 AM, Daniel Zohar <[EMAIL PROTECTED]> > > > wrote: > > > > > > > > > I'm already capping it at 100. If this will be my last resort, I > will > > > > > decrease it more :) > > > > > > > > > > > > > This just can't be... 100 item-item similarities takes milliseconds > to > > > > compute. Something else is going on. > > > > I should make a JIRA to propose my own version of this filtering just > > to > > > > make sure we're talking about the same thing. > > > > > > > > > > > > > > > I limit only the possibleItemIDs in my altered version > > > of SamplingCandidateItemsStrategy > > > Sine TopItems.getTopItems() computes similarities for every previously > > > interacted item with the set of 'possibleItemIDs', you are correct only > > > when the user have a single interaction. However, if the user had made > 20 > > > interactions, we're talking about 2000 item-item similarities. > > > > > > > Sure, but why not limit those 2000 interactions to 100? > > > > I'm not sure "100" is the optimal number or not, but clearly, at smaller > > scale, this all runs quite fast and produces reasonable results. Say > that, > > at some smaller scale, it considers on average X interactions. If you > turn > > this down such that only X interactions are considered here, I don't see > > why you wouldn't get similar performance and quality. > > > > I was actually considering a patch that would simply apply the max both > to > > the number of users sampled per item, but the number of items from each > of > > those users. If you clamped it at 10, then each item you've rated would > > produce at most 10*10 interactions. We only limit one of those things > now. > > > > Well... maybe with the change below it speeds up further so that you can > > get away with more data in less time, so that it is much easier to find a > > good tradeoff. > > > > > > > > > > > > You nailed it! It extremely improves the performance. Without my last > > fix, > > > we're talking about max 2s with my fix, it didn't go over 0.5s! > > > > > > > > > > > .. but does this change alone produce a good speedup? > > > > > > > > > I still don't see any problem with not including 'singleton users' > inside > > > preferenceForItems as long as preferenceFromUsers stays intact. Can you > > > please elaborate more on the problem as you see it? I feel we're some > > kind > > > of communication problem :P > > > > > > > The calculations are just wrong then, since you don't have the right user > > counts per item. Methods that answer that question give the wrong answer;
-
Re: Mahout performance issuesSean Owen 2011-12-02, 18:03
Yes, but those users will bring no more candidate items to consider, and
the apparent bottleneck is not touching those users, but later computing all those similarities. That's my argument. On Fri, Dec 2, 2011 at 5:56 PM, Ted Dunning <[EMAIL PROTECTED]> wrote: > > Actually, if these users single item is a fantastically popular item, then > all of those users will be roped into the computation (with no effect). > > Sean's argument would be correct if the users were each interacting with > some item that is way out in the low frequency tail. By Murphy, this won't > be the case. > > Better to dump the uninformative items using a kill list. >
-
Re: Mahout performance issuesSean Owen 2011-12-02, 18:05
Say we're recommending for user A. User A is connected to items 1, 2, 3.
Those items are connected to other users X, Y, Z. And those users in turn are connected to items 100, 101, 102, 103.... You can down-sample three things: 1. The 1,2,3 2. The X,Y,Z 3. The 100,101,102 We already do #2. I am suggesting we add #3. On Fri, Dec 2, 2011 at 6:00 PM, Ted Dunning <[EMAIL PROTECTED]> wrote: > Does #1 mean down-sample the items in each user? Or does it only > down-sample the number of items for the user that we are producing > recommendations for? > > I recommend down-sampling for all users. IF you down-sample biased toward > low frequency items, then this will also kill the problem of high frequency > items and you get all the performance gains you are talking about and more, > without significant error.
-
Re: Mahout performance issuesTed Dunning 2011-12-02, 18:07
Since touching them adds nothing but cost, then not touching them is
better. Kill the item! In practical terms, we had this problem at Veoh. Everybody got the same intro video. It provided no information. Likewise at Musicmatch, everybody got the same startup noise during the splash screen. It added no information. Both of these cases would kill performance in lots of recommendation engines because a vast number of users would get sucked into computations where it made no difference at all. Better to kill these items. On Fri, Dec 2, 2011 at 10:03 AM, Sean Owen <[EMAIL PROTECTED]> wrote: > Yes, but those users will bring no more candidate items to consider, and > the apparent bottleneck is not touching those users, but later computing > all those similarities. That's my argument. > > On Fri, Dec 2, 2011 at 5:56 PM, Ted Dunning <[EMAIL PROTECTED]> wrote: > > > > Actually, if these users single item is a fantastically popular item, > then > > all of those users will be roped into the computation (with no effect). > > > > Sean's argument would be correct if the users were each interacting with > > some item that is way out in the low frequency tail. By Murphy, this > won't > > be the case. > > > > Better to dump the uninformative items using a kill list. > > >
-
Re: Mahout performance issuesDaniel Zohar 2011-12-02, 18:07
On Fri, Dec 2, 2011 at 6:42 PM, Sean Owen <[EMAIL PROTECTED]> wrote:
> I'll commit #2 since that's an obvious win. > > Great. > I'll make a patch with a proposal for #1, which is sort of like what you're > doing, just slightly different -- it gives a much stronger cap on the > number of interactions considered. > > Alright. > > For #3: You can't just store counts -- you need the user IDs to compute > intersection sizes. > > I'm suggesting to have an additional Map<Long, Long> under GenericBooleanPrefDataModel. There will still be GenericBooleanPrefDataModel.preferenceFromUsers and GenericBooleanPrefDataModel.preferenceForItems only that the latter might be a lot smaller in cases like mine. > If I'm right about the reason #3 makes it run faster, then #2 should make > it run faster by about the same amount -- so #2 subsumes #3. And if that's > true, I'd strongly prefer to not do #3 as it would add little and make > things incorrect. > > I don't think that the reason that #3 helps is because it makes fewer items > available as candidates, because it shouldn't. Say user A rates items X, Y > and Z. Say user B has rated only Y. When recommending for A, we'll find the > set of users who have rated Y (A, B and others). And when we look at B, > we'll add Y as a candidate item (the only item B rated). But Y will be > removed at the end since A already rated it. So, B didn't add any more > item-item interactions to consider. If you remove user B, you still have > the same work to do. > > I think #3 "helps" because it sets the count of users per item, for some > items, to 0 instead of 1, which avoids an intersection computation. But > that's later, after you have gotten the list of candidates. It's not > reducing the number of candidates, but ignoring a lot of them later by > assuming their similarity is 0 (when it isn't!) > > But that intersection computation doesn't have to be so slow. It's slow to > check whether each of 1000 items in one set are members of another set of 1 > item. But checking 1 item for membership in another set of 1000 is just one > quick hash lookup. > > Now, I agree that pretending that set has 0 items instead of 1 is even a > little faster still! And perhaps you could argue it's a reasonable > approximation. But my guess is that it's barely faster, at the price of > breaking correctness. > > I definitely agree that the correctness should not be broken. My solution is not meant to decrease the number of possible items like you stated in your example. It was meant to reduce the amount of item-user associations (while preserving user-item associations) which will results much less effort on intersectionSize(). Even in the case that we have two popular items, item A with 50k interactions and item B with 100k interactions, there are still 50k lookups inside intersectionSize(). If 50% of users interacted with that item interacted _only_ with that item, we are talking about a huge saving, are we not? For example, how many of the people who bought The Da Vinci Code on Amazon, had bought other books there as well? > On Fri, Dec 2, 2011 at 3:58 PM, Daniel Zohar <[EMAIL PROTECTED]> wrote: > > > It's already hard to say what has the most significant impact here. Just > to > > summarize, we had basically 3 changes which made an enormous improvement > in > > performance : > > 1. Limit the number of 'possibleItemIDs' returned > > by SamplingCandidateStrategy > > 2. Optimizing getNumUsersWithPreferenceFor() with > "larger.intersectionSize( > > smaller)" > > 3. Excluding 'singleton users' from > > GenericBooleanPrefDataModel.preferenceForItems > > > > I think we already agreed about 1 & 2 (correct me if I'm wrong). > > Regarding 3, if we had a Map that mapped itemID to numOfUsers that might > > solve the problem and still give a great performance boost. You can see > in > > my previous results that this change diminishes query time to at least a > > quarter (2s vs 0.5s). What do you think? > > > > >
-
Re: Mahout performance issuesDaniel Zohar 2011-12-02, 18:09
And how do you purpose to kill these items? I mean, we should still keep
all the user-item associations, shouldn't we? If it's that popular, how would we recommend items for users which had interacted only with that item alone? On Fri, Dec 2, 2011 at 8:07 PM, Ted Dunning <[EMAIL PROTECTED]> wrote: > Since touching them adds nothing but cost, then not touching them is > better. Kill the item! > > In practical terms, we had this problem at Veoh. Everybody got the same > intro video. It provided no information. Likewise at Musicmatch, > everybody got the same startup noise during the splash screen. It added no > information. Both of these cases would kill performance in lots of > recommendation engines because a vast number of users would get sucked into > computations where it made no difference at all. > > Better to kill these items. > > On Fri, Dec 2, 2011 at 10:03 AM, Sean Owen <[EMAIL PROTECTED]> wrote: > > > Yes, but those users will bring no more candidate items to consider, and > > the apparent bottleneck is not touching those users, but later computing > > all those similarities. That's my argument. > > > > On Fri, Dec 2, 2011 at 5:56 PM, Ted Dunning <[EMAIL PROTECTED]> > wrote: > > > > > > Actually, if these users single item is a fantastically popular item, > > then > > > all of those users will be roped into the computation (with no effect). > > > > > > Sean's argument would be correct if the users were each interacting > with > > > some item that is way out in the low frequency tail. By Murphy, this > > won't > > > be the case. > > > > > > Better to dump the uninformative items using a kill list. > > > > > >
-
Re: Mahout performance issuesTed Dunning 2011-12-02, 18:10
This isn't really the problem.
Suppose that user A is connected to all items. Suppose that all users are connected to item 1 even if no other. Any recommendation will pull in A and thus all items become candidate recommendations. My suggestion is two-fold: (1) eliminate item 1 entirely and (2) down-sample the items that A is connected to. Both are important. (1) is important to avoid bringing all users into a basically pointless computation and (2) is important because A's history makes the cooccurrence matrix dense which hurts even if you don't compute the entire matrix. On Fri, Dec 2, 2011 at 10:05 AM, Sean Owen <[EMAIL PROTECTED]> wrote: > Say we're recommending for user A. User A is connected to items 1, 2, 3. > Those items are connected to other users X, Y, Z. And those users in turn > are connected to items 100, 101, 102, 103.... > > You can down-sample three things: > > 1. The 1,2,3 > 2. The X,Y,Z > 3. The 100,101,102 > > We already do #2. I am suggesting we add #3. > > On Fri, Dec 2, 2011 at 6:00 PM, Ted Dunning <[EMAIL PROTECTED]> wrote: > > > Does #1 mean down-sample the items in each user? Or does it only > > down-sample the number of items for the user that we are producing > > recommendations for? > > > > I recommend down-sampling for all users. IF you down-sample biased > toward > > low frequency items, then this will also kill the problem of high > frequency > > items and you get all the performance gains you are talking about and > more, > > without significant error. >
-
Re: Mahout performance issuesTed Dunning 2011-12-02, 18:13
On Fri, Dec 2, 2011 at 10:09 AM, Daniel Zohar <[EMAIL PROTECTED]> wrote:
> And how do you purpose to kill these items? I mean, we should still keep > all the user-item associations, shouldn't we? > No. These items bring no information whatsoever. > If it's that popular, how would we recommend items for users which had > interacted only with that item alone? > Two answers: - it will appear on the most popular page. Recommendations are for telling people what they might be interested in that is *different* from the population at large. - if a person has only interacted with a single item that is vastly popular, we have no useful information about this person. Indeed, the result of the recommendations will be almost the same as the most popular page. Better to admit that we know nothing here and move on. > > On Fri, Dec 2, 2011 at 8:07 PM, Ted Dunning <[EMAIL PROTECTED]> wrote: > > > Since touching them adds nothing but cost, then not touching them is > > better. Kill the item! > > > > In practical terms, we had this problem at Veoh. Everybody got the same > > intro video. It provided no information. Likewise at Musicmatch, > > everybody got the same startup noise during the splash screen. It added > no > > information. Both of these cases would kill performance in lots of > > recommendation engines because a vast number of users would get sucked > into > > computations where it made no difference at all. > > > > Better to kill these items. > > > > On Fri, Dec 2, 2011 at 10:03 AM, Sean Owen <[EMAIL PROTECTED]> wrote: > > > > > Yes, but those users will bring no more candidate items to consider, > and > > > the apparent bottleneck is not touching those users, but later > computing > > > all those similarities. That's my argument. > > > > > > On Fri, Dec 2, 2011 at 5:56 PM, Ted Dunning <[EMAIL PROTECTED]> > > wrote: > > > > > > > > Actually, if these users single item is a fantastically popular item, > > > then > > > > all of those users will be roped into the computation (with no > effect). > > > > > > > > Sean's argument would be correct if the users were each interacting > > with > > > > some item that is way out in the low frequency tail. By Murphy, this > > > won't > > > > be the case. > > > > > > > > Better to dump the uninformative items using a kill list. > > > > > > > > > >
-
Re: Mahout performance issuesSean Owen 2011-12-02, 18:14
I agree, though that's something slightly different to the question at hand
here. If just about every user viewed X, you could probably forget X, yes. But if some user happened to only view one item, Y, can you drop that? It affects correctness. I argue that it doesn't really affect the bottleneck in question here, which is not quite the point you are getting at. On Fri, Dec 2, 2011 at 6:07 PM, Ted Dunning <[EMAIL PROTECTED]> wrote: > Since touching them adds nothing but cost, then not touching them is > better. Kill the item! > > In practical terms, we had this problem at Veoh. Everybody got the same > intro video. It provided no information. Likewise at Musicmatch, > everybody got the same startup noise during the splash screen. It added no > information. Both of these cases would kill performance in lots of > recommendation engines because a vast number of users would get sucked into > computations where it made no difference at all. > > Better to kill these items.
-
Re: Mahout performance issuesTed Dunning 2011-12-02, 18:18
On Fri, Dec 2, 2011 at 10:14 AM, Sean Owen <[EMAIL PROTECTED]> wrote:
> I agree, though that's something slightly different to the question at hand > here. If just about every user viewed X, you could probably forget X, yes. > > But if some user happened to only view one item, Y, can you drop that? It > affects correctness. I argue that it doesn't really affect the bottleneck > in question here, which is not quite the point you are getting at. > No. Dropping that does not affect correctness if that item is highly popular. If that item is not vastly popular, it does provide some information and would not be affected by my suggestions. > > On Fri, Dec 2, 2011 at 6:07 PM, Ted Dunning <[EMAIL PROTECTED]> wrote: > > > Since touching them adds nothing but cost, then not touching them is > > better. Kill the item! > > > > In practical terms, we had this problem at Veoh. Everybody got the same > > intro video. It provided no information. Likewise at Musicmatch, > > everybody got the same startup noise during the splash screen. It added > no > > information. Both of these cases would kill performance in lots of > > recommendation engines because a vast number of users would get sucked > into > > computations where it made no difference at all. > > > > Better to kill these items. >
-
Re: Mahout performance issuesSean Owen 2011-12-02, 18:19
On Fri, Dec 2, 2011 at 6:07 PM, Daniel Zohar <[EMAIL PROTECTED]> wrote:
> > I definitely agree that the correctness should not be broken. My solution > is not meant to decrease the number of possible items like you stated in > your example. It was meant to reduce the amount of item-user associations > (while preserving user-item associations) which will results much less > effort on intersectionSize(). Even in the case that we have two popular > My point is that intersectionSize() is called as part of a similarity computation. Yes, that's the bottleneck. But, that happens after the stage where candidate items are identified. And you are talking about changing the candidate identification stage, which is not the bottleneck. I think your change *happens* to also reduce the number of similarity computations since it assumes some are 0, when they are not! sure that saves time, in the same way that you'll finish an exam faster if you don't answer half the questions. I am instead suggesting to optimize intersectionSize(), such that for all of these 1-item cases, the answer is computed extremely fast. Which also addresses the bottleneck of course. I suppose this could be proven or disproven quickly -- do you get the same speed up with the change I committed, without your change? if you do, great, we have a solution. If not then I am wrong and you have some example that pinpoints where the new bottleneck is.
-
Re: Mahout performance issuesSean Owen 2011-12-02, 18:22
Yes, and there is no assumption here that the item is vastly popular. The
proposal was to drop any user-item interaction where the item was the only one rated by the user, whether the item is popular or not. There's a non-zero, and probably non-trivial, affect on correctness as a result. On Fri, Dec 2, 2011 at 6:18 PM, Ted Dunning <[EMAIL PROTECTED]> wrote: > On Fri, Dec 2, 2011 at 10:14 AM, Sean Owen <[EMAIL PROTECTED]> wrote: > > > I agree, though that's something slightly different to the question at > hand > > here. If just about every user viewed X, you could probably forget X, > yes. > > > > But if some user happened to only view one item, Y, can you drop that? It > > affects correctness. I argue that it doesn't really affect the bottleneck > > in question here, which is not quite the point you are getting at. > > > > No. Dropping that does not affect correctness if that item is highly > popular. > > If that item is not vastly popular, it does provide some information and > would not be affected by my suggestions. > > > > > > On Fri, Dec 2, 2011 at 6:07 PM, Ted Dunning <[EMAIL PROTECTED]> > wrote: > > > > > Since touching them adds nothing but cost, then not touching them is > > > better. Kill the item! > > > > > > In practical terms, we had this problem at Veoh. Everybody got the > same > > > intro video. It provided no information. Likewise at Musicmatch, > > > everybody got the same startup noise during the splash screen. It > added > > no > > > information. Both of these cases would kill performance in lots of > > > recommendation engines because a vast number of users would get sucked > > into > > > computations where it made no difference at all. > > > > > > Better to kill these items. > > >
-
Re: Mahout performance issuesDaniel Zohar 2011-12-02, 18:43
Sean, can you tell me which files have you committed the changes to? Thanks
On Fri, Dec 2, 2011 at 8:22 PM, Sean Owen <[EMAIL PROTECTED]> wrote: > Yes, and there is no assumption here that the item is vastly popular. The > proposal was to drop any user-item interaction where the item was the only > one rated by the user, whether the item is popular or not. There's a > non-zero, and probably non-trivial, affect on correctness as a result. > > On Fri, Dec 2, 2011 at 6:18 PM, Ted Dunning <[EMAIL PROTECTED]> wrote: > > > On Fri, Dec 2, 2011 at 10:14 AM, Sean Owen <[EMAIL PROTECTED]> wrote: > > > > > I agree, though that's something slightly different to the question at > > hand > > > here. If just about every user viewed X, you could probably forget X, > > yes. > > > > > > But if some user happened to only view one item, Y, can you drop that? > It > > > affects correctness. I argue that it doesn't really affect the > bottleneck > > > in question here, which is not quite the point you are getting at. > > > > > > > No. Dropping that does not affect correctness if that item is highly > > popular. > > > > If that item is not vastly popular, it does provide some information and > > would not be affected by my suggestions. > > > > > > > > > > On Fri, Dec 2, 2011 at 6:07 PM, Ted Dunning <[EMAIL PROTECTED]> > > wrote: > > > > > > > Since touching them adds nothing but cost, then not touching them is > > > > better. Kill the item! > > > > > > > > In practical terms, we had this problem at Veoh. Everybody got the > > same > > > > intro video. It provided no information. Likewise at Musicmatch, > > > > everybody got the same startup noise during the splash screen. It > > added > > > no > > > > information. Both of these cases would kill performance in lots of > > > > recommendation engines because a vast number of users would get > sucked > > > into > > > > computations where it made no difference at all. > > > > > > > > Better to kill these items. > > > > > >
-
Re: Mahout performance issuesSean Owen 2011-12-02, 21:15
For your purposes, it's LogLikelihoodSimilarity. I made similar changes in
other files. Ideally, just svn update to get all recent changes. On Fri, Dec 2, 2011 at 6:43 PM, Daniel Zohar <[EMAIL PROTECTED]> wrote: > Sean, can you tell me which files have you committed the changes to? Thanks
-
Re: Mahout performance issuesSebastian Schelter 2011-12-04, 09:13
I created a jira to supply a non-distributed counterpart of the
sampling that is done in the distributed item similarity computation: https://issues.apache.org/jira/browse/MAHOUT-914 2011/12/2 Sean Owen <[EMAIL PROTECTED]>: > For your purposes, it's LogLikelihoodSimilarity. I made similar changes in > other files. Ideally, just svn update to get all recent changes. > > On Fri, Dec 2, 2011 at 6:43 PM, Daniel Zohar <[EMAIL PROTECTED]> wrote: > >> Sean, can you tell me which files have you committed the changes to? Thanks
-
Re: Mahout performance issuesDaniel Zohar 2011-12-04, 11:43
Combining the latest commits with my
optimized-SamplingCandidateItemsStrategy (http://pastebin.com/6n9C8Pw1) I achieved satisfying results. All the queries were under one second. Sebastian, I took a look at your patch and I think it's more practical than the current SamplingCandidateItemsStrategy, however it still doesn't put a strict cap on the number of possible item IDs like my implementation does. Perhaps there is room for both implementations? On Sun, Dec 4, 2011 at 11:13 AM, Sebastian Schelter <[EMAIL PROTECTED]> wrote: > I created a jira to supply a non-distributed counterpart of the > sampling that is done in the distributed item similarity computation: > > https://issues.apache.org/jira/browse/MAHOUT-914 > > > 2011/12/2 Sean Owen <[EMAIL PROTECTED]>: > > For your purposes, it's LogLikelihoodSimilarity. I made similar changes > in > > other files. Ideally, just svn update to get all recent changes. > > > > On Fri, Dec 2, 2011 at 6:43 PM, Daniel Zohar <[EMAIL PROTECTED]> wrote: > > > >> Sean, can you tell me which files have you committed the changes to? > Thanks >
-
Re: Mahout performance issuesSean Owen 2011-12-04, 12:06
Are you referring to my patch, MAHOUT-910?
It does let you specify a hard cap, really -- if you place a limit of X, then at most X^2 item-item associations come out. Before you could not bound the result, really, since one user could rate a lot of items. I think it's slightly more efficient and unbiased as users with few ratings will not have their ratings sampled out, and all users are equally likely to be sampled out. What do you think? Yes you could easily add a secondary cap though as a final filter. On Sun, Dec 4, 2011 at 11:43 AM, Daniel Zohar <[EMAIL PROTECTED]> wrote: > Combining the latest commits with my > optimized-SamplingCandidateItemsStrategy (http://pastebin.com/6n9C8Pw1) > I achieved satisfying results. All the queries were under one second. > > Sebastian, I took a look at your patch and I think it's more practical than > the current SamplingCandidateItemsStrategy, however it still doesn't put a > strict cap on the number of possible item IDs like my implementation does. > Perhaps there is room for both implementations? > > > > On Sun, Dec 4, 2011 at 11:13 AM, Sebastian Schelter <[EMAIL PROTECTED]> > wrote: > > > I created a jira to supply a non-distributed counterpart of the > > sampling that is done in the distributed item similarity computation: > > > > https://issues.apache.org/jira/browse/MAHOUT-914 > > > > > > 2011/12/2 Sean Owen <[EMAIL PROTECTED]>: > > > For your purposes, it's LogLikelihoodSimilarity. I made similar changes > > in > > > other files. Ideally, just svn update to get all recent changes. > > > > > > On Fri, Dec 2, 2011 at 6:43 PM, Daniel Zohar <[EMAIL PROTECTED]> > wrote: > > > > > >> Sean, can you tell me which files have you committed the changes to? > > Thanks > > >
-
Re: Mahout performance issuesDaniel Zohar 2011-12-04, 12:12
Actually I was referring to Sebastian's. I haven't seen you committed
anything to SamplingCandidateItemsStrategy. Can you tell me in which class the change appears? On Sun, Dec 4, 2011 at 2:06 PM, Sean Owen <[EMAIL PROTECTED]> wrote: > Are you referring to my patch, MAHOUT-910? > > It does let you specify a hard cap, really -- if you place a limit of X, > then at most X^2 item-item associations come out. Before you could not > bound the result, really, since one user could rate a lot of items. > > I think it's slightly more efficient and unbiased as users with few ratings > will not have their ratings sampled out, and all users are equally likely > to be sampled out. > > What do you think? > Yes you could easily add a secondary cap though as a final filter. > > On Sun, Dec 4, 2011 at 11:43 AM, Daniel Zohar <[EMAIL PROTECTED]> wrote: > > > Combining the latest commits with my > > optimized-SamplingCandidateItemsStrategy (http://pastebin.com/6n9C8Pw1) > > I achieved satisfying results. All the queries were under one second. > > > > Sebastian, I took a look at your patch and I think it's more practical > than > > the current SamplingCandidateItemsStrategy, however it still doesn't put > a > > strict cap on the number of possible item IDs like my implementation > does. > > Perhaps there is room for both implementations? > > > > > > > > On Sun, Dec 4, 2011 at 11:13 AM, Sebastian Schelter <[EMAIL PROTECTED]> > > wrote: > > > > > I created a jira to supply a non-distributed counterpart of the > > > sampling that is done in the distributed item similarity computation: > > > > > > https://issues.apache.org/jira/browse/MAHOUT-914 > > > > > > > > > 2011/12/2 Sean Owen <[EMAIL PROTECTED]>: > > > > For your purposes, it's LogLikelihoodSimilarity. I made similar > changes > > > in > > > > other files. Ideally, just svn update to get all recent changes. > > > > > > > > On Fri, Dec 2, 2011 at 6:43 PM, Daniel Zohar <[EMAIL PROTECTED]> > > wrote: > > > > > > > >> Sean, can you tell me which files have you committed the changes to? > > > Thanks > > > > > >
-
Re: Mahout performance issuesSean Owen 2011-12-04, 12:20
Have a look at the patch attached to MAHOUT-910. I have not committed it
yet so as to allow review. https://issues.apache.org/jira/browse/MAHOUT-910 The current implementation samples users. MAHOUT-914 samples items from users. MAHOUT-910 samples both. What's most ideal? I had supposed we want to sample both since you might have a lot of users for one rated item, or many items from each user. That would let you bound both. On Sun, Dec 4, 2011 at 12:12 PM, Daniel Zohar <[EMAIL PROTECTED]> wrote: > Actually I was referring to Sebastian's. I haven't seen you committed > anything to SamplingCandidateItemsStrategy. Can you tell me in which class > the change appears? > > On Sun, Dec 4, 2011 at 2:06 PM, Sean Owen <[EMAIL PROTECTED]> wrote: > > > Are you referring to my patch, MAHOUT-910? > > > > It does let you specify a hard cap, really -- if you place a limit of X, > > then at most X^2 item-item associations come out. Before you could not > > bound the result, really, since one user could rate a lot of items. > > > > I think it's slightly more efficient and unbiased as users with few > ratings > > will not have their ratings sampled out, and all users are equally likely > > to be sampled out. > > > > What do you think? > > Yes you could easily add a secondary cap though as a final filter. > > > > On Sun, Dec 4, 2011 at 11:43 AM, Daniel Zohar <[EMAIL PROTECTED]> > wrote: > > > > > Combining the latest commits with my > > > optimized-SamplingCandidateItemsStrategy (http://pastebin.com/6n9C8Pw1 > ) > > > I achieved satisfying results. All the queries were under one second. > > > > > > Sebastian, I took a look at your patch and I think it's more practical > > than > > > the current SamplingCandidateItemsStrategy, however it still doesn't > put > > a > > > strict cap on the number of possible item IDs like my implementation > > does. > > > Perhaps there is room for both implementations? > > > > > > > > > > > > On Sun, Dec 4, 2011 at 11:13 AM, Sebastian Schelter <[EMAIL PROTECTED]> > > > wrote: > > > > > > > I created a jira to supply a non-distributed counterpart of the > > > > sampling that is done in the distributed item similarity computation: > > > > > > > > https://issues.apache.org/jira/browse/MAHOUT-914 > > > > > > > > > > > > 2011/12/2 Sean Owen <[EMAIL PROTECTED]>: > > > > > For your purposes, it's LogLikelihoodSimilarity. I made similar > > changes > > > > in > > > > > other files. Ideally, just svn update to get all recent changes. > > > > > > > > > > On Fri, Dec 2, 2011 at 6:43 PM, Daniel Zohar <[EMAIL PROTECTED]> > > > wrote: > > > > > > > > > >> Sean, can you tell me which files have you committed the changes > to? > > > > Thanks > > > > > > > > > >
-
Re: Mahout performance issuesSebastian Schelter 2011-12-04, 12:30
Hi Daniel,
My view is this: I think you can pretty safely down-sample power users like it is done in https://issues.apache.org/jira/browse/MAHOUT-914 I did some experiments on the movielens1M dataset that showed that you get a negligible error given you look at enough interactions per user: https://issues.apache.org/jira/secure/attachment/12506028/downsampling.png I could also verify this on the movielens10M dataset. I think this kind of sampling works because the distribution of interactions with items in the power-users and in the whole dataset is very similar. Therefore you don't really learn anything new from the 'power-users'. The 'power-users' might also be crawlers or people sharing accounts in practice. However, I am not sure what happens when you also sample the number of items you look at. If I had to decide, I'd rather follow Ted's advice and kill super-popular items, as they are not helpful per-se. But if the additional item sampling helps in your usecase, I don't oppose including it in Mahout. I think its good to have a variety of candidate item strategies. You should however do some experimenting to see how much the sampling affects quality. An A/B test in a real application would be the best thing to do. --sebastian On 04.12.2011 13:12, Daniel Zohar wrote: > Actually I was referring to Sebastian's. I haven't seen you committedI can > anything to SamplingCandidateItemsStrategy. Can you tell me in which classI can > the change appears? > > On Sun, Dec 4, 2011 at 2:06 PM, Sean Owen <[EMAIL PROTECTED]> wrote: > >> Are you referring to my patch, MAHOUT-910? >> >> It does let you specify a hard cap, really -- if you place a limit of X, >> then at most X^2 item-item associations come out. Before you could not >> bound the result, really, since one user could rate a lot of items. >> >> I think it's slightly more efficient and unbiased as users with few ratings >> will not have their ratings sampled out, and all users are equally likely >> to be sampled out. >> >> What do you think? >> Yes you could easily add a secondary cap though as a final filter. >> >> On Sun, Dec 4, 2011 at 11:43 AM, Daniel Zohar <[EMAIL PROTECTED]> wrote: >> >>> Combining the latest commits with my >>> optimized-SamplingCandidateItemsStrategy (http://pastebin.com/6n9C8Pw1) >>> I achieved satisfying results. All the queries were under one second. >>> >>> Sebastian, I took a look at your patch and I think it's more practical >> than >>> the current SamplingCandidateItemsStrategy, however it still doesn't put >> a >>> strict cap on the number of possible item IDs like my implementation >> does. >>> Perhaps there is room for both implementations? >>> >>> >>> >>> On Sun, Dec 4, 2011 at 11:13 AM, Sebastian Schelter <[EMAIL PROTECTED]> >>> wrote: >>> >>>> I created a jira to supply a non-distributed counterpart of the >>>> sampling that is done in the distributed item similarity computation: >>>> >>>> https://issues.apache.org/jira/browse/MAHOUT-914 >>>> >>>> >>>> 2011/12/2 Sean Owen <[EMAIL PROTECTED]>: >>>>> For your purposes, it's LogLikelihoodSimilarity. I made similar >> changes >>>> in >>>>> other files. Ideally, just svn update to get all recent changes. >>>>> >>>>> On Fri, Dec 2, 2011 at 6:43 PM, Daniel Zohar <[EMAIL PROTECTED]> >>> wrote: >>>>> >>>>>> Sean, can you tell me which files have you committed the changes to? >>>> Thanks >>>> >>> >> >
-
Re: Mahout performance issuesDaniel Zohar 2011-12-04, 12:59
Sean, your impl. is indeed better than mine but for some reason when I ran
it with for a user with a lot of interactions, I got 2023 possibleItemIDs (although I used 10,2 in the constructor). Sebastian, I will try and expriment also with your patch. I would just like to add that in my opinion, as long as 'killing items' has to be done manually, it is not scalable by definition. I personally would always prefer to avoid these kind of solutions. Also, in my case, the most popular item has only 3% of the users interacted with, so I suppose that's not exactly the case as well.. On Sun, Dec 4, 2011 at 2:30 PM, Sebastian Schelter <[EMAIL PROTECTED]> wrote: > Hi Daniel, > > My view is this: I think you can pretty safely down-sample power users > like it is done in https://issues.apache.org/jira/browse/MAHOUT-914 > I did some experiments on the movielens1M dataset that showed that you > get a negligible error given you look at enough interactions per user: > > https://issues.apache.org/jira/secure/attachment/12506028/downsampling.png > > I could also verify this on the movielens10M dataset. I think this kind > of sampling works because the distribution of interactions with items in > the power-users and in the whole dataset is very similar. Therefore you > don't really learn anything new from the 'power-users'. The > 'power-users' might also be crawlers or people sharing accounts in > practice. > > However, I am not sure what happens when you also sample the number of > items you look at. If I had to decide, I'd rather follow Ted's advice > and kill super-popular items, as they are not helpful per-se. > > But if the additional item sampling helps in your usecase, I don't > oppose including it in Mahout. I think its good to have a variety of > candidate item strategies. You should however do some experimenting to > see how much the sampling affects quality. An A/B test in a real > application would be the best thing to do. > > --sebastian > > > > On 04.12.2011 13:12, Daniel Zohar wrote: > > Actually I was referring to Sebastian's. I haven't seen you committedI > can > > anything to SamplingCandidateItemsStrategy. Can you tell me in which > classI can > > the change appears? > > > > On Sun, Dec 4, 2011 at 2:06 PM, Sean Owen <[EMAIL PROTECTED]> wrote: > > > >> Are you referring to my patch, MAHOUT-910? > >> > >> It does let you specify a hard cap, really -- if you place a limit of X, > >> then at most X^2 item-item associations come out. Before you could not > >> bound the result, really, since one user could rate a lot of items. > >> > >> I think it's slightly more efficient and unbiased as users with few > ratings > >> will not have their ratings sampled out, and all users are equally > likely > >> to be sampled out. > >> > >> What do you think? > >> Yes you could easily add a secondary cap though as a final filter. > >> > >> On Sun, Dec 4, 2011 at 11:43 AM, Daniel Zohar <[EMAIL PROTECTED]> > wrote: > >> > >>> Combining the latest commits with my > >>> optimized-SamplingCandidateItemsStrategy (http://pastebin.com/6n9C8Pw1 > ) > >>> I achieved satisfying results. All the queries were under one second. > >>> > >>> Sebastian, I took a look at your patch and I think it's more practical > >> than > >>> the current SamplingCandidateItemsStrategy, however it still doesn't > put > >> a > >>> strict cap on the number of possible item IDs like my implementation > >> does. > >>> Perhaps there is room for both implementations? > >>> > >>> > >>> > >>> On Sun, Dec 4, 2011 at 11:13 AM, Sebastian Schelter <[EMAIL PROTECTED]> > >>> wrote: > >>> > >>>> I created a jira to supply a non-distributed counterpart of the > >>>> sampling that is done in the distributed item similarity computation: > >>>> > >>>> https://issues.apache.org/jira/browse/MAHOUT-914 > >>>> > >>>> > >>>> 2011/12/2 Sean Owen <[EMAIL PROTECTED]>: > >>>>> For your purposes, it's LogLikelihoodSimilarity. I made similar > >> changes > >>>> in > >>>>> other files. Ideally, just svn update to get all recent changes.
-
Re: Mahout performance issuesDaniel Zohar 2011-12-04, 13:04
I assume the parameter does not affect the possibleItemIDs because of the
following line: max = (int) Math.max(defaultMaxPrefsPerItemConsidered, userItemCountMultiplier * Math.log(Math.max(dataModel.getNumUsers(), dataModel.getNumItems()))); On Sun, Dec 4, 2011 at 2:59 PM, Daniel Zohar <[EMAIL PROTECTED]> wrote: > Sean, your impl. is indeed better than mine but for some reason when I ran > it with for a user with a lot of interactions, I got 2023 possibleItemIDs > (although I used 10,2 in the constructor). > > Sebastian, I will try and expriment also with your patch. I would just > like to add that in my opinion, as long as 'killing items' has to be done > manually, it is not scalable by definition. I personally would always > prefer to avoid these kind of solutions. Also, in my case, the most popular > item has only 3% of the users interacted with, so I suppose that's not > exactly the case as well.. > > > On Sun, Dec 4, 2011 at 2:30 PM, Sebastian Schelter <[EMAIL PROTECTED]> wrote: > >> Hi Daniel, >> >> My view is this: I think you can pretty safely down-sample power users >> like it is done in https://issues.apache.org/jira/browse/MAHOUT-914 >> I did some experiments on the movielens1M dataset that showed that you >> get a negligible error given you look at enough interactions per user: >> >> https://issues.apache.org/jira/secure/attachment/12506028/downsampling.png >> >> I could also verify this on the movielens10M dataset. I think this kind >> of sampling works because the distribution of interactions with items in >> the power-users and in the whole dataset is very similar. Therefore you >> don't really learn anything new from the 'power-users'. The >> 'power-users' might also be crawlers or people sharing accounts in >> practice. >> >> However, I am not sure what happens when you also sample the number of >> items you look at. If I had to decide, I'd rather follow Ted's advice >> and kill super-popular items, as they are not helpful per-se. >> >> But if the additional item sampling helps in your usecase, I don't >> oppose including it in Mahout. I think its good to have a variety of >> candidate item strategies. You should however do some experimenting to >> see how much the sampling affects quality. An A/B test in a real >> application would be the best thing to do. >> >> --sebastian >> >> >> >> On 04.12.2011 13:12, Daniel Zohar wrote: >> > Actually I was referring to Sebastian's. I haven't seen you committedI >> can >> > anything to SamplingCandidateItemsStrategy. Can you tell me in which >> classI can >> > the change appears? >> > >> > On Sun, Dec 4, 2011 at 2:06 PM, Sean Owen <[EMAIL PROTECTED]> wrote: >> > >> >> Are you referring to my patch, MAHOUT-910? >> >> >> >> It does let you specify a hard cap, really -- if you place a limit of >> X, >> >> then at most X^2 item-item associations come out. Before you could not >> >> bound the result, really, since one user could rate a lot of items. >> >> >> >> I think it's slightly more efficient and unbiased as users with few >> ratings >> >> will not have their ratings sampled out, and all users are equally >> likely >> >> to be sampled out. >> >> >> >> What do you think? >> >> Yes you could easily add a secondary cap though as a final filter. >> >> >> >> On Sun, Dec 4, 2011 at 11:43 AM, Daniel Zohar <[EMAIL PROTECTED]> >> wrote: >> >> >> >>> Combining the latest commits with my >> >>> optimized-SamplingCandidateItemsStrategy ( >> http://pastebin.com/6n9C8Pw1) >> >>> I achieved satisfying results. All the queries were under one second. >> >>> >> >>> Sebastian, I took a look at your patch and I think it's more practical >> >> than >> >>> the current SamplingCandidateItemsStrategy, however it still doesn't >> put >> >> a >> >>> strict cap on the number of possible item IDs like my implementation >> >> does. >> >>> Perhaps there is room for both implementations? >> >>> >> >>> >> >>> >> >>> On Sun, Dec 4, 2011 at 11:13 AM, Sebastian Schelter <[EMAIL PROTECTED]> >> >>> wrote: >>
-
Re: Mahout performance issuesSean Owen 2011-12-04, 13:42
To talk about this clearly, let me go back to my example and add to it:
--- Say we're recommending for user A. User A is connected to items 1, 2, 3. Those items are connected to other users X, Y, Z. And those users in turn are connected to items 100, 101, 102, 103.... You can down-sample three things: 1. The 1,2,3 2. The X,Y,Z 3. The 100,101,102 4. ... the result of downsampling 1-3, again --- The current implementation samples #2. My proposal samples #2 and #3. Sebastian's samples #3. Your proposal does #2 and #4. I believe that doing all 4 is redundant. You probably need to do at least #2 and #3 to avoid the prolific-user and prolific-item problem. The reason you are still seeing a fair number of IDs is that #1 is not also sampled, in my implementation. I think I suggest that we still have one solution for this, since it's all small variants on the same theme, and let's make in SamplingCandidateItemStrategy. To me, the remaining question is just, which of these 4 do you want to do? I suggest 2, 3, and maybe 1. Follow on question: should we make separately settable limits for each, or does this get complex without much use? On Sun, Dec 4, 2011 at 1:04 PM, Daniel Zohar <[EMAIL PROTECTED]> wrote: > I assume the parameter does not affect the possibleItemIDs because of the > following line: > > max = (int) > Math.max(defaultMaxPrefsPerItemConsidered, userItemCountMultiplier * > Math.log(Math.max(dataModel.getNumUsers(), dataModel.getNumItems()))); > > On Sun, Dec 4, 2011 at 2:59 PM, Daniel Zohar <[EMAIL PROTECTED]> wrote: > > > Sean, your impl. is indeed better than mine but for some reason when I > ran > > it with for a user with a lot of interactions, I got 2023 possibleItemIDs > > (although I used 10,2 in the constructor). > > > > Sebastian, I will try and expriment also with your patch. I would just > > like to add that in my opinion, as long as 'killing items' has to be done > > manually, it is not scalable by definition. I personally would always > > prefer to avoid these kind of solutions. Also, in my case, the most > popular > > item has only 3% of the users interacted with, so I suppose that's not > > exactly the case as well.. >
-
Re: Mahout performance issuesTed Dunning 2011-12-04, 21:01
Sean,
You can also do #1. That is what I have used in the past and what I recommend. That achieves a large part of #2, but what is most important is that it *directly* addresses the key cost factor in off-line recommendations since the number of item pairs emitted is proportional to the sum of the number of items squared for each user. Specifically, I think that each user should have at most N items and if they have more, the number they have should be down-sampled to the point that they have N. I also think that there are some cases were strategy #2 is important even if #1 is implemented. If #1 and #2 are done, then it is a matter of convenience to limit the number of items in each row of the item-item matrix. This is #4 which I endorse and which Sebastian has endorsed. On Sun, Dec 4, 2011 at 5:42 AM, Sean Owen <[EMAIL PROTECTED]> wrote: > To talk about this clearly, let me go back to my example and add to it: > > --- > Say we're recommending for user A. User A is connected to items 1, 2, 3. > Those items are connected to other users X, Y, Z. And those users in turn > are connected to items 100, 101, 102, 103.... You can down-sample three > things: > > 1. The 1,2,3 > 2. The X,Y,Z > 3. The 100,101,102 > 4. ... the result of downsampling 1-3, again > --- > > The current implementation samples #2. My proposal samples #2 and #3. > Sebastian's samples #3. Your proposal does #2 and #4. I believe that doing > all 4 is redundant. You probably need to do at least #2 and #3 to avoid the > prolific-user and prolific-item problem. > > The reason you are still seeing a fair number of IDs is that #1 is not also > sampled, in my implementation. > > I think I suggest that we still have one solution for this, since it's all > small variants on the same theme, and let's make in > SamplingCandidateItemStrategy. > > To me, the remaining question is just, which of these 4 do you want to do? > I suggest 2, 3, and maybe 1. > Follow on question: should we make separately settable limits for each, or > does this get complex without much use? > > On Sun, Dec 4, 2011 at 1:04 PM, Daniel Zohar <[EMAIL PROTECTED]> wrote: > > > I assume the parameter does not affect the possibleItemIDs because of the > > following line: > > > > max = (int) > > Math.max(defaultMaxPrefsPerItemConsidered, userItemCountMultiplier * > > Math.log(Math.max(dataModel.getNumUsers(), dataModel.getNumItems()))); > > > > On Sun, Dec 4, 2011 at 2:59 PM, Daniel Zohar <[EMAIL PROTECTED]> wrote: > > > > > Sean, your impl. is indeed better than mine but for some reason when I > > ran > > > it with for a user with a lot of interactions, I got 2023 > possibleItemIDs > > > (although I used 10,2 in the constructor). > > > > > > Sebastian, I will try and expriment also with your patch. I would just > > > like to add that in my opinion, as long as 'killing items' has to be > done > > > manually, it is not scalable by definition. I personally would always > > > prefer to avoid these kind of solutions. Also, in my case, the most > > popular > > > item has only 3% of the users interacted with, so I suppose that's not > > > exactly the case as well.. > > >
-
Re: Mahout performance issuesDaniel Zohar 2011-12-05, 08:20
I agree with with Ted that users with many preferences should be
down-sampled. I think that if we do go with with #1,#2 and #3 then there's not much point in #4. We just have to make sure that the final size of possibleItemIDs is under control, so it eliminates the performance bottleneck. Another issue to take into account, is to try and not down-sample too much so users with 1-2 preferences still get decent results. On Sun, Dec 4, 2011 at 11:01 PM, Ted Dunning <[EMAIL PROTECTED]> wrote: > Sean, > > You can also do #1. That is what I have used in the past and what I > recommend. That achieves a large part of #2, but what is most important is > that it *directly* addresses the key cost factor in off-line > recommendations since the number of item pairs emitted is proportional to > the sum of the number of items squared for each user. > > Specifically, I think that each user should have at most N items and if > they have more, the number they have should be down-sampled to the point > that they have N. > > I also think that there are some cases were strategy #2 is important even > if #1 is implemented. > > If #1 and #2 are done, then it is a matter of convenience to limit the > number of items in each row of the item-item matrix. This is #4 which I > endorse and which Sebastian has endorsed. > > On Sun, Dec 4, 2011 at 5:42 AM, Sean Owen <[EMAIL PROTECTED]> wrote: > > > To talk about this clearly, let me go back to my example and add to it: > > > > --- > > Say we're recommending for user A. User A is connected to items 1, 2, 3. > > Those items are connected to other users X, Y, Z. And those users in turn > > are connected to items 100, 101, 102, 103.... You can down-sample three > > things: > > > > 1. The 1,2,3 > > 2. The X,Y,Z > > 3. The 100,101,102 > > 4. ... the result of downsampling 1-3, again > > --- > > > > The current implementation samples #2. My proposal samples #2 and #3. > > Sebastian's samples #3. Your proposal does #2 and #4. I believe that > doing > > all 4 is redundant. You probably need to do at least #2 and #3 to avoid > the > > prolific-user and prolific-item problem. > > > > The reason you are still seeing a fair number of IDs is that #1 is not > also > > sampled, in my implementation. > > > > I think I suggest that we still have one solution for this, since it's > all > > small variants on the same theme, and let's make in > > SamplingCandidateItemStrategy. > > > > To me, the remaining question is just, which of these 4 do you want to > do? > > I suggest 2, 3, and maybe 1. > > Follow on question: should we make separately settable limits for each, > or > > does this get complex without much use? > > > > On Sun, Dec 4, 2011 at 1:04 PM, Daniel Zohar <[EMAIL PROTECTED]> wrote: > > > > > I assume the parameter does not affect the possibleItemIDs because of > the > > > following line: > > > > > > max = (int) > > > Math.max(defaultMaxPrefsPerItemConsidered, userItemCountMultiplier * > > > Math.log(Math.max(dataModel.getNumUsers(), dataModel.getNumItems()))); > > > > > > On Sun, Dec 4, 2011 at 2:59 PM, Daniel Zohar <[EMAIL PROTECTED]> > wrote: > > > > > > > Sean, your impl. is indeed better than mine but for some reason when > I > > > ran > > > > it with for a user with a lot of interactions, I got 2023 > > possibleItemIDs > > > > (although I used 10,2 in the constructor). > > > > > > > > Sebastian, I will try and expriment also with your patch. I would > just > > > > like to add that in my opinion, as long as 'killing items' has to be > > done > > > > manually, it is not scalable by definition. I personally would always > > > > prefer to avoid these kind of solutions. Also, in my case, the most > > > popular > > > > item has only 3% of the users interacted with, so I suppose that's > not > > > > exactly the case as well.. > > > > > >
-
Re: Mahout performance issuesTed Dunning 2011-12-05, 08:27
The downsampling should have a target size after sampling so that users
with that many or fewer ratings are not down-sampled at all. This is easy to do using reservoir sampling or anything similar. You can also just keep the first or most recent ratings. Or you can use a sampler biased toward either of those extremes. On Mon, Dec 5, 2011 at 12:20 AM, Daniel Zohar <[EMAIL PROTECTED]> wrote: > Another issue to take into account, is to try and not down-sample too much > so users with 1-2 preferences still get decent results. >
-
Re: Mahout performance issuesDaniel Zohar 2011-12-05, 08:34
I think it would be extremely good to be able to recommend on items related
to more recent preferences. That is, if the timestamps are provided. On Mon, Dec 5, 2011 at 10:27 AM, Ted Dunning <[EMAIL PROTECTED]> wrote: > The downsampling should have a target size after sampling so that users > with that many or fewer ratings are not down-sampled at all. > > This is easy to do using reservoir sampling or anything similar. You can > also just keep the first or most recent ratings. Or you can use a sampler > biased toward either of those extremes. > > On Mon, Dec 5, 2011 at 12:20 AM, Daniel Zohar <[EMAIL PROTECTED]> wrote: > > > Another issue to take into account, is to try and not down-sample too > much > > so users with 1-2 preferences still get decent results. > > >
-
Re: Mahout performance issuesTed Dunning 2011-12-05, 08:58
I have heard arguments both ways for the off-line cooccurrence analysis.
Certainly, the user getting the recommendations should get them based on the most recent items. On Mon, Dec 5, 2011 at 12:34 AM, Daniel Zohar <[EMAIL PROTECTED]> wrote: > I think it would be extremely good to be able to recommend on items related > to more recent preferences. That is, if the timestamps are provided. > > On Mon, Dec 5, 2011 at 10:27 AM, Ted Dunning <[EMAIL PROTECTED]> > wrote: > > > The downsampling should have a target size after sampling so that users > > with that many or fewer ratings are not down-sampled at all. > > > > This is easy to do using reservoir sampling or anything similar. You can > > also just keep the first or most recent ratings. Or you can use a > sampler > > biased toward either of those extremes. > > > > On Mon, Dec 5, 2011 at 12:20 AM, Daniel Zohar <[EMAIL PROTECTED]> > wrote: > > > > > Another issue to take into account, is to try and not down-sample too > > much > > > so users with 1-2 preferences still get decent results. > > > > > >
-
Re: Mahout performance issuesManuel Blechschmidt 2011-12-05, 12:46
Hi Daniel,
unfortunately I did not tried yet the IRStatistics analyzer so I am not able to diagnose the performance problems. At the moment I also have to work on some other other stuff. I know that it uses a thread pool for parallelizing the evaluation. Perhaps you could sample down your data set or let it run over night. Sorry Manuel On 02.12.2011, at 16:26, Daniel Zohar wrote: > Manuel, I starting running the evaluation as proposed. But it seems it will > take forever to complete. It does the evaluation for each user which takes > well over a minute. What am I doing wrong? > This is my code : > > RecommenderBuilder itemBasedBuilder = new RecommenderBuilder() { > > public Recommender buildRecommender(DataModel model) { > > // build and return the Recommender to evaluate here > > try { > > ItemSimilarity itemSimilarity = newCachingItemSimilarity( > new LogLikelihoodSimilarity(model), model); > > CandidateItemsStrategy candidateItemsStrategy = new > OptimizedItemStrategy(20, > 2, 100); > > MostSimilarItemsCandidateItemsStrategy > mostSimilarItemsCandidateItemsStrategy = new OptimizedItemStrategy(20, 2, > 100); > > ItemBasedRecommender recommender > newGenericBooleanPrefItemBasedRecommender( > dataModel, itemSimilarity, candidateItemsStrategy, > > mostSimilarItemsCandidateItemsStrategy); > > return recommender; > > } catch (TasteException e) { > > // TODO Auto-generated catch block > > e.printStackTrace(); > > return null; > > } > > } > > }; > > RecommenderIRStatsEvaluator evaluator = new > GenericRecommenderIRStatsEvaluator(); > > try { > > IRStatistics stats = evaluator.evaluate(itemBasedBuilder, null, > this.dataModel, null, 3, 0, 1.0); > > logger.info("Evalute returned:" + stats.toString()); > > } catch (TasteException e) { > > // TODO Auto-generated catch block > > logger.error("",e); > > } > > On Fri, Dec 2, 2011 at 1:29 PM, Daniel Zohar <[EMAIL PROTECTED]> wrote: > >> Hello Manuel, >> I will run the tests as requested and post the results later. >> >> >> On Fri, Dec 2, 2011 at 1:20 PM, Manuel Blechschmidt < >> [EMAIL PROTECTED]> wrote: >> >>> Hello Daniel, >>> >>> On 02.12.2011, at 12:02, Daniel Zohar wrote: >>> >>>> Hi guys, >>>> >>>> ... >>>> I just ran the fix I proposed earlier and I got great results! The query >>>> time was reduced to about a third for the 'heavy users'. Before it was >>> 1-5 >>>> secs and now it's 0.5-1.5. The best part is that the accuracy level >>> should >>>> remain exactly the same. I also believe it should reduce memory >>>> consumption, as the GenericBooleanPrefDataModel.preferenceForItems gets >>>> significantly smaller (in my case at least). >>> >>> It would be great if you could measure your run time performance and your >>> accuracy with the provided Mahout tools. >>> >>> In your case because you only have boolean feedback precision and recall >>> would make sense. >>> >>> https://cwiki.apache.org/MAHOUT/recommender-documentation.html >>> >>> RecommenderIRStatsEvaluator evaluator = new >>> GenericRecommenderIRStatsEvaluator(); >>> IRStatistics stats = evaluator.evaluate(builder, null, myModel, null, 3, >>> RecommenderIRStatsEvaluator.CHOOSE_THRESHOLD, 1.0); >>> >>> >>> Here is some example code from me: >>> >>> public void testEvaluateRecommender() { >>> try { >>> DataModel myModel = new >>> MyModelImplementationDataModel(); >>> >>> // Users: 12858 >>> // Items: 5467 >>> // MaxPreference: 85850.0 >>> // MinPreference: 50.0 >>> System.out.println("Users: >>> "+myModel.getNumUsers()); >>> System.out.println("Items: Manuel Blechschmidt Dortustr. 57 14467 Potsdam Mobil: 0173/6322621 Twitter: http://twitter.com/Manuel_B
-
Re: Mahout performance issuesDaniel Zohar 2011-12-05, 12:50
No worries Manual.
I think we have almost done cracking the problem. Lets wait for Sean's response. Cheers On Mon, Dec 5, 2011 at 2:46 PM, Manuel Blechschmidt < [EMAIL PROTECTED]> wrote: > Hi Daniel, > unfortunately I did not tried yet the IRStatistics analyzer so I am not > able to diagnose the performance problems. At the moment I also have to > work on some other other stuff. > > I know that it uses a thread pool for parallelizing the evaluation. > Perhaps you could sample down your data set or let it run over night. > > Sorry > Manuel > > On 02.12.2011, at 16:26, Daniel Zohar wrote: > > > Manuel, I starting running the evaluation as proposed. But it seems it > will > > take forever to complete. It does the evaluation for each user which > takes > > well over a minute. What am I doing wrong? > > This is my code : > > > > RecommenderBuilder itemBasedBuilder = new RecommenderBuilder() { > > > > public Recommender buildRecommender(DataModel model) { > > > > // build and return the Recommender to evaluate here > > > > try { > > > > ItemSimilarity itemSimilarity > newCachingItemSimilarity( > > new LogLikelihoodSimilarity(model), model); > > > > CandidateItemsStrategy candidateItemsStrategy = new > > OptimizedItemStrategy(20, > > 2, 100); > > > > MostSimilarItemsCandidateItemsStrategy > > mostSimilarItemsCandidateItemsStrategy = new OptimizedItemStrategy(20, 2, > > 100); > > > > ItemBasedRecommender recommender > > newGenericBooleanPrefItemBasedRecommender( > > dataModel, itemSimilarity, candidateItemsStrategy, > > > > mostSimilarItemsCandidateItemsStrategy); > > > > return recommender; > > > > } catch (TasteException e) { > > > > // TODO Auto-generated catch block > > > > e.printStackTrace(); > > > > return null; > > > > } > > > > } > > > > }; > > > > RecommenderIRStatsEvaluator evaluator = new > > GenericRecommenderIRStatsEvaluator(); > > > > try { > > > > IRStatistics stats = evaluator.evaluate(itemBasedBuilder, null, > > this.dataModel, null, 3, 0, 1.0); > > > > logger.info("Evalute returned:" + stats.toString()); > > > > } catch (TasteException e) { > > > > // TODO Auto-generated catch block > > > > logger.error("",e); > > > > } > > > > On Fri, Dec 2, 2011 at 1:29 PM, Daniel Zohar <[EMAIL PROTECTED]> wrote: > > > >> Hello Manuel, > >> I will run the tests as requested and post the results later. > >> > >> > >> On Fri, Dec 2, 2011 at 1:20 PM, Manuel Blechschmidt < > >> [EMAIL PROTECTED]> wrote: > >> > >>> Hello Daniel, > >>> > >>> On 02.12.2011, at 12:02, Daniel Zohar wrote: > >>> > >>>> Hi guys, > >>>> > >>>> ... > >>>> I just ran the fix I proposed earlier and I got great results! The > query > >>>> time was reduced to about a third for the 'heavy users'. Before it was > >>> 1-5 > >>>> secs and now it's 0.5-1.5. The best part is that the accuracy level > >>> should > >>>> remain exactly the same. I also believe it should reduce memory > >>>> consumption, as the GenericBooleanPrefDataModel.preferenceForItems > gets > >>>> significantly smaller (in my case at least). > >>> > >>> It would be great if you could measure your run time performance and > your > >>> accuracy with the provided Mahout tools. > >>> > >>> In your case because you only have boolean feedback precision and > recall > >>> would make sense. > >>> > >>> https://cwiki.apache.org/MAHOUT/recommender-documentation.html > >>> > >>> RecommenderIRStatsEvaluator evaluator = new > >>> GenericRecommenderIRStatsEvaluator(); > >>> IRStatistics stats = evaluator.evaluate(builder, null, myModel, null, > 3, > >>> RecommenderIRStatsEvaluator.CHOOSE_THRESHOLD, 1.0); > >>> > >>> > >>> Here is some example code from me: > >>> > >>> public void testEvaluateRecommender() { > >>> try {
-
Re: Mahout performance issuesSean Owen 2011-12-05, 13:04
I am posting a new patch to MAHOUT-910 in a second that shows efficient
sampling of all three. On Mon, Dec 5, 2011 at 12:50 PM, Daniel Zohar <[EMAIL PROTECTED]> wrote: > No worries Manual. > I think we have almost done cracking the problem. > Lets wait for Sean's response. > Cheers
-
Re: Mahout performance issuesDaniel Zohar 2011-12-05, 14:20
Hi Sean,
I have been playing around with your patch. It looks good. >From the little testing I did, I can also say that the recommendations seem to be more accurate than in my initial proposal (#4). I just have one suggestion though. I think the current parameters (int defaultMaxPrefsPerItemConsidered, int userItemCountMultiplier) are not so clear and don't give enough control over the sampling. I would introduce two other parameters (it won't be backwards-compatible though) - - maxSourcePrefsConsidered: which will be used in conjunction with SamplingLongPrimitiveIterator to do #1. - maxFinalPrefs : which will set the value for 'int max' in your patch (i.e. get rid of max = (int) Math.max(defaultMaxPrefsPerItemConsidered, userItemCountMultiplier * Math.log(Math.max(dataModel.getNumUsers(), dataModel.getNumItems()))); ) In the future it would be possible to add a strategy that will affect the way maxSourcePrefsConsidered is sampled. For example, most recent items or least recent items or random sampling (like we have now). Even though that might not be the place to do so.. (since it's not in the context of the user) What do you think? On Mon, Dec 5, 2011 at 3:04 PM, Sean Owen <[EMAIL PROTECTED]> wrote: > I am posting a new patch to MAHOUT-910 in a second that shows efficient > sampling of all three. > > On Mon, Dec 5, 2011 at 12:50 PM, Daniel Zohar <[EMAIL PROTECTED]> wrote: > > > No worries Manual. > > I think we have almost done cracking the problem. > > Lets wait for Sean's response. > > Cheers >
-
Re: Mahout performance issuesSean Owen 2011-12-05, 17:46
Let me continue this discussion at MAHOUT-910 in JIRA.
On Mon, Dec 5, 2011 at 2:20 PM, Daniel Zohar <[EMAIL PROTECTED]> wrote: > Hi Sean, > I have been playing around with your patch. It looks good. > From the little testing I did, I can also say that the recommendations seem > to be more accurate than in my initial proposal (#4).
-
Re: Mahout performance issuesDaniel Zohar 2011-12-08, 07:58
I'd like to thank everyone for helping out!
I posted the solution back at Stackoverflow. http://stackoverflow.com/questions/8240383/apache-mahout-performance-issues/8427850#8427850 Cheers, Daniel On Mon, Dec 5, 2011 at 7:46 PM, Sean Owen <[EMAIL PROTECTED]> wrote: > Let me continue this discussion at MAHOUT-910 in JIRA. > > On Mon, Dec 5, 2011 at 2:20 PM, Daniel Zohar <[EMAIL PROTECTED]> wrote: > > > Hi Sean, > > I have been playing around with your patch. It looks good. > > From the little testing I did, I can also say that the recommendations > seem > > to be more accurate than in my initial proposal (#4). > |