|
Vishal Santoshi
2011-10-25, 13:59
Sean Owen
2011-10-25, 14:07
Vishal Santoshi
2011-10-25, 14:27
Sebastian Schelter
2011-10-25, 14:42
Sean Owen
2011-10-25, 14:43
Vishal Santoshi
2011-10-25, 15:08
Vishal Santoshi
2011-10-25, 15:14
Sean Owen
2011-10-25, 15:32
Vishal Santoshi
2011-10-25, 15:44
Sean Owen
2011-10-25, 15:55
Vishal Santoshi
2011-10-25, 16:07
Sebastian Schelter
2011-10-25, 16:16
Ted Dunning
2011-10-25, 16:22
Ted Dunning
2011-10-25, 16:24
Vishal Santoshi
2011-10-25, 16:27
Vishal Santoshi
2011-10-25, 16:41
|
-
MinHash/ItemBasedVishal Santoshi 2011-10-25, 13:59
Hello Folks,
The Item Based Recommendations for my dataset is excruciatingly slow on a 8 node cluster. Yes the number of items is big and the dataset churn does not allow for a long asynchronous process. Recommendations cannot be stale ( a 30 minute delay is stale ). I have tried out MinHash clustering and that is scalable, but without a "degree of association" with multiple clusters any user may belong to , it seems less tight that pure item based ( and thus similarity probability ) algorithm. Any ideas how we pull this off., where * The item churn is frequent. New items enter the dataset all the time. * There is no "preference" apart from opt in. * Very frequent anonymous users enter the system almost all the time. Scale is very important. I am tending towards MinHash with additional algorithms that are executed offline and co occurance.
-
Re: MinHash/ItemBasedSean Owen 2011-10-25, 14:07
Can you put any more numbers around this? how slow is slow, how big is big?
What part of Mahout are you using -- or are you using Mahout? Item-based recommendation sounds fine. Anonymous users aren't a problem as long as you can distinguish them reasonably. I think your challenge is to have a data model that quickly drops out data from old items and can bring new items in. Is this small enough to do in memory? that's the simple, easy place to start. On Tue, Oct 25, 2011 at 2:59 PM, Vishal Santoshi <[EMAIL PROTECTED]> wrote: > Hello Folks, > The Item Based Recommendations for my dataset is > excruciatingly slow on a 8 node cluster. Yes the number of items is big and > the dataset churn does not allow for a long asynchronous process. > Recommendations cannot be stale ( a 30 minute delay is stale ). I have tried > out MinHash clustering and that is scalable, but without a "degree of > association" with multiple clusters any user may belong to , it seems less > tight that pure item based ( and thus similarity probability ) algorithm. > > Any ideas how we pull this off., where > > * The item churn is frequent. New items enter the dataset all the time. > * There is no "preference" apart from opt in. > * Very frequent anonymous users enter the system almost all the time. > > > Scale is very important. > > I am tending towards MinHash with additional algorithms that are executed > offline and co occurance. >
-
Re: MinHash/ItemBasedVishal Santoshi 2011-10-25, 14:27
The data is big as in for a single day ( and I picked up an arbitrary day )
8,335,013 users. 256,010 distinct Items. I am using the Item Based Recommender ( The RecommenderJob ) , with no Preference ( opt in is a signal of preference , multiple opt ins are considered 1 ) <main-class>com.nytimes.computing.mahout.JobDriver</main-class> <arg>recommender</arg> <arg>--input</arg> <arg>${out}/items/bag</arg> <arg>--output</arg> <arg>${out}/items_similarity</arg> <arg>-u</arg> <arg>${out}/items/users/part-r-00000</arg> <arg>-b</arg> <arg>-n</arg> <arg>2</arg> <arg>--similarityClassname</arg> <arg>org.apache.mahout.math.hadoop.similarity.cooccurrence.measures.TanimotoCoefficientSimilarity</arg> <arg>--tempDir</arg> <arg>${out}/temp</arg> Of course the Recommendations are for every user and thus the RecommenderJob-PartialMultiplyMapper-AggregateAndRecommendReducer is the most expensive of all. Further , not sure why the user file is taken in as a Distributed File especially when it may actually be a bigger file that a typical TaskTracker JVM memory limit. In case of MinHash , MinHashDriver <java> <job-tracker>${jobTracker}</job-tracker> <name-node>${nameNode}</name-node> <prepare> <delete path="${out}/minhash"/> </prepare> <configuration> <property> <name>mapred.job.queue.name</name> <value>${queueName}</value> </property> </configuration> <main-class>com.nytimes.computing.mahout.JobDriver</main-class> <arg>minhash_local</arg> <arg>--input</arg> <arg>${out}/bag</arg> <arg>--output</arg> <arg>${out}/minhash</arg> <arg>--keyGroups</arg> <!-- Key Groups --> <arg>2</arg> <arg>-r</arg> <!-- Number of Reducers --> <arg>40</arg> <arg>--minClusterSize</arg> <!-- A legitimate cluster must have this number of members --> <arg>5</arg> <arg>--hashType</arg> <!-- murmur and linear are the other 2 options --> <arg>polynomial</arg> </java> This of course scales. I still have to work with the clusters created and a fair amount of work has to be done to figure out which cluster is relevant. A week of data in this case created the MinHash on our cluster in about 20 minutes. Regards. On Tue, Oct 25, 2011 at 10:07 AM, Sean Owen <[EMAIL PROTECTED]> wrote: > Can you put any more numbers around this? how slow is slow, how big is big? > What part of Mahout are you using -- or are you using Mahout? > > Item-based recommendation sounds fine. Anonymous users aren't a > problem as long as you can distinguish them reasonably. > I think your challenge is to have a data model that quickly drops out > data from old items and can bring new items in. > > Is this small enough to do in memory? that's the simple, easy place to > start. > > On Tue, Oct 25, 2011 at 2:59 PM, Vishal Santoshi > <[EMAIL PROTECTED]> wrote: > > Hello Folks, > > The Item Based Recommendations for my dataset is > > excruciatingly slow on a 8 node cluster. Yes the number of items is big > and > > the dataset churn does not allow for a long asynchronous process. > > Recommendations cannot be stale ( a 30 minute delay is stale ). I have > tried > > out MinHash clustering and that is scalable, but without a "degree of > > association" with multiple clusters any user may belong to , it seems > less > > tight that pure item based ( and thus similarity probability ) algorithm. > > > > Any ideas how we pull this off., where > > > > * The item churn is frequent. New items enter the dataset all the time. > > * There is no "preference" apart from opt in. > > * Very frequent anonymous users enter the system almost all the time.
-
Re: MinHash/ItemBasedSebastian Schelter 2011-10-25, 14:42
Hello Vishal,
How many interactions do you have between those users and items? I'd definitely recommend to you try out the current trunk of Mahout as the performance of RecommenderJob has significantly been improved. The most important parameter (performancewise) is the newly introduced --maxPrefsPerUserInItemSimilarity which causes RecommenderJob to downsample "power" users that can slow down the recommendation computing process (without contributing much to the quality of the results). I'm currently running tests with a patched version of the new RecommenderJob on a Yahoo Music dataset consisting of more than 700 million ratings of 2 million users towards 140 thousand items (seems to be similar to your user/item ratio) and seeing nice results with that even though I run it on a small research cluster. If the phase after the item similarity computation takes too long (I think you suspected this) than you can also try the patch from https://issues.apache.org/jira/browse/MAHOUT-827 that broadcasts the similarity matrix via distributed cache and computes the recommendations in a map-only job. This could work well for your usecase as you have a relatively small number of items. --sebastian On 25.10.2011 16:27, Vishal Santoshi wrote: > The data is big as in for a single day ( and I picked up an arbitrary day ) > > 8,335,013 users. > 256,010 distinct Items. > > I am using the Item Based Recommender ( The RecommenderJob ) , with no > Preference ( opt in is a signal of preference , multiple opt ins are > considered 1 ) > > <main-class>com.nytimes.computing.mahout.JobDriver</main-class> > <arg>recommender</arg> > <arg>--input</arg> > <arg>${out}/items/bag</arg> > <arg>--output</arg> > <arg>${out}/items_similarity</arg> > <arg>-u</arg> > <arg>${out}/items/users/part-r-00000</arg> > <arg>-b</arg> > <arg>-n</arg> > <arg>2</arg> > <arg>--similarityClassname</arg> > > <arg>org.apache.mahout.math.hadoop.similarity.cooccurrence.measures.TanimotoCoefficientSimilarity</arg> > <arg>--tempDir</arg> > <arg>${out}/temp</arg> > > Of course the Recommendations are for every user and thus > the RecommenderJob-PartialMultiplyMapper-AggregateAndRecommendReducer is the > most expensive of all. > Further , not sure why the user file is taken in as a Distributed File > especially when it may actually be a bigger file that a typical TaskTracker > JVM memory limit. > > > > In case of MinHash , MinHashDriver > > <java> > <job-tracker>${jobTracker}</job-tracker> > <name-node>${nameNode}</name-node> > <prepare> > <delete path="${out}/minhash"/> > </prepare> > <configuration> > <property> > <name>mapred.job.queue.name</name> > <value>${queueName}</value> > </property> > </configuration> > <main-class>com.nytimes.computing.mahout.JobDriver</main-class> > <arg>minhash_local</arg> > <arg>--input</arg> > <arg>${out}/bag</arg> > <arg>--output</arg> > <arg>${out}/minhash</arg> > <arg>--keyGroups</arg> <!-- Key Groups --> > <arg>2</arg> > <arg>-r</arg> <!-- Number of Reducers --> > <arg>40</arg> > <arg>--minClusterSize</arg> <!-- A legitimate cluster must have > this number of members --> > <arg>5</arg> > <arg>--hashType</arg> <!-- murmur and linear are the other 2 > options --> > <arg>polynomial</arg> > </java> > > This of course scales. I still have to work with the clusters created and a > fair amount of work has to be done to figure out which cluster is relevant. > > > A week of data in this case created the MinHash on our cluster in about 20
-
Re: MinHash/ItemBasedSean Owen 2011-10-25, 14:43
Why recommend for all users -- why not just new ones or ones that have
been updated? Yes, you're not intended to list all users into memory if using "-u". A very crude rule of thumb is that you can compute about 100 recs per second on a normal machine, normal-sized data (no Hadoop). 8 machines would crank through 8.3M recs in 3 hours at best. Hadoop is going to be 3-4x slower than this due to its overheads. This pipeline probably takes 10 minutes or so to finish even with 0 input; that's the Hadoop overhead. If you're trying to finish computations in minutes, Hadoop probably isn't suitable. But, I think this all works much much better if you can only recompute users that have changed their prefs. On Tue, Oct 25, 2011 at 3:27 PM, Vishal Santoshi <[EMAIL PROTECTED]> wrote: > The data is big as in for a single day ( and I picked up an arbitrary day ) > > 8,335,013 users. > 256,010 distinct Items. > > I am using the Item Based Recommender ( The RecommenderJob ) , with no > Preference ( opt in is a signal of preference , multiple opt ins are > considered 1 ) > > <main-class>com.nytimes.computing.mahout.JobDriver</main-class> > <arg>recommender</arg> > <arg>--input</arg> > <arg>${out}/items/bag</arg> > <arg>--output</arg> > <arg>${out}/items_similarity</arg> > <arg>-u</arg> > <arg>${out}/items/users/part-r-00000</arg> > <arg>-b</arg> > <arg>-n</arg> > <arg>2</arg> > <arg>--similarityClassname</arg> > > <arg>org.apache.mahout.math.hadoop.similarity.cooccurrence.measures.TanimotoCoefficientSimilarity</arg> > <arg>--tempDir</arg> > <arg>${out}/temp</arg> > > Of course the Recommendations are for every user and thus > the RecommenderJob-PartialMultiplyMapper-AggregateAndRecommendReducer is the > most expensive of all. > Further , not sure why the user file is taken in as a Distributed File > especially when it may actually be a bigger file that a typical TaskTracker > JVM memory limit. > > > > In case of MinHash , MinHashDriver > > <java> > <job-tracker>${jobTracker}</job-tracker> > <name-node>${nameNode}</name-node> > <prepare> > <delete path="${out}/minhash"/> > </prepare> > <configuration> > <property> > <name>mapred.job.queue.name</name> > <value>${queueName}</value> > </property> > </configuration> > <main-class>com.nytimes.computing.mahout.JobDriver</main-class> > <arg>minhash_local</arg> > <arg>--input</arg> > <arg>${out}/bag</arg> > <arg>--output</arg> > <arg>${out}/minhash</arg> > <arg>--keyGroups</arg> <!-- Key Groups --> > <arg>2</arg> > <arg>-r</arg> <!-- Number of Reducers --> > <arg>40</arg> > <arg>--minClusterSize</arg> <!-- A legitimate cluster must have > this number of members --> > <arg>5</arg> > <arg>--hashType</arg> <!-- murmur and linear are the other 2 > options --> > <arg>polynomial</arg> > </java> > > This of course scales. I still have to work with the clusters created and a > fair amount of work has to be done to figure out which cluster is relevant. > > > A week of data in this case created the MinHash on our cluster in about 20 > minutes. > > > Regards. > > > On Tue, Oct 25, 2011 at 10:07 AM, Sean Owen <[EMAIL PROTECTED]> wrote: > >> Can you put any more numbers around this? how slow is slow, how big is big? >> What part of Mahout are you using -- or are you using Mahout? >> >> Item-based recommendation sounds fine. Anonymous users aren't a >> problem as long as you can distinguish them reasonably. >> I think your challenge is to have a data model that quickly drops out >> data from old items and can bring new items in.
-
Re: MinHash/ItemBasedVishal Santoshi 2011-10-25, 15:08
>> But, I think this all works much much better if you can only recompute
>> users that have changed their prefs. In our case the preferences is a user clicking on an article ( which doubles as an item ). And these articles are introduced at a frequent rate. Thus the number of new items that occur in the dataset has a very frequent churn and thus not necessarily having any history. Of course we need to recommend the latest item. So the issues are * We have Users that have had an historical click history. * We have new items that will potentially click on. * The brand new Users that may/have to be checked against Users that have a history ( to find similarity ). * Recommendation on old items though OK have staleness associated. Unlike Amazon or Net Flicks , staleness, churn etc is a real big deal. I realize the overhead for Hadoop, yet the data is cumulative as in we would rather go for a sliding window of 3 weeks. We do want to do it every 3-4 hours for every user ( A user will come back any time ). We do realize that the offline part of the computation will likely to be a part of the solution. What would you do based on what our requirements are. For me * Offline clustering. ( PLSI + Minhash/Recommendation ( Item ) * Cooccurance on Item ( for new Users who have no history ). On Tue, Oct 25, 2011 at 10:43 AM, Sean Owen <[EMAIL PROTECTED]> wrote: > Why recommend for all users -- why not just new ones or ones that have > been updated? Yes, you're not intended to list all users into memory > if using "-u". > > A very crude rule of thumb is that you can compute about 100 recs per > second on a normal machine, normal-sized data (no Hadoop). 8 machines > would crank through 8.3M recs in 3 hours at best. Hadoop is going to > be 3-4x slower than this due to its overheads. > > This pipeline probably takes 10 minutes or so to finish even with 0 > input; that's the Hadoop overhead. If you're trying to finish > computations in minutes, Hadoop probably isn't suitable. > > But, I think this all works much much better if you can only recompute > users that have changed their prefs. > > > On Tue, Oct 25, 2011 at 3:27 PM, Vishal Santoshi > <[EMAIL PROTECTED]> wrote: > > The data is big as in for a single day ( and I picked up an arbitrary day > ) > > > > 8,335,013 users. > > 256,010 distinct Items. > > > > I am using the Item Based Recommender ( The RecommenderJob ) , with no > > Preference ( opt in is a signal of preference , multiple opt ins are > > considered 1 ) > > > > <main-class>com.nytimes.computing.mahout.JobDriver</main-class> > > <arg>recommender</arg> > > <arg>--input</arg> > > <arg>${out}/items/bag</arg> > > <arg>--output</arg> > > <arg>${out}/items_similarity</arg> > > <arg>-u</arg> > > <arg>${out}/items/users/part-r-00000</arg> > > <arg>-b</arg> > > <arg>-n</arg> > > <arg>2</arg> > > <arg>--similarityClassname</arg> > > > > > <arg>org.apache.mahout.math.hadoop.similarity.cooccurrence.measures.TanimotoCoefficientSimilarity</arg> > > <arg>--tempDir</arg> > > <arg>${out}/temp</arg> > > > > Of course the Recommendations are for every user and thus > > the RecommenderJob-PartialMultiplyMapper-AggregateAndRecommendReducer is > the > > most expensive of all. > > Further , not sure why the user file is taken in as a Distributed File > > especially when it may actually be a bigger file that a typical > TaskTracker > > JVM memory limit. > > > > > > > > In case of MinHash , MinHashDriver > > > > <java> > > <job-tracker>${jobTracker}</job-tracker> > > <name-node>${nameNode}</name-node> > > <prepare> > > <delete path="${out}/minhash"/> > > </prepare> > > <configuration> > > <property> > > <name>mapred.job.queue.name</name> > > <value>${queueName}</value>
-
Re: MinHash/ItemBasedVishal Santoshi 2011-10-25, 15:14
The number of interactions are about 30 million a day.
>>The most important parameter (performancewise) is the newly introduced >> --maxPrefsPerUserInItemSimilarity Will use it and keep you posted. >> broadcasts the similarity matrix via distributed cache This can be done. As the similarity matrix is unlikely blow up ( the items do not exponentially blow up ). >> definitely recommend to you try out the current trunk of Mahout I have build up mahout with the latest svn co mahout-core-0.6-SNAPSHOT.jar On Tue, Oct 25, 2011 at 10:42 AM, Sebastian Schelter <[EMAIL PROTECTED]> wrote: > Hello Vishal, > > How many interactions do you have between those users and items? I'd > definitely recommend to you try out the current trunk of Mahout as the > performance of RecommenderJob has significantly been improved. > > The most important parameter (performancewise) is the newly introduced > --maxPrefsPerUserInItemSimilarity which causes RecommenderJob to > downsample "power" users that can slow down the recommendation computing > process (without contributing much to the quality of the results). > > I'm currently running tests with a patched version of the new > RecommenderJob on a Yahoo Music dataset consisting of more than 700 > million ratings of 2 million users towards 140 thousand items (seems to > be similar to your user/item ratio) and seeing nice results with that > even though I run it on a small research cluster. > > If the phase after the item similarity computation takes too long (I > think you suspected this) than you can also try the patch from > https://issues.apache.org/jira/browse/MAHOUT-827 that broadcasts the > similarity matrix via distributed cache and computes the recommendations > in a map-only job. This could work well for your usecase as you have a > relatively small number of items. > > --sebastian > > > > On 25.10.2011 16:27, Vishal Santoshi wrote: > > The data is big as in for a single day ( and I picked up an arbitrary day > ) > > > > 8,335,013 users. > > 256,010 distinct Items. > > > > I am using the Item Based Recommender ( The RecommenderJob ) , with no > > Preference ( opt in is a signal of preference , multiple opt ins are > > considered 1 ) > > > > <main-class>com.nytimes.computing.mahout.JobDriver</main-class> > > <arg>recommender</arg> > > <arg>--input</arg> > > <arg>${out}/items/bag</arg> > > <arg>--output</arg> > > <arg>${out}/items_similarity</arg> > > <arg>-u</arg> > > <arg>${out}/items/users/part-r-00000</arg> > > <arg>-b</arg> > > <arg>-n</arg> > > <arg>2</arg> > > <arg>--similarityClassname</arg> > > > > > <arg>org.apache.mahout.math.hadoop.similarity.cooccurrence.measures.TanimotoCoefficientSimilarity</arg> > > <arg>--tempDir</arg> > > <arg>${out}/temp</arg> > > > > Of course the Recommendations are for every user and thus > > the RecommenderJob-PartialMultiplyMapper-AggregateAndRecommendReducer is > the > > most expensive of all. > > Further , not sure why the user file is taken in as a Distributed File > > especially when it may actually be a bigger file that a typical > TaskTracker > > JVM memory limit. > > > > > > > > In case of MinHash , MinHashDriver > > > > <java> > > <job-tracker>${jobTracker}</job-tracker> > > <name-node>${nameNode}</name-node> > > <prepare> > > <delete path="${out}/minhash"/> > > </prepare> > > <configuration> > > <property> > > <name>mapred.job.queue.name</name> > > <value>${queueName}</value> > > </property> > > </configuration> > > > <main-class>com.nytimes.computing.mahout.JobDriver</main-class> > > <arg>minhash_local</arg> > > <arg>--input</arg> > > <arg>${out}/bag</arg>
-
Re: MinHash/ItemBasedSean Owen 2011-10-25, 15:32
On Tue, Oct 25, 2011 at 4:08 PM, Vishal Santoshi
<[EMAIL PROTECTED]> wrote: > In our case the preferences is a user clicking on an article ( which > doubles as an item ). > And these articles are introduced at a frequent rate. Thus the number of new > items that > occur in the dataset has a very frequent churn and thus not necessarily > having any history. > Of course we need to recommend the latest item. OK, but I'm still not seeing why all users need an update every time. Surely most of the 8.3M users aren't even active in a given day.
-
Re: MinHash/ItemBasedVishal Santoshi 2011-10-25, 15:44
They are all active in a day. I am talking about 8.3 million active users a
day. A significant fraction of them will be new users ( say about 2-3 million of them ). Further the churn on items is likely to make historical recommendations obsolete. Thus if I have recommendations that were good of user A yesterday, they are likely to be far less a weight as of today. On Tue, Oct 25, 2011 at 11:32 AM, Sean Owen <[EMAIL PROTECTED]> wrote: > On Tue, Oct 25, 2011 at 4:08 PM, Vishal Santoshi > <[EMAIL PROTECTED]> wrote: > > In our case the preferences is a user clicking on an article ( which > > doubles as an item ). > > And these articles are introduced at a frequent rate. Thus the number of > new > > items that > > occur in the dataset has a very frequent churn and thus not necessarily > > having any history. > > Of course we need to recommend the latest item. > > OK, but I'm still not seeing why all users need an update every time. > Surely most of the 8.3M users aren't even active in a given day. >
-
Re: MinHash/ItemBasedSean Owen 2011-10-25, 15:55
Oh I see, right.
Well, one general strategy is to use Hadoop to compute the recommendations regularly, but not nearly in real-time. Then, use the latest data to imperfectly update the recommendations in real-time. So, you always have slightly stale recommendations, and item-item similarities to fall back on, and are reloading those periodically. Then you're trying to update any recently changed item or user in real-time using item-based recommendation, which can be fast. It's a really big topic in its own right, and there's no complete answer for you here, but you can piece this together from Mahout rather than build it from scratch.) (This is more or less exactly what I have been working on separately, a hybrid Hadoop-based / real-time recommender that can handle this scale but also respond reasonably to new data.) On Tue, Oct 25, 2011 at 4:44 PM, Vishal Santoshi <[EMAIL PROTECTED]> wrote: > They are all active in a day. I am talking about 8.3 million active users a > day. > A significant fraction of them will be new users ( say about 2-3 million of > them ). > Further the churn on items is likely to make historical recommendations > obsolete. > Thus if I have recommendations that were good of user A yesterday, they are > likely to be far less a weight as of today. > > > > > > > > > On Tue, Oct 25, 2011 at 11:32 AM, Sean Owen <[EMAIL PROTECTED]> wrote: > >> On Tue, Oct 25, 2011 at 4:08 PM, Vishal Santoshi >> <[EMAIL PROTECTED]> wrote: >> > In our case the preferences is a user clicking on an article ( which >> > doubles as an item ). >> > And these articles are introduced at a frequent rate. Thus the number of >> new >> > items that >> > occur in the dataset has a very frequent churn and thus not necessarily >> > having any history. >> > Of course we need to recommend the latest item. >> >> OK, but I'm still not seeing why all users need an update every time. >> Surely most of the 8.3M users aren't even active in a given day. >> >
-
Re: MinHash/ItemBasedVishal Santoshi 2011-10-25, 16:07
Yep, Please keep me posted.
BTW , this is exactly why MinHash picked my curiosity and that seems to be affirmed by http://www.datawrangling.com/google-paper-on-parallel-em-algorithm-using-mapreduce MinHash scales , such that the offline periodic component ( based on hadoop/mahout yes mahout has a Minhash based clustering Driver ) seems promising. Again please keep the forum posted on how you go about doing this. Regards, Vishal. On Tue, Oct 25, 2011 at 11:55 AM, Sean Owen <[EMAIL PROTECTED]> wrote: > Oh I see, right. > > Well, one general strategy is to use Hadoop to compute the > recommendations regularly, but not nearly in real-time. Then, use the > latest data to imperfectly update the recommendations in real-time. > So, you always have slightly stale recommendations, and item-item > similarities to fall back on, and are reloading those periodically. > Then you're trying to update any recently changed item or user in > real-time using item-based recommendation, which can be fast. > > It's a really big topic in its own right, and there's no complete > answer for you here, but you can piece this together from Mahout > rather than build it from scratch.) > > (This is more or less exactly what I have been working on separately, > a hybrid Hadoop-based / real-time recommender that can handle this > scale but also respond reasonably to new data.) > > On Tue, Oct 25, 2011 at 4:44 PM, Vishal Santoshi > <[EMAIL PROTECTED]> wrote: > > They are all active in a day. I am talking about 8.3 million active users > a > > day. > > A significant fraction of them will be new users ( say about 2-3 million > of > > them ). > > Further the churn on items is likely to make historical recommendations > > obsolete. > > Thus if I have recommendations that were good of user A yesterday, they > are > > likely to be far less a weight as of today. > > > > > > > > > > > > > > > > > > On Tue, Oct 25, 2011 at 11:32 AM, Sean Owen <[EMAIL PROTECTED]> wrote: > > > >> On Tue, Oct 25, 2011 at 4:08 PM, Vishal Santoshi > >> <[EMAIL PROTECTED]> wrote: > >> > In our case the preferences is a user clicking on an article ( which > >> > doubles as an item ). > >> > And these articles are introduced at a frequent rate. Thus the number > of > >> new > >> > items that > >> > occur in the dataset has a very frequent churn and thus not > necessarily > >> > having any history. > >> > Of course we need to recommend the latest item. > >> > >> OK, but I'm still not seeing why all users need an update every time. > >> Surely most of the 8.3M users aren't even active in a given day. > >> > > >
-
Re: MinHash/ItemBasedSebastian Schelter 2011-10-25, 16:16
The Google News paper you cite follows an approach very different from
the one implemented in RecommenderJob. Their approach has a very high complexity and they chose to use it because of the extreme item churn in the news domain. The techniques in the Google paper (MinHash and PLSI) are used compute user similarities (clusters of users, MinHash just looks at the ratio of co-read stories, PLSI tries to cluster the users according to some latent features in their interactions). A third component tracks co-read stories in realtime and a user is recommended stories that are co-read from other users in his clusters. --sebastian On 25.10.2011 18:07, Vishal Santoshi wrote: > Yep, Please keep me posted. > BTW , this is exactly why MinHash picked my curiosity and that seems to be > affirmed by > > http://www.datawrangling.com/google-paper-on-parallel-em-algorithm-using-mapreduce > > MinHash scales , such that the offline periodic component ( based on > hadoop/mahout yes mahout has a Minhash based clustering Driver ) seems > promising. > Again please keep the forum posted on how you go about doing this. > > Regards, > > Vishal. > > On Tue, Oct 25, 2011 at 11:55 AM, Sean Owen <[EMAIL PROTECTED]> wrote: > >> Oh I see, right. >> >> Well, one general strategy is to use Hadoop to compute the >> recommendations regularly, but not nearly in real-time. Then, use the >> latest data to imperfectly update the recommendations in real-time. >> So, you always have slightly stale recommendations, and item-item >> similarities to fall back on, and are reloading those periodically. >> Then you're trying to update any recently changed item or user in >> real-time using item-based recommendation, which can be fast. >> >> It's a really big topic in its own right, and there's no complete >> answer for you here, but you can piece this together from Mahout >> rather than build it from scratch.) >> >> (This is more or less exactly what I have been working on separately, >> a hybrid Hadoop-based / real-time recommender that can handle this >> scale but also respond reasonably to new data.) >> >> On Tue, Oct 25, 2011 at 4:44 PM, Vishal Santoshi >> <[EMAIL PROTECTED]> wrote: >>> They are all active in a day. I am talking about 8.3 million active users >> a >>> day. >>> A significant fraction of them will be new users ( say about 2-3 million >> of >>> them ). >>> Further the churn on items is likely to make historical recommendations >>> obsolete. >>> Thus if I have recommendations that were good of user A yesterday, they >> are >>> likely to be far less a weight as of today. >>> >>> >>> >>> >>> >>> >>> >>> >>> On Tue, Oct 25, 2011 at 11:32 AM, Sean Owen <[EMAIL PROTECTED]> wrote: >>> >>>> On Tue, Oct 25, 2011 at 4:08 PM, Vishal Santoshi >>>> <[EMAIL PROTECTED]> wrote: >>>>> In our case the preferences is a user clicking on an article ( which >>>>> doubles as an item ). >>>>> And these articles are introduced at a frequent rate. Thus the number >> of >>>> new >>>>> items that >>>>> occur in the dataset has a very frequent churn and thus not >> necessarily >>>>> having any history. >>>>> Of course we need to recommend the latest item. >>>> >>>> OK, but I'm still not seeing why all users need an update every time. >>>> Surely most of the 8.3M users aren't even active in a given day. >>>> >>> >> >
-
Re: MinHash/ItemBasedTed Dunning 2011-10-25, 16:22
My own preference for this kind of recommendation would be to recommend
words and phrases and then use a search engine to find the articles that have those words and phrases in them. Engaging with an article would be tantamount to showing interest in all the words and phrases associated with the article. To avoid floods of data there, I would sparsify that by using LLR to find characteristic terms and phrases for each article. Item based recommendation would only require recent history for new users. Their first page view would not be very informative, but after their first search or document view, they would be good to go. The virtue of this approach is that the set of words and phrases is fairly static and thus the recommendations would not need frequent update. A slightly simpler approach would be to simply search for words and phrases that occur anomalously often in the documents the user has engaged with. That can work, but it will not exhibit any spreading of terms to related terms and thus will present only very similar documents. With either of these approaches, your data volumes would be fairly modest. Minhash finds very, very similar documents. It is usually considered for tasks such as duplicate detection. For recommendation, I would think that you would like something broader. On Tue, Oct 25, 2011 at 9:07 AM, Vishal Santoshi <[EMAIL PROTECTED]>wrote: > Yep, Please keep me posted. > BTW , this is exactly why MinHash picked my curiosity and that seems to be > affirmed by > > > http://www.datawrangling.com/google-paper-on-parallel-em-algorithm-using-mapreduce > > MinHash scales , such that the offline periodic component ( based on > hadoop/mahout yes mahout has a Minhash based clustering Driver ) seems > promising. > Again please keep the forum posted on how you go about doing this. > > Regards, > > Vishal. > > On Tue, Oct 25, 2011 at 11:55 AM, Sean Owen <[EMAIL PROTECTED]> wrote: > > > Oh I see, right. > > > > Well, one general strategy is to use Hadoop to compute the > > recommendations regularly, but not nearly in real-time. Then, use the > > latest data to imperfectly update the recommendations in real-time. > > So, you always have slightly stale recommendations, and item-item > > similarities to fall back on, and are reloading those periodically. > > Then you're trying to update any recently changed item or user in > > real-time using item-based recommendation, which can be fast. > > > > It's a really big topic in its own right, and there's no complete > > answer for you here, but you can piece this together from Mahout > > rather than build it from scratch.) > > > > (This is more or less exactly what I have been working on separately, > > a hybrid Hadoop-based / real-time recommender that can handle this > > scale but also respond reasonably to new data.) > > > > On Tue, Oct 25, 2011 at 4:44 PM, Vishal Santoshi > > <[EMAIL PROTECTED]> wrote: > > > They are all active in a day. I am talking about 8.3 million active > users > > a > > > day. > > > A significant fraction of them will be new users ( say about 2-3 > million > > of > > > them ). > > > Further the churn on items is likely to make historical recommendations > > > obsolete. > > > Thus if I have recommendations that were good of user A yesterday, they > > are > > > likely to be far less a weight as of today. > > > > > > > > > > > > > > > > > > > > > > > > > > > On Tue, Oct 25, 2011 at 11:32 AM, Sean Owen <[EMAIL PROTECTED]> wrote: > > > > > >> On Tue, Oct 25, 2011 at 4:08 PM, Vishal Santoshi > > >> <[EMAIL PROTECTED]> wrote: > > >> > In our case the preferences is a user clicking on an article ( > which > > >> > doubles as an item ). > > >> > And these articles are introduced at a frequent rate. Thus the > number > > of > > >> new > > >> > items that > > >> > occur in the dataset has a very frequent churn and thus not > > necessarily > > >> > having any history. > > >> > Of course we need to recommend the latest item.
-
Re: MinHash/ItemBasedTed Dunning 2011-10-25, 16:24
The upshot of the Google approach, however, is that users have profiles
(expressed in latent semantic space) and new documents are matched against profiles. The word and phrase recommendation system would be very similar in effect to that, but much simpler to implement. On Tue, Oct 25, 2011 at 9:16 AM, Sebastian Schelter <[EMAIL PROTECTED]> wrote: > The Google News paper you cite follows an approach very different from > the one implemented in RecommenderJob. > > Their approach has a very high complexity and they chose to use it > because of the extreme item churn in the news domain. > > The techniques in the Google paper (MinHash and PLSI) are used compute > user similarities (clusters of users, MinHash just looks at the ratio of > co-read stories, PLSI tries to cluster the users according to some > latent features in their interactions). A third component tracks co-read > stories in realtime and a user is recommended stories that are co-read > from other users in his clusters. > > --sebastian > > On 25.10.2011 18:07, Vishal Santoshi wrote: > > Yep, Please keep me posted. > > BTW , this is exactly why MinHash picked my curiosity and that seems to > be > > affirmed by > > > > > http://www.datawrangling.com/google-paper-on-parallel-em-algorithm-using-mapreduce > > > > MinHash scales , such that the offline periodic component ( based on > > hadoop/mahout yes mahout has a Minhash based clustering Driver ) seems > > promising. > > Again please keep the forum posted on how you go about doing this. > > > > Regards, > > > > Vishal. > > > > On Tue, Oct 25, 2011 at 11:55 AM, Sean Owen <[EMAIL PROTECTED]> wrote: > > > >> Oh I see, right. > >> > >> Well, one general strategy is to use Hadoop to compute the > >> recommendations regularly, but not nearly in real-time. Then, use the > >> latest data to imperfectly update the recommendations in real-time. > >> So, you always have slightly stale recommendations, and item-item > >> similarities to fall back on, and are reloading those periodically. > >> Then you're trying to update any recently changed item or user in > >> real-time using item-based recommendation, which can be fast. > >> > >> It's a really big topic in its own right, and there's no complete > >> answer for you here, but you can piece this together from Mahout > >> rather than build it from scratch.) > >> > >> (This is more or less exactly what I have been working on separately, > >> a hybrid Hadoop-based / real-time recommender that can handle this > >> scale but also respond reasonably to new data.) > >> > >> On Tue, Oct 25, 2011 at 4:44 PM, Vishal Santoshi > >> <[EMAIL PROTECTED]> wrote: > >>> They are all active in a day. I am talking about 8.3 million active > users > >> a > >>> day. > >>> A significant fraction of them will be new users ( say about 2-3 > million > >> of > >>> them ). > >>> Further the churn on items is likely to make historical recommendations > >>> obsolete. > >>> Thus if I have recommendations that were good of user A yesterday, they > >> are > >>> likely to be far less a weight as of today. > >>> > >>> > >>> > >>> > >>> > >>> > >>> > >>> > >>> On Tue, Oct 25, 2011 at 11:32 AM, Sean Owen <[EMAIL PROTECTED]> wrote: > >>> > >>>> On Tue, Oct 25, 2011 at 4:08 PM, Vishal Santoshi > >>>> <[EMAIL PROTECTED]> wrote: > >>>>> In our case the preferences is a user clicking on an article ( which > >>>>> doubles as an item ). > >>>>> And these articles are introduced at a frequent rate. Thus the number > >> of > >>>> new > >>>>> items that > >>>>> occur in the dataset has a very frequent churn and thus not > >> necessarily > >>>>> having any history. > >>>>> Of course we need to recommend the latest item. > >>>> > >>>> OK, but I'm still not seeing why all users need an update every time. > >>>> Surely most of the 8.3M users aren't even active in a given day. > >>>> > >>> > >> > > > >
-
Re: MinHash/ItemBasedVishal Santoshi 2011-10-25, 16:27
Exactly as you said. And as you may have deciphered the domain I am working
for is very akin to google's. MinHash ( and thus Jacquard's similarity ) does scale as it reduces users cluster computation to user's data, but has different set of issues and thus the PLSI as well as the co-occurance ( and that makes us go towards NOSQL Cassandra/MongoDB or HBase ). For me Item Based recommendation is fairly precise with less or no complexity ( apart from the scale issue ) and thus pretty straight forward. As Sean has predicted, the problem ( we and google face ) is not essentially tailor made for Item Based Recommendation. A hybrid has to be found IMHO. On Tue, Oct 25, 2011 at 12:16 PM, Sebastian Schelter <[EMAIL PROTECTED]> wrote: > The Google News paper you cite follows an approach very different from > the one implemented in RecommenderJob. > > Their approach has a very high complexity and they chose to use it > because of the extreme item churn in the news domain. > > The techniques in the Google paper (MinHash and PLSI) are used compute > user similarities (clusters of users, MinHash just looks at the ratio of > co-read stories, PLSI tries to cluster the users according to some > latent features in their interactions). A third component tracks co-read > stories in realtime and a user is recommended stories that are co-read > from other users in his clusters. > > --sebastian > > On 25.10.2011 18:07, Vishal Santoshi wrote: > > Yep, Please keep me posted. > > BTW , this is exactly why MinHash picked my curiosity and that seems to > be > > affirmed by > > > > > http://www.datawrangling.com/google-paper-on-parallel-em-algorithm-using-mapreduce > > > > MinHash scales , such that the offline periodic component ( based on > > hadoop/mahout yes mahout has a Minhash based clustering Driver ) seems > > promising. > > Again please keep the forum posted on how you go about doing this. > > > > Regards, > > > > Vishal. > > > > On Tue, Oct 25, 2011 at 11:55 AM, Sean Owen <[EMAIL PROTECTED]> wrote: > > > >> Oh I see, right. > >> > >> Well, one general strategy is to use Hadoop to compute the > >> recommendations regularly, but not nearly in real-time. Then, use the > >> latest data to imperfectly update the recommendations in real-time. > >> So, you always have slightly stale recommendations, and item-item > >> similarities to fall back on, and are reloading those periodically. > >> Then you're trying to update any recently changed item or user in > >> real-time using item-based recommendation, which can be fast. > >> > >> It's a really big topic in its own right, and there's no complete > >> answer for you here, but you can piece this together from Mahout > >> rather than build it from scratch.) > >> > >> (This is more or less exactly what I have been working on separately, > >> a hybrid Hadoop-based / real-time recommender that can handle this > >> scale but also respond reasonably to new data.) > >> > >> On Tue, Oct 25, 2011 at 4:44 PM, Vishal Santoshi > >> <[EMAIL PROTECTED]> wrote: > >>> They are all active in a day. I am talking about 8.3 million active > users > >> a > >>> day. > >>> A significant fraction of them will be new users ( say about 2-3 > million > >> of > >>> them ). > >>> Further the churn on items is likely to make historical recommendations > >>> obsolete. > >>> Thus if I have recommendations that were good of user A yesterday, they > >> are > >>> likely to be far less a weight as of today. > >>> > >>> > >>> > >>> > >>> > >>> > >>> > >>> > >>> On Tue, Oct 25, 2011 at 11:32 AM, Sean Owen <[EMAIL PROTECTED]> wrote: > >>> > >>>> On Tue, Oct 25, 2011 at 4:08 PM, Vishal Santoshi > >>>> <[EMAIL PROTECTED]> wrote: > >>>>> In our case the preferences is a user clicking on an article ( which > >>>>> doubles as an item ). > >>>>> And these articles are introduced at a frequent rate. Thus the number > >> of > >>>> new > >>>>> items that > >>>>> occur in the dataset has a very frequent churn and thus not
-
Re: MinHash/ItemBasedVishal Santoshi 2011-10-25, 16:41
>> Minhash finds very, very similar documents. It is usually considered for
>> tasks such as duplicate detection. For recommendation, I would think that >> you would like something broader. That is essentially my grouse with MinHash. Google has enough history on a user ( or might I say an ability to browse through an year worth click pattern of a user ) . That makes their reco, broader ( the more the data the more it is likely to ascertain good defining clusters ). The exactness in min hashes is what worries me. It does not seem to create "realms" or a sense of what users wants but more of a sense of what user ends up seeing. Though they do say that they "minhash" under the broad universe an article falls under ( science etc ) I have not looked at PLSI and thus not sure how that combination with MinHash modifies the recommendation. On Tue, Oct 25, 2011 at 12:22 PM, Ted Dunning <[EMAIL PROTECTED]> wrote: > My own preference for this kind of recommendation would be to recommend > words and phrases and then use a search engine to find the articles that > have those words and phrases in them. Engaging with an article would be > tantamount to showing interest in all the words and phrases associated with > the article. To avoid floods of data there, I would sparsify that by using > LLR to find characteristic terms and phrases for each article. > > Item based recommendation would only require recent history for new users. > Their first page view would not be very informative, but after their first > search or document view, they would be good to go. > > The virtue of this approach is that the set of words and phrases is fairly > static and thus the recommendations would not need frequent update. > > A slightly simpler approach would be to simply search for words and phrases > that occur anomalously often in the documents the user has engaged with. > That can work, but it will not exhibit any spreading of terms to related > terms and thus will present only very similar documents. > > With either of these approaches, your data volumes would be fairly modest. > > Minhash finds very, very similar documents. It is usually considered for > tasks such as duplicate detection. For recommendation, I would think that > you would like something broader. > > On Tue, Oct 25, 2011 at 9:07 AM, Vishal Santoshi > <[EMAIL PROTECTED]>wrote: > > > Yep, Please keep me posted. > > BTW , this is exactly why MinHash picked my curiosity and that seems to > be > > affirmed by > > > > > > > http://www.datawrangling.com/google-paper-on-parallel-em-algorithm-using-mapreduce > > > > MinHash scales , such that the offline periodic component ( based on > > hadoop/mahout yes mahout has a Minhash based clustering Driver ) seems > > promising. > > Again please keep the forum posted on how you go about doing this. > > > > Regards, > > > > Vishal. > > > > On Tue, Oct 25, 2011 at 11:55 AM, Sean Owen <[EMAIL PROTECTED]> wrote: > > > > > Oh I see, right. > > > > > > Well, one general strategy is to use Hadoop to compute the > > > recommendations regularly, but not nearly in real-time. Then, use the > > > latest data to imperfectly update the recommendations in real-time. > > > So, you always have slightly stale recommendations, and item-item > > > similarities to fall back on, and are reloading those periodically. > > > Then you're trying to update any recently changed item or user in > > > real-time using item-based recommendation, which can be fast. > > > > > > It's a really big topic in its own right, and there's no complete > > > answer for you here, but you can piece this together from Mahout > > > rather than build it from scratch.) > > > > > > (This is more or less exactly what I have been working on separately, > > > a hybrid Hadoop-based / real-time recommender that can handle this > > > scale but also respond reasonably to new data.) > > > > > > On Tue, Oct 25, 2011 at 4:44 PM, Vishal Santoshi > > > <[EMAIL PROTECTED] |