|
Razon, Oren
2012-03-22, 11:35
Sean Owen
2012-03-22, 11:51
Razon, Oren
2012-03-22, 15:16
Sean Owen
2012-03-22, 15:57
Razon, Oren
2012-03-25, 13:04
Sean Owen
2012-03-25, 19:25
Razon, Oren
2012-03-25, 19:28
Ted Dunning
2012-03-25, 19:35
Razon, Oren
2012-03-25, 22:36
Ted Dunning
2012-03-25, 22:55
Razon, Oren
2012-03-25, 23:02
Ted Dunning
2012-03-25, 23:16
Razon, Oren
2012-03-26, 08:27
Sean Owen
2012-03-26, 09:47
Razon, Oren
2012-03-26, 10:05
Sean Owen
2012-03-26, 10:17
Ted Dunning
2012-03-26, 13:52
Razon, Oren
2012-03-26, 15:42
Sean Owen
2012-03-26, 15:54
Sean Owen
2012-03-25, 23:21
Sean Owen
2012-03-25, 19:41
Ted Dunning
2012-03-25, 19:32
Razon, Oren
2012-04-05, 07:27
Sebastian Schelter
2012-04-05, 07:34
Razon, Oren
2012-04-05, 07:44
Sebastian Schelter
2012-04-05, 07:47
Sean Owen
2012-04-05, 07:57
|
-
Mahout beginner questions...Razon, Oren 2012-03-22, 11:35
Hi,
As a data mining developer who need to build a recommender engine POC (Proof Of Concept) to support several future use cases, I've found Mahout framework as an appealing place to start with. But as I'm new to Mahout and Hadoop in general I've a couple of questions... 1. In "Mahout in action" under section 3.2.5 (Database-based data) it says: "...Several classes in Mahout's recommender implementation will attempt to push computations into the database for performance...". I've looked in the documents and inside the code itself, but didn't found anywhere a reference to what are those calculations that are pushed into the DB. Could you please explain what could be done inside the DB? 2. My future use will include use cases with small-medium data volumes (where I guess the non-distributed algorithms will do the job), but also use cases that include huge amounts of data (over 500,000,000 ratings). From my understanding this is where the distributed code should be come handy. My question here is, because I will need to use both distributed & non-distributed how could I build a good design here? Should I build two different solutions on different machines? Could I do part of the job distributed (for example similarity calculation) and the output will be used for the non-distributed code? Is it a BKM? Also if I deploy entire mahout code on an Hadoop environment, what does it mean for the non-distributed code, will it all run as a different java process on the name node? 3. As for now, beside of the Hadoop cluster we are building we have some strong SQL machines (Netezza appliance) that can handle big (structure) data and include good integration with 3'rd party analytics providers or developing on java platform but don't include such reach recommender framework like Mahout. I'm trying to understand how could I utilize both solutions (Netezza & Mahout) to handle big data recommender system use cases. Thought maybe to move data into Netezza, do there all data manipulation and transformation, and in the end to prepare a file that contain the classic data model structure needed for Mahout. But could you think on better solution \ architecture? Maybe keeping the data only inside Netezza and extracting it to Mahout using JDBC when needed? I will be glad to hear your ideas :) Thanks, Oren --------------------------------------------------------------------- Intel Electronics Ltd. This e-mail and any attachments may contain confidential material for the sole use of the intended recipient(s). Any review or distribution by others is strictly prohibited. If you are not the intended recipient, please contact the sender and delete all copies. +
Razon, Oren 2012-03-22, 11:35
-
Re: Mahout beginner questions...Sean Owen 2012-03-22, 11:51
1. These are the JDBC-related classes. For example see
MySQLJDBCDiffStorage or MySQLJDBCDataModel in integration/ 2. The distributed and non-distributed code are quite separate. At this scale I don't think you can use the non-distributed code to a meaningful degree. For example you could pre-compute item-item similarities over this data and use a non-distributed item-based recommender but you probably have enough items that this will strain memory. You would probably be looking at pre-computing recommendations in batch. 3. I don't think Netezza will help much here. It's still not fast enough at this scale to use with a real-time recommender (nothing is). If it's just a place you store data to feed into Hadoop it's not adding value. All the JDBC-related integrations ultimately load data into memory and that's out of the question with 500M data points. I'd also suggest you have a think about whether you "really" have 500M data points. Often you can know that most of the data is noise or not useful, and can get useful recommendations on a fraction of the data (maybe 5M). That makes a lot of things easier. On Thu, Mar 22, 2012 at 11:35 AM, Razon, Oren <[EMAIL PROTECTED]> wrote: > Hi, > As a data mining developer who need to build a recommender engine POC (Proof Of Concept) to support several future use cases, I've found Mahout framework as an appealing place to start with. But as I'm new to Mahout and Hadoop in general I've a couple of questions... > > 1. In "Mahout in action" under section 3.2.5 (Database-based data) it says: "...Several classes in Mahout's recommender implementation will attempt to push computations into the database for performance...". I've looked in the documents and inside the code itself, but didn't found anywhere a reference to what are those calculations that are pushed into the DB. Could you please explain what could be done inside the DB? > 2. My future use will include use cases with small-medium data volumes (where I guess the non-distributed algorithms will do the job), but also use cases that include huge amounts of data (over 500,000,000 ratings). From my understanding this is where the distributed code should be come handy. My question here is, because I will need to use both distributed & non-distributed how could I build a good design here? > Should I build two different solutions on different machines? Could I do part of the job distributed (for example similarity calculation) and the output will be used for the non-distributed code? Is it a BKM? Also if I deploy entire mahout code on an Hadoop environment, what does it mean for the non-distributed code, will it all run as a different java process on the name node? > 3. As for now, beside of the Hadoop cluster we are building we have some strong SQL machines (Netezza appliance) that can handle big (structure) data and include good integration with 3'rd party analytics providers or developing on java platform but don't include such reach recommender framework like Mahout. I'm trying to understand how could I utilize both solutions (Netezza & Mahout) to handle big data recommender system use cases. Thought maybe to move data into Netezza, do there all data manipulation and transformation, and in the end to prepare a file that contain the classic data model structure needed for Mahout. But could you think on better solution \ architecture? Maybe keeping the data only inside Netezza and extracting it to Mahout using JDBC when needed? I will be glad to hear your ideas :) > > Thanks, > Oren > > > > > > > > > --------------------------------------------------------------------- > Intel Electronics Ltd. > > This e-mail and any attachments may contain confidential material for > the sole use of the intended recipient(s). Any review or distribution > by others is strictly prohibited. If you are not the intended > recipient, please contact the sender and delete all copies. +
Sean Owen 2012-03-22, 11:51
-
RE: Mahout beginner questions...Razon, Oren 2012-03-22, 15:16
Hi Sean,
Thanks for your fast response, I really appreciate the quality of your book ("Mahout in action"), and the support you give in such forums. Just to clear my second question... I want to build a recommender framework that will support different use cases. So my intention is to have both distributed and non-distributed solution in one framework, the question is, is it a good design to put them both in the same machine (one of the machines in the Hadoop cluster)? BTW... another question, it seem that a good solution to the recommender scalability will be to use model based recommenders. Saying this, I wonder why there is such few model based recommenders, especially considering the fact that Mahout contain several data mining models implemented already? -----Original Message----- From: Sean Owen [mailto:[EMAIL PROTECTED]] Sent: Thursday, March 22, 2012 13:51 To: [EMAIL PROTECTED] Subject: Re: Mahout beginner questions... 1. These are the JDBC-related classes. For example see MySQLJDBCDiffStorage or MySQLJDBCDataModel in integration/ 2. The distributed and non-distributed code are quite separate. At this scale I don't think you can use the non-distributed code to a meaningful degree. For example you could pre-compute item-item similarities over this data and use a non-distributed item-based recommender but you probably have enough items that this will strain memory. You would probably be looking at pre-computing recommendations in batch. 3. I don't think Netezza will help much here. It's still not fast enough at this scale to use with a real-time recommender (nothing is). If it's just a place you store data to feed into Hadoop it's not adding value. All the JDBC-related integrations ultimately load data into memory and that's out of the question with 500M data points. I'd also suggest you have a think about whether you "really" have 500M data points. Often you can know that most of the data is noise or not useful, and can get useful recommendations on a fraction of the data (maybe 5M). That makes a lot of things easier. On Thu, Mar 22, 2012 at 11:35 AM, Razon, Oren <[EMAIL PROTECTED]> wrote: > Hi, > As a data mining developer who need to build a recommender engine POC (Proof Of Concept) to support several future use cases, I've found Mahout framework as an appealing place to start with. But as I'm new to Mahout and Hadoop in general I've a couple of questions... > > 1. In "Mahout in action" under section 3.2.5 (Database-based data) it says: "...Several classes in Mahout's recommender implementation will attempt to push computations into the database for performance...". I've looked in the documents and inside the code itself, but didn't found anywhere a reference to what are those calculations that are pushed into the DB. Could you please explain what could be done inside the DB? > 2. My future use will include use cases with small-medium data volumes (where I guess the non-distributed algorithms will do the job), but also use cases that include huge amounts of data (over 500,000,000 ratings). From my understanding this is where the distributed code should be come handy. My question here is, because I will need to use both distributed & non-distributed how could I build a good design here? > Should I build two different solutions on different machines? Could I do part of the job distributed (for example similarity calculation) and the output will be used for the non-distributed code? Is it a BKM? Also if I deploy entire mahout code on an Hadoop environment, what does it mean for the non-distributed code, will it all run as a different java process on the name node? > 3. As for now, beside of the Hadoop cluster we are building we have some strong SQL machines (Netezza appliance) that can handle big (structure) data and include good integration with 3'rd party analytics providers or developing on java platform but don't include such reach recommender framework like Mahout. I'm trying to understand how could I utilize both solutions (Netezza & Mahout) to handle big data recommender system use cases. Thought maybe to move data into Netezza, do there all data manipulation and transformation, and in the end to prepare a file that contain the classic data model structure needed for Mahout. But could you think on better solution \ architecture? Maybe keeping the data only inside Netezza and extracting it to Mahout using JDBC when needed? I will be glad to hear your ideas :) Intel Electronics Ltd. This e-mail and any attachments may contain confidential material for the sole use of the intended recipient(s). Any review or distribution by others is strictly prohibited. If you are not the intended recipient, please contact the sender and delete all copies. +
Razon, Oren 2012-03-22, 15:16
-
Re: Mahout beginner questions...Sean Owen 2012-03-22, 15:57
A distributed and non-distributed recommender are really quite
separate. They perform the same task in quite different ways. I don't think you would mix them per se. Depends on what you mean by a model-based recommender... I would call the matrix-factorization-based and clustering-based approaches "model-based" in the sense that they assume the existence of some underlying structure and discover it. There's no Bayesian-style approaches in the code. They scale in different ways; I am not sure they are unilaterally a solution to scale, no. I do agree in general that these have good scaling properties for real-world use cases, like the matrix-factorization approaches. A "real" scalable architecture would have a real-time component and a big distributed computation component. Mahout has elements of both and can be the basis for piecing that together, but it's not a question of strapping together the distributed and non-distributed implementation. It's a bit harder than that. I am actually quite close to being ready to show off something in this area -- I have been working separately on a more complete rec system that has both the real-time element but integrated directly with a distributed element to handle the large-scale computation. I think this is typical of big data architectures. You have (at least) a real-time distributed "Serving Layer" and a big distributed batch "Computation Layer". More on this in about... 2 weeks. On Thu, Mar 22, 2012 at 3:16 PM, Razon, Oren <[EMAIL PROTECTED]> wrote: > Hi Sean, > Thanks for your fast response, I really appreciate the quality of your book ("Mahout in action"), and the support you give in such forums. > Just to clear my second question... > I want to build a recommender framework that will support different use cases. So my intention is to have both distributed and non-distributed solution in one framework, the question is, is it a good design to put them both in the same machine (one of the machines in the Hadoop cluster)? > > BTW... another question, it seem that a good solution to the recommender scalability will be to use model based recommenders. > Saying this, I wonder why there is such few model based recommenders, especially considering the fact that Mahout contain several data mining models implemented already? > > > -----Original Message----- > From: Sean Owen [mailto:[EMAIL PROTECTED]] > Sent: Thursday, March 22, 2012 13:51 > To: [EMAIL PROTECTED] > Subject: Re: Mahout beginner questions... > > 1. These are the JDBC-related classes. For example see > MySQLJDBCDiffStorage or MySQLJDBCDataModel in integration/ > > 2. The distributed and non-distributed code are quite separate. At > this scale I don't think you can use the non-distributed code to a > meaningful degree. For example you could pre-compute item-item > similarities over this data and use a non-distributed item-based > recommender but you probably have enough items that this will strain > memory. You would probably be looking at pre-computing recommendations > in batch. > > 3. I don't think Netezza will help much here. It's still not fast > enough at this scale to use with a real-time recommender (nothing is). > If it's just a place you store data to feed into Hadoop it's not > adding value. All the JDBC-related integrations ultimately load data > into memory and that's out of the question with 500M data points. > > I'd also suggest you have a think about whether you "really" have 500M > data points. Often you can know that most of the data is noise or not > useful, and can get useful recommendations on a fraction of the data > (maybe 5M). That makes a lot of things easier. > > On Thu, Mar 22, 2012 at 11:35 AM, Razon, Oren <[EMAIL PROTECTED]> wrote: >> Hi, >> As a data mining developer who need to build a recommender engine POC (Proof Of Concept) to support several future use cases, I've found Mahout framework as an appealing place to start with. But as I'm new to Mahout and Hadoop in general I've a couple of questions... +
Sean Owen 2012-03-22, 15:57
-
RE: Mahout beginner questions...Razon, Oren 2012-03-25, 13:04
Thanks for the detailed answer Sean.
I want to understand more clearly the non-distributed code limitations. I saw that you advise that for more than 100,000,000 ratings the non-distributed engine won't do the job. The question is why? Is it memory issue (and then if I will have a bigger machine, meaning I could scale up), or is it because of the recommendation time it takes? Thanks, Oren -----Original Message----- From: Sean Owen [mailto:[EMAIL PROTECTED]] Sent: Thursday, March 22, 2012 17:57 To: [EMAIL PROTECTED] Subject: Re: Mahout beginner questions... A distributed and non-distributed recommender are really quite separate. They perform the same task in quite different ways. I don't think you would mix them per se. Depends on what you mean by a model-based recommender... I would call the matrix-factorization-based and clustering-based approaches "model-based" in the sense that they assume the existence of some underlying structure and discover it. There's no Bayesian-style approaches in the code. They scale in different ways; I am not sure they are unilaterally a solution to scale, no. I do agree in general that these have good scaling properties for real-world use cases, like the matrix-factorization approaches. A "real" scalable architecture would have a real-time component and a big distributed computation component. Mahout has elements of both and can be the basis for piecing that together, but it's not a question of strapping together the distributed and non-distributed implementation. It's a bit harder than that. I am actually quite close to being ready to show off something in this area -- I have been working separately on a more complete rec system that has both the real-time element but integrated directly with a distributed element to handle the large-scale computation. I think this is typical of big data architectures. You have (at least) a real-time distributed "Serving Layer" and a big distributed batch "Computation Layer". More on this in about... 2 weeks. On Thu, Mar 22, 2012 at 3:16 PM, Razon, Oren <[EMAIL PROTECTED]> wrote: > Hi Sean, > Thanks for your fast response, I really appreciate the quality of your book ("Mahout in action"), and the support you give in such forums. > Just to clear my second question... > I want to build a recommender framework that will support different use cases. So my intention is to have both distributed and non-distributed solution in one framework, the question is, is it a good design to put them both in the same machine (one of the machines in the Hadoop cluster)? > > BTW... another question, it seem that a good solution to the recommender scalability will be to use model based recommenders. > Saying this, I wonder why there is such few model based recommenders, especially considering the fact that Mahout contain several data mining models implemented already? > > > -----Original Message----- > From: Sean Owen [mailto:[EMAIL PROTECTED]] > Sent: Thursday, March 22, 2012 13:51 > To: [EMAIL PROTECTED] > Subject: Re: Mahout beginner questions... > > 1. These are the JDBC-related classes. For example see > MySQLJDBCDiffStorage or MySQLJDBCDataModel in integration/ > > 2. The distributed and non-distributed code are quite separate. At > this scale I don't think you can use the non-distributed code to a > meaningful degree. For example you could pre-compute item-item > similarities over this data and use a non-distributed item-based > recommender but you probably have enough items that this will strain > memory. You would probably be looking at pre-computing recommendations > in batch. > > 3. I don't think Netezza will help much here. It's still not fast > enough at this scale to use with a real-time recommender (nothing is). > If it's just a place you store data to feed into Hadoop it's not > adding value. All the JDBC-related integrations ultimately load data > into memory and that's out of the question with 500M data points. > > I'd also suggest you have a think about whether you "really" have 500M Intel Electronics Ltd. This e-mail and any attachments may contain confidential material for the sole use of the intended recipient(s). Any review or distribution by others is strictly prohibited. If you are not the intended recipient, please contact the sender and delete all copies. +
Razon, Oren 2012-03-25, 13:04
-
Re: Mahout beginner questions...Sean Owen 2012-03-25, 19:25
It is memory. You will need a pretty large heap to put 100M data in memory
-- probably 4GB, if not a little more (so the machine would need 8GB+ RAM). You can go bigger if you have more memory but that size seems about the biggest to reasonably assume people have. Of course more data slows things down and past about 10M data points you need to tune things to sample data rather than try every possibility. This is most of what CandidateItemStrategy has to do with. It is relatively easy to tune this though so speed doesn't have to ben an issue. Again you can go bigger and tune it to down-sample more; somehow I stil believe that 100M is a crude but useful rule of thumb, as to the point beyond which it's just hard to get good speed and quality. Sean On Sun, Mar 25, 2012 at 2:04 PM, Razon, Oren <[EMAIL PROTECTED]> wrote: > Thanks for the detailed answer Sean. > I want to understand more clearly the non-distributed code limitations. > I saw that you advise that for more than 100,000,000 ratings the > non-distributed engine won't do the job. > The question is why? Is it memory issue (and then if I will have a bigger > machine, meaning I could scale up), or is it because of the recommendation > time it takes? > > +
Sean Owen 2012-03-25, 19:25
-
RE: Mahout beginner questions...Razon, Oren 2012-03-25, 19:28
Correct me if I'm wrong but a good way to boost up speed could be to use caching recommender, meaning computing the recommendations in advanced (refresh it every X min\hours) and always recommend using the most updated recommendations, right?!
-----Original Message----- From: Sean Owen [mailto:[EMAIL PROTECTED]] Sent: Sunday, March 25, 2012 21:25 To: [EMAIL PROTECTED] Subject: Re: Mahout beginner questions... It is memory. You will need a pretty large heap to put 100M data in memory -- probably 4GB, if not a little more (so the machine would need 8GB+ RAM). You can go bigger if you have more memory but that size seems about the biggest to reasonably assume people have. Of course more data slows things down and past about 10M data points you need to tune things to sample data rather than try every possibility. This is most of what CandidateItemStrategy has to do with. It is relatively easy to tune this though so speed doesn't have to ben an issue. Again you can go bigger and tune it to down-sample more; somehow I stil believe that 100M is a crude but useful rule of thumb, as to the point beyond which it's just hard to get good speed and quality. Sean On Sun, Mar 25, 2012 at 2:04 PM, Razon, Oren <[EMAIL PROTECTED]> wrote: > Thanks for the detailed answer Sean. > I want to understand more clearly the non-distributed code limitations. > I saw that you advise that for more than 100,000,000 ratings the > non-distributed engine won't do the job. > The question is why? Is it memory issue (and then if I will have a bigger > machine, meaning I could scale up), or is it because of the recommendation > time it takes? > > --------------------------------------------------------------------- Intel Electronics Ltd. This e-mail and any attachments may contain confidential material for the sole use of the intended recipient(s). Any review or distribution by others is strictly prohibited. If you are not the intended recipient, please contact the sender and delete all copies. +
Razon, Oren 2012-03-25, 19:28
-
Re: Mahout beginner questions...Ted Dunning 2012-03-25, 19:35
Not really. See my previous posting.
The best way to get fast recommendations is to use an item-based recommender. Pre-computing recommendations for all users is not usually a win because you wind up doing a lot of wasted work and you still don't have anything for new users who appear between refreshes. If you build up a service to handle the new users, you might as well just serve all users from that service so that you get up to date recommendations for everyone. There IS a large off-line computation. But that doesn't produce recommendations for USER's. It typically produces recommendations for ITEM's. Then those item-item recommendations are combined to produce recommendations for users. On Sun, Mar 25, 2012 at 12:28 PM, Razon, Oren <[EMAIL PROTECTED]> wrote: > Correct me if I'm wrong but a good way to boost up speed could be to use > caching recommender, meaning computing the recommendations in advanced > (refresh it every X min\hours) and always recommend using the most updated > recommendations, right?! > > -----Original Message----- > From: Sean Owen [mailto:[EMAIL PROTECTED]] > Sent: Sunday, March 25, 2012 21:25 > To: [EMAIL PROTECTED] > Subject: Re: Mahout beginner questions... > > It is memory. You will need a pretty large heap to put 100M data in memory > -- probably 4GB, if not a little more (so the machine would need 8GB+ RAM). > You can go bigger if you have more memory but that size seems about the > biggest to reasonably assume people have. > > Of course more data slows things down and past about 10M data points you > need to tune things to sample data rather than try every possibility. This > is most of what CandidateItemStrategy has to do with. It is relatively easy > to tune this though so speed doesn't have to ben an issue. > > Again you can go bigger and tune it to down-sample more; somehow I stil > believe that 100M is a crude but useful rule of thumb, as to the point > beyond which it's just hard to get good speed and quality. > > Sean > > On Sun, Mar 25, 2012 at 2:04 PM, Razon, Oren <[EMAIL PROTECTED]> wrote: > > > Thanks for the detailed answer Sean. > > I want to understand more clearly the non-distributed code limitations. > > I saw that you advise that for more than 100,000,000 ratings the > > non-distributed engine won't do the job. > > The question is why? Is it memory issue (and then if I will have a bigger > > machine, meaning I could scale up), or is it because of the > recommendation > > time it takes? > > > > > --------------------------------------------------------------------- > Intel Electronics Ltd. > > This e-mail and any attachments may contain confidential material for > the sole use of the intended recipient(s). Any review or distribution > by others is strictly prohibited. If you are not the intended > recipient, please contact the sender and delete all copies. > +
Ted Dunning 2012-03-25, 19:35
-
RE: Mahout beginner questions...Razon, Oren 2012-03-25, 22:36
Ok, so that was a good clarification, which lead me to new questions :)
The system I need should of course give the recommendation itself in no time. And as Sean said, it need to have some real time components to enable a different recommendation after the user interact with the application. But because I'm talking about very large scales, I guess that I want to push much of my model computation to offline mode (which will be refreshed every X minutes). So my options are like that (considering I want to build a real scalable solution): Use the non-distributed \ distributed code to compute some of my model in advance (for example similarity between items \ KNN for each users) --> I guess that for that part, considering I'm offline, the mapreduce code is idle, because of his scalability. Than use a non-distributed online code to calculate the final recommendations based on the pre computed part and do some final computation (weighting the KNN ratings for items my user didn't experienced yet) In order to be able to do so, I will probably need a machine that have high memory capacity to contain all the calculations inside the memory. I can even go further and prepare a cached recommender that will be refreshed whenever I really want my recommendations to be updated. Am I right here? I know the "glue" between the 2 parts is not quite there (as Sean said), but my question is, how much does the current framework support this kind of architecture? Meaning what kind of actions can I really prepare in advance before continuing to the final computation? If so, beside of co-occurrence matrix and matrix factorization what other computations are available to me to do in a mapreduce manner? Does it mean I will have 2 separate machines for that case, one as an Hadoop cluster for the offline computation and an online one that will use the distributed output to do final recommendations (but then it mean I need to move data between machines, which is not so idle...)? Also, as I mentioned earlier I might need to store my data in a SQL machine. If so, what drivers are currently supported? I saw only JDBC & PostgreSQL, is there anyone else? As you said in the book, using a SQL machine will probably slow things down because of the data movement using the drivers... Could you estimate how much slower is it comparing to using a file? Again I might do the reading from the DB offline so I'm not too afraid from losing some of my speed... -----Original Message----- From: Ted Dunning [mailto:[EMAIL PROTECTED]] Sent: Sunday, March 25, 2012 21:35 To: [EMAIL PROTECTED] Subject: Re: Mahout beginner questions... Not really. See my previous posting. The best way to get fast recommendations is to use an item-based recommender. Pre-computing recommendations for all users is not usually a win because you wind up doing a lot of wasted work and you still don't have anything for new users who appear between refreshes. If you build up a service to handle the new users, you might as well just serve all users from that service so that you get up to date recommendations for everyone. There IS a large off-line computation. But that doesn't produce recommendations for USER's. It typically produces recommendations for ITEM's. Then those item-item recommendations are combined to produce recommendations for users. On Sun, Mar 25, 2012 at 12:28 PM, Razon, Oren <[EMAIL PROTECTED]> wrote: > Correct me if I'm wrong but a good way to boost up speed could be to use > caching recommender, meaning computing the recommendations in advanced > (refresh it every X min\hours) and always recommend using the most updated > recommendations, right?! > > -----Original Message----- > From: Sean Owen [mailto:[EMAIL PROTECTED]] > Sent: Sunday, March 25, 2012 21:25 > To: [EMAIL PROTECTED] > Subject: Re: Mahout beginner questions... > > It is memory. You will need a pretty large heap to put 100M data in memory > -- probably 4GB, if not a little more (so the machine would need 8GB+ RAM). Intel Electronics Ltd. This e-mail and any attachments may contain confidential material for the sole use of the intended recipient(s). Any review or distribution by others is strictly prohibited. If you are not the intended recipient, please contact the sender and delete all copies. +
Razon, Oren 2012-03-25, 22:36
-
Re: Mahout beginner questions...Ted Dunning 2012-03-25, 22:55
On Sun, Mar 25, 2012 at 3:36 PM, Razon, Oren <[EMAIL PROTECTED]> wrote:
> ... > The system I need should of course give the recommendation itself in no > time. > ... But because I'm talking about very large scales, I guess that I want to > push much of my model computation to offline mode (which will be refreshed > every X minutes). > Actually, you aren't talking about all that large a scale. At Veoh, we built our models from several billion interactions on a tiny cluster. > So my options are like that (considering I want to build a real scalable > solution): > Use the non-distributed \ distributed code to compute some of my model in > advance (for example similarity between items \ KNN for each users) --> I > guess that for that part, considering I'm offline, the mapreduce code is > idle, because of his scalability. > Repeating what I said earlier, the offline part produces item-item information only. It does not produce KNN data for any users. There is no reference to a user in the result. > Than use a non-distributed online code to calculate the final > recommendations based on the pre computed part and do some final > computation (weighting the KNN ratings for items my user didn't experienced > yet) > All that happens here is that item => item* lists are combined. > In order to be able to do so, I will probably need a machine that have > high memory capacity to contain all the calculations inside the memory. > Not really. > I can even go further and prepare a cached recommender that will be > refreshed whenever I really want my recommendations to be updated. > This is correct. > ... > I know the "glue" between the 2 parts is not quite there (as Sean said), > but my question is, how much does the current framework support this kind > of architecture? Yes. > Meaning what kind of actions can I really prepare in advance before > continuing to the final computation? If so, beside of co-occurrence matrix > and matrix factorization what other computations are available to me to do > in a mapreduce manner? Does it mean I will have 2 separate machines for > that case, one as an Hadoop cluster for the offline computation and an > online one that will use the distributed output to do final recommendations > (but then it mean I need to move data between machines, which is not so > idle...)? > Yes. You will need off-line and on-line machines if you want to have serious guarantees about response times. And yes, you will need to do some copying if you use standard Hadoop. If you use MapR's version of Hadoop, you can serve data directly out of the cluster with no copying because you can access files via NFS. > > Also, as I mentioned earlier I might need to store my data in a SQL > machine. If so, what drivers are currently supported? I saw only JDBC & > PostgreSQL, is there anyone else? > You don't need to store your data ONLY on an SQL machine and storing logs in SQL is generally a bad mistake. > As you said in the book, using a SQL machine will probably slow things > down because of the data movement using the drivers... Could you estimate > how much slower is it comparing to using a file? 100x, roughly. SQL is generally not usable as the source for parallel computations. +
Ted Dunning 2012-03-25, 22:55
-
RE: Mahout beginner questions...Razon, Oren 2012-03-25, 23:02
Thanks Ted,
So let's continue with your example... I will do I 2 I similarity matrix on Hadoop and then will do online recommendation based on it and the user ranked items. So where does the online part will sit at? Is it a good design to implement it on the same machine that Hadoop run on (name node for example)? Or you suggest to build 2 different applications on 2 different machines (one of them is the cluster) and transfer the data between them? -----Original Message----- From: Ted Dunning [mailto:[EMAIL PROTECTED]] Sent: Monday, March 26, 2012 00:56 To: [EMAIL PROTECTED] Subject: Re: Mahout beginner questions... On Sun, Mar 25, 2012 at 3:36 PM, Razon, Oren <[EMAIL PROTECTED]> wrote: > ... > The system I need should of course give the recommendation itself in no > time. > ... But because I'm talking about very large scales, I guess that I want to > push much of my model computation to offline mode (which will be refreshed > every X minutes). > Actually, you aren't talking about all that large a scale. At Veoh, we built our models from several billion interactions on a tiny cluster. > So my options are like that (considering I want to build a real scalable > solution): > Use the non-distributed \ distributed code to compute some of my model in > advance (for example similarity between items \ KNN for each users) --> I > guess that for that part, considering I'm offline, the mapreduce code is > idle, because of his scalability. > Repeating what I said earlier, the offline part produces item-item information only. It does not produce KNN data for any users. There is no reference to a user in the result. > Than use a non-distributed online code to calculate the final > recommendations based on the pre computed part and do some final > computation (weighting the KNN ratings for items my user didn't experienced > yet) > All that happens here is that item => item* lists are combined. > In order to be able to do so, I will probably need a machine that have > high memory capacity to contain all the calculations inside the memory. > Not really. > I can even go further and prepare a cached recommender that will be > refreshed whenever I really want my recommendations to be updated. > This is correct. > ... > I know the "glue" between the 2 parts is not quite there (as Sean said), > but my question is, how much does the current framework support this kind > of architecture? Yes. > Meaning what kind of actions can I really prepare in advance before > continuing to the final computation? If so, beside of co-occurrence matrix > and matrix factorization what other computations are available to me to do > in a mapreduce manner? Does it mean I will have 2 separate machines for > that case, one as an Hadoop cluster for the offline computation and an > online one that will use the distributed output to do final recommendations > (but then it mean I need to move data between machines, which is not so > idle...)? > Yes. You will need off-line and on-line machines if you want to have serious guarantees about response times. And yes, you will need to do some copying if you use standard Hadoop. If you use MapR's version of Hadoop, you can serve data directly out of the cluster with no copying because you can access files via NFS. > > Also, as I mentioned earlier I might need to store my data in a SQL > machine. If so, what drivers are currently supported? I saw only JDBC & > PostgreSQL, is there anyone else? > You don't need to store your data ONLY on an SQL machine and storing logs in SQL is generally a bad mistake. > As you said in the book, using a SQL machine will probably slow things > down because of the data movement using the drivers... Could you estimate > how much slower is it comparing to using a file? 100x, roughly. SQL is generally not usable as the source for parallel computations. --------------------------------------------------------------------- Intel Electronics Ltd. This e-mail and any attachments may contain confidential material for the sole use of the intended recipient(s). Any review or distribution by others is strictly prohibited. If you are not the intended recipient, please contact the sender and delete all copies. +
Razon, Oren 2012-03-25, 23:02
-
Re: Mahout beginner questions...Ted Dunning 2012-03-25, 23:16
On Sun, Mar 25, 2012 at 4:02 PM, Razon, Oren <[EMAIL PROTECTED]> wrote:
> > So let's continue with your example... I will do I 2 I similarity matrix > on Hadoop and then will do online recommendation based on it and the user > ranked items. > Yes. > So where does the online part will sit at? Is it a good design to > implement it on the same machine that Hadoop run on (name node for > example)? Or you suggest to build 2 different applications on 2 different > machines (one of them is the cluster) and transfer the data between them? > I recommend that you separate the off-line computation away from the on-line component. The reason is that the off-line computation can put a severe strain on the resources of the machines it runs on. You can isolate this load somewhat, but it is better to simply use different machines unless you are really absolutely desperate for hardware. Even then, it is probably more cost effective to drive your off-line resources as hard as possible and simply use a relatively small machine for the on-line component. +
Ted Dunning 2012-03-25, 23:16
-
RE: Mahout beginner questions...Razon, Oren 2012-03-26, 08:27
By saying: "At Veoh, we built our models from several billion interactions on a tiny cluster " you meant that you used the distributed code on your cluster as an online recommender?
From what I've understood so far, I can't rely only on the Hadoop part if I want a truly real time recommender that will modify his recommendations and models per click of the user (because you need to rebuild the data in the HDFS run you batch job, and return an answer) -----Original Message----- From: Ted Dunning [mailto:[EMAIL PROTECTED]] Sent: Monday, March 26, 2012 00:56 To: [EMAIL PROTECTED] Subject: Re: Mahout beginner questions... On Sun, Mar 25, 2012 at 3:36 PM, Razon, Oren <[EMAIL PROTECTED]> wrote: > ... > The system I need should of course give the recommendation itself in no > time. > ... But because I'm talking about very large scales, I guess that I want to > push much of my model computation to offline mode (which will be refreshed > every X minutes). > Actually, you aren't talking about all that large a scale. At Veoh, we built our models from several billion interactions on a tiny cluster. > So my options are like that (considering I want to build a real scalable > solution): > Use the non-distributed \ distributed code to compute some of my model in > advance (for example similarity between items \ KNN for each users) --> I > guess that for that part, considering I'm offline, the mapreduce code is > idle, because of his scalability. > Repeating what I said earlier, the offline part produces item-item information only. It does not produce KNN data for any users. There is no reference to a user in the result. > Than use a non-distributed online code to calculate the final > recommendations based on the pre computed part and do some final > computation (weighting the KNN ratings for items my user didn't experienced > yet) > All that happens here is that item => item* lists are combined. > In order to be able to do so, I will probably need a machine that have > high memory capacity to contain all the calculations inside the memory. > Not really. > I can even go further and prepare a cached recommender that will be > refreshed whenever I really want my recommendations to be updated. > This is correct. > ... > I know the "glue" between the 2 parts is not quite there (as Sean said), > but my question is, how much does the current framework support this kind > of architecture? Yes. > Meaning what kind of actions can I really prepare in advance before > continuing to the final computation? If so, beside of co-occurrence matrix > and matrix factorization what other computations are available to me to do > in a mapreduce manner? Does it mean I will have 2 separate machines for > that case, one as an Hadoop cluster for the offline computation and an > online one that will use the distributed output to do final recommendations > (but then it mean I need to move data between machines, which is not so > idle...)? > Yes. You will need off-line and on-line machines if you want to have serious guarantees about response times. And yes, you will need to do some copying if you use standard Hadoop. If you use MapR's version of Hadoop, you can serve data directly out of the cluster with no copying because you can access files via NFS. > > Also, as I mentioned earlier I might need to store my data in a SQL > machine. If so, what drivers are currently supported? I saw only JDBC & > PostgreSQL, is there anyone else? > You don't need to store your data ONLY on an SQL machine and storing logs in SQL is generally a bad mistake. > As you said in the book, using a SQL machine will probably slow things > down because of the data movement using the drivers... Could you estimate > how much slower is it comparing to using a file? 100x, roughly. SQL is generally not usable as the source for parallel computations. --------------------------------------------------------------------- Intel Electronics Ltd. This e-mail and any attachments may contain confidential material for the sole use of the intended recipient(s). Any review or distribution by others is strictly prohibited. If you are not the intended recipient, please contact the sender and delete all copies. +
Razon, Oren 2012-03-26, 08:27
-
Re: Mahout beginner questions...Sean Owen 2012-03-26, 09:47
I'm sure he's referring to the off-line model-building bit, not an online
component. On Mon, Mar 26, 2012 at 9:27 AM, Razon, Oren <[EMAIL PROTECTED]> wrote: > By saying: "At Veoh, we built our models from several billion interactions > on a tiny cluster " you meant that you used the distributed code on your > cluster as an online recommender? > From what I've understood so far, I can't rely only on the Hadoop part if > I want a truly real time recommender that will modify his recommendations > and models per click of the user (because you need to rebuild the data in > the HDFS run you batch job, and return an answer) > > +
Sean Owen 2012-03-26, 09:47
-
RE: Mahout beginner questions...Razon, Oren 2012-03-26, 10:05
Saying that, my conclusion so far (sorry if I'm a bit slow here :)) --> I need to have the 2 parts (offline and online) in place, If I plan to have a real scalable machine that could do some of the recommendation calculations in real time in order to interact with the user dynamically.
But I'm still not quite sure I've understood how I can scale with that... As more as I'm pushing computation to offline I guess I'm less concerned with the retrieving time. From that perspective I could scale But I'm still not sure how it help me to scale from memory perspective... Even if I computed all similarities in advanced I still need to load the entire similarity result file into my memory in order that the online part will calculate his part. Maybe I'm wrong here, and I don't necessarily need to load the entire intermediate file (similarity results) into the memory?! -----Original Message----- From: Sean Owen [mailto:[EMAIL PROTECTED]] Sent: Monday, March 26, 2012 11:48 To: [EMAIL PROTECTED] Subject: Re: Mahout beginner questions... I'm sure he's referring to the off-line model-building bit, not an online component. On Mon, Mar 26, 2012 at 9:27 AM, Razon, Oren <[EMAIL PROTECTED]> wrote: > By saying: "At Veoh, we built our models from several billion interactions > on a tiny cluster " you meant that you used the distributed code on your > cluster as an online recommender? > From what I've understood so far, I can't rely only on the Hadoop part if > I want a truly real time recommender that will modify his recommendations > and models per click of the user (because you need to rebuild the data in > the HDFS run you batch job, and return an answer) > > --------------------------------------------------------------------- Intel Electronics Ltd. This e-mail and any attachments may contain confidential material for the sole use of the intended recipient(s). Any review or distribution by others is strictly prohibited. If you are not the intended recipient, please contact the sender and delete all copies. +
Razon, Oren 2012-03-26, 10:05
-
Re: Mahout beginner questions...Sean Owen 2012-03-26, 10:17
Yes, my position is that you need at least these two layers in the end.
To get straight to your point, no you don't have to load all item-item pairs in memory, necessarily. At one extreme, if you completely pre-computed recommendations and didn't calculate anything in real-time, you wouldn't need any of that in memory. Even if you did load it in memory, you could sample, and only retain the tiny fraction of similarities that are significant. The more you down-sample, the less accurate the results become of course. One point I'm driving at is that in this hybrid model, where you periodically recompute "best" results based on all data, off-line, you can get away with much more approximate updates in real-time. A new datum ought to have some effect, and some roughly correct effect, but it's not such a big deal if it's not perfect, since the right-er answer is coming soon anyway and will overwrite. And of course the properties of an item-item similarity-based approach aren't necessarily those of other approaches. For example with matrix-factorization approaches there is a much more well-defined (and faster) way to fold in new data. And the data that must live in memory is also bounded and relatively smaller. On Mon, Mar 26, 2012 at 11:05 AM, Razon, Oren <[EMAIL PROTECTED]> wrote: > Saying that, my conclusion so far (sorry if I'm a bit slow here :)) --> I > need to have the 2 parts (offline and online) in place, If I plan to have a > real scalable machine that could do some of the recommendation calculations > in real time in order to interact with the user dynamically. > > But I'm still not quite sure I've understood how I can scale with that... > As more as I'm pushing computation to offline I guess I'm less concerned > with the retrieving time. From that perspective I could scale > > But I'm still not sure how it help me to scale from memory perspective... > Even if I computed all similarities in advanced I still need to load the > entire similarity result file into my memory in order that the online part > will calculate his part. Maybe I'm wrong here, and I don't necessarily need > to load the entire intermediate file (similarity results) into the memory?! > > > -----Original Message----- > From: Sean Owen [mailto:[EMAIL PROTECTED]] > Sent: Monday, March 26, 2012 11:48 > To: [EMAIL PROTECTED] > Subject: Re: Mahout beginner questions... > > I'm sure he's referring to the off-line model-building bit, not an online > component. > > On Mon, Mar 26, 2012 at 9:27 AM, Razon, Oren <[EMAIL PROTECTED]> wrote: > > > By saying: "At Veoh, we built our models from several billion > interactions > > on a tiny cluster " you meant that you used the distributed code on your > > cluster as an online recommender? > > From what I've understood so far, I can't rely only on the Hadoop part if > > I want a truly real time recommender that will modify his recommendations > > and models per click of the user (because you need to rebuild the data in > > the HDFS run you batch job, and return an answer) > > > > > --------------------------------------------------------------------- > Intel Electronics Ltd. > > This e-mail and any attachments may contain confidential material for > the sole use of the intended recipient(s). Any review or distribution > by others is strictly prohibited. If you are not the intended > recipient, please contact the sender and delete all copies. > +
Sean Owen 2012-03-26, 10:17
-
Re: Mahout beginner questions...Ted Dunning 2012-03-26, 13:52
No. I meant that I used the same sort of combined offline and online processes that I have recommended to you. The cluster did the offline part and a web tier did the online part.
Sent from my iPhone On Mar 26, 2012, at 1:27 AM, "Razon, Oren" <[EMAIL PROTECTED]> wrote: > By saying: "At Veoh, we built our models from several billion interactions on a tiny cluster " you meant that you used the distributed code on your cluster as an online recommender? > From what I've understood so far, I can't rely only on the Hadoop part if I want a truly real time recommender that will modify his recommendations and models per click of the user (because you need to rebuild the data in the HDFS run you batch job, and return an answer) > > -----Original Message----- > From: Ted Dunning [mailto:[EMAIL PROTECTED]] > Sent: Monday, March 26, 2012 00:56 > To: [EMAIL PROTECTED] > Subject: Re: Mahout beginner questions... > > On Sun, Mar 25, 2012 at 3:36 PM, Razon, Oren <[EMAIL PROTECTED]> wrote: > >> ... >> The system I need should of course give the recommendation itself in no >> time. >> ... > > But because I'm talking about very large scales, I guess that I want to >> push much of my model computation to offline mode (which will be refreshed >> every X minutes). >> > > Actually, you aren't talking about all that large a scale. At Veoh, we > built our models from several billion interactions on a tiny cluster. > > >> So my options are like that (considering I want to build a real scalable >> solution): >> Use the non-distributed \ distributed code to compute some of my model in >> advance (for example similarity between items \ KNN for each users) --> I >> guess that for that part, considering I'm offline, the mapreduce code is >> idle, because of his scalability. >> > > Repeating what I said earlier, the offline part produces item-item > information only. It does not produce KNN data for any users. There is no > reference to a user in the result. > > >> Than use a non-distributed online code to calculate the final >> recommendations based on the pre computed part and do some final >> computation (weighting the KNN ratings for items my user didn't experienced >> yet) >> > > All that happens here is that item => item* lists are combined. > > >> In order to be able to do so, I will probably need a machine that have >> high memory capacity to contain all the calculations inside the memory. >> > > Not really. > > >> I can even go further and prepare a cached recommender that will be >> refreshed whenever I really want my recommendations to be updated. >> > > This is correct. > > >> ... >> I know the "glue" between the 2 parts is not quite there (as Sean said), >> but my question is, how much does the current framework support this kind >> of architecture? > > > Yes. > > >> Meaning what kind of actions can I really prepare in advance before >> continuing to the final computation? If so, beside of co-occurrence matrix >> and matrix factorization what other computations are available to me to do >> in a mapreduce manner? Does it mean I will have 2 separate machines for >> that case, one as an Hadoop cluster for the offline computation and an >> online one that will use the distributed output to do final recommendations >> (but then it mean I need to move data between machines, which is not so >> idle...)? >> > > Yes. You will need off-line and on-line machines if you want to have > serious guarantees about response times. And yes, you will need to do some > copying if you use standard Hadoop. If you use MapR's version of Hadoop, > you can serve data directly out of the cluster with no copying because you > can access files via NFS. > > >> >> Also, as I mentioned earlier I might need to store my data in a SQL >> machine. If so, what drivers are currently supported? I saw only JDBC & >> PostgreSQL, is there anyone else? >> > > You don't need to store your data ONLY on an SQL machine and storing logs +
Ted Dunning 2012-03-26, 13:52
-
RE: Mahout beginner questions...Razon, Oren 2012-03-26, 15:42
Another question that crossed my mind.
Consider all you said below... I'm not quite sure when will I want to use a SQL machine at all as my data source? Response perspective --> You said it will take much more than reading from a file Memory perspective --> In the end you need to move the data from the DB into your memory So what is the pros in doing so? When should I consider it? -----Original Message----- From: Ted Dunning [mailto:[EMAIL PROTECTED]] Sent: Monday, March 26, 2012 15:52 To: [EMAIL PROTECTED] Subject: Re: Mahout beginner questions... No. I meant that I used the same sort of combined offline and online processes that I have recommended to you. The cluster did the offline part and a web tier did the online part. Sent from my iPhone On Mar 26, 2012, at 1:27 AM, "Razon, Oren" <[EMAIL PROTECTED]> wrote: > By saying: "At Veoh, we built our models from several billion interactions on a tiny cluster " you meant that you used the distributed code on your cluster as an online recommender? > From what I've understood so far, I can't rely only on the Hadoop part if I want a truly real time recommender that will modify his recommendations and models per click of the user (because you need to rebuild the data in the HDFS run you batch job, and return an answer) > > -----Original Message----- > From: Ted Dunning [mailto:[EMAIL PROTECTED]] > Sent: Monday, March 26, 2012 00:56 > To: [EMAIL PROTECTED] > Subject: Re: Mahout beginner questions... > > On Sun, Mar 25, 2012 at 3:36 PM, Razon, Oren <[EMAIL PROTECTED]> wrote: > >> ... >> The system I need should of course give the recommendation itself in no >> time. >> ... > > But because I'm talking about very large scales, I guess that I want to >> push much of my model computation to offline mode (which will be refreshed >> every X minutes). >> > > Actually, you aren't talking about all that large a scale. At Veoh, we > built our models from several billion interactions on a tiny cluster. > > >> So my options are like that (considering I want to build a real scalable >> solution): >> Use the non-distributed \ distributed code to compute some of my model in >> advance (for example similarity between items \ KNN for each users) --> I >> guess that for that part, considering I'm offline, the mapreduce code is >> idle, because of his scalability. >> > > Repeating what I said earlier, the offline part produces item-item > information only. It does not produce KNN data for any users. There is no > reference to a user in the result. > > >> Than use a non-distributed online code to calculate the final >> recommendations based on the pre computed part and do some final >> computation (weighting the KNN ratings for items my user didn't experienced >> yet) >> > > All that happens here is that item => item* lists are combined. > > >> In order to be able to do so, I will probably need a machine that have >> high memory capacity to contain all the calculations inside the memory. >> > > Not really. > > >> I can even go further and prepare a cached recommender that will be >> refreshed whenever I really want my recommendations to be updated. >> > > This is correct. > > >> ... >> I know the "glue" between the 2 parts is not quite there (as Sean said), >> but my question is, how much does the current framework support this kind >> of architecture? > > > Yes. > > >> Meaning what kind of actions can I really prepare in advance before >> continuing to the final computation? If so, beside of co-occurrence matrix >> and matrix factorization what other computations are available to me to do >> in a mapreduce manner? Does it mean I will have 2 separate machines for >> that case, one as an Hadoop cluster for the offline computation and an >> online one that will use the distributed output to do final recommendations >> (but then it mean I need to move data between machines, which is not so >> idle...)? >> > > Yes. You will need off-line and on-line machines if you want to have Intel Electronics Ltd. This e-mail and any attachments may contain confidential material for the sole use of the intended recipient(s). Any review or distribution by others is strictly prohibited. If you are not the intended recipient, please contact the sender and delete all copies. +
Razon, Oren 2012-03-26, 15:42
-
Re: Mahout beginner questions...Sean Owen 2012-03-26, 15:54
An SQL database doesn't have much role to play in this kind of system,
and that's no criticism of RDBMSes. The algorithms operate on very simple, nearly unstructured data and are essentially read-only. So the complexity of keys and transactions is just overhead. The simple, non-distributed implementations need a huge amount of random access to data. Even lean fast NoSQL stores aren't really suitable; these are just going to be in-memory problems. If you're just going to read into memory, well, it's certainly possible and simple to read that out of an RDBMS. But it might as well come from a file; there's no advantage to having bothered to put it in a table. (Of course at tiny scale, a DB can keep up fine. 100K data points? no problem. That's why things like MySQLJDBCDataModel even exist I suppose.) Once you go to the trouble of parallelizing the algorithm, and breaking it up so that every computation doesn't touch so much data (and, this is often the hard, clever part) you can split it up using MapReduce / Hadoop and those tiny workers can meaningfully crunch through parts of the problem. There too, they are simple beasts and have a simple sequential read-only input model. You could make them too read out of an RDBMS, but at *best*, it's overkill; it might as well have come from a dumber store like HDFS. At *worst* it will still fall over when 1000 workers try to pull (unrelated) data out of the same table and overwhelm the RDBMS machine, when the whole point of parallelizing it was to be able to read in parallel chunks of unrelated data from many storage servers -- a la HDFS. On Mon, Mar 26, 2012 at 4:42 PM, Razon, Oren <[EMAIL PROTECTED]> wrote: > Another question that crossed my mind. > Consider all you said below... I'm not quite sure when will I want to use a SQL machine at all as my data source? > Response perspective --> You said it will take much more than reading from a file > Memory perspective --> In the end you need to move the data from the DB into your memory > > So what is the pros in doing so? When should I consider it? > > -----Original Message----- > From: Ted Dunning [mailto:[EMAIL PROTECTED]] > Sent: Monday, March 26, 2012 15:52 > To: [EMAIL PROTECTED] > Subject: Re: Mahout beginner questions... > > No. I meant that I used the same sort of combined offline and online processes that I have recommended to you. The cluster did the offline part and a web tier did the online part. > > Sent from my iPhone > > On Mar 26, 2012, at 1:27 AM, "Razon, Oren" <[EMAIL PROTECTED]> wrote: > >> By saying: "At Veoh, we built our models from several billion interactions on a tiny cluster " you meant that you used the distributed code on your cluster as an online recommender? >> From what I've understood so far, I can't rely only on the Hadoop part if I want a truly real time recommender that will modify his recommendations and models per click of the user (because you need to rebuild the data in the HDFS run you batch job, and return an answer) >> >> -----Original Message----- >> From: Ted Dunning [mailto:[EMAIL PROTECTED]] >> Sent: Monday, March 26, 2012 00:56 >> To: [EMAIL PROTECTED] >> Subject: Re: Mahout beginner questions... >> >> On Sun, Mar 25, 2012 at 3:36 PM, Razon, Oren <[EMAIL PROTECTED]> wrote: >> >>> ... >>> The system I need should of course give the recommendation itself in no >>> time. >>> ... >> >> But because I'm talking about very large scales, I guess that I want to >>> push much of my model computation to offline mode (which will be refreshed >>> every X minutes). >>> >> >> Actually, you aren't talking about all that large a scale. At Veoh, we >> built our models from several billion interactions on a tiny cluster. >> >> >>> So my options are like that (considering I want to build a real scalable >>> solution): >>> Use the non-distributed \ distributed code to compute some of my model in >>> advance (for example similarity between items \ KNN for each users) --> I >>> guess that for that part, considering I'm offline, the mapreduce code is +
Sean Owen 2012-03-26, 15:54
-
Re: Mahout beginner questions...Sean Owen 2012-03-25, 23:21
On Sun, Mar 25, 2012 at 11:36 PM, Razon, Oren <[EMAIL PROTECTED]> wrote:
> In order to be able to do so, I will probably need a machine that have > high memory capacity to contain all the calculations inside the memory. > I can even go further and prepare a cached recommender that will be > refreshed whenever I really want my recommendations to be updated. > Am I right here? > Maybe -- the memory requirements are lower than if one machine is doing everything but yes I generally agree that the front-end often has to keep a load of stuff in memory to do what it does quickly. > > I know the "glue" between the 2 parts is not quite there (as Sean said), > but my question is, how much does the current framework support this kind > of architecture? Meaning what kind of actions can I really prepare in > advance before continuing to the final computation? If so, beside of > co-occurrence matrix and matrix factorization what other computations are > available to me to do in a mapreduce manner? Does it mean I will have 2 > separate machines for that case, one as an Hadoop cluster for the offline > computation and an online one that will use the distributed output to do > final recommendations (but then it mean I need to move data between > machines, which is not so idle...)? > Item similarity (based on co-occurrence or otherwise) and matrix-factorization stuff is more or less exactly what's available. It's easy to integrate the output of the distributed item-item similarity computation. That plugs right in to the non-distributed item-based recommender. Well, you have to write some code to read the result off HDFS and construct some objects. And you probably have to do some pruning. Etc. It's the last 20%, the wiring and mortar that isn't necessarily handed to you. That's kind of open-ended since how it's glued together is something you may need or want to control. Your Hadoop cluster is definitely not the same sort of beast as a front-end server. Logically, quite different, and in practice almost surely separate machines. I suppose you could run both on one machine for testing or experiments. > Also, as I mentioned earlier I might need to store my data in a SQL > machine. If so, what drivers are currently supported? I saw only JDBC & > PostgreSQL, is there anyone else? > As you said in the book, using a SQL machine will probably slow things > down because of the data movement using the drivers... Could you estimate > how much slower is it comparing to using a file? Again I might do the > reading from the DB offline so I'm not too afraid from losing some of my > speed... > For you, your question is what can be used as an input to Hadoop. I think there are InputFormats for generic SQL databases, yes, but that's a question for Hadoop not Mahout. A SQL database is not the best place to store and read your input for Hadoop. It's overkill. HDFS is the right sort of place to have this data. There is no question of reading from a DB "online" -- it's way too slow. The 'drivers' you see are for reading info from a DB into memory mostly. And they are for non-distributed stuff. It's such simple SQL that I think it will work on just about any DB, with perhaps a tiny tweak here or there. > > > -----Original Message----- > From: Ted Dunning [mailto:[EMAIL PROTECTED]] > Sent: Sunday, March 25, 2012 21:35 > To: [EMAIL PROTECTED] > Subject: Re: Mahout beginner questions... > > Not really. See my previous posting. > > The best way to get fast recommendations is to use an item-based > recommender. Pre-computing recommendations for all users is not usually a > win because you wind up doing a lot of wasted work and you still don't have > anything for new users who appear between refreshes. If you build up a > service to handle the new users, you might as well just serve all users > from that service so that you get up to date recommendations for everyone. > > There IS a large off-line computation. But that doesn't produce > recommendations for USER's. It typically produces recommendations for +
Sean Owen 2012-03-25, 23:21
-
Re: Mahout beginner questions...Sean Owen 2012-03-25, 19:41
Caching recommendations is a good use of memory, sure. It doesn't decrease
memory requirements and doesn't speed up the initial recommendation though. Yes pre-computing recommendations is also possible. This is more or less what the Hadoop-based implementation is for. That scales just fine but is not real-time. You're waiting X minutes/hours to see any reaction to new data a user inputs. For some contexts, that's fine. For many it's not; I expect my recs to change every time I rate a book on Amazon. That's almost the fun of it. Ted is right that you may more commonly pre-compute some big piece of the puzzle like item-item similarities or a matrix factorization. Then you can finish the rec computation quite quickly, and it can respond to new data straight away (at least approximately). This is the sort of setup I was alluding to earlier, and what a 'real' and complete, scalable system resembles. It's more complex. This does not exist per se in the project. The pieces are there, in fact 80% of it I'd say, but the stitching together is still mostly up to the developer. On Sun, Mar 25, 2012 at 8:28 PM, Razon, Oren <[EMAIL PROTECTED]> wrote: > Correct me if I'm wrong but a good way to boost up speed could be to use > caching recommender, meaning computing the recommendations in advanced > (refresh it every X min\hours) and always recommend using the most updated > recommendations, right?! > > -----Original Message----- > From: Sean Owen [mailto:[EMAIL PROTECTED]] > Sent: Sunday, March 25, 2012 21:25 > To: [EMAIL PROTECTED] > Subject: Re: Mahout beginner questions... > > It is memory. You will need a pretty large heap to put 100M data in memory > -- probably 4GB, if not a little more (so the machine would need 8GB+ RAM). > You can go bigger if you have more memory but that size seems about the > biggest to reasonably assume people have. > > Of course more data slows things down and past about 10M data points you > need to tune things to sample data rather than try every possibility. This > is most of what CandidateItemStrategy has to do with. It is relatively easy > to tune this though so speed doesn't have to ben an issue. > > Again you can go bigger and tune it to down-sample more; somehow I stil > believe that 100M is a crude but useful rule of thumb, as to the point > beyond which it's just hard to get good speed and quality. > > Sean > > On Sun, Mar 25, 2012 at 2:04 PM, Razon, Oren <[EMAIL PROTECTED]> wrote: > > > Thanks for the detailed answer Sean. > > I want to understand more clearly the non-distributed code limitations. > > I saw that you advise that for more than 100,000,000 ratings the > > non-distributed engine won't do the job. > > The question is why? Is it memory issue (and then if I will have a bigger > > machine, meaning I could scale up), or is it because of the > recommendation > > time it takes? > > > > > --------------------------------------------------------------------- > Intel Electronics Ltd. > > This e-mail and any attachments may contain confidential material for > the sole use of the intended recipient(s). Any review or distribution > by others is strictly prohibited. If you are not the intended > recipient, please contact the sender and delete all copies. > +
Sean Owen 2012-03-25, 19:41
-
Re: Mahout beginner questions...Ted Dunning 2012-03-25, 19:32
It rounds like the original poster isn't clear about the division between
off-line and on-line work. Almost all production recommendation systems have a large off-line component which analyzes logs of behavior and produces a recommendation model. This model typically consists of item-item relationships stored in a form that is usable by the on-line component of the system. This part is preparation for recommendation, but is not itself recommendation. This off-line component can run sequentially or in parallel using map-reduce. In my experience, with decent down-sampling of excessively active users and excessively popular items, it isn't unreasonable to reach 100M non-zeros in the user x item history in the off-line component. The actual recommendations are produced using the on-line component. This component reads in the recommendation model, possibly all at once, possibly on demand and possibly as the model is changed. The model may be read from a database or from flat files or many other sources. To make a recommendation, a user history or user id is presented to the recommendation system. If an id is presented, it is presumed that the history is available somewhere or that the recommendations have been pre-computed for that user. In any case, the history is combined with the recommendation model to produce a recommendation list for the user of the moment. On Sun, Mar 25, 2012 at 12:25 PM, Sean Owen <[EMAIL PROTECTED]> wrote: > It is memory. You will need a pretty large heap to put 100M data in memory > -- probably 4GB, if not a little more (so the machine would need 8GB+ RAM). > You can go bigger if you have more memory but that size seems about the > biggest to reasonably assume people have. > > Of course more data slows things down and past about 10M data points you > need to tune things to sample data rather than try every possibility. This > is most of what CandidateItemStrategy has to do with. It is relatively easy > to tune this though so speed doesn't have to ben an issue. > > Again you can go bigger and tune it to down-sample more; somehow I stil > believe that 100M is a crude but useful rule of thumb, as to the point > beyond which it's just hard to get good speed and quality. > > Sean > > On Sun, Mar 25, 2012 at 2:04 PM, Razon, Oren <[EMAIL PROTECTED]> wrote: > > > Thanks for the detailed answer Sean. > > I want to understand more clearly the non-distributed code limitations. > > I saw that you advise that for more than 100,000,000 ratings the > > non-distributed engine won't do the job. > > The question is why? Is it memory issue (and then if I will have a bigger > > machine, meaning I could scale up), or is it because of the > recommendation > > time it takes? > > > > > +
Ted Dunning 2012-03-25, 19:32
-
RE: Mahout beginner questions...Razon, Oren 2012-04-05, 07:27
Ok, so here is the point I still not getting.
The architecture we are talking about is to push heavy computation for offline work, for that I could utilize Hadoop part. Beside, having an online part, which will retrieve the recommendation from the pre-computed results or even will do some more computation online to try and adjust the recommendation to current user context. But as you said for the JDBC connector, in order to serve recommendations fast, the online recommender need to have all pre-computed results in-memory. So isn't it a limitation to scale up? It means that as long as my recommender service is growing I will need more memory in order to hold it all in-memory in the online part... Am I wrong here? -----Original Message----- From: Sean Owen [mailto:[EMAIL PROTECTED]] Sent: Thursday, March 22, 2012 17:57 To: [EMAIL PROTECTED] Subject: Re: Mahout beginner questions... A distributed and non-distributed recommender are really quite separate. They perform the same task in quite different ways. I don't think you would mix them per se. Depends on what you mean by a model-based recommender... I would call the matrix-factorization-based and clustering-based approaches "model-based" in the sense that they assume the existence of some underlying structure and discover it. There's no Bayesian-style approaches in the code. They scale in different ways; I am not sure they are unilaterally a solution to scale, no. I do agree in general that these have good scaling properties for real-world use cases, like the matrix-factorization approaches. A "real" scalable architecture would have a real-time component and a big distributed computation component. Mahout has elements of both and can be the basis for piecing that together, but it's not a question of strapping together the distributed and non-distributed implementation. It's a bit harder than that. I am actually quite close to being ready to show off something in this area -- I have been working separately on a more complete rec system that has both the real-time element but integrated directly with a distributed element to handle the large-scale computation. I think this is typical of big data architectures. You have (at least) a real-time distributed "Serving Layer" and a big distributed batch "Computation Layer". More on this in about... 2 weeks. On Thu, Mar 22, 2012 at 3:16 PM, Razon, Oren <[EMAIL PROTECTED]> wrote: > Hi Sean, > Thanks for your fast response, I really appreciate the quality of your book ("Mahout in action"), and the support you give in such forums. > Just to clear my second question... > I want to build a recommender framework that will support different use cases. So my intention is to have both distributed and non-distributed solution in one framework, the question is, is it a good design to put them both in the same machine (one of the machines in the Hadoop cluster)? > > BTW... another question, it seem that a good solution to the recommender scalability will be to use model based recommenders. > Saying this, I wonder why there is such few model based recommenders, especially considering the fact that Mahout contain several data mining models implemented already? > > > -----Original Message----- > From: Sean Owen [mailto:[EMAIL PROTECTED]] > Sent: Thursday, March 22, 2012 13:51 > To: [EMAIL PROTECTED] > Subject: Re: Mahout beginner questions... > > 1. These are the JDBC-related classes. For example see > MySQLJDBCDiffStorage or MySQLJDBCDataModel in integration/ > > 2. The distributed and non-distributed code are quite separate. At > this scale I don't think you can use the non-distributed code to a > meaningful degree. For example you could pre-compute item-item > similarities over this data and use a non-distributed item-based > recommender but you probably have enough items that this will strain > memory. You would probably be looking at pre-computing recommendations > in batch. > > 3. I don't think Netezza will help much here. It's still not fast Intel Electronics Ltd. This e-mail and any attachments may contain confidential material for the sole use of the intended recipient(s). Any review or distribution by others is strictly prohibited. If you are not the intended recipient, please contact the sender and delete all copies. +
Razon, Oren 2012-04-05, 07:27
-
Re: Mahout beginner questions...Sebastian Schelter 2012-04-05, 07:34
Hi Oren,
If you use an item-based approach, its sufficient to use the top-k similar items per item (with k somewhere between 25 and 100). That means the data to hold in memory is num_items * k data points. While this is a theoretical limitation, it should not be a problem in practical scenarios, as you can easily fit some hundred million of that datapoints in a few gigabytes of RAM. --sebastian On 05.04.2012 09:27, Razon, Oren wrote: > Ok, so here is the point I still not getting. > > The architecture we are talking about is to push heavy computation for offline work, for that I could utilize Hadoop part. > Beside, having an online part, which will retrieve the recommendation from the pre-computed results or even will do some more computation online to try and adjust the recommendation to current user context. > But as you said for the JDBC connector, in order to serve recommendations fast, the online recommender need to have all pre-computed results in-memory. So isn't it a limitation to scale up? It means that as long as my recommender service is growing I will need more memory in order to hold it all in-memory in the online part... > Am I wrong here? > > -----Original Message----- > From: Sean Owen [mailto:[EMAIL PROTECTED]] > Sent: Thursday, March 22, 2012 17:57 > To: [EMAIL PROTECTED] > Subject: Re: Mahout beginner questions... > > A distributed and non-distributed recommender are really quite > separate. They perform the same task in quite different ways. I don't > think you would mix them per se. > > Depends on what you mean by a model-based recommender... I would call > the matrix-factorization-based and clustering-based approaches > "model-based" in the sense that they assume the existence of some > underlying structure and discover it. There's no Bayesian-style > approaches in the code. > > They scale in different ways; I am not sure they are unilaterally a > solution to scale, no. I do agree in general that these have good > scaling properties for real-world use cases, like the > matrix-factorization approaches. > > > A "real" scalable architecture would have a real-time component and a > big distributed computation component. Mahout has elements of both and > can be the basis for piecing that together, but it's not a question of > strapping together the distributed and non-distributed implementation. > It's a bit harder than that. > > > I am actually quite close to being ready to show off something in this > area -- I have been working separately on a more complete rec system > that has both the real-time element but integrated directly with a > distributed element to handle the large-scale computation. I think > this is typical of big data architectures. You have (at least) a > real-time distributed "Serving Layer" and a big distributed batch > "Computation Layer". More on this in about... 2 weeks. > > > On Thu, Mar 22, 2012 at 3:16 PM, Razon, Oren <[EMAIL PROTECTED]> wrote: >> Hi Sean, >> Thanks for your fast response, I really appreciate the quality of your book ("Mahout in action"), and the support you give in such forums. >> Just to clear my second question... >> I want to build a recommender framework that will support different use cases. So my intention is to have both distributed and non-distributed solution in one framework, the question is, is it a good design to put them both in the same machine (one of the machines in the Hadoop cluster)? >> >> BTW... another question, it seem that a good solution to the recommender scalability will be to use model based recommenders. >> Saying this, I wonder why there is such few model based recommenders, especially considering the fact that Mahout contain several data mining models implemented already? >> >> >> -----Original Message----- >> From: Sean Owen [mailto:[EMAIL PROTECTED]] >> Sent: Thursday, March 22, 2012 13:51 >> To: [EMAIL PROTECTED] >> Subject: Re: Mahout beginner questions... >> >> 1. These are the JDBC-related classes. For example see +
Sebastian Schelter 2012-04-05, 07:34
-
RE: Mahout beginner questions...Razon, Oren 2012-04-05, 07:44
Thanks for the answer, but still...
I will need to keep in memory the rating matrix so I will be able to utilize the ranking a user gave to items together with the item similarity. -----Original Message----- From: Sebastian Schelter [mailto:[EMAIL PROTECTED]] Sent: Thursday, April 05, 2012 10:34 To: [EMAIL PROTECTED] Subject: Re: Mahout beginner questions... Hi Oren, If you use an item-based approach, its sufficient to use the top-k similar items per item (with k somewhere between 25 and 100). That means the data to hold in memory is num_items * k data points. While this is a theoretical limitation, it should not be a problem in practical scenarios, as you can easily fit some hundred million of that datapoints in a few gigabytes of RAM. --sebastian On 05.04.2012 09:27, Razon, Oren wrote: > Ok, so here is the point I still not getting. > > The architecture we are talking about is to push heavy computation for offline work, for that I could utilize Hadoop part. > Beside, having an online part, which will retrieve the recommendation from the pre-computed results or even will do some more computation online to try and adjust the recommendation to current user context. > But as you said for the JDBC connector, in order to serve recommendations fast, the online recommender need to have all pre-computed results in-memory. So isn't it a limitation to scale up? It means that as long as my recommender service is growing I will need more memory in order to hold it all in-memory in the online part... > Am I wrong here? > > -----Original Message----- > From: Sean Owen [mailto:[EMAIL PROTECTED]] > Sent: Thursday, March 22, 2012 17:57 > To: [EMAIL PROTECTED] > Subject: Re: Mahout beginner questions... > > A distributed and non-distributed recommender are really quite > separate. They perform the same task in quite different ways. I don't > think you would mix them per se. > > Depends on what you mean by a model-based recommender... I would call > the matrix-factorization-based and clustering-based approaches > "model-based" in the sense that they assume the existence of some > underlying structure and discover it. There's no Bayesian-style > approaches in the code. > > They scale in different ways; I am not sure they are unilaterally a > solution to scale, no. I do agree in general that these have good > scaling properties for real-world use cases, like the > matrix-factorization approaches. > > > A "real" scalable architecture would have a real-time component and a > big distributed computation component. Mahout has elements of both and > can be the basis for piecing that together, but it's not a question of > strapping together the distributed and non-distributed implementation. > It's a bit harder than that. > > > I am actually quite close to being ready to show off something in this > area -- I have been working separately on a more complete rec system > that has both the real-time element but integrated directly with a > distributed element to handle the large-scale computation. I think > this is typical of big data architectures. You have (at least) a > real-time distributed "Serving Layer" and a big distributed batch > "Computation Layer". More on this in about... 2 weeks. > > > On Thu, Mar 22, 2012 at 3:16 PM, Razon, Oren <[EMAIL PROTECTED]> wrote: >> Hi Sean, >> Thanks for your fast response, I really appreciate the quality of your book ("Mahout in action"), and the support you give in such forums. >> Just to clear my second question... >> I want to build a recommender framework that will support different use cases. So my intention is to have both distributed and non-distributed solution in one framework, the question is, is it a good design to put them both in the same machine (one of the machines in the Hadoop cluster)? >> >> BTW... another question, it seem that a good solution to the recommender scalability will be to use model based recommenders. >> Saying this, I wonder why there is such few model based recommenders, especially considering the fact that Mahout contain several data mining models implemented already? Intel Electronics Ltd. This e-mail and any attachments may contain confidential material for the sole use of the intended recipient(s). Any review or distribution by others is strictly prohibited. If you are not the intended recipient, please contact the sender and delete all copies. +
Razon, Oren 2012-04-05, 07:44
-
Re: Mahout beginner questions...Sebastian Schelter 2012-04-05, 07:47
You don't have to hold the rating matrix in memory. When computing
recommendations for a user, fetch all his ratings from some datastore (database, key-value-store, memcache...) with a single query and use the item similarities that are held in-memory to compute the recommendations. --sebastian On 05.04.2012 09:44, Razon, Oren wrote: > Thanks for the answer, but still... > I will need to keep in memory the rating matrix so I will be able to utilize the ranking a user gave to items together with the item similarity. > > -----Original Message----- > From: Sebastian Schelter [mailto:[EMAIL PROTECTED]] > Sent: Thursday, April 05, 2012 10:34 > To: [EMAIL PROTECTED] > Subject: Re: Mahout beginner questions... > > Hi Oren, > > If you use an item-based approach, its sufficient to use the top-k > similar items per item (with k somewhere between 25 and 100). That means > the data to hold in memory is num_items * k data points. > > While this is a theoretical limitation, it should not be a problem in > practical scenarios, as you can easily fit some hundred million of that > datapoints in a few gigabytes of RAM. > > --sebastian > > > On 05.04.2012 09:27, Razon, Oren wrote: >> Ok, so here is the point I still not getting. >> >> The architecture we are talking about is to push heavy computation for offline work, for that I could utilize Hadoop part. >> Beside, having an online part, which will retrieve the recommendation from the pre-computed results or even will do some more computation online to try and adjust the recommendation to current user context. >> But as you said for the JDBC connector, in order to serve recommendations fast, the online recommender need to have all pre-computed results in-memory. So isn't it a limitation to scale up? It means that as long as my recommender service is growing I will need more memory in order to hold it all in-memory in the online part... >> Am I wrong here? >> >> -----Original Message----- >> From: Sean Owen [mailto:[EMAIL PROTECTED]] >> Sent: Thursday, March 22, 2012 17:57 >> To: [EMAIL PROTECTED] >> Subject: Re: Mahout beginner questions... >> >> A distributed and non-distributed recommender are really quite >> separate. They perform the same task in quite different ways. I don't >> think you would mix them per se. >> >> Depends on what you mean by a model-based recommender... I would call >> the matrix-factorization-based and clustering-based approaches >> "model-based" in the sense that they assume the existence of some >> underlying structure and discover it. There's no Bayesian-style >> approaches in the code. >> >> They scale in different ways; I am not sure they are unilaterally a >> solution to scale, no. I do agree in general that these have good >> scaling properties for real-world use cases, like the >> matrix-factorization approaches. >> >> >> A "real" scalable architecture would have a real-time component and a >> big distributed computation component. Mahout has elements of both and >> can be the basis for piecing that together, but it's not a question of >> strapping together the distributed and non-distributed implementation. >> It's a bit harder than that. >> >> >> I am actually quite close to being ready to show off something in this >> area -- I have been working separately on a more complete rec system >> that has both the real-time element but integrated directly with a >> distributed element to handle the large-scale computation. I think >> this is typical of big data architectures. You have (at least) a >> real-time distributed "Serving Layer" and a big distributed batch >> "Computation Layer". More on this in about... 2 weeks. >> >> >> On Thu, Mar 22, 2012 at 3:16 PM, Razon, Oren <[EMAIL PROTECTED]> wrote: >>> Hi Sean, >>> Thanks for your fast response, I really appreciate the quality of your book ("Mahout in action"), and the support you give in such forums. >>> Just to clear my second question... >>> I want to build a recommender framework that will support different use cases. So my intention is to have both distributed and non-distributed solution in one framework, the question is, is it a good design to put them both in the same machine (one of the machines in the Hadoop cluster)? +
Sebastian Schelter 2012-04-05, 07:47
-
Re: Mahout beginner questions...Sean Owen 2012-04-05, 07:57
It might or might not be interesting to comment on this discussion in
light of the new product/project I mentioned last night, Myrrix. It's definitely an example of precisely this two-layered architecture we've been discussing on this thread. http://myrrix.com/design/ The nice thing about a matrix-factorization-based approach is that it's feasible to load this entire 'model' into memory -- the two factored matrices. Everything can be done from these: recommendation, most-similar, estimates, even fast approximate updates to the model for new data. Being able to work in memory keeps it fast and simple. If even those get too big for memory, you can shard across servers, by user ID (and include only part of the user-feature matrix on each). Sharding the item-feature matrix gets hard. Sean On Thu, Apr 5, 2012 at 8:47 AM, Sebastian Schelter <[EMAIL PROTECTED]> wrote: > You don't have to hold the rating matrix in memory. When computing > recommendations for a user, fetch all his ratings from some datastore > (database, key-value-store, memcache...) with a single query and use the > item similarities that are held in-memory to compute the recommendations. > +
Sean Owen 2012-04-05, 07:57
|