|
Vinod
2011-12-08, 12:07
Sean Owen
2011-12-08, 12:13
Vinod
2011-12-08, 12:27
Sean Owen
2011-12-08, 12:30
Vinod
2011-12-08, 13:02
Sean Owen
2011-12-08, 13:19
Vinod
2011-12-08, 13:46
Sean Owen
2011-12-08, 13:49
Sebastian Schelter
2011-12-08, 14:19
Jens Grivolla
2011-12-09, 10:17
Sebastian Schelter
2011-12-09, 14:20
Jens Grivolla
2011-12-09, 15:56
Ted Dunning
2011-12-08, 14:23
Vinod
2011-12-08, 17:17
Suneel Marthi
2011-12-08, 14:30
Vinod
2011-12-08, 17:20
Lance Norskog
2011-12-08, 22:52
Suneel Marthi
2011-12-08, 23:04
Ted Dunning
2011-12-08, 14:19
|
-
Persisting trained models in MahoutVinod 2011-12-08, 12:07
Hi,
This is my first day of experimentation with Mahout. I am following "Mahout in Action" book and looking at the sample code provided, it seems that models for ex:- recommender, needs to be trained at the start of the program (start/restart). Recommender interface extends Refreshable which doesn't extend serializable. So, I am wondering if Mahout provides an alternate mechanism to to persist trained models (recommender instance in this case). Apologies if this is a very silly question. Thanks & regards, Vinod +
Vinod 2011-12-08, 12:07
-
Re: Persisting trained models in MahoutSean Owen 2011-12-08, 12:13
The classes aren't Serializable, no. In the case of DataModel, it's assumed
that you already have some persisted model somewhere, in a DB or file or something, so this would be redundant. On Thu, Dec 8, 2011 at 12:07 PM, Vinod <[EMAIL PROTECTED]> wrote: > Hi, > > This is my first day of experimentation with Mahout. I am following "Mahout > in Action" book and looking at the sample code provided, it seems that > models for ex:- recommender, needs to be trained at the start of the > program (start/restart). Recommender interface extends Refreshable which > doesn't extend serializable. So, I am wondering if Mahout provides an > alternate mechanism to to persist trained models (recommender instance in > this case). > > Apologies if this is a very silly question. > > Thanks & regards, > Vinod > +
Sean Owen 2011-12-08, 12:13
-
Re: Persisting trained models in MahoutVinod 2011-12-08, 12:27
Hi Sean,
Thanks for the quick response. By model, I am not referring to data model but, a "trained" recommender instance. Weka, for examples, has ability to save and load models:- http://weka.wikispaces.com/Serialization http://weka.wikispaces.com/Saving+and+loading+models This avoids the need to train model (recommender) every time a server is bounced or program is restarted. regards, Vinod On Thu, Dec 8, 2011 at 5:43 PM, Sean Owen <[EMAIL PROTECTED]> wrote: > The classes aren't Serializable, no. In the case of DataModel, it's assumed > that you already have some persisted model somewhere, in a DB or file or > something, so this would be redundant. > > On Thu, Dec 8, 2011 at 12:07 PM, Vinod <[EMAIL PROTECTED]> wrote: > > > Hi, > > > > This is my first day of experimentation with Mahout. I am following > "Mahout > > in Action" book and looking at the sample code provided, it seems that > > models for ex:- recommender, needs to be trained at the start of the > > program (start/restart). Recommender interface extends Refreshable which > > doesn't extend serializable. So, I am wondering if Mahout provides an > > alternate mechanism to to persist trained models (recommender instance in > > this case). > > > > Apologies if this is a very silly question. > > > > Thanks & regards, > > Vinod > > > +
Vinod 2011-12-08, 12:27
-
Re: Persisting trained models in MahoutSean Owen 2011-12-08, 12:30
Ah right. No, there's still not a provision for this. You would just have
to serialize it yourself if you like. Most of the implementations don't have a great deal of startup overhead, so don't really need this. The exception is perhaps slope-one, but there you can actually save and supply pre-computed diffs. Still it would be valid to store and re-supply user-user similarities or something. You can do this, manually, by querying for user-user similarities, saving them, then loading them and supplying them via GenericUserSimilarity for instance. On Thu, Dec 8, 2011 at 12:27 PM, Vinod <[EMAIL PROTECTED]> wrote: > Hi Sean, > > Thanks for the quick response. > > By model, I am not referring to data model but, a "trained" recommender > instance. > > Weka, for examples, has ability to save and load models:- > http://weka.wikispaces.com/Serialization > http://weka.wikispaces.com/Saving+and+loading+models > > This avoids the need to train model (recommender) every time a server is > bounced or program is restarted. > > regards, > Vinod > > > On Thu, Dec 8, 2011 at 5:43 PM, Sean Owen <[EMAIL PROTECTED]> wrote: > > > The classes aren't Serializable, no. In the case of DataModel, it's > assumed > > that you already have some persisted model somewhere, in a DB or file or > > something, so this would be redundant. > > > > On Thu, Dec 8, 2011 at 12:07 PM, Vinod <[EMAIL PROTECTED]> wrote: > > > > > Hi, > > > > > > This is my first day of experimentation with Mahout. I am following > > "Mahout > > > in Action" book and looking at the sample code provided, it seems that > > > models for ex:- recommender, needs to be trained at the start of the > > > program (start/restart). Recommender interface extends Refreshable > which > > > doesn't extend serializable. So, I am wondering if Mahout provides an > > > alternate mechanism to to persist trained models (recommender instance > in > > > this case). > > > > > > Apologies if this is a very silly question. > > > > > > Thanks & regards, > > > Vinod > > > > > > +
Sean Owen 2011-12-08, 12:30
-
Re: Persisting trained models in MahoutVinod 2011-12-08, 13:02
Hi Sean,
Neither Recommender nor any of its parent interface extends serializable so there is no way that I'd be able to serialize it. I agree that the implementations may not have startup overhead. However, training a model on millions of row is a cpu, memory & time consuming activity. For example, when data set is changed from 100K to 1M in chapter 4, program crashes with OutOfMemory after significant amount of time. I feel that training should be done in development only. Once a developer is ok with test results, he should be able to save instance of the trained and tested model (for ex:- recommender or classifier). These saved instances of trained and tested models only should be deployed to production. Thought? regards, Vinod On Thu, Dec 8, 2011 at 6:00 PM, Sean Owen <[EMAIL PROTECTED]> wrote: > Ah right. No, there's still not a provision for this. You would just have > to serialize it yourself if you like. > Most of the implementations don't have a great deal of startup overhead, so > don't really need this. The exception is perhaps slope-one, but there you > can actually save and supply pre-computed diffs. > Still it would be valid to store and re-supply user-user similarities or > something. You can do this, manually, by querying for user-user > similarities, saving them, then loading them and supplying them via > GenericUserSimilarity for instance. > > On Thu, Dec 8, 2011 at 12:27 PM, Vinod <[EMAIL PROTECTED]> wrote: > > > Hi Sean, > > > > Thanks for the quick response. > > > > By model, I am not referring to data model but, a "trained" recommender > > instance. > > > > Weka, for examples, has ability to save and load models:- > > http://weka.wikispaces.com/Serialization > > http://weka.wikispaces.com/Saving+and+loading+models > > > > This avoids the need to train model (recommender) every time a server is > > bounced or program is restarted. > > > > regards, > > Vinod > > > > > > On Thu, Dec 8, 2011 at 5:43 PM, Sean Owen <[EMAIL PROTECTED]> wrote: > > > > > The classes aren't Serializable, no. In the case of DataModel, it's > > assumed > > > that you already have some persisted model somewhere, in a DB or file > or > > > something, so this would be redundant. > > > > > > On Thu, Dec 8, 2011 at 12:07 PM, Vinod <[EMAIL PROTECTED]> wrote: > > > > > > > Hi, > > > > > > > > This is my first day of experimentation with Mahout. I am following > > > "Mahout > > > > in Action" book and looking at the sample code provided, it seems > that > > > > models for ex:- recommender, needs to be trained at the start of the > > > > program (start/restart). Recommender interface extends Refreshable > > which > > > > doesn't extend serializable. So, I am wondering if Mahout provides an > > > > alternate mechanism to to persist trained models (recommender > instance > > in > > > > this case). > > > > > > > > Apologies if this is a very silly question. > > > > > > > > Thanks & regards, > > > > Vinod > > > > > > > > > > +
Vinod 2011-12-08, 13:02
-
Re: Persisting trained models in MahoutSean Owen 2011-12-08, 13:19
Yes, I mean you need to write it and read it in your own code.
What do you mean by training a model? computing similarities? I don't know if there's such a thing here as "training" on one data set and running on another. The implementations always use all currently available info. Is this a cold-start issue? OutOfMemoryError is nothing to do with this; on such a small data set it indicates you didn't set your JVM heap size above the default. On Thu, Dec 8, 2011 at 1:02 PM, Vinod <[EMAIL PROTECTED]> wrote: > Hi Sean, > > Neither Recommender nor any of its parent interface extends serializable so > there is no way that I'd be able to serialize it. > > I agree that the implementations may not have startup overhead. However, > training a model on millions of row is a cpu, memory & time consuming > activity. For example, when data set is changed from 100K to 1M in chapter > 4, program crashes with OutOfMemory after significant amount of time. > > I feel that training should be done in development only. Once a developer > is ok with test results, he should be able to save instance of the trained > and tested model (for ex:- recommender or classifier). > > These saved instances of trained and tested models only should be deployed > to production. > > Thought? > > regards, > Vinod > > > > On Thu, Dec 8, 2011 at 6:00 PM, Sean Owen <[EMAIL PROTECTED]> wrote: > > > Ah right. No, there's still not a provision for this. You would just have > > to serialize it yourself if you like. > > Most of the implementations don't have a great deal of startup overhead, > so > > don't really need this. The exception is perhaps slope-one, but there you > > can actually save and supply pre-computed diffs. > > Still it would be valid to store and re-supply user-user similarities or > > something. You can do this, manually, by querying for user-user > > similarities, saving them, then loading them and supplying them via > > GenericUserSimilarity for instance. > > > > On Thu, Dec 8, 2011 at 12:27 PM, Vinod <[EMAIL PROTECTED]> wrote: > > > > > Hi Sean, > > > > > > Thanks for the quick response. > > > > > > By model, I am not referring to data model but, a "trained" recommender > > > instance. > > > > > > Weka, for examples, has ability to save and load models:- > > > http://weka.wikispaces.com/Serialization > > > http://weka.wikispaces.com/Saving+and+loading+models > > > > > > This avoids the need to train model (recommender) every time a server > is > > > bounced or program is restarted. > > > > > > regards, > > > Vinod > > > > > > > > > On Thu, Dec 8, 2011 at 5:43 PM, Sean Owen <[EMAIL PROTECTED]> wrote: > > > > > > > The classes aren't Serializable, no. In the case of DataModel, it's > > > assumed > > > > that you already have some persisted model somewhere, in a DB or file > > or > > > > something, so this would be redundant. > > > > > > > > On Thu, Dec 8, 2011 at 12:07 PM, Vinod <[EMAIL PROTECTED]> wrote: > > > > > > > > > Hi, > > > > > > > > > > This is my first day of experimentation with Mahout. I am following > > > > "Mahout > > > > > in Action" book and looking at the sample code provided, it seems > > that > > > > > models for ex:- recommender, needs to be trained at the start of > the > > > > > program (start/restart). Recommender interface extends Refreshable > > > which > > > > > doesn't extend serializable. So, I am wondering if Mahout provides > an > > > > > alternate mechanism to to persist trained models (recommender > > instance > > > in > > > > > this case). > > > > > > > > > > Apologies if this is a very silly question. > > > > > > > > > > Thanks & regards, > > > > > Vinod > > > > > > > > > > > > > > > +
Sean Owen 2011-12-08, 13:19
-
Re: Persisting trained models in MahoutVinod 2011-12-08, 13:46
I'll use the first example from Chapter 2 of your book to clarify what I
mean by training:- Following code trains the recommender:- DataModel model = new FileDataModel(new File("intro.csv")); UserSimilarity similarity = new PearsonCorrelationSimilarity(model); UserNeighborhood neighborhood new NearestNUserNeighborhood(2, similarity, model); Recommender recommender = new GenericUserBasedRecommender( model, neighborhood, similarity); At this point, recommender is trained on preferences of users 1 to 5 in intro.csv. We should now be able to serialize() this recommender instance into a file, say "Movie Recommender.model" using steps mentioned here ( http://java.sun.com/developer/technicalArticles/Programming/serialization/) All we need to do now is deploy "Movie Recommender.model" to production. If I understand the behavior correctly, this model should now be able to predict recommendation for a new user. As an example, lets assume that production has a different user base. If recommender instance is loaded from "Movie Recommender.model" file and asked to provide recommendations for user '7' who has rated 101 and 102 as 4 and 3 respectively, it should be able to predict recommendations for 7. right? regards, Vinod On Thu, Dec 8, 2011 at 6:49 PM, Sean Owen <[EMAIL PROTECTED]> wrote: > Yes, I mean you need to write it and read it in your own code. > > What do you mean by training a model? computing similarities? I don't know > if there's such a thing here as "training" on one data set and running on > another. The implementations always use all currently available info. Is > this a cold-start issue? > > OutOfMemoryError is nothing to do with this; on such a small data set it > indicates you didn't set your JVM heap size above the default. > > > On Thu, Dec 8, 2011 at 1:02 PM, Vinod <[EMAIL PROTECTED]> wrote: > > > Hi Sean, > > > > Neither Recommender nor any of its parent interface extends serializable > so > > there is no way that I'd be able to serialize it. > > > > I agree that the implementations may not have startup overhead. However, > > training a model on millions of row is a cpu, memory & time consuming > > activity. For example, when data set is changed from 100K to 1M in > chapter > > 4, program crashes with OutOfMemory after significant amount of time. > > > > I feel that training should be done in development only. Once a developer > > is ok with test results, he should be able to save instance of the > trained > > and tested model (for ex:- recommender or classifier). > > > > These saved instances of trained and tested models only should be > deployed > > to production. > > > > Thought? > > > > regards, > > Vinod > > > > > > > > On Thu, Dec 8, 2011 at 6:00 PM, Sean Owen <[EMAIL PROTECTED]> wrote: > > > > > Ah right. No, there's still not a provision for this. You would just > have > > > to serialize it yourself if you like. > > > Most of the implementations don't have a great deal of startup > overhead, > > so > > > don't really need this. The exception is perhaps slope-one, but there > you > > > can actually save and supply pre-computed diffs. > > > Still it would be valid to store and re-supply user-user similarities > or > > > something. You can do this, manually, by querying for user-user > > > similarities, saving them, then loading them and supplying them via > > > GenericUserSimilarity for instance. > > > > > > On Thu, Dec 8, 2011 at 12:27 PM, Vinod <[EMAIL PROTECTED]> wrote: > > > > > > > Hi Sean, > > > > > > > > Thanks for the quick response. > > > > > > > > By model, I am not referring to data model but, a "trained" > recommender > > > > instance. > > > > > > > > Weka, for examples, has ability to save and load models:- > > > > http://weka.wikispaces.com/Serialization > > > > http://weka.wikispaces.com/Saving+and+loading+models > > > > > > > > This avoids the need to train model (recommender) every time a server > > is > > > > bounced or program is restarted. > > > > +
Vinod 2011-12-08, 13:46
-
Re: Persisting trained models in MahoutSean Owen 2011-12-08, 13:49
That's right, you could get this effect by computing and saving off all the
user-user similarities, then reading them back in, putting them in a GenericUserSimilarity, and proceeding as below. Those similarities are the closest thing to a model here. It's going to take a while to compute all those pairs, and most will be unused, and so reloading them is going to take a lot of time and memory. You could prune the small ones I suppose. It might be faster to recompute! On Thu, Dec 8, 2011 at 1:46 PM, Vinod <[EMAIL PROTECTED]> wrote: > I'll use the first example from Chapter 2 of your book to clarify what I > mean by training:- > > Following code trains the recommender:- > DataModel model = new FileDataModel(new File("intro.csv")); > > UserSimilarity similarity = new PearsonCorrelationSimilarity(model); > UserNeighborhood neighborhood > new NearestNUserNeighborhood(2, similarity, model); > > Recommender recommender = new GenericUserBasedRecommender( > model, neighborhood, similarity); > > At this point, recommender is trained on preferences of users 1 to 5 in > intro.csv. > > We should now be able to serialize() this recommender instance into a file, > say "Movie Recommender.model" using steps mentioned here ( > http://java.sun.com/developer/technicalArticles/Programming/serialization/ > ) > > All we need to do now is deploy "Movie Recommender.model" to production. > > If I understand the behavior correctly, this model should now be able to > predict recommendation for a new user. > > As an example, lets assume that production has a different user base. If > recommender instance is loaded from "Movie Recommender.model" file and > asked to provide recommendations for user '7' who has rated 101 and 102 as > 4 and 3 respectively, it should be able to predict recommendations for 7. > right? > > regards, > Vinod > > > > > On Thu, Dec 8, 2011 at 6:49 PM, Sean Owen <[EMAIL PROTECTED]> wrote: > > > Yes, I mean you need to write it and read it in your own code. > > > > What do you mean by training a model? computing similarities? I don't > know > > if there's such a thing here as "training" on one data set and running on > > another. The implementations always use all currently available info. Is > > this a cold-start issue? > > > > OutOfMemoryError is nothing to do with this; on such a small data set it > > indicates you didn't set your JVM heap size above the default. > > > > > > On Thu, Dec 8, 2011 at 1:02 PM, Vinod <[EMAIL PROTECTED]> wrote: > > > > > Hi Sean, > > > > > > Neither Recommender nor any of its parent interface extends > serializable > > so > > > there is no way that I'd be able to serialize it. > > > > > > I agree that the implementations may not have startup overhead. > However, > > > training a model on millions of row is a cpu, memory & time consuming > > > activity. For example, when data set is changed from 100K to 1M in > > chapter > > > 4, program crashes with OutOfMemory after significant amount of time. > > > > > > I feel that training should be done in development only. Once a > developer > > > is ok with test results, he should be able to save instance of the > > trained > > > and tested model (for ex:- recommender or classifier). > > > > > > These saved instances of trained and tested models only should be > > deployed > > > to production. > > > > > > Thought? > > > > > > regards, > > > Vinod > > > > > > > > > > > > On Thu, Dec 8, 2011 at 6:00 PM, Sean Owen <[EMAIL PROTECTED]> wrote: > > > > > > > Ah right. No, there's still not a provision for this. You would just > > have > > > > to serialize it yourself if you like. > > > > Most of the implementations don't have a great deal of startup > > overhead, > > > so > > > > don't really need this. The exception is perhaps slope-one, but there > > you > > > > can actually save and supply pre-computed diffs. > > > > Still it would be valid to store and re-supply user-user similarities > > or > > > > something. You can do this, manually, by querying for user-user +
Sean Owen 2011-12-08, 13:49
-
Re: Persisting trained models in MahoutSebastian Schelter 2011-12-08, 14:19
A model for item-based collaborative filtering simply consists of the
precomputed item similarities. We currently support such a precomputation only as hadoop job, but it should be a matter of an hour to create a class that precalculates the item similarities sequentially using an ItemBasedRecommender. You can either store these similarities in the database and load them via MySQLJDBCInMemoryItemSimilarity/SQL92JDBCInMemoryItemSimilarity or you can write them to a .csv file and load them via FileItemSimilarity. A model for recommenders that use matrix factorization consists of the user and item feature vectors. You can use a FilePersistenceStrategy with any SVDRecommender to read and write these. In the future we could also support loading the results of ParallelALSFactorizationJob into an SVDRecommender. --sebastian On 08.12.2011 14:49, Sean Owen wrote: > That's right, you could get this effect by computing and saving off all the > user-user similarities, then reading them back in, putting them in a > GenericUserSimilarity, and proceeding as below. Those similarities are the > closest thing to a model here. > > It's going to take a while to compute all those pairs, and most will be > unused, and so reloading them is going to take a lot of time and memory. > You could prune the small ones I suppose. It might be faster to recompute! > > On Thu, Dec 8, 2011 at 1:46 PM, Vinod <[EMAIL PROTECTED]> wrote: > >> I'll use the first example from Chapter 2 of your book to clarify what I >> mean by training:- >> >> Following code trains the recommender:- >> DataModel model = new FileDataModel(new File("intro.csv")); >> >> UserSimilarity similarity = new PearsonCorrelationSimilarity(model); >> UserNeighborhood neighborhood >> new NearestNUserNeighborhood(2, similarity, model); >> >> Recommender recommender = new GenericUserBasedRecommender( >> model, neighborhood, similarity); >> >> At this point, recommender is trained on preferences of users 1 to 5 in >> intro.csv. >> >> We should now be able to serialize() this recommender instance into a file, >> say "Movie Recommender.model" using steps mentioned here ( >> http://java.sun.com/developer/technicalArticles/Programming/serialization/ >> ) >> >> All we need to do now is deploy "Movie Recommender.model" to production. >> >> If I understand the behavior correctly, this model should now be able to >> predict recommendation for a new user. >> >> As an example, lets assume that production has a different user base. If >> recommender instance is loaded from "Movie Recommender.model" file and >> asked to provide recommendations for user '7' who has rated 101 and 102 as >> 4 and 3 respectively, it should be able to predict recommendations for 7. >> right? >> >> regards, >> Vinod >> >> >> >> >> On Thu, Dec 8, 2011 at 6:49 PM, Sean Owen <[EMAIL PROTECTED]> wrote: >> >>> Yes, I mean you need to write it and read it in your own code. >>> >>> What do you mean by training a model? computing similarities? I don't >> know >>> if there's such a thing here as "training" on one data set and running on >>> another. The implementations always use all currently available info. Is >>> this a cold-start issue? >>> >>> OutOfMemoryError is nothing to do with this; on such a small data set it >>> indicates you didn't set your JVM heap size above the default. >>> >>> >>> On Thu, Dec 8, 2011 at 1:02 PM, Vinod <[EMAIL PROTECTED]> wrote: >>> >>>> Hi Sean, >>>> >>>> Neither Recommender nor any of its parent interface extends >> serializable >>> so >>>> there is no way that I'd be able to serialize it. >>>> >>>> I agree that the implementations may not have startup overhead. >> However, >>>> training a model on millions of row is a cpu, memory & time consuming >>>> activity. For example, when data set is changed from 100K to 1M in >>> chapter >>>> 4, program crashes with OutOfMemory after significant amount of time. >>>> >>>> I feel that training should be done in development only. Once a +
Sebastian Schelter 2011-12-08, 14:19
-
Re: Persisting trained models in MahoutJens Grivolla 2011-12-09, 10:17
On 12/08/2011 03:19 PM, Sebastian Schelter wrote:
> [...] > > A model for recommenders that use matrix factorization consists of the > user and item feature vectors. You can use a FilePersistenceStrategy > with any SVDRecommender to read and write these. > > In the future we could also support loading the results of > ParallelALSFactorizationJob into an SVDRecommender. I was actually looking for this. I guess this is the one case where there actually is a "model", and calculating the factorization can be costly. I would expect that doing the "SVD" offline (e.g. on Hadoop) and then providing online recommendations which only need a simple linear projection is a pretty common use case, isn't it? You can even take new user preferences into account in realtime (when projecting the user vector into the feature space) with very little cost, and just update the transformation matrices (which should be quite static) periodically. Bye, Jens +
Jens Grivolla 2011-12-09, 10:17
-
Re: Persisting trained models in MahoutSebastian Schelter 2011-12-09, 14:20
Yes, you describe it perfectly. I think the only reason this has not
been done yet is that the model computation is not very fast on Hadoop because of its iterative nature. Would you like to work on integrating the SVD recommenders? --sebastian On 09.12.2011 11:17, Jens Grivolla wrote: > On 12/08/2011 03:19 PM, Sebastian Schelter wrote: >> [...] >> >> A model for recommenders that use matrix factorization consists of the >> user and item feature vectors. You can use a FilePersistenceStrategy >> with any SVDRecommender to read and write these. >> >> In the future we could also support loading the results of >> ParallelALSFactorizationJob into an SVDRecommender. > > I was actually looking for this. I guess this is the one case where > there actually is a "model", and calculating the factorization can be > costly. > > I would expect that doing the "SVD" offline (e.g. on Hadoop) and then > providing online recommendations which only need a simple linear > projection is a pretty common use case, isn't it? You can even take new > user preferences into account in realtime (when projecting the user > vector into the feature space) with very little cost, and just update > the transformation matrices (which should be quite static) periodically. > > Bye, > Jens > +
Sebastian Schelter 2011-12-09, 14:20
-
Re: Persisting trained models in MahoutJens Grivolla 2011-12-09, 15:56
I'm just getting started on Mahout for a new project. I used Taste a few
years back, but things have changed a lot since then. So basically, I'll be working on getting all the basic functionality I need first, and am not really ready to take on such development right now. I may look into persisting the transformation matrices if I need to, but that's at least a few months away still. So if it's ready to use by the time I need it, all the better ;-) I'll be mostly working on integrating external user/content features (demographics, etc.) to deal with cold start, and will rely as heavily as possible on existing algorithms and implementations for the core CF stuff. Bye, Jens On 12/09/2011 03:20 PM, Sebastian Schelter wrote: > Yes, you describe it perfectly. I think the only reason this has not > been done yet is that the model computation is not very fast on Hadoop > because of its iterative nature. > > Would you like to work on integrating the SVD recommenders? > > --sebastian > > On 09.12.2011 11:17, Jens Grivolla wrote: >> On 12/08/2011 03:19 PM, Sebastian Schelter wrote: >>> [...] >>> >>> A model for recommenders that use matrix factorization consists of the >>> user and item feature vectors. You can use a FilePersistenceStrategy >>> with any SVDRecommender to read and write these. >>> >>> In the future we could also support loading the results of >>> ParallelALSFactorizationJob into an SVDRecommender. >> >> I was actually looking for this. I guess this is the one case where >> there actually is a "model", and calculating the factorization can be >> costly. >> >> I would expect that doing the "SVD" offline (e.g. on Hadoop) and then >> providing online recommendations which only need a simple linear >> projection is a pretty common use case, isn't it? You can even take new >> user preferences into account in realtime (when projecting the user >> vector into the feature space) with very little cost, and just update >> the transformation matrices (which should be quite static) periodically. >> >> Bye, >> Jens >> > > +
Jens Grivolla 2011-12-09, 15:56
-
Re: Persisting trained models in MahoutTed Dunning 2011-12-08, 14:23
This is a fair statement of the traditional way of doing business for
*small* models of the sort used in classification. The insistence on using serialization is kind of silly since there are many down-sides to Java serialization and it is becoming rare for systems that need to serialize large amounts of data to use Java serialization. The fact is, however, that this is not general practice with recommendations. It is common to do lots of off-line computation that you could characterize as "learning", and it is common to save the results of this off-line computation for later deployment, but it is also common to do the learning on the fly since it is generally pretty trivial stuff. The earliest examples highlight the simpler approach. Keep going to see more interesting examples. On Thu, Dec 8, 2011 at 6:46 AM, Vinod <[EMAIL PROTECTED]> wrote: > I'll use the first example from Chapter 2 of your book to clarify what I > mean by training:- > > Following code trains the recommender:- > DataModel model = new FileDataModel(new File("intro.csv")); > > UserSimilarity similarity = new PearsonCorrelationSimilarity(model); > UserNeighborhood neighborhood > new NearestNUserNeighborhood(2, similarity, model); > > Recommender recommender = new GenericUserBasedRecommender( > model, neighborhood, similarity); > > At this point, recommender is trained on preferences of users 1 to 5 in > intro.csv. > > We should now be able to serialize() this recommender instance into a file, > say "Movie Recommender.model" using steps mentioned here ( > http://java.sun.com/developer/technicalArticles/Programming/serialization/ > ) > > All we need to do now is deploy "Movie Recommender.model" to production. > > If I understand the behavior correctly, this model should now be able to > predict recommendation for a new user. > +
Ted Dunning 2011-12-08, 14:23
-
Re: Persisting trained models in MahoutVinod 2011-12-08, 17:17
Hi Ted,
Sure. I'll continue reading and try examples in later chapters. Thanks. regards, Vinod On Thu, Dec 8, 2011 at 7:53 PM, Ted Dunning <[EMAIL PROTECTED]> wrote: > This is a fair statement of the traditional way of doing business for > *small* models of the sort used in classification. The insistence on using > serialization is kind of silly since there are many down-sides to Java > serialization and it is becoming rare for systems that need to serialize > large amounts of data to use Java serialization. > > The fact is, however, that this is not general practice with > recommendations. It is common to do lots of off-line computation that you > could characterize as "learning", and it is common to save the results of > this off-line computation for later deployment, but it is also common to do > the learning on the fly since it is generally pretty trivial stuff. > > The earliest examples highlight the simpler approach. Keep going to see > more interesting examples. > > On Thu, Dec 8, 2011 at 6:46 AM, Vinod <[EMAIL PROTECTED]> wrote: > > > I'll use the first example from Chapter 2 of your book to clarify what I > > mean by training:- > > > > Following code trains the recommender:- > > DataModel model = new FileDataModel(new File("intro.csv")); > > > > UserSimilarity similarity = new PearsonCorrelationSimilarity(model); > > UserNeighborhood neighborhood > > new NearestNUserNeighborhood(2, similarity, model); > > > > Recommender recommender = new GenericUserBasedRecommender( > > model, neighborhood, similarity); > > > > At this point, recommender is trained on preferences of users 1 to 5 in > > intro.csv. > > > > We should now be able to serialize() this recommender instance into a > file, > > say "Movie Recommender.model" using steps mentioned here ( > > > http://java.sun.com/developer/technicalArticles/Programming/serialization/ > > ) > > > > All we need to do now is deploy "Movie Recommender.model" to production. > > > > If I understand the behavior correctly, this model should now be able to > > predict recommendation for a new user. > > > +
Vinod 2011-12-08, 17:17
-
Re: Persisting trained models in MahoutSuneel Marthi 2011-12-08, 14:30
Would ModelSerializer class in Mahout be what you are looking for? I had used it to persist trained models for SGD classifiers, you may want to look into it.
________________________________ From: Vinod <[EMAIL PROTECTED]> To: [EMAIL PROTECTED] Sent: Thursday, December 8, 2011 8:46 AM Subject: Re: Persisting trained models in Mahout I'll use the first example from Chapter 2 of your book to clarify what I mean by training:- Following code trains the recommender:- DataModel model = new FileDataModel(new File("intro.csv")); UserSimilarity similarity = new PearsonCorrelationSimilarity(model); UserNeighborhood neighborhood new NearestNUserNeighborhood(2, similarity, model); Recommender recommender = new GenericUserBasedRecommender( model, neighborhood, similarity); At this point, recommender is trained on preferences of users 1 to 5 in intro.csv. We should now be able to serialize() this recommender instance into a file, say "Movie Recommender.model" using steps mentioned here ( http://java.sun.com/developer/technicalArticles/Programming/serialization/) All we need to do now is deploy "Movie Recommender.model" to production. If I understand the behavior correctly, this model should now be able to predict recommendation for a new user. As an example, lets assume that production has a different user base. If recommender instance is loaded from "Movie Recommender.model" file and asked to provide recommendations for user '7' who has rated 101 and 102 as 4 and 3 respectively, it should be able to predict recommendations for 7. right? regards, Vinod On Thu, Dec 8, 2011 at 6:49 PM, Sean Owen <[EMAIL PROTECTED]> wrote: > Yes, I mean you need to write it and read it in your own code. > > What do you mean by training a model? computing similarities? I don't know > if there's such a thing here as "training" on one data set and running on > another. The implementations always use all currently available info. Is > this a cold-start issue? > > OutOfMemoryError is nothing to do with this; on such a small data set it > indicates you didn't set your JVM heap size above the default. > > > On Thu, Dec 8, 2011 at 1:02 PM, Vinod <[EMAIL PROTECTED]> wrote: > > > Hi Sean, > > > > Neither Recommender nor any of its parent interface extends serializable > so > > there is no way that I'd be able to serialize it. > > > > I agree that the implementations may not have startup overhead. However, > > training a model on millions of row is a cpu, memory & time consuming > > activity. For example, when data set is changed from 100K to 1M in > chapter > > 4, program crashes with OutOfMemory after significant amount of time. > > > > I feel that training should be done in development only. Once a developer > > is ok with test results, he should be able to save instance of the > trained > > and tested model (for ex:- recommender or classifier). > > > > These saved instances of trained and tested models only should be > deployed > > to production. > > > > Thought? > > > > regards, > > Vinod > > > > > > > > On Thu, Dec 8, 2011 at 6:00 PM, Sean Owen <[EMAIL PROTECTED]> wrote: > > > > > Ah right. No, there's still not a provision for this. You would just > have > > > to serialize it yourself if you like. > > > Most of the implementations don't have a great deal of startup > overhead, > > so > > > don't really need this. The exception is perhaps slope-one, but there > you > > > can actually save and supply pre-computed diffs. > > > Still it would be valid to store and re-supply user-user similarities > or > > > something. You can do this, manually, by querying for user-user > > > similarities, saving them, then loading them and supplying them via > > > GenericUserSimilarity for instance. > > > > > > On Thu, Dec 8, 2011 at 12:27 PM, Vinod <[EMAIL PROTECTED]> wrote: > > > > > > > Hi Sean, > > > > > > > > Thanks for the quick response. > > > > > > > > By model, I am not referring to data model but, a "trained" > recommender +
Suneel Marthi 2011-12-08, 14:30
-
Re: Persisting trained models in MahoutVinod 2011-12-08, 17:20
Sure Suneel. Thanks.
On Thu, Dec 8, 2011 at 8:00 PM, Suneel Marthi <[EMAIL PROTECTED]>wrote: > Would ModelSerializer class in Mahout be what you are looking for? I had > used it to persist trained models for SGD classifiers, you may want to look > into it. > > > > ________________________________ > From: Vinod <[EMAIL PROTECTED]> > To: [EMAIL PROTECTED] > Sent: Thursday, December 8, 2011 8:46 AM > Subject: Re: Persisting trained models in Mahout > > I'll use the first example from Chapter 2 of your book to clarify what I > mean by training:- > > Following code trains the recommender:- > DataModel model = new FileDataModel(new File("intro.csv")); > > UserSimilarity similarity = new PearsonCorrelationSimilarity(model); > UserNeighborhood neighborhood > new NearestNUserNeighborhood(2, similarity, model); > > Recommender recommender = new GenericUserBasedRecommender( > model, neighborhood, similarity); > > At this point, recommender is trained on preferences of users 1 to 5 in > intro.csv. > > We should now be able to serialize() this recommender instance into a file, > say "Movie Recommender.model" using steps mentioned here ( > http://java.sun.com/developer/technicalArticles/Programming/serialization/ > ) > > All we need to do now is deploy "Movie Recommender.model" to production. > > If I understand the behavior correctly, this model should now be able to > predict recommendation for a new user. > > As an example, lets assume that production has a different user base. If > recommender instance is loaded from "Movie Recommender.model" file and > asked to provide recommendations for user '7' who has rated 101 and 102 as > 4 and 3 respectively, it should be able to predict recommendations for 7. > right? > > regards, > Vinod > > > > > On Thu, Dec 8, 2011 at 6:49 PM, Sean Owen <[EMAIL PROTECTED]> wrote: > > > Yes, I mean you need to write it and read it in your own code. > > > > What do you mean by training a model? computing similarities? I don't > know > > if there's such a thing here as "training" on one data set and running on > > another. The implementations always use all currently available info. Is > > this a cold-start issue? > > > > OutOfMemoryError is nothing to do with this; on such a small data set it > > indicates you didn't set your JVM heap size above the default. > > > > > > On Thu, Dec 8, 2011 at 1:02 PM, Vinod <[EMAIL PROTECTED]> wrote: > > > > > Hi Sean, > > > > > > Neither Recommender nor any of its parent interface extends > serializable > > so > > > there is no way that I'd be able to serialize it. > > > > > > I agree that the implementations may not have startup overhead. > However, > > > training a model on millions of row is a cpu, memory & time consuming > > > activity. For example, when data set is changed from 100K to 1M in > > chapter > > > 4, program crashes with OutOfMemory after significant amount of time. > > > > > > I feel that training should be done in development only. Once a > developer > > > is ok with test results, he should be able to save instance of the > > trained > > > and tested model (for ex:- recommender or classifier). > > > > > > These saved instances of trained and tested models only should be > > deployed > > > to production. > > > > > > Thought? > > > > > > regards, > > > Vinod > > > > > > > > > > > > On Thu, Dec 8, 2011 at 6:00 PM, Sean Owen <[EMAIL PROTECTED]> wrote: > > > > > > > Ah right. No, there's still not a provision for this. You would just > > have > > > > to serialize it yourself if you like. > > > > Most of the implementations don't have a great deal of startup > > overhead, > > > so > > > > don't really need this. The exception is perhaps slope-one, but there > > you > > > > can actually save and supply pre-computed diffs. > > > > Still it would be valid to store and re-supply user-user similarities > > or > > > > something. You can do this, manually, by querying for user-user > > > > similarities, saving them, then loading them and supplying them via +
Vinod 2011-12-08, 17:20
-
Re: Persisting trained models in MahoutLance Norskog 2011-12-08, 22:52
It would also be useful to load and cache often-used items and compute
rarely-used items online. The Caching classes are the natural fit for this. On Thu, Dec 8, 2011 at 9:20 AM, Vinod <[EMAIL PROTECTED]> wrote: > Sure Suneel. Thanks. > > On Thu, Dec 8, 2011 at 8:00 PM, Suneel Marthi <[EMAIL PROTECTED] > >wrote: > > > Would ModelSerializer class in Mahout be what you are looking for? I had > > used it to persist trained models for SGD classifiers, you may want to > look > > into it. > > > > > > > > ________________________________ > > From: Vinod <[EMAIL PROTECTED]> > > To: [EMAIL PROTECTED] > > Sent: Thursday, December 8, 2011 8:46 AM > > Subject: Re: Persisting trained models in Mahout > > > > I'll use the first example from Chapter 2 of your book to clarify what I > > mean by training:- > > > > Following code trains the recommender:- > > DataModel model = new FileDataModel(new File("intro.csv")); > > > > UserSimilarity similarity = new PearsonCorrelationSimilarity(model); > > UserNeighborhood neighborhood > > new NearestNUserNeighborhood(2, similarity, model); > > > > Recommender recommender = new GenericUserBasedRecommender( > > model, neighborhood, similarity); > > > > At this point, recommender is trained on preferences of users 1 to 5 in > > intro.csv. > > > > We should now be able to serialize() this recommender instance into a > file, > > say "Movie Recommender.model" using steps mentioned here ( > > > http://java.sun.com/developer/technicalArticles/Programming/serialization/ > > ) > > > > All we need to do now is deploy "Movie Recommender.model" to production. > > > > If I understand the behavior correctly, this model should now be able to > > predict recommendation for a new user. > > > > As an example, lets assume that production has a different user base. If > > recommender instance is loaded from "Movie Recommender.model" file and > > asked to provide recommendations for user '7' who has rated 101 and 102 > as > > 4 and 3 respectively, it should be able to predict recommendations for 7. > > right? > > > > regards, > > Vinod > > > > > > > > > > On Thu, Dec 8, 2011 at 6:49 PM, Sean Owen <[EMAIL PROTECTED]> wrote: > > > > > Yes, I mean you need to write it and read it in your own code. > > > > > > What do you mean by training a model? computing similarities? I don't > > know > > > if there's such a thing here as "training" on one data set and running > on > > > another. The implementations always use all currently available info. > Is > > > this a cold-start issue? > > > > > > OutOfMemoryError is nothing to do with this; on such a small data set > it > > > indicates you didn't set your JVM heap size above the default. > > > > > > > > > On Thu, Dec 8, 2011 at 1:02 PM, Vinod <[EMAIL PROTECTED]> wrote: > > > > > > > Hi Sean, > > > > > > > > Neither Recommender nor any of its parent interface extends > > serializable > > > so > > > > there is no way that I'd be able to serialize it. > > > > > > > > I agree that the implementations may not have startup overhead. > > However, > > > > training a model on millions of row is a cpu, memory & time consuming > > > > activity. For example, when data set is changed from 100K to 1M in > > > chapter > > > > 4, program crashes with OutOfMemory after significant amount of time. > > > > > > > > I feel that training should be done in development only. Once a > > developer > > > > is ok with test results, he should be able to save instance of the > > > trained > > > > and tested model (for ex:- recommender or classifier). > > > > > > > > These saved instances of trained and tested models only should be > > > deployed > > > > to production. > > > > > > > > Thought? > > > > > > > > regards, > > > > Vinod > > > > > > > > > > > > > > > > On Thu, Dec 8, 2011 at 6:00 PM, Sean Owen <[EMAIL PROTECTED]> wrote: > > > > > > > > > Ah right. No, there's still not a provision for this. You would > just > > > have > > > > > to serialize it yourself if you like. Lance Norskog [EMAIL PROTECTED] +
Lance Norskog 2011-12-08, 22:52
-
Re: Persisting trained models in MahoutSuneel Marthi 2011-12-08, 23:04
That's correct.
Thanks for pointing this out, Lance. ________________________________ From: Lance Norskog <[EMAIL PROTECTED]> To: [EMAIL PROTECTED] Sent: Thursday, December 8, 2011 5:52 PM Subject: Re: Persisting trained models in Mahout It would also be useful to load and cache often-used items and compute rarely-used items online. The Caching classes are the natural fit for this. On Thu, Dec 8, 2011 at 9:20 AM, Vinod <[EMAIL PROTECTED]> wrote: > Sure Suneel. Thanks. > > On Thu, Dec 8, 2011 at 8:00 PM, Suneel Marthi <[EMAIL PROTECTED] > >wrote: > > > Would ModelSerializer class in Mahout be what you are looking for? I had > > used it to persist trained models for SGD classifiers, you may want to > look > > into it. > > > > > > > > ________________________________ > > From: Vinod <[EMAIL PROTECTED]> > > To: [EMAIL PROTECTED] > > Sent: Thursday, December 8, 2011 8:46 AM > > Subject: Re: Persisting trained models in Mahout > > > > I'll use the first example from Chapter 2 of your book to clarify what I > > mean by training:- > > > > Following code trains the recommender:- > > DataModel model = new FileDataModel(new File("intro.csv")); > > > > UserSimilarity similarity = new PearsonCorrelationSimilarity(model); > > UserNeighborhood neighborhood > > new NearestNUserNeighborhood(2, similarity, model); > > > > Recommender recommender = new GenericUserBasedRecommender( > > model, neighborhood, similarity); > > > > At this point, recommender is trained on preferences of users 1 to 5 in > > intro.csv. > > > > We should now be able to serialize() this recommender instance into a > file, > > say "Movie Recommender.model" using steps mentioned here ( > > > http://java.sun.com/developer/technicalArticles/Programming/serialization/ > > ) > > > > All we need to do now is deploy "Movie Recommender.model" to production. > > > > If I understand the behavior correctly, this model should now be able to > > predict recommendation for a new user. > > > > As an example, lets assume that production has a different user base. If > > recommender instance is loaded from "Movie Recommender.model" file and > > asked to provide recommendations for user '7' who has rated 101 and 102 > as > > 4 and 3 respectively, it should be able to predict recommendations for 7. > > right? > > > > regards, > > Vinod > > > > > > > > > > On Thu, Dec 8, 2011 at 6:49 PM, Sean Owen <[EMAIL PROTECTED]> wrote: > > > > > Yes, I mean you need to write it and read it in your own code. > > > > > > What do you mean by training a model? computing similarities? I don't > > know > > > if there's such a thing here as "training" on one data set and running > on > > > another. The implementations always use all currently available info. > Is > > > this a cold-start issue? > > > > > > OutOfMemoryError is nothing to do with this; on such a small data set > it > > > indicates you didn't set your JVM heap size above the default. > > > > > > > > > On Thu, Dec 8, 2011 at 1:02 PM, Vinod <[EMAIL PROTECTED]> wrote: > > > > > > > Hi Sean, > > > > > > > > Neither Recommender nor any of its parent interface extends > > serializable > > > so > > > > there is no way that I'd be able to serialize it. > > > > > > > > I agree that the implementations may not have startup overhead. > > However, > > > > training a model on millions of row is a cpu, memory & time consuming > > > > activity. For example, when data set is changed from 100K to 1M in > > > chapter > > > > 4, program crashes with OutOfMemory after significant amount of time. > > > > > > > > I feel that training should be done in development only. Once a > > developer > > > > is ok with test results, he should be able to save instance of the > > > trained > > > > and tested model (for ex:- recommender or classifier). > > > > > > > > These saved instances of trained and tested models only should be > > > deployed > > > > to production. > > > > > > > > Thought? > > > > > > > > regards, > > > Lance Norskog [EMAIL PROTECTED] +
Suneel Marthi 2011-12-08, 23:04
-
Re: Persisting trained models in MahoutTed Dunning 2011-12-08, 14:19
There are other ways to store data structures than extending serializable.
The classifiers, for instance, can be saved and loaded at will. See Chapter 16. Recommenders allow off-line computation of item-item similarities which is the major cost for a recommender. The on-line component starts quickly and provides fast access given this data. Your problems with memory usage were probably due to using the paradigm in which the entire computation is done on-line. That is fine for small problems, but not for large. Keep in mind also that recommendation models are not small. On Thu, Dec 8, 2011 at 6:02 AM, Vinod <[EMAIL PROTECTED]> wrote: > Neither Recommender nor any of its parent interface extends serializable so > there is no way that I'd be able to serialize it. > +
Ted Dunning 2011-12-08, 14:19
|