|
Mridul Kapoor
2012-03-18, 20:12
Sean Owen
2012-03-18, 20:43
Ted Dunning
2012-03-18, 20:54
Mridul Kapoor
2012-03-19, 05:19
Sebastian Schelter
2012-03-19, 06:38
Ted Dunning
2012-03-19, 07:55
Mridul Kapoor
2012-03-20, 05:06
Ted Dunning
2012-03-20, 05:54
Mridul Kapoor
2012-03-20, 06:25
Sean Owen
2012-03-20, 09:47
Sebastian Schelter
2012-03-20, 09:51
Mridul Kapoor
2012-03-20, 09:56
Sebastian Schelter
2012-03-20, 10:01
Mridul Kapoor
2012-03-20, 20:39
Mridul Kapoor
2012-03-21, 10:27
Mridul Kapoor
2012-03-21, 10:31
Mridul Kapoor
2012-03-21, 11:22
|
-
MongoDBDataModel in memory ?Mridul Kapoor 2012-03-18, 20:12
Hi,
I am up for building a item based recommender using Mahout. I have humongous amount of data in a Mongodb collection. But I am not sure that the MongoDBDataModel provided with Mahout will be able to handle my case. I see that in the buildModel() function, it creates a > FastByIDMap<Collection<Preference>> userIDPrefMap = new > FastByIDMap<Collection<Preference>>(); > [line 556] Does the subsequent code refer to creating an in-memory-model of the data from the mongodb collection(which I think it does); if yes - is there any current immediate alternative to that ? Thanks Mridul
-
Re: MongoDBDataModel in memory ?Sean Owen 2012-03-18, 20:43
Yep it's all in memory -- it would be too slow to access it out of Mongo.
The purpose is just making it easy to read and re-read data into Mongo, and facilitate updates. If the data is too big to fit in memory you should look first at pruning your data -- can sampling 10% of it still give you good results? If not, you are in Hadoop territory then and would want to look at a distributed algorithm here. On Sun, Mar 18, 2012 at 8:12 PM, Mridul Kapoor <[EMAIL PROTECTED]>wrote: > Hi, > I am up for building a item based recommender using Mahout. I have > humongous amount of data in a Mongodb collection. But I am not sure that > the MongoDBDataModel provided with Mahout will be able to handle my case. I > see that in the buildModel() function, it creates a > > > FastByIDMap<Collection<Preference>> userIDPrefMap = new > > FastByIDMap<Collection<Preference>>(); > > > [line 556] > Does the subsequent code refer to creating an in-memory-model of the data > from the mongodb collection(which I think it does); if yes - is there any > current immediate alternative to that ? > > Thanks > Mridul >
-
Re: MongoDBDataModel in memory ?Ted Dunning 2012-03-18, 20:54
Mridul,
What is the humongous amount of data in Mongo? Is it really item->item links? Or is it session information? With a recommender, it is unusual to have more than a few hundred links to other items for any given item. This means that even for 10 million items, you only have about a billion links in total and that can usually fit in memory on a single machine pretty easily. Recommenders with 10 million items are pretty rare and can often be factored down by some content characteristic. So, are you sure your data is too large for memory? On Sun, Mar 18, 2012 at 1:12 PM, Mridul Kapoor <[EMAIL PROTECTED]>wrote: > Hi, > I am up for building a item based recommender using Mahout. I have > humongous amount of data in a Mongodb collection. But I am not sure that > the MongoDBDataModel provided with Mahout will be able to handle my case. I > see that in the buildModel() function, it creates a > > > FastByIDMap<Collection<Preference>> userIDPrefMap = new > > FastByIDMap<Collection<Preference>>(); > > > [line 556] > Does the subsequent code refer to creating an in-memory-model of the data > from the mongodb collection(which I think it does); if yes - is there any > current immediate alternative to that ? > > Thanks > Mridul >
-
Re: MongoDBDataModel in memory ?Mridul Kapoor 2012-03-19, 05:19
On 19 March 2012 02:24, Ted Dunning <[EMAIL PROTECTED]> wrote:
> Mridul, > > What is the humongous amount of data in Mongo? Is it really item->item > links? Or is it session information? > It is session information. So this should anyway exceed, I suppose. And in my case, I need to pre-compute the item-item similarities. These would be refreshed periodically, maybe three times a week. I then need a recommender to run as a web-service, which would read from these pre-computed similarities(should be stored in some persistent place, right?). Could you suggest the way forward here, then. Thanks Mridul
-
Re: MongoDBDataModel in memory ?Sebastian Schelter 2012-03-19, 06:38
I've created a guide for scaling out a recommender system, maybe it is
useful for you: http://ssc.io/deploying-a-massively-scalable-recommender-system-with-apache-mahout/ Am 19.03.2012 06:20 schrieb "Mridul Kapoor" <[EMAIL PROTECTED]>: > On 19 March 2012 02:24, Ted Dunning <[EMAIL PROTECTED]> wrote: > > > Mridul, > > > > What is the humongous amount of data in Mongo? Is it really item->item > > links? Or is it session information? > > > > It is session information. So this should anyway exceed, I suppose. And in > my case, I need to pre-compute the item-item similarities. These would be > refreshed periodically, maybe three times a week. I then need a recommender > to run as a web-service, which would read from these pre-computed > similarities(should be stored in some persistent place, right?). Could you > suggest the way forward here, then. > > Thanks > Mridul >
-
Re: MongoDBDataModel in memory ?Ted Dunning 2012-03-19, 07:55
Session data never needs to be in memory. It can be processed sequentially or using map reduce.
The item item data is all you need in memory. Sent from my iPhone On Mar 18, 2012, at 10:19 PM, Mridul Kapoor <[EMAIL PROTECTED]> wrote: > On 19 March 2012 02:24, Ted Dunning <[EMAIL PROTECTED]> wrote: > >> Mridul, >> >> What is the humongous amount of data in Mongo? Is it really item->item >> links? Or is it session information? >> > > It is session information. So this should anyway exceed, I suppose. And in > my case, I need to pre-compute the item-item similarities. These would be > refreshed periodically, maybe three times a week. I then need a recommender > to run as a web-service, which would read from these pre-computed > similarities(should be stored in some persistent place, right?). Could you > suggest the way forward here, then. > > Thanks > Mridul
-
Re: MongoDBDataModel in memory ?Mridul Kapoor 2012-03-20, 05:06
Thanks Sebastian,
That guide turned out useful for me. In accordance I have changed my approach now. I have written a script to put the virtual sessions data in a preferences.csv file, from mongodb. Now, I need to pre-compute ItemSimilarities. I probably wont be using a hadoop cluster. Is there a way that I run the ItemSimilarityJob on a single machine ? Thanks a lot! Mridul
-
Re: MongoDBDataModel in memory ?Ted Dunning 2012-03-20, 05:54
On Mon, Mar 19, 2012 at 10:06 PM, Mridul Kapoor <[EMAIL PROTECTED]>wrote:
> .... Is there a way that I run the ItemSimilarityJob on a single > machine ? > > Yes. There is a sequential invocation as well.
-
Re: MongoDBDataModel in memory ?Mridul Kapoor 2012-03-20, 06:25
On 20 March 2012 11:24, Ted Dunning <[EMAIL PROTECTED]> wrote:
> Yes. There is a sequential invocation as well. > And pardon me for this, but I couldn't really get that, couldn't find it through Mahout in Action as well. Could someone point me to the job/class I should use to pre-compute item similarities -- in a way like * org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob* but I really do not want to leverage hadoop right now -- and run it on a single machine Thanks Mridul **
-
Re: MongoDBDataModel in memory ?Sean Owen 2012-03-20, 09:47
If you don't need Hadoop then this is pretty simple. You can just write a
nested loop that computes all pairs off an ItemSimilarity implementation. If I recall rightly GenericItemSimilarity will do that for you off an existing ItemSimilarity and then has the results in memory as a new ItemSimilarity. On Tue, Mar 20, 2012 at 6:25 AM, Mridul Kapoor <[EMAIL PROTECTED]>wrote: > On 20 March 2012 11:24, Ted Dunning <[EMAIL PROTECTED]> wrote: > > > Yes. There is a sequential invocation as well. > > > > And pardon me for this, but I couldn't really get that, couldn't find it > through Mahout in Action as well. Could someone point me to the job/class I > should use to pre-compute item similarities -- in a way like * > org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob* > but I really do not want to leverage hadoop right now -- and run it on a > single machine > > Thanks > Mridul > ** >
-
Re: MongoDBDataModel in memory ?Sebastian Schelter 2012-03-20, 09:51
There's no magic involved in precomputing item similarities, just setup
a recommender, ask it for similar items and store them somewhere: DataModel dataModel = ... ItemSimilarity similarity = new CachingItemSimilarity(...); ItemBasedRecommender recommender = new GenericItemBasedRecommender(dataModel, similarity); LongPrimitiveIterator itemIDs = dataModel.getItemIDs(); while (itemIDs.hasNext()) { long itemID = itemIDs.nextLong(); for (RecommendedItem similarItem : recommender.mostSimilarItems(itemID, ...)) { // save similar item to a file or a database } } With a little bit of engineering, this code can also run multithreaded. Best, Sebastian On 20.03.2012 07:25, Mridul Kapoor wrote: > On 20 March 2012 11:24, Ted Dunning <[EMAIL PROTECTED]> wrote: > >> Yes. There is a sequential invocation as well. >> > > And pardon me for this, but I couldn't really get that, couldn't find it > through Mahout in Action as well. Could someone point me to the job/class I > should use to pre-compute item similarities -- in a way like * > org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob* > but I really do not want to leverage hadoop right now -- and run it on a > single machine > > Thanks > Mridul > ** >
-
Re: MongoDBDataModel in memory ?Mridul Kapoor 2012-03-20, 09:56
On 20 March 2012 15:21, Sebastian Schelter <[EMAIL PROTECTED]> wrote:
> There's no magic involved in precomputing item similarities, just setup > a recommender, ask it for similar items and store them somewhere: > > DataModel dataModel = ... > ItemSimilarity similarity = new CachingItemSimilarity(...); > ItemBasedRecommender recommender = new > GenericItemBasedRecommender(dataModel, similarity); > > LongPrimitiveIterator itemIDs = dataModel.getItemIDs(); > while (itemIDs.hasNext()) { > long itemID = itemIDs.nextLong(); > for (RecommendedItem similarItem : > recommender.mostSimilarItems(itemID, ...)) { > // save similar item to a file or a database > } > } > > With a little bit of engineering, this code can also run multithreaded. > > Best, > Sebastian Did a bit of exploring. Could I just do this instead. I mean this would obviously just save to a file instead of a database which might instead be better for larger data. But is the following ok ? $ bin/mahout itemsimilarity --input ~/path/to/preferences.csv --output ~/path/to/output --similarityClassname SIMILARITY_LOGLIKELIHOOD --maxSimilaritiesPerItem 10 --tempDir ~/path/to/temp Thanks Mridul
-
Re: MongoDBDataModel in memory ?Sebastian Schelter 2012-03-20, 10:01
You can do this, but it will be awfully slow on a single box...
On 20.03.2012 10:56, Mridul Kapoor wrote: > On 20 March 2012 15:21, Sebastian Schelter <[EMAIL PROTECTED]> wrote: > >> There's no magic involved in precomputing item similarities, just setup >> a recommender, ask it for similar items and store them somewhere: >> >> DataModel dataModel = ... >> ItemSimilarity similarity = new CachingItemSimilarity(...); >> ItemBasedRecommender recommender = new >> GenericItemBasedRecommender(dataModel, similarity); >> >> LongPrimitiveIterator itemIDs = dataModel.getItemIDs(); >> while (itemIDs.hasNext()) { >> long itemID = itemIDs.nextLong(); >> for (RecommendedItem similarItem : >> recommender.mostSimilarItems(itemID, ...)) { >> // save similar item to a file or a database >> } >> } >> >> With a little bit of engineering, this code can also run multithreaded. >> >> Best, >> Sebastian > > > Did a bit of exploring. Could I just do this instead. I mean this would > obviously just save to a file instead of a database which might instead be > better for larger data. But is the following ok ? > > $ bin/mahout itemsimilarity --input ~/path/to/preferences.csv --output > ~/path/to/output --similarityClassname SIMILARITY_LOGLIKELIHOOD > --maxSimilaritiesPerItem 10 --tempDir ~/path/to/temp > > Thanks > Mridul >
-
Re: MongoDBDataModel in memory ?Mridul Kapoor 2012-03-20, 20:39
Thanks a ton people!
Mridul
-
Re: MongoDBDataModel in memory ?Mridul Kapoor 2012-03-21, 10:27
Hi
Thanks a lot Sebastian, Sean and Ted for your continuing help! Following your examples, I tried to create my own implementation. I ran it over a part of the dataset. DataModel dataModel = new FileDataModel(new File("preferences.csv")); > > LogLikelihoodSimilarity llls = new LogLikelihoodSimilarity(dataModel); > > ItemSimilarity similarity = new CachingItemSimilarity(llls, dataModel); > > ItemBasedRecommender recommender = new > > GenericBooleanPrefItemBasedRecommender(dataModel, similarity); > > >> BufferedWriter f = new BufferedWriter(new >> FileWriter("output-similarities.txt",true)); > > >> LongPrimitiveIterator itemIDs = dataModel.getItemIDs(); > > while (itemIDs.hasNext()) { > > long itemID = itemIDs.nextLong(); > > for (RecommendedItem similarItem : > > recommender.mostSimilarItems(itemID, 10)) { > > f.write(itemID + "\t" + similarItem.getID() + "\t" + >> similarItem.getValue()); > > }// inner for loop ends > > }// outer while loop ends > > f.close(); > > My preferences.csv is of the form > user, item user, item So I successfully ran this over a small part of my dataset. >From whatever I have read about in the mahout api doc, in the FileDataModel documentation : [Class FileDataModel] > This class will also look for update "delta" files in the same directory, > with file names that start the same way (up to the first period). These > files have the same format, and provide updated data that supersedes what > is in the main data file. Please correct me if I got this code/concept wrong here : in this case I want to update output_similarities.txt with similarities calculated from new update data. Can I have more and more datafiles with names like preferences.csv.001, preferences.csv.002 and so on.. and then just run this code again to update the output_similarity ? Will it totally recalculate everything, or just use these update files to modify and add the item-item-similarities. Thanks Mridul
-
Re: MongoDBDataModel in memory ?Mridul Kapoor 2012-03-21, 10:31
Correction :
On 21 March 2012 15:57, Mridul Kapoor <[EMAIL PROTECTED]> wrote: > Hi > Thanks a lot Sebastian, Sean and Ted for your continuing help! Following > your examples, I tried to create my own implementation. > > DataModel dataModel = new FileDataModel(new File("preferences.csv")); >> >> LogLikelihoodSimilarity llls = new LogLikelihoodSimilarity(dataModel); >> >> ItemSimilarity similarity = new CachingItemSimilarity(llls, >>> dataModel); >> >> ItemBasedRecommender recommender = new >> >> GenericBooleanPrefItemBasedRecommender(dataModel, similarity); >> >> >>> BufferedWriter f = new BufferedWriter(new >>> FileWriter("output-similarities.txt",true)); >> >> >>> LongPrimitiveIterator itemIDs = dataModel.getItemIDs(); >> >> while (itemIDs.hasNext()) { >> >> long itemID = itemIDs.nextLong(); >> >> for (RecommendedItem similarItem : >> >> recommender.mostSimilarItems(itemID, 10)) { >> >> f.write(itemID + "\t" + similarItem.getID() + "\t" + >>> similarItem.getValue()); >> >> }// inner for loop ends >> >> }// outer while loop ends >> >> f.close(); >> >> > My preferences.csv is of the form > >> user, item > > user, item > > > *Correction :* > So I successfully ran this over a small part of my dataset. > I ran this using $ bin/mahout itemsimilarity -- and now want to run in the way Sebastian suggested >From whatever I have read about in the mahout api doc, in the FileDataModel > documentation : > [Class FileDataModel] > >> This class will also look for update "delta" files in the same directory, >> with file names that start the same way (up to the first period). These >> files have the same format, and provide updated data that supersedes what >> is in the main data file. > > > Please correct me if I got this code/concept wrong here : in this case I > want to update output_similarities.txt with similarities calculated from > new update data. Can I have more and more datafiles with names like > preferences.csv.001, preferences.csv.002 and so on.. and then just run this > code again to update the output_similarity ? Will it totally recalculate > everything, or just use these update files to modify and add the > item-item-similarities. > > Thanks > Mridul >
-
Re: MongoDBDataModel in memory ?Mridul Kapoor 2012-03-21, 11:22
On 21 March 2012 16:01, Mridul Kapoor <[EMAIL PROTECTED]> wrote:
> Correction : > > > On 21 March 2012 15:57, Mridul Kapoor <[EMAIL PROTECTED]> wrote: > >> Hi >> Thanks a lot Sebastian, Sean and Ted for your continuing help! Following >> your examples, I tried to create my own implementation. >> >> DataModel dataModel = new FileDataModel(new File("preferences.csv")); >>> >>> LogLikelihoodSimilarity llls = new >>>> LogLikelihoodSimilarity(dataModel); >>> >>> ItemSimilarity similarity = new CachingItemSimilarity(llls, >>>> dataModel); >>> >>> ItemBasedRecommender recommender = new >>> >>> GenericBooleanPrefItemBasedRecommender(dataModel, similarity); >>> >>> >>>> BufferedWriter f = new BufferedWriter(new >>>> FileWriter("output-similarities.txt",true)); >>> >>> >>>> LongPrimitiveIterator itemIDs = dataModel.getItemIDs(); >>> >>> while (itemIDs.hasNext()) { >>> >>> long itemID = itemIDs.nextLong(); >>> >>> for (RecommendedItem similarItem : >>> >>> recommender.mostSimilarItems(itemID, 10)) { >>> >>> f.write(itemID + "\t" + similarItem.getID() + "\t" + >>>> similarItem.getValue()); >>> >>> }// inner for loop ends >>> >>> }// outer while loop ends >>> >>> f.close(); >>> >>> >> My preferences.csv is of the form >> >>> user, item >> >> user, item >> >> >> *Correction :* > >> So I successfully ran this over a small part of my dataset. >> > > I ran this using > $ bin/mahout itemsimilarity -- > and now want to run in the way Sebastian suggested > > From whatever I have read about in the mahout api doc, in the >> FileDataModel documentation : >> [Class FileDataModel] >> >>> This class will also look for update "delta" files in the same >>> directory, with file names that start the same way (up to the first >>> period). These files have the same format, and provide updated data that >>> supersedes what is in the main data file. >> >> >> Please correct me if I got this code/concept wrong here : in this case I >> want to update output_similarities.txt with similarities calculated from >> new update data. Can I have more and more datafiles with names like >> preferences.csv.001, preferences.csv.002 and so on.. and then just run this >> code again to update the output_similarity ? Will it totally recalculate >> everything, or just use these update files to modify and add the >> item-item-similarities. >> >> Thanks >> Mridul >> > I think I found the answer to my questions. Mahout in Action to the rescue ! Really well written. I guess I'll name the files preferences.01.csv preferences.02.csv and so on. And whenever I am want to use these updates, I should run recommender.refresh(null). That should pick up the data from these update files. Correct me if I am wrong here. Mridul |