Home | About | Sematext search-lucene.com search-hadoop.com
 Search Lucene and all its subprojects:

Switch to Plain View
Mahout, mail # user - Mahout beginner questions...


+
Razon, Oren 2012-03-22, 11:35
+
Sean Owen 2012-03-22, 11:51
+
Razon, Oren 2012-03-22, 15:16
+
Sean Owen 2012-03-22, 15:57
+
Razon, Oren 2012-03-25, 13:04
+
Sean Owen 2012-03-25, 19:25
+
Razon, Oren 2012-03-25, 19:28
+
Ted Dunning 2012-03-25, 19:35
+
Razon, Oren 2012-03-25, 22:36
+
Ted Dunning 2012-03-25, 22:55
+
Razon, Oren 2012-03-25, 23:02
+
Ted Dunning 2012-03-25, 23:16
+
Razon, Oren 2012-03-26, 08:27
+
Sean Owen 2012-03-26, 09:47
+
Razon, Oren 2012-03-26, 10:05
+
Sean Owen 2012-03-26, 10:17
+
Ted Dunning 2012-03-26, 13:52
+
Razon, Oren 2012-03-26, 15:42
+
Sean Owen 2012-03-26, 15:54
+
Sean Owen 2012-03-25, 23:21
Copy link to this message
-
Re: Mahout beginner questions...
Sean Owen 2012-03-25, 19:41
Caching recommendations is a good use of memory, sure. It doesn't decrease
memory requirements and doesn't speed up the initial recommendation though.

Yes pre-computing recommendations is also possible. This is more or less
what the Hadoop-based implementation is for. That scales just fine but is
not real-time. You're waiting X minutes/hours to see any reaction to new
data a user inputs. For some contexts, that's fine. For many it's not; I
expect my recs to change every time I rate a book on Amazon. That's almost
the fun of it.

Ted is right that you may more commonly pre-compute some big piece of the
puzzle like item-item similarities or a matrix factorization. Then you can
finish the rec computation quite quickly, and it can respond to new data
straight away (at least approximately).

This is the sort of setup I was alluding to earlier, and what a 'real' and
complete, scalable system resembles. It's more complex.

This does not exist per se in the project. The pieces are there, in fact
80% of it I'd say, but the stitching together is still mostly up to the
developer.

On Sun, Mar 25, 2012 at 8:28 PM, Razon, Oren <[EMAIL PROTECTED]> wrote:

> Correct me if I'm wrong but a good way to boost up speed could be to use
> caching recommender, meaning computing the recommendations in advanced
> (refresh it every X min\hours) and always recommend using the most updated
> recommendations, right?!
>
> -----Original Message-----
> From: Sean Owen [mailto:[EMAIL PROTECTED]]
> Sent: Sunday, March 25, 2012 21:25
> To: [EMAIL PROTECTED]
> Subject: Re: Mahout beginner questions...
>
> It is memory. You will need a pretty large heap to put 100M data in memory
> -- probably 4GB, if not a little more (so the machine would need 8GB+ RAM).
> You can go bigger if you have more memory but that size seems about the
> biggest to reasonably assume people have.
>
> Of course more data slows things down and past about 10M data points you
> need to tune things to sample data rather than try every possibility. This
> is most of what CandidateItemStrategy has to do with. It is relatively easy
> to tune this though so speed doesn't have to ben an issue.
>
> Again you can go bigger and tune it to down-sample more; somehow I stil
> believe that 100M is a crude but useful rule of thumb, as to the point
> beyond which it's just hard to get good speed and quality.
>
> Sean
>
> On Sun, Mar 25, 2012 at 2:04 PM, Razon, Oren <[EMAIL PROTECTED]> wrote:
>
> > Thanks for the detailed answer Sean.
> > I want to understand more clearly the non-distributed code limitations.
> > I saw that you advise that for more than 100,000,000 ratings the
> > non-distributed engine won't do the job.
> > The question is why? Is it memory issue (and then if I will have a bigger
> > machine, meaning I could scale up), or is it because of the
> recommendation
> > time it takes?
> >
> >
> ---------------------------------------------------------------------
> Intel Electronics Ltd.
>
> This e-mail and any attachments may contain confidential material for
> the sole use of the intended recipient(s). Any review or distribution
> by others is strictly prohibited. If you are not the intended
> recipient, please contact the sender and delete all copies.
>
+
Ted Dunning 2012-03-25, 19:32
+
Razon, Oren 2012-04-05, 07:27
+
Sebastian Schelter 2012-04-05, 07:34
+
Razon, Oren 2012-04-05, 07:44
+
Sebastian Schelter 2012-04-05, 07:47
+
Sean Owen 2012-04-05, 07:57