|
First Qaxy
2010-05-11, 06:11
Sean Owen
2010-05-11, 07:14
Sean Owen
2010-05-11, 07:23
Sean Owen
2010-05-11, 07:55
Grant Ingersoll
2010-05-11, 12:10
Sean Owen
2010-05-11, 12:15
First Qaxy
2010-05-11, 12:27
First Qaxy
2010-05-11, 14:49
First Qaxy
2010-05-11, 14:52
First Qaxy
2010-05-11, 21:00
Sean Owen
2010-05-11, 21:08
First Qaxy
2010-05-12, 02:01
First Qaxy
2010-05-12, 02:05
|
-
RecommenderJob outputFirst Qaxy 2010-05-11, 06:11
Hello,
When running the RecommenderJob with --booleanData false on this input:101,1001101,1002101,1003101,1004101,1005102,1002102,1003103,1002103,1003103,1004105,1001105,1002105,1003105,1004105,1015106,1002106,1003106,1004106,1020106,1021 the output that I'm getting has: 101 [1015:4.0,1021:3.0,1020:3.0,1005:-Infinity,1004:-Infinity,1003:-Infinity,1001:-Infinity,1002:-Infinity]102 [1004:10.0,1005:8.0,1020:2.0,1021:2.0,1015:2.0,1003:-Infinity,1002:-Infinity]103 [1005:12.0,1021:3.0,1020:3.0,1015:3.0,1004:-Infinity,1002:-Infinity,1003:-Infinity]105 [1005:14.0,1020:3.0,1021:3.0,1015:-Infinity,1004:-Infinity,1003:-Infinity,1001:-Infinity,1002:-Infinity]106 [1005:12.0,1021:4.0,1015:3.0,1004:-Infinity,1002:-Infinity,1003:-Infinity,1020:-Infinity] What is the meaning(formula) of the float number? > 101 [1015:4.0 <= what is 4.0 ? Thanks, -qf
-
Re: RecommenderJob outputSean Owen 2010-05-11, 07:14
The values are entries in the final recommendation vector. They don't
have a good interpretation by themselves, but larger values should mean better recommendation. So the recommendations are ordered by this value. It's included just in case it is useful. In other recommender systems (like .pseudo), this would be the actual estimated preference. However I don't immediately see why the result would be negative infinity, ever. I'd have to look into that. On Tue, May 11, 2010 at 7:11 AM, First Qaxy <[EMAIL PROTECTED]> wrote: > Hello, > When running the RecommenderJob with --booleanData false on this input:101,1001101,1002101,1003101,1004101,1005102,1002102,1003103,1002103,1003103,1004105,1001105,1002105,1003105,1004105,1015106,1002106,1003106,1004106,1020106,1021 > the output that I'm getting has: > 101 [1015:4.0,1021:3.0,1020:3.0,1005:-Infinity,1004:-Infinity,1003:-Infinity,1001:-Infinity,1002:-Infinity]102 [1004:10.0,1005:8.0,1020:2.0,1021:2.0,1015:2.0,1003:-Infinity,1002:-Infinity]103 [1005:12.0,1021:3.0,1020:3.0,1015:3.0,1004:-Infinity,1002:-Infinity,1003:-Infinity]105 [1005:14.0,1020:3.0,1021:3.0,1015:-Infinity,1004:-Infinity,1003:-Infinity,1001:-Infinity,1002:-Infinity]106 [1005:12.0,1021:4.0,1015:3.0,1004:-Infinity,1002:-Infinity,1003:-Infinity,1020:-Infinity] > What is the meaning(formula) of the float number? > 101 [1015:4.0 <= what is 4.0 ? > Thanks, -qf > >
-
Re: RecommenderJob outputSean Owen 2010-05-11, 07:23
Er, wait why are you setting booleanData = false? Though the
formatting got messed up here, it looks like you do not have explicit ratings. So you should set to true.. On Tue, May 11, 2010 at 7:11 AM, First Qaxy <[EMAIL PROTECTED]> wrote: > Hello, > When running the RecommenderJob with --booleanData false on this input:101,1001101,1002101,1003101,1004101,1005102,1002102,1003103,1002103,1003103,1004105,1001105,1002105,1003105,1004105,1015106,1002106,1003106,1004106,1020106,1021 > the output that I'm getting has: > 101 [1015:4.0,1021:3.0,1020:3.0,1005:-Infinity,1004:-Infinity,1003:-Infinity,1001:-Infinity,1002:-Infinity]102 [1004:10.0,1005:8.0,1020:2.0,1021:2.0,1015:2.0,1003:-Infinity,1002:-Infinity]103 [1005:12.0,1021:3.0,1020:3.0,1015:3.0,1004:-Infinity,1002:-Infinity,1003:-Infinity]105 [1005:14.0,1020:3.0,1021:3.0,1015:-Infinity,1004:-Infinity,1003:-Infinity,1001:-Infinity,1002:-Infinity]106 [1005:12.0,1021:4.0,1015:3.0,1004:-Infinity,1002:-Infinity,1003:-Infinity,1020:-Infinity] > What is the meaning(formula) of the float number? > 101 [1015:4.0 <= what is 4.0 ? > Thanks, -qf > >
-
Re: RecommenderJob outputSean Owen 2010-05-11, 07:55
I just committed more of my local changes, since I'm actively
improving and fixing things here. My output looks more reasonable: 101 [1015:4.0,1021:3.0,1020:3.0] 102 [1004:10.0,1005:8.0,1021:2.0,1020:2.0,1015:2.0] 103 [1005:12.0,1021:3.0,1015:3.0,1020:3.0] 105 [1005:14.0,1021:3.0,1020:3.0] 106 [1005:12.0,1021:4.0,1015:3.0] So you might just try the code from head. booleanData doesn't really affect the output, it just enables optimizations for this case.
-
Re: RecommenderJob outputGrant Ingersoll 2010-05-11, 12:10
Please, when starting a new thread, start a new message.
See http://people.apache.org/~hossman/#threadhijack <snip> When starting a new discussion on a mailing list, please do not reply to an existing message, instead start a fresh email. Even if you change the subject line of your email, other mail headers still track which thread you replied to and your question is "hidden" in that thread and gets less attention. It makes following discussions in the mailing list archives particularly difficult. See Also: http://en.wikipedia.org/wiki/User:DonDiego/Thread_hijacking </snip> On May 11, 2010, at 2:11 AM, First Qaxy wrote: > Hello, > When running the RecommenderJob with --booleanData false on this input:101,1001101,1002101,1003101,1004101,1005102,1002102,1003103,1002103,1003103,1004105,1001105,1002105,1003105,1004105,1015106,1002106,1003106,1004106,1020106,1021 > the output that I'm getting has: > 101 [1015:4.0,1021:3.0,1020:3.0,1005:-Infinity,1004:-Infinity,1003:-Infinity,1001:-Infinity,1002:-Infinity]102 [1004:10.0,1005:8.0,1020:2.0,1021:2.0,1015:2.0,1003:-Infinity,1002:-Infinity]103 [1005:12.0,1021:3.0,1020:3.0,1015:3.0,1004:-Infinity,1002:-Infinity,1003:-Infinity]105 [1005:14.0,1020:3.0,1021:3.0,1015:-Infinity,1004:-Infinity,1003:-Infinity,1001:-Infinity,1002:-Infinity]106 [1005:12.0,1021:4.0,1015:3.0,1004:-Infinity,1002:-Infinity,1003:-Infinity,1020:-Infinity] > What is the meaning(formula) of the float number? > 101 [1015:4.0 <= what is 4.0 ? > Thanks, -qf >
-
Re: RecommenderJob outputSean Owen 2010-05-11, 12:15
(Did that happen? I only see my three replies to the original message
-- sure, maybe that could have been one -- but all were directly relevant to the first message.) (Or is this somehow looking connected to another thread because it shares the same subject? didn't happen for me in Gmail at least) On Tue, May 11, 2010 at 1:10 PM, Grant Ingersoll <[EMAIL PROTECTED]> wrote: > Please, when starting a new thread, start a new message.
-
Re: RecommenderJob outputFirst Qaxy 2010-05-11, 12:27
Hi Grant,
I wasn't aware of that. Thanks. I'll do that going forward. -qf --- On Tue, 5/11/10, Grant Ingersoll <[EMAIL PROTECTED]> wrote: From: Grant Ingersoll <[EMAIL PROTECTED]> Subject: Re: RecommenderJob output To: [EMAIL PROTECTED] Cc: [EMAIL PROTECTED] Received: Tuesday, May 11, 2010, 8:10 AM Please, when starting a new thread, start a new message. See http://people.apache.org/~hossman/#threadhijack <snip> When starting a new discussion on a mailing list, please do not reply to an existing message, instead start a fresh email. Even if you change the subject line of your email, other mail headers still track which thread you replied to and your question is "hidden" in that thread and gets less attention. It makes following discussions in the mailing list archives particularly difficult. See Also: http://en.wikipedia.org/wiki/User:DonDiego/Thread_hijacking </snip> On May 11, 2010, at 2:11 AM, First Qaxy wrote: > Hello, > When running the RecommenderJob with --booleanData false on this input:101,1001101,1002101,1003101,1004101,1005102,1002102,1003103,1002103,1003103,1004105,1001105,1002105,1003105,1004105,1015106,1002106,1003106,1004106,1020106,1021 > the output that I'm getting has: > 101 [1015:4.0,1021:3.0,1020:3.0,1005:-Infinity,1004:-Infinity,1003:-Infinity,1001:-Infinity,1002:-Infinity]102 [1004:10.0,1005:8.0,1020:2.0,1021:2.0,1015:2.0,1003:-Infinity,1002:-Infinity]103 [1005:12.0,1021:3.0,1020:3.0,1015:3.0,1004:-Infinity,1002:-Infinity,1003:-Infinity]105 [1005:14.0,1020:3.0,1021:3.0,1015:-Infinity,1004:-Infinity,1003:-Infinity,1001:-Infinity,1002:-Infinity]106 [1005:12.0,1021:4.0,1015:3.0,1004:-Infinity,1002:-Infinity,1003:-Infinity,1020:-Infinity] > What is the meaning(formula) of the float number? > 101 [1015:4.0 <= what is 4.0 ? > Thanks, -qf >
-
Re: RecommenderJob outputFirst Qaxy 2010-05-11, 14:49
Sorry, typed the wrong thing - yes, it is true in fact.
--- On Tue, 5/11/10, Sean Owen <[EMAIL PROTECTED]> wrote: From: Sean Owen <[EMAIL PROTECTED]> Subject: Re: RecommenderJob output To: [EMAIL PROTECTED] Cc: [EMAIL PROTECTED] Received: Tuesday, May 11, 2010, 3:23 AM Er, wait why are you setting booleanData = false? Though the formatting got messed up here, it looks like you do not have explicit ratings. So you should set to true.. On Tue, May 11, 2010 at 7:11 AM, First Qaxy <[EMAIL PROTECTED]> wrote: > Hello, > When running the RecommenderJob with --booleanData false on this input:101,1001101,1002101,1003101,1004101,1005102,1002102,1003103,1002103,1003103,1004105,1001105,1002105,1003105,1004105,1015106,1002106,1003106,1004106,1020106,1021 > the output that I'm getting has: > 101 [1015:4.0,1021:3.0,1020:3.0,1005:-Infinity,1004:-Infinity,1003:-Infinity,1001:-Infinity,1002:-Infinity]102 [1004:10.0,1005:8.0,1020:2.0,1021:2.0,1015:2.0,1003:-Infinity,1002:-Infinity]103 [1005:12.0,1021:3.0,1020:3.0,1015:3.0,1004:-Infinity,1002:-Infinity,1003:-Infinity]105 [1005:14.0,1020:3.0,1021:3.0,1015:-Infinity,1004:-Infinity,1003:-Infinity,1001:-Infinity,1002:-Infinity]106 [1005:12.0,1021:4.0,1015:3.0,1004:-Infinity,1002:-Infinity,1003:-Infinity,1020:-Infinity] > What is the meaning(formula) of the float number? > 101 [1015:4.0 <= what is 4.0 ? > Thanks, -qf > >
-
Re: RecommenderJob outputFirst Qaxy 2010-05-11, 14:52
Thanks, I've tested it and it did stop showing the -Infinity values.
-qf --- On Tue, 5/11/10, Sean Owen <[EMAIL PROTECTED]> wrote: From: Sean Owen <[EMAIL PROTECTED]> Subject: Re: RecommenderJob output To: [EMAIL PROTECTED] Cc: [EMAIL PROTECTED] Received: Tuesday, May 11, 2010, 3:55 AM I just committed more of my local changes, since I'm actively improving and fixing things here. My output looks more reasonable: 101 [1015:4.0,1021:3.0,1020:3.0] 102 [1004:10.0,1005:8.0,1021:2.0,1020:2.0,1015:2.0] 103 [1005:12.0,1021:3.0,1015:3.0,1020:3.0] 105 [1005:14.0,1021:3.0,1020:3.0] 106 [1005:12.0,1021:4.0,1015:3.0] So you might just try the code from head. booleanData doesn't really affect the output, it just enables optimizations for this case.
-
Re: RecommenderJob outputFirst Qaxy 2010-05-11, 21:00
One question on the recommendation lifecycle: once a RecommendationJob is being run with the intermediate/temp model being created what is the process of maintaining it? Can I update it or parts of it to reflect new data?
For example if I have a new user or new preferences for an existing user that I want to compute recommendation for can I do that by incrementally update the internal model and regenerate only recommendations for the user that I'm interested in? Thanks. -qf --- On Tue, 5/11/10, Sean Owen <[EMAIL PROTECTED]> wrote: From: Sean Owen <[EMAIL PROTECTED]> Subject: Re: RecommenderJob output To: [EMAIL PROTECTED] Cc: [EMAIL PROTECTED] Received: Tuesday, May 11, 2010, 3:55 AM I just committed more of my local changes, since I'm actively improving and fixing things here. My output looks more reasonable: 101 [1015:4.0,1021:3.0,1020:3.0] 102 [1004:10.0,1005:8.0,1021:2.0,1020:2.0,1015:2.0] 103 [1005:12.0,1021:3.0,1015:3.0,1020:3.0] 105 [1005:14.0,1021:3.0,1020:3.0] 106 [1005:12.0,1021:4.0,1015:3.0] So you might just try the code from head. booleanData doesn't really affect the output, it just enables optimizations for this case.
-
Re: RecommenderJob outputSean Owen 2010-05-11, 21:08
Can you update it while it's running? Not really. It's a multi-phase
batch job and I don't think you could meaningfully change it on the fly. Do you need to run the whole thing every time? No, not at all. Phase 1 (item IDs to item indices) doesn't need to run every time, nor does phase 3 (count co-occurrence). It's OK if these are a little out of date. Phase 2 is user vector generation; while I didn't write any ability to simply append a new user vector to its output, it's easy to write. So you don't have to run that every time. Phase 4 and 5 are really where the recommendation happens. Those go together. You can limit which users it processes though with a file of user IDs, --usersFile. I'd say the core job is nearing maturity -- think it's tuned and debugged. But these kind of practical hooks, like being able to incrementally update aspects of the pipeline, are exactly what's needed next. I'd welcome your input and patches in this regard. Sean On Tue, May 11, 2010 at 10:00 PM, First Qaxy <[EMAIL PROTECTED]> wrote: > One question on the recommendation lifecycle: once a RecommendationJob is being run with the intermediate/temp model being created what is the process of maintaining it? Can I update it or parts of it to reflect new data? > For example if I have a new user or new preferences for an existing user that I want to compute recommendation for can I do that by incrementally update the internal model and regenerate only recommendations for the user that I'm interested in? > > Thanks. > -qf > --- On Tue, 5/11/10, Sean Owen <[EMAIL PROTECTED]> wrote: > > From: Sean Owen <[EMAIL PROTECTED]> > Subject: Re: RecommenderJob output > To: [EMAIL PROTECTED] > Cc: [EMAIL PROTECTED] > Received: Tuesday, May 11, 2010, 3:55 AM > > I just committed more of my local changes, since I'm actively > improving and fixing things here. > > My output looks more reasonable: > > 101 [1015:4.0,1021:3.0,1020:3.0] > 102 [1004:10.0,1005:8.0,1021:2.0,1020:2.0,1015:2.0] > 103 [1005:12.0,1021:3.0,1015:3.0,1020:3.0] > 105 [1005:14.0,1021:3.0,1020:3.0] > 106 [1005:12.0,1021:4.0,1015:3.0] > > So you might just try the code from head. booleanData doesn't really > affect the output, it just enables optimizations for this case. > > >
-
Re: RecommenderJob outputFirst Qaxy 2010-05-12, 02:01
Great info. No, I'm not looking into having multiple active processes trying to update it. It's more of a single worker process that needs to update the "model" as new data becomes available (every few hours, days,... depending on the customer needs). Ideally I should be able to tell which users were affected so only their recommendations would end up being updated back to Solr. I am getting closer to the end of the evaluation process of Mahout and will soon proceed with the implementation, at which point I hope I'll be able to provide better feedback and contribute more.
On a different thread - I have a high level / best practices question: When doing clustering or classification with large datasets - is the expectation that the algorithms would run on the whole data set available or a (carefully selected) sub set i.e. the training model. I'm interesting in the "model" deployed in production, not just for the purpose of training. If the answer is - a sub set - what is usually a good size relative to the full data set and how do people approach this in order to get a representative smaller set? -qf --- On Tue, 5/11/10, Sean Owen <[EMAIL PROTECTED]> wrote: From: Sean Owen <[EMAIL PROTECTED]> Subject: Re: RecommenderJob output To: [EMAIL PROTECTED] Received: Tuesday, May 11, 2010, 5:08 PM Can you update it while it's running? Not really. It's a multi-phase batch job and I don't think you could meaningfully change it on the fly. Do you need to run the whole thing every time? No, not at all. Phase 1 (item IDs to item indices) doesn't need to run every time, nor does phase 3 (count co-occurrence). It's OK if these are a little out of date. Phase 2 is user vector generation; while I didn't write any ability to simply append a new user vector to its output, it's easy to write. So you don't have to run that every time. Phase 4 and 5 are really where the recommendation happens. Those go together. You can limit which users it processes though with a file of user IDs, --usersFile. I'd say the core job is nearing maturity -- think it's tuned and debugged. But these kind of practical hooks, like being able to incrementally update aspects of the pipeline, are exactly what's needed next. I'd welcome your input and patches in this regard. Sean On Tue, May 11, 2010 at 10:00 PM, First Qaxy <[EMAIL PROTECTED]> wrote: > One question on the recommendation lifecycle: once a RecommendationJob is being run with the intermediate/temp model being created what is the process of maintaining it? Can I update it or parts of it to reflect new data? > For example if I have a new user or new preferences for an existing user that I want to compute recommendation for can I do that by incrementally update the internal model and regenerate only recommendations for the user that I'm interested in? > > Thanks. > -qf > --- On Tue, 5/11/10, Sean Owen <[EMAIL PROTECTED]> wrote: > > From: Sean Owen <[EMAIL PROTECTED]> > Subject: Re: RecommenderJob output > To: [EMAIL PROTECTED] > Cc: [EMAIL PROTECTED] > Received: Tuesday, May 11, 2010, 3:55 AM > > I just committed more of my local changes, since I'm actively > improving and fixing things here. > > My output looks more reasonable: > > 101 [1015:4.0,1021:3.0,1020:3.0] > 102 [1004:10.0,1005:8.0,1021:2.0,1020:2.0,1015:2.0] > 103 [1005:12.0,1021:3.0,1015:3.0,1020:3.0] > 105 [1005:14.0,1021:3.0,1020:3.0] > 106 [1005:12.0,1021:4.0,1015:3.0] > > So you might just try the code from head. booleanData doesn't really > affect the output, it just enables optimizations for this case. > > >
-
Re: RecommenderJob outputFirst Qaxy 2010-05-12, 02:05
> the training model. I'm interesting in the "model" deployed in production, not just for the purpose of training.
err, I meant to say : not just for the purpose of *testing*. --- On Tue, 5/11/10, First Qaxy <[EMAIL PROTECTED]> wrote: From: First Qaxy <[EMAIL PROTECTED]> Subject: Re: RecommenderJob output To: [EMAIL PROTECTED] Received: Tuesday, May 11, 2010, 10:01 PM Great info. No, I'm not looking into having multiple active processes trying to update it. It's more of a single worker process that needs to update the "model" as new data becomes available (every few hours, days,... depending on the customer needs). Ideally I should be able to tell which users were affected so only their recommendations would end up being updated back to Solr. I am getting closer to the end of the evaluation process of Mahout and will soon proceed with the implementation, at which point I hope I'll be able to provide better feedback and contribute more. On a different thread - I have a high level / best practices question: When doing clustering or classification with large datasets - is the expectation that the algorithms would run on the whole data set available or a (carefully selected) sub set i.e. the training model. I'm interesting in the "model" deployed in production, not just for the purpose of training. If the answer is - a sub set - what is usually a good size relative to the full data set and how do people approach this in order to get a representative smaller set? -qf --- On Tue, 5/11/10, Sean Owen <[EMAIL PROTECTED]> wrote: From: Sean Owen <[EMAIL PROTECTED]> Subject: Re: RecommenderJob output To: [EMAIL PROTECTED] Received: Tuesday, May 11, 2010, 5:08 PM Can you update it while it's running? Not really. It's a multi-phase batch job and I don't think you could meaningfully change it on the fly. Do you need to run the whole thing every time? No, not at all. Phase 1 (item IDs to item indices) doesn't need to run every time, nor does phase 3 (count co-occurrence). It's OK if these are a little out of date. Phase 2 is user vector generation; while I didn't write any ability to simply append a new user vector to its output, it's easy to write. So you don't have to run that every time. Phase 4 and 5 are really where the recommendation happens. Those go together. You can limit which users it processes though with a file of user IDs, --usersFile. I'd say the core job is nearing maturity -- think it's tuned and debugged. But these kind of practical hooks, like being able to incrementally update aspects of the pipeline, are exactly what's needed next. I'd welcome your input and patches in this regard. Sean On Tue, May 11, 2010 at 10:00 PM, First Qaxy <[EMAIL PROTECTED]> wrote: > One question on the recommendation lifecycle: once a RecommendationJob is being run with the intermediate/temp model being created what is the process of maintaining it? Can I update it or parts of it to reflect new data? > For example if I have a new user or new preferences for an existing user that I want to compute recommendation for can I do that by incrementally update the internal model and regenerate only recommendations for the user that I'm interested in? > > Thanks. > -qf > --- On Tue, 5/11/10, Sean Owen <[EMAIL PROTECTED]> wrote: > > From: Sean Owen <[EMAIL PROTECTED]> > Subject: Re: RecommenderJob output > To: [EMAIL PROTECTED] > Cc: [EMAIL PROTECTED] > Received: Tuesday, May 11, 2010, 3:55 AM > > I just committed more of my local changes, since I'm actively > improving and fixing things here. > > My output looks more reasonable: > > 101 [1015:4.0,1021:3.0,1020:3.0] > 102 [1004:10.0,1005:8.0,1021:2.0,1020:2.0,1015:2.0] > 103 [1005:12.0,1021:3.0,1015:3.0,1020:3.0] > 105 [1005:14.0,1021:3.0,1020:3.0] > 106 [1005:12.0,1021:4.0,1015:3.0] > > So you might just try the code from head. booleanData doesn't really > affect the output, it just enables optimizations for this case. |