|
Grant Ingersoll
2011-10-11, 16:34
Sean Owen
2011-10-11, 16:36
Grant Ingersoll
2011-10-11, 18:49
Sean Owen
2011-10-11, 18:54
Grant Ingersoll
2011-10-11, 18:55
Grant Ingersoll
2011-10-11, 19:15
Grant Ingersoll
2011-10-12, 13:28
Ken Krugler
2011-10-12, 18:10
Grant Ingersoll
2011-10-12, 18:30
Lance Norskog
2011-10-13, 06:33
Sean Owen
2011-10-13, 06:37
Sebastian Schelter
2011-10-13, 08:01
Grant Ingersoll
2011-10-13, 10:47
Lance Norskog
2011-10-13, 20:11
Ted Dunning
2011-10-13, 20:14
Grant Ingersoll
2011-10-13, 23:17
Grant Ingersoll
2011-10-14, 01:31
Grant Ingersoll
2011-10-14, 02:35
Grant Ingersoll
2011-10-14, 03:00
Lance Norskog
2011-10-14, 03:19
Sebastian Schelter
2011-10-14, 06:04
Lance Norskog
2011-10-14, 06:28
Grant Ingersoll
2011-10-14, 12:42
Grant Ingersoll
2011-10-14, 15:10
Lance Norskog
2011-10-15, 03:32
|
-
RecommenderJob and NaNGrant Ingersoll 2011-10-11, 16:34
I'm running trunk RecommenderJob (via build-asf-email.sh) and am not getting any recommendations due to NaNs being calculated in the AggregateAndRecommend step. I'm not quite sure what is going on as it seems like this was working as little as two weeks ago (post Sebastian's big change to RecJob), but I don't see a whole lot of changes in that part of the code.
The data is user id's mapping to email thread ids. My input data is simply a triple of user id, thread id, 1 (meaning that user participated in that thread) It seems like I will have a lot of good values in the inputs to the AggregateAndRecommend step, except one id will be NaN and this then seems to get added in and makes everything NaN (I realize this is a very naive understanding). I sense that I should be looking upstream in the process for a fix, but I am not sure where that is. Any ideas where I should be looking to eliminate these NaNs? If you want to try this with a small data set, you can get it here: http://www.lucidimagination.com/devzone/technical-articles/scaling-mahout (but note the companion article is not published yet.) Thanks, Grant
-
Re: RecommenderJob and NaNSean Owen 2011-10-11, 16:36
Where is the NaN coming up -- what has this value?
It should be propagated in some cases but not others. I'm not aware of any changes here. Generally small data sets will have this problem of not being able to compute much of anything useful, so NaN might be right here. But you say it was different recently, which seems to rule that out. On Tue, Oct 11, 2011 at 5:34 PM, Grant Ingersoll <[EMAIL PROTECTED]> wrote: > I'm running trunk RecommenderJob (via build-asf-email.sh) and am not getting any recommendations due to NaNs being calculated in the AggregateAndRecommend step. I'm not quite sure what is going on as it seems like this was working as little as two weeks ago (post Sebastian's big change to RecJob), but I don't see a whole lot of changes in that part of the code. > > The data is user id's mapping to email thread ids. My input data is simply a triple of user id, thread id, 1 (meaning that user participated in that thread) It seems like I will have a lot of good values in the inputs to the AggregateAndRecommend step, except one id will be NaN and this then seems to get added in and makes everything NaN (I realize this is a very naive understanding). I sense that I should be looking upstream in the process for a fix, but I am not sure where that is. > > Any ideas where I should be looking to eliminate these NaNs? If you want to try this with a small data set, you can get it here: http://www.lucidimagination.com/devzone/technical-articles/scaling-mahout (but note the companion article is not published yet.) > > Thanks, > Grant
-
Re: RecommenderJob and NaNGrant Ingersoll 2011-10-11, 18:49
On Oct 11, 2011, at 12:36 PM, Sean Owen wrote: > Where is the NaN coming up -- what has this value? simColumn seems to be the originator in the Aggregate step. For instance, my current breakpoint shows: {309682:0.9566912651062012,42938:0.9566912651062012,309672:NaN} I can also see some in the PartialMultiplyMapper via the similarityMatrixColumn. Is that set by SimilarityMatrixRowWrapperMapper? <code> /* remove self similarity */ similarityMatrixRow.set(key.get(), Double.NaN); </code> > It should be propagated in some cases but not others. I'm not aware of > any changes here. yeah, me neither. This is all related to MAHOUT-798. > > Generally small data sets will have this problem of not being able to > compute much of anything useful, so NaN might be right here. > But you say it was different recently, which seems to rule that out. I also _believe_ I'm seeing it in a much larger data set on Hadoop, it's just that's a whole lot harder to debug. > > On Tue, Oct 11, 2011 at 5:34 PM, Grant Ingersoll <[EMAIL PROTECTED]> wrote: >> I'm running trunk RecommenderJob (via build-asf-email.sh) and am not getting any recommendations due to NaNs being calculated in the AggregateAndRecommend step. I'm not quite sure what is going on as it seems like this was working as little as two weeks ago (post Sebastian's big change to RecJob), but I don't see a whole lot of changes in that part of the code. >> >> The data is user id's mapping to email thread ids. My input data is simply a triple of user id, thread id, 1 (meaning that user participated in that thread) It seems like I will have a lot of good values in the inputs to the AggregateAndRecommend step, except one id will be NaN and this then seems to get added in and makes everything NaN (I realize this is a very naive understanding). I sense that I should be looking upstream in the process for a fix, but I am not sure where that is. >> >> Any ideas where I should be looking to eliminate these NaNs? If you want to try this with a small data set, you can get it here: http://www.lucidimagination.com/devzone/technical-articles/scaling-mahout (but note the companion article is not published yet.) >> >> Thanks, >> Grant
-
Re: RecommenderJob and NaNSean Owen 2011-10-11, 18:54
NaN is added for all user item pairs that already exist in the input, to
make them ineligible for recommendation. That's normal - could this be the case? On Oct 11, 2011 7:49 PM, "Grant Ingersoll" <[EMAIL PROTECTED]> wrote: > > On Oct 11, 2011, at 12:36 PM, Sean Owen wrote: > > > Where is the NaN coming up -- what has this value? > > simColumn seems to be the originator in the Aggregate step. For instance, > my current breakpoint shows: > {309682:0.9566912651062012,42938:0.9566912651062012,309672:NaN} > > I can also see some in the PartialMultiplyMapper via the > similarityMatrixColumn. > > Is that set by SimilarityMatrixRowWrapperMapper? > <code> > /* remove self similarity */ > similarityMatrixRow.set(key.get(), Double.NaN); > </code> > > > > > It should be propagated in some cases but not others. I'm not aware of > > any changes here. > > yeah, me neither. This is all related to MAHOUT-798. > > > > > Generally small data sets will have this problem of not being able to > > compute much of anything useful, so NaN might be right here. > > But you say it was different recently, which seems to rule that out. > > I also _believe_ I'm seeing it in a much larger data set on Hadoop, it's > just that's a whole lot harder to debug. > > > > > On Tue, Oct 11, 2011 at 5:34 PM, Grant Ingersoll <[EMAIL PROTECTED]> > wrote: > >> I'm running trunk RecommenderJob (via build-asf-email.sh) and am not > getting any recommendations due to NaNs being calculated in the > AggregateAndRecommend step. I'm not quite sure what is going on as it seems > like this was working as little as two weeks ago (post Sebastian's big > change to RecJob), but I don't see a whole lot of changes in that part of > the code. > >> > >> The data is user id's mapping to email thread ids. My input data is > simply a triple of user id, thread id, 1 (meaning that user participated in > that thread) It seems like I will have a lot of good values in the inputs > to the AggregateAndRecommend step, except one id will be NaN and this then > seems to get added in and makes everything NaN (I realize this is a very > naive understanding). I sense that I should be looking upstream in the > process for a fix, but I am not sure where that is. > >> > >> Any ideas where I should be looking to eliminate these NaNs? If you > want to try this with a small data set, you can get it here: > http://www.lucidimagination.com/devzone/technical-articles/scaling-mahout(but note the companion article is not published yet.) > >> > >> Thanks, > >> Grant > > >
-
Re: RecommenderJob and NaNGrant Ingersoll 2011-10-11, 18:55
On Oct 11, 2011, at 2:49 PM, Grant Ingersoll wrote: > > On Oct 11, 2011, at 12:36 PM, Sean Owen wrote: > >> Where is the NaN coming up -- what has this value? > > simColumn seems to be the originator in the Aggregate step. For instance, my current breakpoint shows: > {309682:0.9566912651062012,42938:0.9566912651062012,309672:NaN} > > I can also see some in the PartialMultiplyMapper via the similarityMatrixColumn. > > Is that set by SimilarityMatrixRowWrapperMapper? > <code> > /* remove self similarity */ > similarityMatrixRow.set(key.get(), Double.NaN); > </code> Ah, but that is just taking care of itself, so maybe not the issue. > > > >> It should be propagated in some cases but not others. I'm not aware of >> any changes here. > > yeah, me neither. This is all related to MAHOUT-798. > >> >> Generally small data sets will have this problem of not being able to >> compute much of anything useful, so NaN might be right here. >> But you say it was different recently, which seems to rule that out. > > I also _believe_ I'm seeing it in a much larger data set on Hadoop, it's just that's a whole lot harder to debug. > >> >> On Tue, Oct 11, 2011 at 5:34 PM, Grant Ingersoll <[EMAIL PROTECTED]> wrote: >>> I'm running trunk RecommenderJob (via build-asf-email.sh) and am not getting any recommendations due to NaNs being calculated in the AggregateAndRecommend step. I'm not quite sure what is going on as it seems like this was working as little as two weeks ago (post Sebastian's big change to RecJob), but I don't see a whole lot of changes in that part of the code. >>> >>> The data is user id's mapping to email thread ids. My input data is simply a triple of user id, thread id, 1 (meaning that user participated in that thread) It seems like I will have a lot of good values in the inputs to the AggregateAndRecommend step, except one id will be NaN and this then seems to get added in and makes everything NaN (I realize this is a very naive understanding). I sense that I should be looking upstream in the process for a fix, but I am not sure where that is. >>> >>> Any ideas where I should be looking to eliminate these NaNs? If you want to try this with a small data set, you can get it here: http://www.lucidimagination.com/devzone/technical-articles/scaling-mahout (but note the companion article is not published yet.) >>> >>> Thanks, >>> Grant > > -------------------------------------------- Grant Ingersoll http://www.lucidimagination.com Lucene Eurocon 2011: http://www.lucene-eurocon.com
-
Re: RecommenderJob and NaNGrant Ingersoll 2011-10-11, 19:15
On Oct 11, 2011, at 2:54 PM, Sean Owen wrote: > NaN is added for all user item pairs that already exist in the input, to > make them ineligible for recommendation. That's normal - could this be the > case? Trying to track down. I don't think it is the self case, but not 100% sure. > On Oct 11, 2011 7:49 PM, "Grant Ingersoll" <[EMAIL PROTECTED]> wrote: > >> >> On Oct 11, 2011, at 12:36 PM, Sean Owen wrote: >> >>> Where is the NaN coming up -- what has this value? >> >> simColumn seems to be the originator in the Aggregate step. For instance, >> my current breakpoint shows: >> {309682:0.9566912651062012,42938:0.9566912651062012,309672:NaN} >> >> I can also see some in the PartialMultiplyMapper via the >> similarityMatrixColumn. >> >> Is that set by SimilarityMatrixRowWrapperMapper? >> <code> >> /* remove self similarity */ >> similarityMatrixRow.set(key.get(), Double.NaN); >> </code> >> >> >> >>> It should be propagated in some cases but not others. I'm not aware of >>> any changes here. >> >> yeah, me neither. This is all related to MAHOUT-798. >> >>> >>> Generally small data sets will have this problem of not being able to >>> compute much of anything useful, so NaN might be right here. >>> But you say it was different recently, which seems to rule that out. >> >> I also _believe_ I'm seeing it in a much larger data set on Hadoop, it's >> just that's a whole lot harder to debug. >> >>> >>> On Tue, Oct 11, 2011 at 5:34 PM, Grant Ingersoll <[EMAIL PROTECTED]> >> wrote: >>>> I'm running trunk RecommenderJob (via build-asf-email.sh) and am not >> getting any recommendations due to NaNs being calculated in the >> AggregateAndRecommend step. I'm not quite sure what is going on as it seems >> like this was working as little as two weeks ago (post Sebastian's big >> change to RecJob), but I don't see a whole lot of changes in that part of >> the code. >>>> >>>> The data is user id's mapping to email thread ids. My input data is >> simply a triple of user id, thread id, 1 (meaning that user participated in >> that thread) It seems like I will have a lot of good values in the inputs >> to the AggregateAndRecommend step, except one id will be NaN and this then >> seems to get added in and makes everything NaN (I realize this is a very >> naive understanding). I sense that I should be looking upstream in the >> process for a fix, but I am not sure where that is. >>>> >>>> Any ideas where I should be looking to eliminate these NaNs? If you >> want to try this with a small data set, you can get it here: >> http://www.lucidimagination.com/devzone/technical-articles/scaling-mahout(but note the companion article is not published yet.) >>>> >>>> Thanks, >>>> Grant >> >> >> -------------------------------------------- Grant Ingersoll http://www.lucidimagination.com Lucene Eurocon 2011: http://www.lucene-eurocon.com
-
Re: RecommenderJob and NaNGrant Ingersoll 2011-10-12, 13:28
Digging some more:
In AggregateAndRecommend, around lines 143, I have, for userId 0, a simColumn of: {22966:0.9566912651062012,81901:0.9566912651062012,263375:0.9566912651062012,263374:0.9566912651062012,263376:NaN} Which then becomes the numerator and the denom. Looping, my next simCol is: {22966:0.9566912651062012,81901:0.9566912651062012,263375:NaN,263374:0.9566912651062012,263376:0.9566912651062012} and then {22966:0.9566912651062012,81901:0.9566912651062012,263375:0.9566912651062012,263374:NaN,263376:0.9566912651062012} ... Each time, those are getting added into the numerators/denoms value, such that by the time we are done looping (line 161), we have: numerators: {22966:NaN,81901:NaN,263376:NaN,263375:NaN,263374:NaN} denoms: {22966:NaN,81901:NaN,263376:NaN,263375:NaN,263374:NaN} numberOfSimilarItemsUsed: {81901:5.0,22966:5.0,263376:5.0,263375:5.0,263374:5.0} Not sure on how to interpret this as I haven't dug into the math here yet or figured out where those NaN are coming from originally. On Oct 11, 2011, at 2:55 PM, Grant Ingersoll wrote: > > On Oct 11, 2011, at 2:49 PM, Grant Ingersoll wrote: > >> >> On Oct 11, 2011, at 12:36 PM, Sean Owen wrote: >> >>> Where is the NaN coming up -- what has this value? >> >> simColumn seems to be the originator in the Aggregate step. For instance, my current breakpoint shows: >> {309682:0.9566912651062012,42938:0.9566912651062012,309672:NaN} >> >> I can also see some in the PartialMultiplyMapper via the similarityMatrixColumn. >> >> Is that set by SimilarityMatrixRowWrapperMapper? >> <code> >> /* remove self similarity */ >> similarityMatrixRow.set(key.get(), Double.NaN); >> </code> > > Ah, but that is just taking care of itself, so maybe not the issue. > >> >> >> >>> It should be propagated in some cases but not others. I'm not aware of >>> any changes here. >> >> yeah, me neither. This is all related to MAHOUT-798. >> >>> >>> Generally small data sets will have this problem of not being able to >>> compute much of anything useful, so NaN might be right here. >>> But you say it was different recently, which seems to rule that out. >> >> I also _believe_ I'm seeing it in a much larger data set on Hadoop, it's just that's a whole lot harder to debug. >> >>> >>> On Tue, Oct 11, 2011 at 5:34 PM, Grant Ingersoll <[EMAIL PROTECTED]> wrote: >>>> I'm running trunk RecommenderJob (via build-asf-email.sh) and am not getting any recommendations due to NaNs being calculated in the AggregateAndRecommend step. I'm not quite sure what is going on as it seems like this was working as little as two weeks ago (post Sebastian's big change to RecJob), but I don't see a whole lot of changes in that part of the code. >>>> >>>> The data is user id's mapping to email thread ids. My input data is simply a triple of user id, thread id, 1 (meaning that user participated in that thread) It seems like I will have a lot of good values in the inputs to the AggregateAndRecommend step, except one id will be NaN and this then seems to get added in and makes everything NaN (I realize this is a very naive understanding). I sense that I should be looking upstream in the process for a fix, but I am not sure where that is. >>>> >>>> Any ideas where I should be looking to eliminate these NaNs? If you want to try this with a small data set, you can get it here: http://www.lucidimagination.com/devzone/technical-articles/scaling-mahout (but note the companion article is not published yet.) >>>> >>>> Thanks, >>>> Grant >> >> > > -------------------------------------------- > Grant Ingersoll > http://www.lucidimagination.com > Lucene Eurocon 2011: http://www.lucene-eurocon.com > -------------------------------------------- Grant Ingersoll http://www.lucidimagination.com Lucene Eurocon 2011: http://www.lucene-eurocon.com
-
Re: RecommenderJob and NaNKen Krugler 2011-10-12, 18:10
Hi Grant,
Just curious, are you running this locally or distributed? I'd run into a similar issue, though in a completely different algorithm (Jimmy Lin's PageRank implementation) due to the use of a static variable. When running locally, this wasn't getting cleared between loops, and thus I got wonky results. The same thing would have happened with JVM reuse enabled. -- Ken On Oct 12, 2011, at 3:28pm, Grant Ingersoll wrote: > Digging some more: > > In AggregateAndRecommend, around lines 143, I have, for userId 0, a simColumn of: > {22966:0.9566912651062012,81901:0.9566912651062012,263375:0.9566912651062012,263374:0.9566912651062012,263376:NaN} > > Which then becomes the numerator and the denom. > > Looping, my next simCol is: > {22966:0.9566912651062012,81901:0.9566912651062012,263375:NaN,263374:0.9566912651062012,263376:0.9566912651062012} > > and then > {22966:0.9566912651062012,81901:0.9566912651062012,263375:0.9566912651062012,263374:NaN,263376:0.9566912651062012} > > ... > > Each time, those are getting added into the numerators/denoms value, such that by the time we are done looping (line 161), we have: > numerators: {22966:NaN,81901:NaN,263376:NaN,263375:NaN,263374:NaN} > denoms: {22966:NaN,81901:NaN,263376:NaN,263375:NaN,263374:NaN} > > numberOfSimilarItemsUsed: {81901:5.0,22966:5.0,263376:5.0,263375:5.0,263374:5.0} > > Not sure on how to interpret this as I haven't dug into the math here yet or figured out where those NaN are coming from originally. > > On Oct 11, 2011, at 2:55 PM, Grant Ingersoll wrote: > >> >> On Oct 11, 2011, at 2:49 PM, Grant Ingersoll wrote: >> >>> >>> On Oct 11, 2011, at 12:36 PM, Sean Owen wrote: >>> >>>> Where is the NaN coming up -- what has this value? >>> >>> simColumn seems to be the originator in the Aggregate step. For instance, my current breakpoint shows: >>> {309682:0.9566912651062012,42938:0.9566912651062012,309672:NaN} >>> >>> I can also see some in the PartialMultiplyMapper via the similarityMatrixColumn. >>> >>> Is that set by SimilarityMatrixRowWrapperMapper? >>> <code> >>> /* remove self similarity */ >>> similarityMatrixRow.set(key.get(), Double.NaN); >>> </code> >> >> Ah, but that is just taking care of itself, so maybe not the issue. >> >>> >>> >>> >>>> It should be propagated in some cases but not others. I'm not aware of >>>> any changes here. >>> >>> yeah, me neither. This is all related to MAHOUT-798. >>> >>>> >>>> Generally small data sets will have this problem of not being able to >>>> compute much of anything useful, so NaN might be right here. >>>> But you say it was different recently, which seems to rule that out. >>> >>> I also _believe_ I'm seeing it in a much larger data set on Hadoop, it's just that's a whole lot harder to debug. >>> >>>> >>>> On Tue, Oct 11, 2011 at 5:34 PM, Grant Ingersoll <[EMAIL PROTECTED]> wrote: >>>>> I'm running trunk RecommenderJob (via build-asf-email.sh) and am not getting any recommendations due to NaNs being calculated in the AggregateAndRecommend step. I'm not quite sure what is going on as it seems like this was working as little as two weeks ago (post Sebastian's big change to RecJob), but I don't see a whole lot of changes in that part of the code. >>>>> >>>>> The data is user id's mapping to email thread ids. My input data is simply a triple of user id, thread id, 1 (meaning that user participated in that thread) It seems like I will have a lot of good values in the inputs to the AggregateAndRecommend step, except one id will be NaN and this then seems to get added in and makes everything NaN (I realize this is a very naive understanding). I sense that I should be looking upstream in the process for a fix, but I am not sure where that is. >>>>> >>>>> Any ideas where I should be looking to eliminate these NaNs? If you want to try this with a small data set, you can get it here: http://www.lucidimagination.com/devzone/technical-articles/scaling-mahout (but note the companion article is not published yet.) Ken Krugler +1 530-210-6378 http://bixolabs.com custom big data solutions & training Hadoop, Cascading, Mahout & Solr
-
Re: RecommenderJob and NaNGrant Ingersoll 2011-10-12, 18:30
Both local and on EC2
On Oct 12, 2011, at 2:10 PM, Ken Krugler wrote: > Hi Grant, > > Just curious, are you running this locally or distributed? > > I'd run into a similar issue, though in a completely different algorithm (Jimmy Lin's PageRank implementation) due to the use of a static variable. > > When running locally, this wasn't getting cleared between loops, and thus I got wonky results. > > The same thing would have happened with JVM reuse enabled. > > -- Ken > > On Oct 12, 2011, at 3:28pm, Grant Ingersoll wrote: > >> Digging some more: >> >> In AggregateAndRecommend, around lines 143, I have, for userId 0, a simColumn of: >> {22966:0.9566912651062012,81901:0.9566912651062012,263375:0.9566912651062012,263374:0.9566912651062012,263376:NaN} >> >> Which then becomes the numerator and the denom. >> >> Looping, my next simCol is: >> {22966:0.9566912651062012,81901:0.9566912651062012,263375:NaN,263374:0.9566912651062012,263376:0.9566912651062012} >> >> and then >> {22966:0.9566912651062012,81901:0.9566912651062012,263375:0.9566912651062012,263374:NaN,263376:0.9566912651062012} >> >> ... >> >> Each time, those are getting added into the numerators/denoms value, such that by the time we are done looping (line 161), we have: >> numerators: {22966:NaN,81901:NaN,263376:NaN,263375:NaN,263374:NaN} >> denoms: {22966:NaN,81901:NaN,263376:NaN,263375:NaN,263374:NaN} >> >> numberOfSimilarItemsUsed: {81901:5.0,22966:5.0,263376:5.0,263375:5.0,263374:5.0} >> >> Not sure on how to interpret this as I haven't dug into the math here yet or figured out where those NaN are coming from originally. >> >> On Oct 11, 2011, at 2:55 PM, Grant Ingersoll wrote: >> >>> >>> On Oct 11, 2011, at 2:49 PM, Grant Ingersoll wrote: >>> >>>> >>>> On Oct 11, 2011, at 12:36 PM, Sean Owen wrote: >>>> >>>>> Where is the NaN coming up -- what has this value? >>>> >>>> simColumn seems to be the originator in the Aggregate step. For instance, my current breakpoint shows: >>>> {309682:0.9566912651062012,42938:0.9566912651062012,309672:NaN} >>>> >>>> I can also see some in the PartialMultiplyMapper via the similarityMatrixColumn. >>>> >>>> Is that set by SimilarityMatrixRowWrapperMapper? >>>> <code> >>>> /* remove self similarity */ >>>> similarityMatrixRow.set(key.get(), Double.NaN); >>>> </code> >>> >>> Ah, but that is just taking care of itself, so maybe not the issue. >>> >>>> >>>> >>>> >>>>> It should be propagated in some cases but not others. I'm not aware of >>>>> any changes here. >>>> >>>> yeah, me neither. This is all related to MAHOUT-798. >>>> >>>>> >>>>> Generally small data sets will have this problem of not being able to >>>>> compute much of anything useful, so NaN might be right here. >>>>> But you say it was different recently, which seems to rule that out. >>>> >>>> I also _believe_ I'm seeing it in a much larger data set on Hadoop, it's just that's a whole lot harder to debug. >>>> >>>>> >>>>> On Tue, Oct 11, 2011 at 5:34 PM, Grant Ingersoll <[EMAIL PROTECTED]> wrote: >>>>>> I'm running trunk RecommenderJob (via build-asf-email.sh) and am not getting any recommendations due to NaNs being calculated in the AggregateAndRecommend step. I'm not quite sure what is going on as it seems like this was working as little as two weeks ago (post Sebastian's big change to RecJob), but I don't see a whole lot of changes in that part of the code. >>>>>> >>>>>> The data is user id's mapping to email thread ids. My input data is simply a triple of user id, thread id, 1 (meaning that user participated in that thread) It seems like I will have a lot of good values in the inputs to the AggregateAndRecommend step, except one id will be NaN and this then seems to get added in and makes everything NaN (I realize this is a very naive understanding). I sense that I should be looking upstream in the process for a fix, but I am not sure where that is. >>>>>> >>>>>> Any ideas where I should be looking to eliminate these NaNs? If you want to try this with a small data set, you can get it here: http://www.lucidimagination.com/devzone/technical-articles/scaling-mahout (but note the companion article is not published yet.) Grant Ingersoll http://www.lucidimagination.com Lucene Eurocon 2011: http://www.lucene-eurocon.com
-
Re: RecommenderJob and NaNLance Norskog 2011-10-13, 06:33
Is this job working well for anyone now?
When was the last time this job worked for someone? On Wed, Oct 12, 2011 at 11:30 AM, Grant Ingersoll <[EMAIL PROTECTED]>wrote: > Both local and on EC2 > > On Oct 12, 2011, at 2:10 PM, Ken Krugler wrote: > > > Hi Grant, > > > > Just curious, are you running this locally or distributed? > > > > I'd run into a similar issue, though in a completely different algorithm > (Jimmy Lin's PageRank implementation) due to the use of a static variable. > > > > When running locally, this wasn't getting cleared between loops, and thus > I got wonky results. > > > > The same thing would have happened with JVM reuse enabled. > > > > -- Ken > > > > On Oct 12, 2011, at 3:28pm, Grant Ingersoll wrote: > > > >> Digging some more: > >> > >> In AggregateAndRecommend, around lines 143, I have, for userId 0, a > simColumn of: > >> > {22966:0.9566912651062012,81901:0.9566912651062012,263375:0.9566912651062012,263374:0.9566912651062012,263376:NaN} > >> > >> Which then becomes the numerator and the denom. > >> > >> Looping, my next simCol is: > >> > {22966:0.9566912651062012,81901:0.9566912651062012,263375:NaN,263374:0.9566912651062012,263376:0.9566912651062012} > >> > >> and then > >> > {22966:0.9566912651062012,81901:0.9566912651062012,263375:0.9566912651062012,263374:NaN,263376:0.9566912651062012} > >> > >> ... > >> > >> Each time, those are getting added into the numerators/denoms value, > such that by the time we are done looping (line 161), we have: > >> numerators: {22966:NaN,81901:NaN,263376:NaN,263375:NaN,263374:NaN} > >> denoms: {22966:NaN,81901:NaN,263376:NaN,263375:NaN,263374:NaN} > >> > >> numberOfSimilarItemsUsed: > {81901:5.0,22966:5.0,263376:5.0,263375:5.0,263374:5.0} > >> > >> Not sure on how to interpret this as I haven't dug into the math here > yet or figured out where those NaN are coming from originally. > >> > >> On Oct 11, 2011, at 2:55 PM, Grant Ingersoll wrote: > >> > >>> > >>> On Oct 11, 2011, at 2:49 PM, Grant Ingersoll wrote: > >>> > >>>> > >>>> On Oct 11, 2011, at 12:36 PM, Sean Owen wrote: > >>>> > >>>>> Where is the NaN coming up -- what has this value? > >>>> > >>>> simColumn seems to be the originator in the Aggregate step. For > instance, my current breakpoint shows: > >>>> {309682:0.9566912651062012,42938:0.9566912651062012,309672:NaN} > >>>> > >>>> I can also see some in the PartialMultiplyMapper via the > similarityMatrixColumn. > >>>> > >>>> Is that set by SimilarityMatrixRowWrapperMapper? > >>>> <code> > >>>> /* remove self similarity */ > >>>> similarityMatrixRow.set(key.get(), Double.NaN); > >>>> </code> > >>> > >>> Ah, but that is just taking care of itself, so maybe not the issue. > >>> > >>>> > >>>> > >>>> > >>>>> It should be propagated in some cases but not others. I'm not aware > of > >>>>> any changes here. > >>>> > >>>> yeah, me neither. This is all related to MAHOUT-798. > >>>> > >>>>> > >>>>> Generally small data sets will have this problem of not being able to > >>>>> compute much of anything useful, so NaN might be right here. > >>>>> But you say it was different recently, which seems to rule that out. > >>>> > >>>> I also _believe_ I'm seeing it in a much larger data set on Hadoop, > it's just that's a whole lot harder to debug. > >>>> > >>>>> > >>>>> On Tue, Oct 11, 2011 at 5:34 PM, Grant Ingersoll < > [EMAIL PROTECTED]> wrote: > >>>>>> I'm running trunk RecommenderJob (via build-asf-email.sh) and am not > getting any recommendations due to NaNs being calculated in the > AggregateAndRecommend step. I'm not quite sure what is going on as it seems > like this was working as little as two weeks ago (post Sebastian's big > change to RecJob), but I don't see a whole lot of changes in that part of > the code. > >>>>>> > >>>>>> The data is user id's mapping to email thread ids. My input data is > simply a triple of user id, thread id, 1 (meaning that user participated in > that thread) It seems like I will have a lot of good values in the inputs Lance Norskog [EMAIL PROTECTED]
-
Re: RecommenderJob and NaNSean Owen 2011-10-13, 06:37
RecommenderJob? The unit tests run it all the time.
There should not be any glitches with static variables -- don't think there are any. On Thu, Oct 13, 2011 at 7:33 AM, Lance Norskog <[EMAIL PROTECTED]> wrote: > Is this job working well for anyone now? > When was the last time this job worked for someone? > > On Wed, Oct 12, 2011 at 11:30 AM, Grant Ingersoll <[EMAIL PROTECTED]>wrote: > >> Both local and on EC2 >> >> On Oct 12, 2011, at 2:10 PM, Ken Krugler wrote: >> >> > Hi Grant, >> > >> > Just curious, are you running this locally or distributed? >> > >> > I'd run into a similar issue, though in a completely different algorithm >> (Jimmy Lin's PageRank implementation) due to the use of a static variable. >> > >> > When running locally, this wasn't getting cleared between loops, and thus >> I got wonky results. >> > >> > The same thing would have happened with JVM reuse enabled. >> > >> > -- Ken >> > >> > On Oct 12, 2011, at 3:28pm, Grant Ingersoll wrote: >> > >> >> Digging some more: >> >> >> >> In AggregateAndRecommend, around lines 143, I have, for userId 0, a >> simColumn of: >> >> >> {22966:0.9566912651062012,81901:0.9566912651062012,263375:0.9566912651062012,263374:0.9566912651062012,263376:NaN} >> >> >> >> Which then becomes the numerator and the denom. >> >> >> >> Looping, my next simCol is: >> >> >> {22966:0.9566912651062012,81901:0.9566912651062012,263375:NaN,263374:0.9566912651062012,263376:0.9566912651062012} >> >> >> >> and then >> >> >> {22966:0.9566912651062012,81901:0.9566912651062012,263375:0.9566912651062012,263374:NaN,263376:0.9566912651062012} >> >> >> >> ... >> >> >> >> Each time, those are getting added into the numerators/denoms value, >> such that by the time we are done looping (line 161), we have: >> >> numerators: {22966:NaN,81901:NaN,263376:NaN,263375:NaN,263374:NaN} >> >> denoms: {22966:NaN,81901:NaN,263376:NaN,263375:NaN,263374:NaN} >> >> >> >> numberOfSimilarItemsUsed: >> {81901:5.0,22966:5.0,263376:5.0,263375:5.0,263374:5.0} >> >> >> >> Not sure on how to interpret this as I haven't dug into the math here >> yet or figured out where those NaN are coming from originally. >> >> >> >> On Oct 11, 2011, at 2:55 PM, Grant Ingersoll wrote: >> >> >> >>> >> >>> On Oct 11, 2011, at 2:49 PM, Grant Ingersoll wrote: >> >>> >> >>>> >> >>>> On Oct 11, 2011, at 12:36 PM, Sean Owen wrote: >> >>>> >> >>>>> Where is the NaN coming up -- what has this value? >> >>>> >> >>>> simColumn seems to be the originator in the Aggregate step. For >> instance, my current breakpoint shows: >> >>>> {309682:0.9566912651062012,42938:0.9566912651062012,309672:NaN} >> >>>> >> >>>> I can also see some in the PartialMultiplyMapper via the >> similarityMatrixColumn. >> >>>> >> >>>> Is that set by SimilarityMatrixRowWrapperMapper? >> >>>> <code> >> >>>> /* remove self similarity */ >> >>>> similarityMatrixRow.set(key.get(), Double.NaN); >> >>>> </code> >> >>> >> >>> Ah, but that is just taking care of itself, so maybe not the issue. >> >>> >> >>>> >> >>>> >> >>>> >> >>>>> It should be propagated in some cases but not others. I'm not aware >> of >> >>>>> any changes here. >> >>>> >> >>>> yeah, me neither. This is all related to MAHOUT-798. >> >>>> >> >>>>> >> >>>>> Generally small data sets will have this problem of not being able to >> >>>>> compute much of anything useful, so NaN might be right here. >> >>>>> But you say it was different recently, which seems to rule that out. >> >>>> >> >>>> I also _believe_ I'm seeing it in a much larger data set on Hadoop, >> it's just that's a whole lot harder to debug. >> >>>> >> >>>>> >> >>>>> On Tue, Oct 11, 2011 at 5:34 PM, Grant Ingersoll < >> [EMAIL PROTECTED]> wrote: >> >>>>>> I'm running trunk RecommenderJob (via build-asf-email.sh) and am not >> getting any recommendations due to NaNs being calculated in the >> AggregateAndRecommend step. I'm not quite sure what is going on as it seems >> like this was working as little as two weeks ago (post Sebastian's big
-
Re: RecommenderJob and NaNSebastian Schelter 2011-10-13, 08:01
Grant,
Can you share a little more details about the results, do you get any exceptions? Or do you just get no results? Using the NaNs inside the similarity matrix vectors has been included in the job for a very long time and should not cause any problems. As Sean already mentioned we have unit tests with toy data that should catch the very obvious errors in this code. Can you share the dataset? I can do a testrun on my research cluster. --sebastian On 13.10.2011 08:37, Sean Owen wrote: > RecommenderJob? The unit tests run it all the time. > There should not be any glitches with static variables -- don't think > there are any. > > On Thu, Oct 13, 2011 at 7:33 AM, Lance Norskog <[EMAIL PROTECTED]> wrote: >> Is this job working well for anyone now? >> When was the last time this job worked for someone? >> >> On Wed, Oct 12, 2011 at 11:30 AM, Grant Ingersoll <[EMAIL PROTECTED]>wrote: >> >>> Both local and on EC2 >>> >>> On Oct 12, 2011, at 2:10 PM, Ken Krugler wrote: >>> >>>> Hi Grant, >>>> >>>> Just curious, are you running this locally or distributed? >>>> >>>> I'd run into a similar issue, though in a completely different algorithm >>> (Jimmy Lin's PageRank implementation) due to the use of a static variable. >>>> >>>> When running locally, this wasn't getting cleared between loops, and thus >>> I got wonky results. >>>> >>>> The same thing would have happened with JVM reuse enabled. >>>> >>>> -- Ken >>>> >>>> On Oct 12, 2011, at 3:28pm, Grant Ingersoll wrote: >>>> >>>>> Digging some more: >>>>> >>>>> In AggregateAndRecommend, around lines 143, I have, for userId 0, a >>> simColumn of: >>>>> >>> {22966:0.9566912651062012,81901:0.9566912651062012,263375:0.9566912651062012,263374:0.9566912651062012,263376:NaN} >>>>> >>>>> Which then becomes the numerator and the denom. >>>>> >>>>> Looping, my next simCol is: >>>>> >>> {22966:0.9566912651062012,81901:0.9566912651062012,263375:NaN,263374:0.9566912651062012,263376:0.9566912651062012} >>>>> >>>>> and then >>>>> >>> {22966:0.9566912651062012,81901:0.9566912651062012,263375:0.9566912651062012,263374:NaN,263376:0.9566912651062012} >>>>> >>>>> ... >>>>> >>>>> Each time, those are getting added into the numerators/denoms value, >>> such that by the time we are done looping (line 161), we have: >>>>> numerators: {22966:NaN,81901:NaN,263376:NaN,263375:NaN,263374:NaN} >>>>> denoms: {22966:NaN,81901:NaN,263376:NaN,263375:NaN,263374:NaN} >>>>> >>>>> numberOfSimilarItemsUsed: >>> {81901:5.0,22966:5.0,263376:5.0,263375:5.0,263374:5.0} >>>>> >>>>> Not sure on how to interpret this as I haven't dug into the math here >>> yet or figured out where those NaN are coming from originally. >>>>> >>>>> On Oct 11, 2011, at 2:55 PM, Grant Ingersoll wrote: >>>>> >>>>>> >>>>>> On Oct 11, 2011, at 2:49 PM, Grant Ingersoll wrote: >>>>>> >>>>>>> >>>>>>> On Oct 11, 2011, at 12:36 PM, Sean Owen wrote: >>>>>>> >>>>>>>> Where is the NaN coming up -- what has this value? >>>>>>> >>>>>>> simColumn seems to be the originator in the Aggregate step. For >>> instance, my current breakpoint shows: >>>>>>> {309682:0.9566912651062012,42938:0.9566912651062012,309672:NaN} >>>>>>> >>>>>>> I can also see some in the PartialMultiplyMapper via the >>> similarityMatrixColumn. >>>>>>> >>>>>>> Is that set by SimilarityMatrixRowWrapperMapper? >>>>>>> <code> >>>>>>> /* remove self similarity */ >>>>>>> similarityMatrixRow.set(key.get(), Double.NaN); >>>>>>> </code> >>>>>> >>>>>> Ah, but that is just taking care of itself, so maybe not the issue. >>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>>> It should be propagated in some cases but not others. I'm not aware >>> of >>>>>>>> any changes here. >>>>>>> >>>>>>> yeah, me neither. This is all related to MAHOUT-798. >>>>>>> >>>>>>>> >>>>>>>> Generally small data sets will have this problem of not being able to >>>>>>>> compute much of anything useful, so NaN might be right here. >>>>>>>> But you say it was different recently, which seems to rule that out.
-
Re: RecommenderJob and NaNGrant Ingersoll 2011-10-13, 10:47
On Oct 13, 2011, at 4:01 AM, Sebastian Schelter wrote: > Grant, > > Can you share a little more details about the results, do you get any > exceptions? Or do you just get no results? No results. > > Using the NaNs inside the similarity matrix vectors has been included in > the job for a very long time and should not cause any problems. As Sean > already mentioned we have unit tests with toy data that should catch the > very obvious errors in this code. Yeah, I don't know what happened. I know I was getting results as little as two weeks ago. I will try rolling back to an earlier commit. > > Can you share the dataset? I can do a testrun on my research cluster. I already have earlier in this thread. There is a small set via the link below or you can use the ASF email public dataset on Amazon or any subset of it. > > --sebastian > > On 13.10.2011 08:37, Sean Owen wrote: >> RecommenderJob? The unit tests run it all the time. >> There should not be any glitches with static variables -- don't think >> there are any. >> >> On Thu, Oct 13, 2011 at 7:33 AM, Lance Norskog <[EMAIL PROTECTED]> wrote: >>> Is this job working well for anyone now? >>> When was the last time this job worked for someone? >>> >>> On Wed, Oct 12, 2011 at 11:30 AM, Grant Ingersoll <[EMAIL PROTECTED]>wrote: >>> >>>> Both local and on EC2 >>>> >>>> On Oct 12, 2011, at 2:10 PM, Ken Krugler wrote: >>>> >>>>> Hi Grant, >>>>> >>>>> Just curious, are you running this locally or distributed? >>>>> >>>>> I'd run into a similar issue, though in a completely different algorithm >>>> (Jimmy Lin's PageRank implementation) due to the use of a static variable. >>>>> >>>>> When running locally, this wasn't getting cleared between loops, and thus >>>> I got wonky results. >>>>> >>>>> The same thing would have happened with JVM reuse enabled. >>>>> >>>>> -- Ken >>>>> >>>>> On Oct 12, 2011, at 3:28pm, Grant Ingersoll wrote: >>>>> >>>>>> Digging some more: >>>>>> >>>>>> In AggregateAndRecommend, around lines 143, I have, for userId 0, a >>>> simColumn of: >>>>>> >>>> {22966:0.9566912651062012,81901:0.9566912651062012,263375:0.9566912651062012,263374:0.9566912651062012,263376:NaN} >>>>>> >>>>>> Which then becomes the numerator and the denom. >>>>>> >>>>>> Looping, my next simCol is: >>>>>> >>>> {22966:0.9566912651062012,81901:0.9566912651062012,263375:NaN,263374:0.9566912651062012,263376:0.9566912651062012} >>>>>> >>>>>> and then >>>>>> >>>> {22966:0.9566912651062012,81901:0.9566912651062012,263375:0.9566912651062012,263374:NaN,263376:0.9566912651062012} >>>>>> >>>>>> ... >>>>>> >>>>>> Each time, those are getting added into the numerators/denoms value, >>>> such that by the time we are done looping (line 161), we have: >>>>>> numerators: {22966:NaN,81901:NaN,263376:NaN,263375:NaN,263374:NaN} >>>>>> denoms: {22966:NaN,81901:NaN,263376:NaN,263375:NaN,263374:NaN} >>>>>> >>>>>> numberOfSimilarItemsUsed: >>>> {81901:5.0,22966:5.0,263376:5.0,263375:5.0,263374:5.0} >>>>>> >>>>>> Not sure on how to interpret this as I haven't dug into the math here >>>> yet or figured out where those NaN are coming from originally. >>>>>> >>>>>> On Oct 11, 2011, at 2:55 PM, Grant Ingersoll wrote: >>>>>> >>>>>>> >>>>>>> On Oct 11, 2011, at 2:49 PM, Grant Ingersoll wrote: >>>>>>> >>>>>>>> >>>>>>>> On Oct 11, 2011, at 12:36 PM, Sean Owen wrote: >>>>>>>> >>>>>>>>> Where is the NaN coming up -- what has this value? >>>>>>>> >>>>>>>> simColumn seems to be the originator in the Aggregate step. For >>>> instance, my current breakpoint shows: >>>>>>>> {309682:0.9566912651062012,42938:0.9566912651062012,309672:NaN} >>>>>>>> >>>>>>>> I can also see some in the PartialMultiplyMapper via the >>>> similarityMatrixColumn. >>>>>>>> >>>>>>>> Is that set by SimilarityMatrixRowWrapperMapper? >>>>>>>> <code> >>>>>>>> /* remove self similarity */ >>>>>>>> similarityMatrixRow.set(key.get(), Double.NaN); >>>>>>>> </code> >>>>>>> >>>> Grant Ingersoll http://www.lucidimagination.com Lucene Eurocon 2011: http://www.lucene-eurocon.com
-
Re: RecommenderJob and NaNLance Norskog 2011-10-13, 20:11
Is the Apache public download bandwidth donated by Amazon? Or should we try
to keep usage within AWS? On Thu, Oct 13, 2011 at 3:47 AM, Grant Ingersoll <[EMAIL PROTECTED]>wrote: > > On Oct 13, 2011, at 4:01 AM, Sebastian Schelter wrote: > > > Grant, > > > > Can you share a little more details about the results, do you get any > > exceptions? Or do you just get no results? > > No results. > > > > > Using the NaNs inside the similarity matrix vectors has been included in > > the job for a very long time and should not cause any problems. As Sean > > already mentioned we have unit tests with toy data that should catch the > > very obvious errors in this code. > > Yeah, I don't know what happened. I know I was getting results as little > as two weeks ago. I will try rolling back to an earlier commit. > > > > > Can you share the dataset? I can do a testrun on my research cluster. > > I already have earlier in this thread. There is a small set via the link > below or you can use the ASF email public dataset on Amazon or any subset of > it. > > > > > > --sebastian > > > > On 13.10.2011 08:37, Sean Owen wrote: > >> RecommenderJob? The unit tests run it all the time. > >> There should not be any glitches with static variables -- don't think > >> there are any. > >> > >> On Thu, Oct 13, 2011 at 7:33 AM, Lance Norskog <[EMAIL PROTECTED]> > wrote: > >>> Is this job working well for anyone now? > >>> When was the last time this job worked for someone? > >>> > >>> On Wed, Oct 12, 2011 at 11:30 AM, Grant Ingersoll <[EMAIL PROTECTED] > >wrote: > >>> > >>>> Both local and on EC2 > >>>> > >>>> On Oct 12, 2011, at 2:10 PM, Ken Krugler wrote: > >>>> > >>>>> Hi Grant, > >>>>> > >>>>> Just curious, are you running this locally or distributed? > >>>>> > >>>>> I'd run into a similar issue, though in a completely different > algorithm > >>>> (Jimmy Lin's PageRank implementation) due to the use of a static > variable. > >>>>> > >>>>> When running locally, this wasn't getting cleared between loops, and > thus > >>>> I got wonky results. > >>>>> > >>>>> The same thing would have happened with JVM reuse enabled. > >>>>> > >>>>> -- Ken > >>>>> > >>>>> On Oct 12, 2011, at 3:28pm, Grant Ingersoll wrote: > >>>>> > >>>>>> Digging some more: > >>>>>> > >>>>>> In AggregateAndRecommend, around lines 143, I have, for userId 0, a > >>>> simColumn of: > >>>>>> > >>>> > {22966:0.9566912651062012,81901:0.9566912651062012,263375:0.9566912651062012,263374:0.9566912651062012,263376:NaN} > >>>>>> > >>>>>> Which then becomes the numerator and the denom. > >>>>>> > >>>>>> Looping, my next simCol is: > >>>>>> > >>>> > {22966:0.9566912651062012,81901:0.9566912651062012,263375:NaN,263374:0.9566912651062012,263376:0.9566912651062012} > >>>>>> > >>>>>> and then > >>>>>> > >>>> > {22966:0.9566912651062012,81901:0.9566912651062012,263375:0.9566912651062012,263374:NaN,263376:0.9566912651062012} > >>>>>> > >>>>>> ... > >>>>>> > >>>>>> Each time, those are getting added into the numerators/denoms value, > >>>> such that by the time we are done looping (line 161), we have: > >>>>>> numerators: {22966:NaN,81901:NaN,263376:NaN,263375:NaN,263374:NaN} > >>>>>> denoms: {22966:NaN,81901:NaN,263376:NaN,263375:NaN,263374:NaN} > >>>>>> > >>>>>> numberOfSimilarItemsUsed: > >>>> {81901:5.0,22966:5.0,263376:5.0,263375:5.0,263374:5.0} > >>>>>> > >>>>>> Not sure on how to interpret this as I haven't dug into the math > here > >>>> yet or figured out where those NaN are coming from originally. > >>>>>> > >>>>>> On Oct 11, 2011, at 2:55 PM, Grant Ingersoll wrote: > >>>>>> > >>>>>>> > >>>>>>> On Oct 11, 2011, at 2:49 PM, Grant Ingersoll wrote: > >>>>>>> > >>>>>>>> > >>>>>>>> On Oct 11, 2011, at 12:36 PM, Sean Owen wrote: > >>>>>>>> > >>>>>>>>> Where is the NaN coming up -- what has this value? > >>>>>>>> > >>>>>>>> simColumn seems to be the originator in the Aggregate step. For > >>>> instance, my current breakpoint shows: > >>>>>>>> {309682:0.9566912651062012,42938:0.9566912651062012,309672:NaN} Lance Norskog [EMAIL PROTECTED]
-
Re: RecommenderJob and NaNTed Dunning 2011-10-13, 20:14
Usage within AWS is a neighborly thing to do.
But yes, Amazon donates this bandwidth. On Thu, Oct 13, 2011 at 8:11 PM, Lance Norskog <[EMAIL PROTECTED]> wrote: > Is the Apache public download bandwidth donated by Amazon? Or should we try > to keep usage within AWS? > > On Thu, Oct 13, 2011 at 3:47 AM, Grant Ingersoll <[EMAIL PROTECTED] > >wrote: > > > > > On Oct 13, 2011, at 4:01 AM, Sebastian Schelter wrote: > > > > > Grant, > > > > > > Can you share a little more details about the results, do you get any > > > exceptions? Or do you just get no results? > > > > No results. > > > > > > > > Using the NaNs inside the similarity matrix vectors has been included > in > > > the job for a very long time and should not cause any problems. As Sean > > > already mentioned we have unit tests with toy data that should catch > the > > > very obvious errors in this code. > > > > Yeah, I don't know what happened. I know I was getting results as little > > as two weeks ago. I will try rolling back to an earlier commit. > > > > > > > > Can you share the dataset? I can do a testrun on my research cluster. > > > > I already have earlier in this thread. There is a small set via the link > > below or you can use the ASF email public dataset on Amazon or any subset > of > > it. > > > > > > > > > > --sebastian > > > > > > On 13.10.2011 08:37, Sean Owen wrote: > > >> RecommenderJob? The unit tests run it all the time. > > >> There should not be any glitches with static variables -- don't think > > >> there are any. > > >> > > >> On Thu, Oct 13, 2011 at 7:33 AM, Lance Norskog <[EMAIL PROTECTED]> > > wrote: > > >>> Is this job working well for anyone now? > > >>> When was the last time this job worked for someone? > > >>> > > >>> On Wed, Oct 12, 2011 at 11:30 AM, Grant Ingersoll < > [EMAIL PROTECTED] > > >wrote: > > >>> > > >>>> Both local and on EC2 > > >>>> > > >>>> On Oct 12, 2011, at 2:10 PM, Ken Krugler wrote: > > >>>> > > >>>>> Hi Grant, > > >>>>> > > >>>>> Just curious, are you running this locally or distributed? > > >>>>> > > >>>>> I'd run into a similar issue, though in a completely different > > algorithm > > >>>> (Jimmy Lin's PageRank implementation) due to the use of a static > > variable. > > >>>>> > > >>>>> When running locally, this wasn't getting cleared between loops, > and > > thus > > >>>> I got wonky results. > > >>>>> > > >>>>> The same thing would have happened with JVM reuse enabled. > > >>>>> > > >>>>> -- Ken > > >>>>> > > >>>>> On Oct 12, 2011, at 3:28pm, Grant Ingersoll wrote: > > >>>>> > > >>>>>> Digging some more: > > >>>>>> > > >>>>>> In AggregateAndRecommend, around lines 143, I have, for userId 0, > a > > >>>> simColumn of: > > >>>>>> > > >>>> > > > {22966:0.9566912651062012,81901:0.9566912651062012,263375:0.9566912651062012,263374:0.9566912651062012,263376:NaN} > > >>>>>> > > >>>>>> Which then becomes the numerator and the denom. > > >>>>>> > > >>>>>> Looping, my next simCol is: > > >>>>>> > > >>>> > > > {22966:0.9566912651062012,81901:0.9566912651062012,263375:NaN,263374:0.9566912651062012,263376:0.9566912651062012} > > >>>>>> > > >>>>>> and then > > >>>>>> > > >>>> > > > {22966:0.9566912651062012,81901:0.9566912651062012,263375:0.9566912651062012,263374:NaN,263376:0.9566912651062012} > > >>>>>> > > >>>>>> ... > > >>>>>> > > >>>>>> Each time, those are getting added into the numerators/denoms > value, > > >>>> such that by the time we are done looping (line 161), we have: > > >>>>>> numerators: {22966:NaN,81901:NaN,263376:NaN,263375:NaN,263374:NaN} > > >>>>>> denoms: {22966:NaN,81901:NaN,263376:NaN,263375:NaN,263374:NaN} > > >>>>>> > > >>>>>> numberOfSimilarItemsUsed: > > >>>> {81901:5.0,22966:5.0,263376:5.0,263375:5.0,263374:5.0} > > >>>>>> > > >>>>>> Not sure on how to interpret this as I haven't dug into the math > > here > > >>>> yet or figured out where those NaN are coming from originally. > > >>>>>> > > >>>>>> On Oct 11, 2011, at 2:55 PM, Grant Ingersoll wrote:
-
Re: RecommenderJob and NaNGrant Ingersoll 2011-10-13, 23:17
Were you able to get the data, Sebastian?
On Oct 13, 2011, at 4:01 AM, Sebastian Schelter wrote: > Grant, > > Can you share a little more details about the results, do you get any > exceptions? Or do you just get no results? > > Using the NaNs inside the similarity matrix vectors has been included in > the job for a very long time and should not cause any problems. As Sean > already mentioned we have unit tests with toy data that should catch the > very obvious errors in this code. > > Can you share the dataset? I can do a testrun on my research cluster. > > --sebastian > > On 13.10.2011 08:37, Sean Owen wrote: >> RecommenderJob? The unit tests run it all the time. >> There should not be any glitches with static variables -- don't think >> there are any. >> >> On Thu, Oct 13, 2011 at 7:33 AM, Lance Norskog <[EMAIL PROTECTED]> wrote: >>> Is this job working well for anyone now? >>> When was the last time this job worked for someone? >>> >>> On Wed, Oct 12, 2011 at 11:30 AM, Grant Ingersoll <[EMAIL PROTECTED]>wrote: >>> >>>> Both local and on EC2 >>>> >>>> On Oct 12, 2011, at 2:10 PM, Ken Krugler wrote: >>>> >>>>> Hi Grant, >>>>> >>>>> Just curious, are you running this locally or distributed? >>>>> >>>>> I'd run into a similar issue, though in a completely different algorithm >>>> (Jimmy Lin's PageRank implementation) due to the use of a static variable. >>>>> >>>>> When running locally, this wasn't getting cleared between loops, and thus >>>> I got wonky results. >>>>> >>>>> The same thing would have happened with JVM reuse enabled. >>>>> >>>>> -- Ken >>>>> >>>>> On Oct 12, 2011, at 3:28pm, Grant Ingersoll wrote: >>>>> >>>>>> Digging some more: >>>>>> >>>>>> In AggregateAndRecommend, around lines 143, I have, for userId 0, a >>>> simColumn of: >>>>>> >>>> {22966:0.9566912651062012,81901:0.9566912651062012,263375:0.9566912651062012,263374:0.9566912651062012,263376:NaN} >>>>>> >>>>>> Which then becomes the numerator and the denom. >>>>>> >>>>>> Looping, my next simCol is: >>>>>> >>>> {22966:0.9566912651062012,81901:0.9566912651062012,263375:NaN,263374:0.9566912651062012,263376:0.9566912651062012} >>>>>> >>>>>> and then >>>>>> >>>> {22966:0.9566912651062012,81901:0.9566912651062012,263375:0.9566912651062012,263374:NaN,263376:0.9566912651062012} >>>>>> >>>>>> ... >>>>>> >>>>>> Each time, those are getting added into the numerators/denoms value, >>>> such that by the time we are done looping (line 161), we have: >>>>>> numerators: {22966:NaN,81901:NaN,263376:NaN,263375:NaN,263374:NaN} >>>>>> denoms: {22966:NaN,81901:NaN,263376:NaN,263375:NaN,263374:NaN} >>>>>> >>>>>> numberOfSimilarItemsUsed: >>>> {81901:5.0,22966:5.0,263376:5.0,263375:5.0,263374:5.0} >>>>>> >>>>>> Not sure on how to interpret this as I haven't dug into the math here >>>> yet or figured out where those NaN are coming from originally. >>>>>> >>>>>> On Oct 11, 2011, at 2:55 PM, Grant Ingersoll wrote: >>>>>> >>>>>>> >>>>>>> On Oct 11, 2011, at 2:49 PM, Grant Ingersoll wrote: >>>>>>> >>>>>>>> >>>>>>>> On Oct 11, 2011, at 12:36 PM, Sean Owen wrote: >>>>>>>> >>>>>>>>> Where is the NaN coming up -- what has this value? >>>>>>>> >>>>>>>> simColumn seems to be the originator in the Aggregate step. For >>>> instance, my current breakpoint shows: >>>>>>>> {309682:0.9566912651062012,42938:0.9566912651062012,309672:NaN} >>>>>>>> >>>>>>>> I can also see some in the PartialMultiplyMapper via the >>>> similarityMatrixColumn. >>>>>>>> >>>>>>>> Is that set by SimilarityMatrixRowWrapperMapper? >>>>>>>> <code> >>>>>>>> /* remove self similarity */ >>>>>>>> similarityMatrixRow.set(key.get(), Double.NaN); >>>>>>>> </code> >>>>>>> >>>>>>> Ah, but that is just taking care of itself, so maybe not the issue. >>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>>> It should be propagated in some cases but not others. I'm not aware >>>> of >>>>>>>>> any changes here. >>>>>>>> >>>>>>>> yeah, me neither. This is all related to MAHOUT-798. Grant Ingersoll http://www.lucidimagination.com Lucene Eurocon 2011: http://www.lucene-eurocon.com
-
Re: RecommenderJob and NaNGrant Ingersoll 2011-10-14, 01:31
OK, I can confirm that an earlier version (54300025dbdd6e688a4eb3d043016eb641067c7e in github/lucidimagination/mahout) worked. Now, to figure out why.
-Grant On Oct 13, 2011, at 4:01 AM, Sebastian Schelter wrote: > Grant, > > Can you share a little more details about the results, do you get any > exceptions? Or do you just get no results? > > Using the NaNs inside the similarity matrix vectors has been included in > the job for a very long time and should not cause any problems. As Sean > already mentioned we have unit tests with toy data that should catch the > very obvious errors in this code. > > Can you share the dataset? I can do a testrun on my research cluster. > > --sebastian > > On 13.10.2011 08:37, Sean Owen wrote: >> RecommenderJob? The unit tests run it all the time. >> There should not be any glitches with static variables -- don't think >> there are any. >> >> On Thu, Oct 13, 2011 at 7:33 AM, Lance Norskog <[EMAIL PROTECTED]> wrote: >>> Is this job working well for anyone now? >>> When was the last time this job worked for someone? >>> >>> On Wed, Oct 12, 2011 at 11:30 AM, Grant Ingersoll <[EMAIL PROTECTED]>wrote: >>> >>>> Both local and on EC2 >>>> >>>> On Oct 12, 2011, at 2:10 PM, Ken Krugler wrote: >>>> >>>>> Hi Grant, >>>>> >>>>> Just curious, are you running this locally or distributed? >>>>> >>>>> I'd run into a similar issue, though in a completely different algorithm >>>> (Jimmy Lin's PageRank implementation) due to the use of a static variable. >>>>> >>>>> When running locally, this wasn't getting cleared between loops, and thus >>>> I got wonky results. >>>>> >>>>> The same thing would have happened with JVM reuse enabled. >>>>> >>>>> -- Ken >>>>> >>>>> On Oct 12, 2011, at 3:28pm, Grant Ingersoll wrote: >>>>> >>>>>> Digging some more: >>>>>> >>>>>> In AggregateAndRecommend, around lines 143, I have, for userId 0, a >>>> simColumn of: >>>>>> >>>> {22966:0.9566912651062012,81901:0.9566912651062012,263375:0.9566912651062012,263374:0.9566912651062012,263376:NaN} >>>>>> >>>>>> Which then becomes the numerator and the denom. >>>>>> >>>>>> Looping, my next simCol is: >>>>>> >>>> {22966:0.9566912651062012,81901:0.9566912651062012,263375:NaN,263374:0.9566912651062012,263376:0.9566912651062012} >>>>>> >>>>>> and then >>>>>> >>>> {22966:0.9566912651062012,81901:0.9566912651062012,263375:0.9566912651062012,263374:NaN,263376:0.9566912651062012} >>>>>> >>>>>> ... >>>>>> >>>>>> Each time, those are getting added into the numerators/denoms value, >>>> such that by the time we are done looping (line 161), we have: >>>>>> numerators: {22966:NaN,81901:NaN,263376:NaN,263375:NaN,263374:NaN} >>>>>> denoms: {22966:NaN,81901:NaN,263376:NaN,263375:NaN,263374:NaN} >>>>>> >>>>>> numberOfSimilarItemsUsed: >>>> {81901:5.0,22966:5.0,263376:5.0,263375:5.0,263374:5.0} >>>>>> >>>>>> Not sure on how to interpret this as I haven't dug into the math here >>>> yet or figured out where those NaN are coming from originally. >>>>>> >>>>>> On Oct 11, 2011, at 2:55 PM, Grant Ingersoll wrote: >>>>>> >>>>>>> >>>>>>> On Oct 11, 2011, at 2:49 PM, Grant Ingersoll wrote: >>>>>>> >>>>>>>> >>>>>>>> On Oct 11, 2011, at 12:36 PM, Sean Owen wrote: >>>>>>>> >>>>>>>>> Where is the NaN coming up -- what has this value? >>>>>>>> >>>>>>>> simColumn seems to be the originator in the Aggregate step. For >>>> instance, my current breakpoint shows: >>>>>>>> {309682:0.9566912651062012,42938:0.9566912651062012,309672:NaN} >>>>>>>> >>>>>>>> I can also see some in the PartialMultiplyMapper via the >>>> similarityMatrixColumn. >>>>>>>> >>>>>>>> Is that set by SimilarityMatrixRowWrapperMapper? >>>>>>>> <code> >>>>>>>> /* remove self similarity */ >>>>>>>> similarityMatrixRow.set(key.get(), Double.NaN); >>>>>>>> </code> >>>>>>> >>>>>>> Ah, but that is just taking care of itself, so maybe not the issue. >>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>>> It should be propagated in some cases but not others. I'm not aware Grant Ingersoll http://www.lucidimagination.com Lucene Eurocon 2011: http://www.lucene-eurocon.com
-
Re: RecommenderJob and NaNGrant Ingersoll 2011-10-14, 02:35
Note, the next version (13df29e4fe97b4370f24d7e91ab5909de76f0f3b) doesn't work. Debugging.
On Oct 13, 2011, at 9:31 PM, Grant Ingersoll wrote: > OK, I can confirm that an earlier version (54300025dbdd6e688a4eb3d043016eb641067c7e in github/lucidimagination/mahout) worked. Now, to figure out why. > > -Grant > > On Oct 13, 2011, at 4:01 AM, Sebastian Schelter wrote: > >> Grant, >> >> Can you share a little more details about the results, do you get any >> exceptions? Or do you just get no results? >> >> Using the NaNs inside the similarity matrix vectors has been included in >> the job for a very long time and should not cause any problems. As Sean >> already mentioned we have unit tests with toy data that should catch the >> very obvious errors in this code. >> >> Can you share the dataset? I can do a testrun on my research cluster. >> >> --sebastian >> >> On 13.10.2011 08:37, Sean Owen wrote: >>> RecommenderJob? The unit tests run it all the time. >>> There should not be any glitches with static variables -- don't think >>> there are any. >>> >>> On Thu, Oct 13, 2011 at 7:33 AM, Lance Norskog <[EMAIL PROTECTED]> wrote: >>>> Is this job working well for anyone now? >>>> When was the last time this job worked for someone? >>>> >>>> On Wed, Oct 12, 2011 at 11:30 AM, Grant Ingersoll <[EMAIL PROTECTED]>wrote: >>>> >>>>> Both local and on EC2 >>>>> >>>>> On Oct 12, 2011, at 2:10 PM, Ken Krugler wrote: >>>>> >>>>>> Hi Grant, >>>>>> >>>>>> Just curious, are you running this locally or distributed? >>>>>> >>>>>> I'd run into a similar issue, though in a completely different algorithm >>>>> (Jimmy Lin's PageRank implementation) due to the use of a static variable. >>>>>> >>>>>> When running locally, this wasn't getting cleared between loops, and thus >>>>> I got wonky results. >>>>>> >>>>>> The same thing would have happened with JVM reuse enabled. >>>>>> >>>>>> -- Ken >>>>>> >>>>>> On Oct 12, 2011, at 3:28pm, Grant Ingersoll wrote: >>>>>> >>>>>>> Digging some more: >>>>>>> >>>>>>> In AggregateAndRecommend, around lines 143, I have, for userId 0, a >>>>> simColumn of: >>>>>>> >>>>> {22966:0.9566912651062012,81901:0.9566912651062012,263375:0.9566912651062012,263374:0.9566912651062012,263376:NaN} >>>>>>> >>>>>>> Which then becomes the numerator and the denom. >>>>>>> >>>>>>> Looping, my next simCol is: >>>>>>> >>>>> {22966:0.9566912651062012,81901:0.9566912651062012,263375:NaN,263374:0.9566912651062012,263376:0.9566912651062012} >>>>>>> >>>>>>> and then >>>>>>> >>>>> {22966:0.9566912651062012,81901:0.9566912651062012,263375:0.9566912651062012,263374:NaN,263376:0.9566912651062012} >>>>>>> >>>>>>> ... >>>>>>> >>>>>>> Each time, those are getting added into the numerators/denoms value, >>>>> such that by the time we are done looping (line 161), we have: >>>>>>> numerators: {22966:NaN,81901:NaN,263376:NaN,263375:NaN,263374:NaN} >>>>>>> denoms: {22966:NaN,81901:NaN,263376:NaN,263375:NaN,263374:NaN} >>>>>>> >>>>>>> numberOfSimilarItemsUsed: >>>>> {81901:5.0,22966:5.0,263376:5.0,263375:5.0,263374:5.0} >>>>>>> >>>>>>> Not sure on how to interpret this as I haven't dug into the math here >>>>> yet or figured out where those NaN are coming from originally. >>>>>>> >>>>>>> On Oct 11, 2011, at 2:55 PM, Grant Ingersoll wrote: >>>>>>> >>>>>>>> >>>>>>>> On Oct 11, 2011, at 2:49 PM, Grant Ingersoll wrote: >>>>>>>> >>>>>>>>> >>>>>>>>> On Oct 11, 2011, at 12:36 PM, Sean Owen wrote: >>>>>>>>> >>>>>>>>>> Where is the NaN coming up -- what has this value? >>>>>>>>> >>>>>>>>> simColumn seems to be the originator in the Aggregate step. For >>>>> instance, my current breakpoint shows: >>>>>>>>> {309682:0.9566912651062012,42938:0.9566912651062012,309672:NaN} >>>>>>>>> >>>>>>>>> I can also see some in the PartialMultiplyMapper via the >>>>> similarityMatrixColumn. >>>>>>>>> >>>>>>>>> Is that set by SimilarityMatrixRowWrapperMapper? >>>>>>>>> <code> >>>>>>>>> /* remove self similarity */ Grant Ingersoll http://www.lucidimagination.com Lucene Eurocon 2011: http://www.lucene-eurocon.com
-
Re: RecommenderJob and NaNGrant Ingersoll 2011-10-14, 03:00
Looks like it is me. Still not sure why, but getting there.
On Oct 13, 2011, at 10:35 PM, Grant Ingersoll wrote: > Note, the next version (13df29e4fe97b4370f24d7e91ab5909de76f0f3b) doesn't work. Debugging. > > > > On Oct 13, 2011, at 9:31 PM, Grant Ingersoll wrote: > >> OK, I can confirm that an earlier version (54300025dbdd6e688a4eb3d043016eb641067c7e in github/lucidimagination/mahout) worked. Now, to figure out why. >> >> -Grant >> >> On Oct 13, 2011, at 4:01 AM, Sebastian Schelter wrote: >> >>> Grant, >>> >>> Can you share a little more details about the results, do you get any >>> exceptions? Or do you just get no results? >>> >>> Using the NaNs inside the similarity matrix vectors has been included in >>> the job for a very long time and should not cause any problems. As Sean >>> already mentioned we have unit tests with toy data that should catch the >>> very obvious errors in this code. >>> >>> Can you share the dataset? I can do a testrun on my research cluster. >>> >>> --sebastian >>> >>> On 13.10.2011 08:37, Sean Owen wrote: >>>> RecommenderJob? The unit tests run it all the time. >>>> There should not be any glitches with static variables -- don't think >>>> there are any. >>>> >>>> On Thu, Oct 13, 2011 at 7:33 AM, Lance Norskog <[EMAIL PROTECTED]> wrote: >>>>> Is this job working well for anyone now? >>>>> When was the last time this job worked for someone? >>>>> >>>>> On Wed, Oct 12, 2011 at 11:30 AM, Grant Ingersoll <[EMAIL PROTECTED]>wrote: >>>>> >>>>>> Both local and on EC2 >>>>>> >>>>>> On Oct 12, 2011, at 2:10 PM, Ken Krugler wrote: >>>>>> >>>>>>> Hi Grant, >>>>>>> >>>>>>> Just curious, are you running this locally or distributed? >>>>>>> >>>>>>> I'd run into a similar issue, though in a completely different algorithm >>>>>> (Jimmy Lin's PageRank implementation) due to the use of a static variable. >>>>>>> >>>>>>> When running locally, this wasn't getting cleared between loops, and thus >>>>>> I got wonky results. >>>>>>> >>>>>>> The same thing would have happened with JVM reuse enabled. >>>>>>> >>>>>>> -- Ken >>>>>>> >>>>>>> On Oct 12, 2011, at 3:28pm, Grant Ingersoll wrote: >>>>>>> >>>>>>>> Digging some more: >>>>>>>> >>>>>>>> In AggregateAndRecommend, around lines 143, I have, for userId 0, a >>>>>> simColumn of: >>>>>>>> >>>>>> {22966:0.9566912651062012,81901:0.9566912651062012,263375:0.9566912651062012,263374:0.9566912651062012,263376:NaN} >>>>>>>> >>>>>>>> Which then becomes the numerator and the denom. >>>>>>>> >>>>>>>> Looping, my next simCol is: >>>>>>>> >>>>>> {22966:0.9566912651062012,81901:0.9566912651062012,263375:NaN,263374:0.9566912651062012,263376:0.9566912651062012} >>>>>>>> >>>>>>>> and then >>>>>>>> >>>>>> {22966:0.9566912651062012,81901:0.9566912651062012,263375:0.9566912651062012,263374:NaN,263376:0.9566912651062012} >>>>>>>> >>>>>>>> ... >>>>>>>> >>>>>>>> Each time, those are getting added into the numerators/denoms value, >>>>>> such that by the time we are done looping (line 161), we have: >>>>>>>> numerators: {22966:NaN,81901:NaN,263376:NaN,263375:NaN,263374:NaN} >>>>>>>> denoms: {22966:NaN,81901:NaN,263376:NaN,263375:NaN,263374:NaN} >>>>>>>> >>>>>>>> numberOfSimilarItemsUsed: >>>>>> {81901:5.0,22966:5.0,263376:5.0,263375:5.0,263374:5.0} >>>>>>>> >>>>>>>> Not sure on how to interpret this as I haven't dug into the math here >>>>>> yet or figured out where those NaN are coming from originally. >>>>>>>> >>>>>>>> On Oct 11, 2011, at 2:55 PM, Grant Ingersoll wrote: >>>>>>>> >>>>>>>>> >>>>>>>>> On Oct 11, 2011, at 2:49 PM, Grant Ingersoll wrote: >>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Oct 11, 2011, at 12:36 PM, Sean Owen wrote: >>>>>>>>>> >>>>>>>>>>> Where is the NaN coming up -- what has this value? >>>>>>>>>> >>>>>>>>>> simColumn seems to be the originator in the Aggregate step. For >>>>>> instance, my current breakpoint shows: >>>>>>>>>> {309682:0.9566912651062012,42938:0.9566912651062012,309672:NaN} Grant Ingersoll http://www.lucidimagination.com Lucene Eurocon 2011: http://www.lucene-eurocon.com
-
Re: RecommenderJob and NaNLance Norskog 2011-10-14, 03:19
I meant running with real data.
On Wed, Oct 12, 2011 at 11:37 PM, Sean Owen <[EMAIL PROTECTED]> wrote: > RecommenderJob? The unit tests run it all the time. > There should not be any glitches with static variables -- don't think > there are any. > > On Thu, Oct 13, 2011 at 7:33 AM, Lance Norskog <[EMAIL PROTECTED]> wrote: > > Is this job working well for anyone now? > > When was the last time this job worked for someone? > > > > On Wed, Oct 12, 2011 at 11:30 AM, Grant Ingersoll <[EMAIL PROTECTED] > >wrote: > > > >> Both local and on EC2 > >> > >> On Oct 12, 2011, at 2:10 PM, Ken Krugler wrote: > >> > >> > Hi Grant, > >> > > >> > Just curious, are you running this locally or distributed? > >> > > >> > I'd run into a similar issue, though in a completely different > algorithm > >> (Jimmy Lin's PageRank implementation) due to the use of a static > variable. > >> > > >> > When running locally, this wasn't getting cleared between loops, and > thus > >> I got wonky results. > >> > > >> > The same thing would have happened with JVM reuse enabled. > >> > > >> > -- Ken > >> > > >> > On Oct 12, 2011, at 3:28pm, Grant Ingersoll wrote: > >> > > >> >> Digging some more: > >> >> > >> >> In AggregateAndRecommend, around lines 143, I have, for userId 0, a > >> simColumn of: > >> >> > >> > {22966:0.9566912651062012,81901:0.9566912651062012,263375:0.9566912651062012,263374:0.9566912651062012,263376:NaN} > >> >> > >> >> Which then becomes the numerator and the denom. > >> >> > >> >> Looping, my next simCol is: > >> >> > >> > {22966:0.9566912651062012,81901:0.9566912651062012,263375:NaN,263374:0.9566912651062012,263376:0.9566912651062012} > >> >> > >> >> and then > >> >> > >> > {22966:0.9566912651062012,81901:0.9566912651062012,263375:0.9566912651062012,263374:NaN,263376:0.9566912651062012} > >> >> > >> >> ... > >> >> > >> >> Each time, those are getting added into the numerators/denoms value, > >> such that by the time we are done looping (line 161), we have: > >> >> numerators: {22966:NaN,81901:NaN,263376:NaN,263375:NaN,263374:NaN} > >> >> denoms: {22966:NaN,81901:NaN,263376:NaN,263375:NaN,263374:NaN} > >> >> > >> >> numberOfSimilarItemsUsed: > >> {81901:5.0,22966:5.0,263376:5.0,263375:5.0,263374:5.0} > >> >> > >> >> Not sure on how to interpret this as I haven't dug into the math here > >> yet or figured out where those NaN are coming from originally. > >> >> > >> >> On Oct 11, 2011, at 2:55 PM, Grant Ingersoll wrote: > >> >> > >> >>> > >> >>> On Oct 11, 2011, at 2:49 PM, Grant Ingersoll wrote: > >> >>> > >> >>>> > >> >>>> On Oct 11, 2011, at 12:36 PM, Sean Owen wrote: > >> >>>> > >> >>>>> Where is the NaN coming up -- what has this value? > >> >>>> > >> >>>> simColumn seems to be the originator in the Aggregate step. For > >> instance, my current breakpoint shows: > >> >>>> {309682:0.9566912651062012,42938:0.9566912651062012,309672:NaN} > >> >>>> > >> >>>> I can also see some in the PartialMultiplyMapper via the > >> similarityMatrixColumn. > >> >>>> > >> >>>> Is that set by SimilarityMatrixRowWrapperMapper? > >> >>>> <code> > >> >>>> /* remove self similarity */ > >> >>>> similarityMatrixRow.set(key.get(), Double.NaN); > >> >>>> </code> > >> >>> > >> >>> Ah, but that is just taking care of itself, so maybe not the issue. > >> >>> > >> >>>> > >> >>>> > >> >>>> > >> >>>>> It should be propagated in some cases but not others. I'm not > aware > >> of > >> >>>>> any changes here. > >> >>>> > >> >>>> yeah, me neither. This is all related to MAHOUT-798. > >> >>>> > >> >>>>> > >> >>>>> Generally small data sets will have this problem of not being able > to > >> >>>>> compute much of anything useful, so NaN might be right here. > >> >>>>> But you say it was different recently, which seems to rule that > out. > >> >>>> > >> >>>> I also _believe_ I'm seeing it in a much larger data set on Hadoop, > >> it's just that's a whole lot harder to debug. > >> >>>> > >> >>>>> > >> >>>>> On Tue, Oct 11, 2011 at 5:34 PM, Grant Ingersoll < Lance Norskog [EMAIL PROTECTED]
-
Re: RecommenderJob and NaNSebastian Schelter 2011-10-14, 06:04
Only got the raw data, how did you convert it to our standard
recommender input? --sebastian On 14.10.2011 01:17, Grant Ingersoll wrote: > Were you able to get the data, Sebastian? > > On Oct 13, 2011, at 4:01 AM, Sebastian Schelter wrote: > >> Grant, >> >> Can you share a little more details about the results, do you get any >> exceptions? Or do you just get no results? >> >> Using the NaNs inside the similarity matrix vectors has been included in >> the job for a very long time and should not cause any problems. As Sean >> already mentioned we have unit tests with toy data that should catch the >> very obvious errors in this code. >> >> Can you share the dataset? I can do a testrun on my research cluster. >> >> --sebastian >> >> On 13.10.2011 08:37, Sean Owen wrote: >>> RecommenderJob? The unit tests run it all the time. >>> There should not be any glitches with static variables -- don't think >>> there are any. >>> >>> On Thu, Oct 13, 2011 at 7:33 AM, Lance Norskog <[EMAIL PROTECTED]> wrote: >>>> Is this job working well for anyone now? >>>> When was the last time this job worked for someone? >>>> >>>> On Wed, Oct 12, 2011 at 11:30 AM, Grant Ingersoll <[EMAIL PROTECTED]>wrote: >>>> >>>>> Both local and on EC2 >>>>> >>>>> On Oct 12, 2011, at 2:10 PM, Ken Krugler wrote: >>>>> >>>>>> Hi Grant, >>>>>> >>>>>> Just curious, are you running this locally or distributed? >>>>>> >>>>>> I'd run into a similar issue, though in a completely different algorithm >>>>> (Jimmy Lin's PageRank implementation) due to the use of a static variable. >>>>>> >>>>>> When running locally, this wasn't getting cleared between loops, and thus >>>>> I got wonky results. >>>>>> >>>>>> The same thing would have happened with JVM reuse enabled. >>>>>> >>>>>> -- Ken >>>>>> >>>>>> On Oct 12, 2011, at 3:28pm, Grant Ingersoll wrote: >>>>>> >>>>>>> Digging some more: >>>>>>> >>>>>>> In AggregateAndRecommend, around lines 143, I have, for userId 0, a >>>>> simColumn of: >>>>>>> >>>>> {22966:0.9566912651062012,81901:0.9566912651062012,263375:0.9566912651062012,263374:0.9566912651062012,263376:NaN} >>>>>>> >>>>>>> Which then becomes the numerator and the denom. >>>>>>> >>>>>>> Looping, my next simCol is: >>>>>>> >>>>> {22966:0.9566912651062012,81901:0.9566912651062012,263375:NaN,263374:0.9566912651062012,263376:0.9566912651062012} >>>>>>> >>>>>>> and then >>>>>>> >>>>> {22966:0.9566912651062012,81901:0.9566912651062012,263375:0.9566912651062012,263374:NaN,263376:0.9566912651062012} >>>>>>> >>>>>>> ... >>>>>>> >>>>>>> Each time, those are getting added into the numerators/denoms value, >>>>> such that by the time we are done looping (line 161), we have: >>>>>>> numerators: {22966:NaN,81901:NaN,263376:NaN,263375:NaN,263374:NaN} >>>>>>> denoms: {22966:NaN,81901:NaN,263376:NaN,263375:NaN,263374:NaN} >>>>>>> >>>>>>> numberOfSimilarItemsUsed: >>>>> {81901:5.0,22966:5.0,263376:5.0,263375:5.0,263374:5.0} >>>>>>> >>>>>>> Not sure on how to interpret this as I haven't dug into the math here >>>>> yet or figured out where those NaN are coming from originally. >>>>>>> >>>>>>> On Oct 11, 2011, at 2:55 PM, Grant Ingersoll wrote: >>>>>>> >>>>>>>> >>>>>>>> On Oct 11, 2011, at 2:49 PM, Grant Ingersoll wrote: >>>>>>>> >>>>>>>>> >>>>>>>>> On Oct 11, 2011, at 12:36 PM, Sean Owen wrote: >>>>>>>>> >>>>>>>>>> Where is the NaN coming up -- what has this value? >>>>>>>>> >>>>>>>>> simColumn seems to be the originator in the Aggregate step. For >>>>> instance, my current breakpoint shows: >>>>>>>>> {309682:0.9566912651062012,42938:0.9566912651062012,309672:NaN} >>>>>>>>> >>>>>>>>> I can also see some in the PartialMultiplyMapper via the >>>>> similarityMatrixColumn. >>>>>>>>> >>>>>>>>> Is that set by SimilarityMatrixRowWrapperMapper? >>>>>>>>> <code> >>>>>>>>> /* remove self similarity */ >>>>>>>>> similarityMatrixRow.set(key.get(), Double.NaN); >>>>>>>>> </code> >>>>>>>> >>>>>>>> Ah, but that is just taking care of itself, so maybe not the issue.
-
Re: RecommenderJob and NaNLance Norskog 2011-10-14, 06:28
cd mahout/examples/bin
./build-asf-email.sh content/ out/ over/ select 1 for recommender where content/ is content/coccoon.apache.org content/commons.apache.org and out/ and over/ are output directories. Run the shell script with -x as you will probably have to tweak it. Lance On Thu, Oct 13, 2011 at 11:04 PM, Sebastian Schelter <[EMAIL PROTECTED]> wrote: > Only got the raw data, how did you convert it to our standard > recommender input? > > --sebastian > > > On 14.10.2011 01:17, Grant Ingersoll wrote: > > Were you able to get the data, Sebastian? > > > > On Oct 13, 2011, at 4:01 AM, Sebastian Schelter wrote: > > > >> Grant, > >> > >> Can you share a little more details about the results, do you get any > >> exceptions? Or do you just get no results? > >> > >> Using the NaNs inside the similarity matrix vectors has been included in > >> the job for a very long time and should not cause any problems. As Sean > >> already mentioned we have unit tests with toy data that should catch the > >> very obvious errors in this code. > >> > >> Can you share the dataset? I can do a testrun on my research cluster. > >> > >> --sebastian > >> > >> On 13.10.2011 08:37, Sean Owen wrote: > >>> RecommenderJob? The unit tests run it all the time. > >>> There should not be any glitches with static variables -- don't think > >>> there are any. > >>> > >>> On Thu, Oct 13, 2011 at 7:33 AM, Lance Norskog <[EMAIL PROTECTED]> > wrote: > >>>> Is this job working well for anyone now? > >>>> When was the last time this job worked for someone? > >>>> > >>>> On Wed, Oct 12, 2011 at 11:30 AM, Grant Ingersoll < > [EMAIL PROTECTED]>wrote: > >>>> > >>>>> Both local and on EC2 > >>>>> > >>>>> On Oct 12, 2011, at 2:10 PM, Ken Krugler wrote: > >>>>> > >>>>>> Hi Grant, > >>>>>> > >>>>>> Just curious, are you running this locally or distributed? > >>>>>> > >>>>>> I'd run into a similar issue, though in a completely different > algorithm > >>>>> (Jimmy Lin's PageRank implementation) due to the use of a static > variable. > >>>>>> > >>>>>> When running locally, this wasn't getting cleared between loops, and > thus > >>>>> I got wonky results. > >>>>>> > >>>>>> The same thing would have happened with JVM reuse enabled. > >>>>>> > >>>>>> -- Ken > >>>>>> > >>>>>> On Oct 12, 2011, at 3:28pm, Grant Ingersoll wrote: > >>>>>> > >>>>>>> Digging some more: > >>>>>>> > >>>>>>> In AggregateAndRecommend, around lines 143, I have, for userId 0, a > >>>>> simColumn of: > >>>>>>> > >>>>> > {22966:0.9566912651062012,81901:0.9566912651062012,263375:0.9566912651062012,263374:0.9566912651062012,263376:NaN} > >>>>>>> > >>>>>>> Which then becomes the numerator and the denom. > >>>>>>> > >>>>>>> Looping, my next simCol is: > >>>>>>> > >>>>> > {22966:0.9566912651062012,81901:0.9566912651062012,263375:NaN,263374:0.9566912651062012,263376:0.9566912651062012} > >>>>>>> > >>>>>>> and then > >>>>>>> > >>>>> > {22966:0.9566912651062012,81901:0.9566912651062012,263375:0.9566912651062012,263374:NaN,263376:0.9566912651062012} > >>>>>>> > >>>>>>> ... > >>>>>>> > >>>>>>> Each time, those are getting added into the numerators/denoms > value, > >>>>> such that by the time we are done looping (line 161), we have: > >>>>>>> numerators: {22966:NaN,81901:NaN,263376:NaN,263375:NaN,263374:NaN} > >>>>>>> denoms: {22966:NaN,81901:NaN,263376:NaN,263375:NaN,263374:NaN} > >>>>>>> > >>>>>>> numberOfSimilarItemsUsed: > >>>>> {81901:5.0,22966:5.0,263376:5.0,263375:5.0,263374:5.0} > >>>>>>> > >>>>>>> Not sure on how to interpret this as I haven't dug into the math > here > >>>>> yet or figured out where those NaN are coming from originally. > >>>>>>> > >>>>>>> On Oct 11, 2011, at 2:55 PM, Grant Ingersoll wrote: > >>>>>>> > >>>>>>>> > >>>>>>>> On Oct 11, 2011, at 2:49 PM, Grant Ingersoll wrote: > >>>>>>>> > >>>>>>>>> > >>>>>>>>> On Oct 11, 2011, at 12:36 PM, Sean Owen wrote: > >>>>>>>>> > >>>>>>>>>> Where is the NaN coming up -- what has this value? > >> Lance Norskog [EMAIL PROTECTED]
-
Re: RecommenderJob and NaNGrant Ingersoll 2011-10-14, 12:42
FYI, I think I see the problem. Working on a fix.
On Oct 14, 2011, at 2:28 AM, Lance Norskog wrote: > cd mahout/examples/bin > ./build-asf-email.sh content/ out/ over/ > select 1 for recommender > > where content/ is > content/coccoon.apache.org > content/commons.apache.org > > and out/ and over/ are output directories. Run the shell script with -x as > you will probably have to tweak it. > > Lance > > On Thu, Oct 13, 2011 at 11:04 PM, Sebastian Schelter <[EMAIL PROTECTED]> wrote: > >> Only got the raw data, how did you convert it to our standard >> recommender input? >> >> --sebastian >> >> >> On 14.10.2011 01:17, Grant Ingersoll wrote: >>> Were you able to get the data, Sebastian? >>> >>> On Oct 13, 2011, at 4:01 AM, Sebastian Schelter wrote: >>> >>>> Grant, >>>> >>>> Can you share a little more details about the results, do you get any >>>> exceptions? Or do you just get no results? >>>> >>>> Using the NaNs inside the similarity matrix vectors has been included in >>>> the job for a very long time and should not cause any problems. As Sean >>>> already mentioned we have unit tests with toy data that should catch the >>>> very obvious errors in this code. >>>> >>>> Can you share the dataset? I can do a testrun on my research cluster. >>>> >>>> --sebastian >>>> >>>> On 13.10.2011 08:37, Sean Owen wrote: >>>>> RecommenderJob? The unit tests run it all the time. >>>>> There should not be any glitches with static variables -- don't think >>>>> there are any. >>>>> >>>>> On Thu, Oct 13, 2011 at 7:33 AM, Lance Norskog <[EMAIL PROTECTED]> >> wrote: >>>>>> Is this job working well for anyone now? >>>>>> When was the last time this job worked for someone? >>>>>> >>>>>> On Wed, Oct 12, 2011 at 11:30 AM, Grant Ingersoll < >> [EMAIL PROTECTED]>wrote: >>>>>> >>>>>>> Both local and on EC2 >>>>>>> >>>>>>> On Oct 12, 2011, at 2:10 PM, Ken Krugler wrote: >>>>>>> >>>>>>>> Hi Grant, >>>>>>>> >>>>>>>> Just curious, are you running this locally or distributed? >>>>>>>> >>>>>>>> I'd run into a similar issue, though in a completely different >> algorithm >>>>>>> (Jimmy Lin's PageRank implementation) due to the use of a static >> variable. >>>>>>>> >>>>>>>> When running locally, this wasn't getting cleared between loops, and >> thus >>>>>>> I got wonky results. >>>>>>>> >>>>>>>> The same thing would have happened with JVM reuse enabled. >>>>>>>> >>>>>>>> -- Ken >>>>>>>> >>>>>>>> On Oct 12, 2011, at 3:28pm, Grant Ingersoll wrote: >>>>>>>> >>>>>>>>> Digging some more: >>>>>>>>> >>>>>>>>> In AggregateAndRecommend, around lines 143, I have, for userId 0, a >>>>>>> simColumn of: >>>>>>>>> >>>>>>> >> {22966:0.9566912651062012,81901:0.9566912651062012,263375:0.9566912651062012,263374:0.9566912651062012,263376:NaN} >>>>>>>>> >>>>>>>>> Which then becomes the numerator and the denom. >>>>>>>>> >>>>>>>>> Looping, my next simCol is: >>>>>>>>> >>>>>>> >> {22966:0.9566912651062012,81901:0.9566912651062012,263375:NaN,263374:0.9566912651062012,263376:0.9566912651062012} >>>>>>>>> >>>>>>>>> and then >>>>>>>>> >>>>>>> >> {22966:0.9566912651062012,81901:0.9566912651062012,263375:0.9566912651062012,263374:NaN,263376:0.9566912651062012} >>>>>>>>> >>>>>>>>> ... >>>>>>>>> >>>>>>>>> Each time, those are getting added into the numerators/denoms >> value, >>>>>>> such that by the time we are done looping (line 161), we have: >>>>>>>>> numerators: {22966:NaN,81901:NaN,263376:NaN,263375:NaN,263374:NaN} >>>>>>>>> denoms: {22966:NaN,81901:NaN,263376:NaN,263375:NaN,263374:NaN} >>>>>>>>> >>>>>>>>> numberOfSimilarItemsUsed: >>>>>>> {81901:5.0,22966:5.0,263376:5.0,263375:5.0,263374:5.0} >>>>>>>>> >>>>>>>>> Not sure on how to interpret this as I haven't dug into the math >> here >>>>>>> yet or figured out where those NaN are coming from originally. >>>>>>>>> >>>>>>>>> On Oct 11, 2011, at 2:55 PM, Grant Ingersoll wrote: >>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Oct 11, 2011, at 2:49 PM, Grant Ingersoll wrote: Grant Ingersoll http://www.lucidimagination.com Lucene Eurocon 2011: http://www.lucene-eurocon.com
-
Re: RecommenderJob and NaNGrant Ingersoll 2011-10-14, 15:10
OK, I believe I checked in a fix. The issue came down to me generalizing the SeqFilesFromMailArchives in terms of the metadata extraction (from, to, references, etc.) and the fact that the code I use to extract preferences (MailToRecMapper) depended on things being in a specific order.
On Oct 14, 2011, at 2:28 AM, Lance Norskog wrote: > cd mahout/examples/bin > ./build-asf-email.sh content/ out/ over/ > select 1 for recommender > > where content/ is > content/coccoon.apache.org > content/commons.apache.org > > and out/ and over/ are output directories. Run the shell script with -x as > you will probably have to tweak it. > > Lance > > On Thu, Oct 13, 2011 at 11:04 PM, Sebastian Schelter <[EMAIL PROTECTED]> wrote: > >> Only got the raw data, how did you convert it to our standard >> recommender input? >> >> --sebastian >> >> >> On 14.10.2011 01:17, Grant Ingersoll wrote: >>> Were you able to get the data, Sebastian? >>> >>> On Oct 13, 2011, at 4:01 AM, Sebastian Schelter wrote: >>> >>>> Grant, >>>> >>>> Can you share a little more details about the results, do you get any >>>> exceptions? Or do you just get no results? >>>> >>>> Using the NaNs inside the similarity matrix vectors has been included in >>>> the job for a very long time and should not cause any problems. As Sean >>>> already mentioned we have unit tests with toy data that should catch the >>>> very obvious errors in this code. >>>> >>>> Can you share the dataset? I can do a testrun on my research cluster. >>>> >>>> --sebastian >>>> >>>> On 13.10.2011 08:37, Sean Owen wrote: >>>>> RecommenderJob? The unit tests run it all the time. >>>>> There should not be any glitches with static variables -- don't think >>>>> there are any. >>>>> >>>>> On Thu, Oct 13, 2011 at 7:33 AM, Lance Norskog <[EMAIL PROTECTED]> >> wrote: >>>>>> Is this job working well for anyone now? >>>>>> When was the last time this job worked for someone? >>>>>> >>>>>> On Wed, Oct 12, 2011 at 11:30 AM, Grant Ingersoll < >> [EMAIL PROTECTED]>wrote: >>>>>> >>>>>>> Both local and on EC2 >>>>>>> >>>>>>> On Oct 12, 2011, at 2:10 PM, Ken Krugler wrote: >>>>>>> >>>>>>>> Hi Grant, >>>>>>>> >>>>>>>> Just curious, are you running this locally or distributed? >>>>>>>> >>>>>>>> I'd run into a similar issue, though in a completely different >> algorithm >>>>>>> (Jimmy Lin's PageRank implementation) due to the use of a static >> variable. >>>>>>>> >>>>>>>> When running locally, this wasn't getting cleared between loops, and >> thus >>>>>>> I got wonky results. >>>>>>>> >>>>>>>> The same thing would have happened with JVM reuse enabled. >>>>>>>> >>>>>>>> -- Ken >>>>>>>> >>>>>>>> On Oct 12, 2011, at 3:28pm, Grant Ingersoll wrote: >>>>>>>> >>>>>>>>> Digging some more: >>>>>>>>> >>>>>>>>> In AggregateAndRecommend, around lines 143, I have, for userId 0, a >>>>>>> simColumn of: >>>>>>>>> >>>>>>> >> {22966:0.9566912651062012,81901:0.9566912651062012,263375:0.9566912651062012,263374:0.9566912651062012,263376:NaN} >>>>>>>>> >>>>>>>>> Which then becomes the numerator and the denom. >>>>>>>>> >>>>>>>>> Looping, my next simCol is: >>>>>>>>> >>>>>>> >> {22966:0.9566912651062012,81901:0.9566912651062012,263375:NaN,263374:0.9566912651062012,263376:0.9566912651062012} >>>>>>>>> >>>>>>>>> and then >>>>>>>>> >>>>>>> >> {22966:0.9566912651062012,81901:0.9566912651062012,263375:0.9566912651062012,263374:NaN,263376:0.9566912651062012} >>>>>>>>> >>>>>>>>> ... >>>>>>>>> >>>>>>>>> Each time, those are getting added into the numerators/denoms >> value, >>>>>>> such that by the time we are done looping (line 161), we have: >>>>>>>>> numerators: {22966:NaN,81901:NaN,263376:NaN,263375:NaN,263374:NaN} >>>>>>>>> denoms: {22966:NaN,81901:NaN,263376:NaN,263375:NaN,263374:NaN} >>>>>>>>> >>>>>>>>> numberOfSimilarItemsUsed: >>>>>>> {81901:5.0,22966:5.0,263376:5.0,263375:5.0,263374:5.0} >>>>>>>>> >>>>>>>>> Not sure on how to interpret this as I haven't dug into the math Grant Ingersoll http://www.lucidimagination.com Lucene Eurocon 2011: http://www.lucene-eurocon.com
-
Re: RecommenderJob and NaNLance Norskog 2011-10-15, 03:32
Bingo, I'm getting recs now.
On Fri, Oct 14, 2011 at 8:10 AM, Grant Ingersoll <[EMAIL PROTECTED]>wrote: > OK, I believe I checked in a fix. The issue came down to me generalizing > the SeqFilesFromMailArchives in terms of the metadata extraction (from, to, > references, etc.) and the fact that the code I use to extract preferences > (MailToRecMapper) depended on things being in a specific order. > > On Oct 14, 2011, at 2:28 AM, Lance Norskog wrote: > > > cd mahout/examples/bin > > ./build-asf-email.sh content/ out/ over/ > > select 1 for recommender > > > > where content/ is > > content/coccoon.apache.org > > content/commons.apache.org > > > > and out/ and over/ are output directories. Run the shell script with -x > as > > you will probably have to tweak it. > > > > Lance > > > > On Thu, Oct 13, 2011 at 11:04 PM, Sebastian Schelter <[EMAIL PROTECTED]> > wrote: > > > >> Only got the raw data, how did you convert it to our standard > >> recommender input? > >> > >> --sebastian > >> > >> > >> On 14.10.2011 01:17, Grant Ingersoll wrote: > >>> Were you able to get the data, Sebastian? > >>> > >>> On Oct 13, 2011, at 4:01 AM, Sebastian Schelter wrote: > >>> > >>>> Grant, > >>>> > >>>> Can you share a little more details about the results, do you get any > >>>> exceptions? Or do you just get no results? > >>>> > >>>> Using the NaNs inside the similarity matrix vectors has been included > in > >>>> the job for a very long time and should not cause any problems. As > Sean > >>>> already mentioned we have unit tests with toy data that should catch > the > >>>> very obvious errors in this code. > >>>> > >>>> Can you share the dataset? I can do a testrun on my research cluster. > >>>> > >>>> --sebastian > >>>> > >>>> On 13.10.2011 08:37, Sean Owen wrote: > >>>>> RecommenderJob? The unit tests run it all the time. > >>>>> There should not be any glitches with static variables -- don't think > >>>>> there are any. > >>>>> > >>>>> On Thu, Oct 13, 2011 at 7:33 AM, Lance Norskog <[EMAIL PROTECTED]> > >> wrote: > >>>>>> Is this job working well for anyone now? > >>>>>> When was the last time this job worked for someone? > >>>>>> > >>>>>> On Wed, Oct 12, 2011 at 11:30 AM, Grant Ingersoll < > >> [EMAIL PROTECTED]>wrote: > >>>>>> > >>>>>>> Both local and on EC2 > >>>>>>> > >>>>>>> On Oct 12, 2011, at 2:10 PM, Ken Krugler wrote: > >>>>>>> > >>>>>>>> Hi Grant, > >>>>>>>> > >>>>>>>> Just curious, are you running this locally or distributed? > >>>>>>>> > >>>>>>>> I'd run into a similar issue, though in a completely different > >> algorithm > >>>>>>> (Jimmy Lin's PageRank implementation) due to the use of a static > >> variable. > >>>>>>>> > >>>>>>>> When running locally, this wasn't getting cleared between loops, > and > >> thus > >>>>>>> I got wonky results. > >>>>>>>> > >>>>>>>> The same thing would have happened with JVM reuse enabled. > >>>>>>>> > >>>>>>>> -- Ken > >>>>>>>> > >>>>>>>> On Oct 12, 2011, at 3:28pm, Grant Ingersoll wrote: > >>>>>>>> > >>>>>>>>> Digging some more: > >>>>>>>>> > >>>>>>>>> In AggregateAndRecommend, around lines 143, I have, for userId 0, > a > >>>>>>> simColumn of: > >>>>>>>>> > >>>>>>> > >> > {22966:0.9566912651062012,81901:0.9566912651062012,263375:0.9566912651062012,263374:0.9566912651062012,263376:NaN} > >>>>>>>>> > >>>>>>>>> Which then becomes the numerator and the denom. > >>>>>>>>> > >>>>>>>>> Looping, my next simCol is: > >>>>>>>>> > >>>>>>> > >> > {22966:0.9566912651062012,81901:0.9566912651062012,263375:NaN,263374:0.9566912651062012,263376:0.9566912651062012} > >>>>>>>>> > >>>>>>>>> and then > >>>>>>>>> > >>>>>>> > >> > {22966:0.9566912651062012,81901:0.9566912651062012,263375:0.9566912651062012,263374:NaN,263376:0.9566912651062012} > >>>>>>>>> > >>>>>>>>> ... > >>>>>>>>> > >>>>>>>>> Each time, those are getting added into the numerators/denoms > >> value, > >>>>>>> such that by the time we are done looping (line 161), we have: > >>>>>>>>> numerators: Lance Norskog [EMAIL PROTECTED] |