|
|
-
Re: Parallel ALS-WR on very large matrix -- crashing (I think)
Kate Ericson 2012-02-02, 00:58
Hi,
This *may* just be a Hadoop issue - it sounds like the JobTracker is upset that it hasn't heard from one of the workers in too long (over 600 seconds). Can you check your Hadoop Administration pages for the cluster? Does the cluster still seem to be functioning? I haven't used Hadoop with EC2, so I'm not sure how difficult it will be to check the cluster :-/ If everything seems to be OK, there's a Hadoop setting to modify how long it's willing to wait before assuming a machine has failed and killing a task. -Kate
On Wed, Feb 1, 2012 at 5:48 PM, Nicholas Kolegraff <[EMAIL PROTECTED]> wrote: > Hello, > I am attempting to run parallelALS on a very large matrix on EC2. > The matrix is ~8 Million x 1 million. vary sparse .007% has data. > I am attempting to run on 8 nodes with 34.2 GB of memory. (m2.2xlarge) > (I kept getting OutOfMemory exceptions so I kept upping the ante until I > arrived at the above configuration) > > It makes it through the following jobs no problem: > ip-10-166-55-151.us-west-1.compute.internal_1328033659670_job_201201311814_0001_hadoop_ParallelALSFactorizationJob-ItemRatingVectorsMappe > ip-10-166-55-151.us-west-1.compute.internal_1328033659670_job_201201311814_0002_hadoop_ParallelALSFactorizationJob-TransposeMapper-Reduce > ip-10-166-55-151.us-west-1.compute.internal_1328033659670_job_201201311814_0003_hadoop_ParallelALSFactorizationJob-AverageRatingMapper-Re > ip-10-166-55-151.us-west-1.compute.internal_1328033659670_job_201201311814_0004_hadoop_ParallelALSFactorizationJob-SolveExplicitFeedbackM > .... > ip-10-166-55-151.us-west-1.compute.internal_1328033659670_job_201201311814_0023_hadoop_ParallelALSFactorizationJob-SolveExplicitFeedbackM > > Then crashes here with only the following error messages: > Task attempt_201201311814_0023_m_000000_0 failed to report status for 600 > seconds. Killing! > > Each map attempt in the 23rd 'SolveExplicitFeedback' fails to report it's > status? > > I'm not sure what is causing this -- I am still trying to wrap my head > around the mahout API. > > Could this still be a memory issue? > > Hopefully i'm not missing something trivial?!?!
+
Kate Ericson 2012-02-02, 00:58
-
Re: Parallel ALS-WR on very large matrix -- crashing (I think)
Ted Dunning 2012-02-02, 01:12
So the total size of the data is modest at about 560 M non-zero elements. Total data should be small compared to your node sizes.
But the distribution of your data can be important as well.
Can you say if you have any rows or columns are extremely dense?
On Wed, Feb 1, 2012 at 4:58 PM, Kate Ericson <[EMAIL PROTECTED]>wrote:
> > The matrix is ~8 Million x 1 million. vary sparse .007% has data. >
+
Ted Dunning 2012-02-02, 01:12
-
Re: Parallel ALS-WR on very large matrix -- crashing (I think)
Nicholas Kolegraff 2012-02-02, 01:23
Thanks for the prompt reply Kate!
The cluster has since been torn down on EC2 but, I did monitor it during the job execution and all seemed to be ok. JobTracker and NameNode would continue to report status.
I was aware of the configuration setting and hoping to refrain from playing with it :-) I get scared to modify it too large, since that time could get unnecessarily charged to my EC2 account. :S
Do you know if it should still report status in the midst of a complex task? Seems questionable that it wouldn't just send a friendly hello?
On Wed, Feb 1, 2012 at 4:58 PM, Kate Ericson <[EMAIL PROTECTED]>wrote:
> Hi, > > This *may* just be a Hadoop issue - it sounds like the JobTracker is > upset that it hasn't heard from one of the workers in too long (over > 600 seconds). > Can you check your Hadoop Administration pages for the cluster? Does > the cluster still seem to be functioning? > I haven't used Hadoop with EC2, so I'm not sure how difficult it will > be to check the cluster :-/ > If everything seems to be OK, there's a Hadoop setting to modify how > long it's willing to wait before assuming a machine has failed and > killing a task. > > > -Kate > > On Wed, Feb 1, 2012 at 5:48 PM, Nicholas Kolegraff > <[EMAIL PROTECTED]> wrote: > > Hello, > > I am attempting to run parallelALS on a very large matrix on EC2. > > The matrix is ~8 Million x 1 million. vary sparse .007% has data. > > I am attempting to run on 8 nodes with 34.2 GB of memory. (m2.2xlarge) > > (I kept getting OutOfMemory exceptions so I kept upping the ante until I > > arrived at the above configuration) > > > > It makes it through the following jobs no problem: > > > ip-10-166-55-151.us-west-1.compute.internal_1328033659670_job_201201311814_0001_hadoop_ParallelALSFactorizationJob-ItemRatingVectorsMappe > > > ip-10-166-55-151.us-west-1.compute.internal_1328033659670_job_201201311814_0002_hadoop_ParallelALSFactorizationJob-TransposeMapper-Reduce > > > ip-10-166-55-151.us-west-1.compute.internal_1328033659670_job_201201311814_0003_hadoop_ParallelALSFactorizationJob-AverageRatingMapper-Re > > > ip-10-166-55-151.us-west-1.compute.internal_1328033659670_job_201201311814_0004_hadoop_ParallelALSFactorizationJob-SolveExplicitFeedbackM > > .... > > > ip-10-166-55-151.us-west-1.compute.internal_1328033659670_job_201201311814_0023_hadoop_ParallelALSFactorizationJob-SolveExplicitFeedbackM > > > > Then crashes here with only the following error messages: > > Task attempt_201201311814_0023_m_000000_0 failed to report status for 600 > > seconds. Killing! > > > > Each map attempt in the 23rd 'SolveExplicitFeedback' fails to report it's > > status? > > > > I'm not sure what is causing this -- I am still trying to wrap my head > > around the mahout API. > > > > Could this still be a memory issue? > > > > Hopefully i'm not missing something trivial?!?! >
+
Nicholas Kolegraff 2012-02-02, 01:23
-
Re: Parallel ALS-WR on very large matrix -- crashing (I think)
Kate Ericson 2012-02-02, 01:32
If it's thrashing on something, there's a good chance it might miss a checkpoint. Like Ted brought up, there may be some very dense areas of your input causing this problem. How much memory are you giving to your Hadoop workers? The default value is rather small.
-Kate
On Wed, Feb 1, 2012 at 6:23 PM, Nicholas Kolegraff <[EMAIL PROTECTED]> wrote: > Thanks for the prompt reply Kate! > > The cluster has since been torn down on EC2 but, I did monitor it during > the job execution and all seemed to be ok. JobTracker and NameNode would > continue to report status. > > I was aware of the configuration setting and hoping to refrain from playing > with it :-) I get scared to modify it too large, since that time could get > unnecessarily charged to my EC2 account. :S > > Do you know if it should still report status in the midst of a complex > task? Seems questionable that it wouldn't just send a friendly hello? > > On Wed, Feb 1, 2012 at 4:58 PM, Kate Ericson <[EMAIL PROTECTED]>wrote: > >> Hi, >> >> This *may* just be a Hadoop issue - it sounds like the JobTracker is >> upset that it hasn't heard from one of the workers in too long (over >> 600 seconds). >> Can you check your Hadoop Administration pages for the cluster? Does >> the cluster still seem to be functioning? >> I haven't used Hadoop with EC2, so I'm not sure how difficult it will >> be to check the cluster :-/ >> If everything seems to be OK, there's a Hadoop setting to modify how >> long it's willing to wait before assuming a machine has failed and >> killing a task. >> >> >> -Kate >> >> On Wed, Feb 1, 2012 at 5:48 PM, Nicholas Kolegraff >> <[EMAIL PROTECTED]> wrote: >> > Hello, >> > I am attempting to run parallelALS on a very large matrix on EC2. >> > The matrix is ~8 Million x 1 million. vary sparse .007% has data. >> > I am attempting to run on 8 nodes with 34.2 GB of memory. (m2.2xlarge) >> > (I kept getting OutOfMemory exceptions so I kept upping the ante until I >> > arrived at the above configuration) >> > >> > It makes it through the following jobs no problem: >> > >> ip-10-166-55-151.us-west-1.compute.internal_1328033659670_job_201201311814_0001_hadoop_ParallelALSFactorizationJob-ItemRatingVectorsMappe >> > >> ip-10-166-55-151.us-west-1.compute.internal_1328033659670_job_201201311814_0002_hadoop_ParallelALSFactorizationJob-TransposeMapper-Reduce >> > >> ip-10-166-55-151.us-west-1.compute.internal_1328033659670_job_201201311814_0003_hadoop_ParallelALSFactorizationJob-AverageRatingMapper-Re >> > >> ip-10-166-55-151.us-west-1.compute.internal_1328033659670_job_201201311814_0004_hadoop_ParallelALSFactorizationJob-SolveExplicitFeedbackM >> > .... >> > >> ip-10-166-55-151.us-west-1.compute.internal_1328033659670_job_201201311814_0023_hadoop_ParallelALSFactorizationJob-SolveExplicitFeedbackM >> > >> > Then crashes here with only the following error messages: >> > Task attempt_201201311814_0023_m_000000_0 failed to report status for 600 >> > seconds. Killing! >> > >> > Each map attempt in the 23rd 'SolveExplicitFeedback' fails to report it's >> > status? >> > >> > I'm not sure what is causing this -- I am still trying to wrap my head >> > around the mahout API. >> > >> > Could this still be a memory issue? >> > >> > Hopefully i'm not missing something trivial?!?! >>
+
Kate Ericson 2012-02-02, 01:32
-
Re: Parallel ALS-WR on very large matrix -- crashing (I think)
Ted Dunning 2012-02-02, 01:44
Status reporting happens automatically when output is generated. In a long computation, it is good form to occasionally update a counter or otherwise indicate that the computation is still progressing.
On Wed, Feb 1, 2012 at 5:23 PM, Nicholas Kolegraff <[EMAIL PROTECTED]>wrote:
> Do you know if it should still report status in the midst of a complex > task? Seems questionable that it wouldn't just send a friendly hello? > >
+
Ted Dunning 2012-02-02, 01:44
-
Re: Parallel ALS-WR on very large matrix -- crashing (I think)
Nicholas Kolegraff 2012-02-02, 02:03
The most dense row contains about 55K elements of the 1M There are about 5 other rows with about 10K then drops considerably after that for others ~2K
I am using the memory-intensive bootstrap action on EC2 -- which bumps heap space for the childs to around 4G, I believe.
On Wed, Feb 1, 2012 at 5:44 PM, Ted Dunning <[EMAIL PROTECTED]> wrote:
> Status reporting happens automatically when output is generated. In a long > computation, it is good form to occasionally update a counter or otherwise > indicate that the computation is still progressing. > > On Wed, Feb 1, 2012 at 5:23 PM, Nicholas Kolegraff > <[EMAIL PROTECTED]>wrote: > > > Do you know if it should still report status in the midst of a complex > > task? Seems questionable that it wouldn't just send a friendly hello? > > > > >
+
Nicholas Kolegraff 2012-02-02, 02:03
-
Re: Parallel ALS-WR on very large matrix -- crashing (I think)
Sean Owen 2012-02-02, 08:53
I have seen this happen in "normal" operation when the sorting on the mapper is taking a long long time, because the output is large. You can tell it to increase the timeout. If this is what is happening, you won't have a chance to update a counter as a keep-alive ping, but yes that is generally right otherwise. If this is the case it's that a mapper is outputting a whole lot of info, perhaps 'too much'. I don't know for sure, just another a guess for the pile.
On Thu, Feb 2, 2012 at 1:44 AM, Ted Dunning <[EMAIL PROTECTED]> wrote:
> Status reporting happens automatically when output is generated. In a long > computation, it is good form to occasionally update a counter or otherwise > indicate that the computation is still progressing. > > On Wed, Feb 1, 2012 at 5:23 PM, Nicholas Kolegraff > <[EMAIL PROTECTED]>wrote: > > > Do you know if it should still report status in the midst of a complex > > task? Seems questionable that it wouldn't just send a friendly hello? > > > > >
+
Sean Owen 2012-02-02, 08:53
-
Re: Parallel ALS-WR on very large matrix -- crashing (I think)
Sebastian Schelter 2012-02-02, 09:47
Nicholas,
can you give us the detailed arguments you start the job with? I'd especially be interested in the number of features (--numFeatures) you use. Do you use the job with implicit feedback data (--implicitFeedback=true)?
The memory requirements of the job are the following:
In each iteration either the item-features matrix (items x features) or the user-features matrix (users x features) is loaded into the memory of each mapper. Then the original user-item matrix (or its transpose) is read row-wise by the mappers and they recompute the features via AlternatingLeastSquaresSolver/ImplicitFeedbackAlternatingLeastSquaresSolver.
--sebastian On 02.02.2012 09:53, Sean Owen wrote: > I have seen this happen in "normal" operation when the sorting on the > mapper is taking a long long time, because the output is large. You can > tell it to increase the timeout. If this is what is happening, you won't > have a chance to update a counter as a keep-alive ping, but yes that is > generally right otherwise. If this is the case it's that a mapper is > outputting a whole lot of info, perhaps 'too much'. I don't know for sure, > just another a guess for the pile. > > On Thu, Feb 2, 2012 at 1:44 AM, Ted Dunning <[EMAIL PROTECTED]> wrote: > >> Status reporting happens automatically when output is generated. In a long >> computation, it is good form to occasionally update a counter or otherwise >> indicate that the computation is still progressing. >> >> On Wed, Feb 1, 2012 at 5:23 PM, Nicholas Kolegraff >> <[EMAIL PROTECTED]>wrote: >> >>> Do you know if it should still report status in the midst of a complex >>> task? Seems questionable that it wouldn't just send a friendly hello? >>> >>> >> >
+
Sebastian Schelter 2012-02-02, 09:47
-
Re: Parallel ALS-WR on very large matrix -- crashing (I think)
Nicholas Kolegraff 2012-02-02, 16:25
I will up the ante with the time out and report back -- thanks all for the suggestions
Hey, Sebastian -- Here are the arguments I am using: --input matrix --output ALS --numFeatures 25 --numIterations 10 --lambda 0.065 When the mapper loads the matrix into memory it only loads the actual non-zero data, correct?
Hey Ted -- I messed up on the sparsity. Turns out there are only 70M non-zero elements.
Oh, and, I only have binary data -- I wasn't sure of the implications with ALS-WR on binary data -- I couldn't find anything to suggest otherwise. I am using data of the format user,item,1 I have read about probabilistic factorization -- which works with binary data -- and perhaps naively, thought ALS-WR was similar so what-the-heck :-)
I'd love nothing more than to share the data, however, I'd probably get in some trouble :-) Perhaps I could generate a matrix with a similar distribution? -- I'll have to check on that and see if it is ok #bureaucracy
Stay tuned...
On Thu, Feb 2, 2012 at 1:47 AM, Sebastian Schelter <[EMAIL PROTECTED]> wrote:
> Nicholas, > > can you give us the detailed arguments you start the job with? I'd > especially be interested in the number of features (--numFeatures) you > use. Do you use the job with implicit feedback data > (--implicitFeedback=true)? > > The memory requirements of the job are the following: > > In each iteration either the item-features matrix (items x features) or > the user-features matrix (users x features) is loaded into the memory of > each mapper. Then the original user-item matrix (or its transpose) is > read row-wise by the mappers and they recompute the features via > > AlternatingLeastSquaresSolver/ImplicitFeedbackAlternatingLeastSquaresSolver. > > --sebastian > > > On 02.02.2012 09:53, Sean Owen wrote: > > I have seen this happen in "normal" operation when the sorting on the > > mapper is taking a long long time, because the output is large. You can > > tell it to increase the timeout. If this is what is happening, you won't > > have a chance to update a counter as a keep-alive ping, but yes that is > > generally right otherwise. If this is the case it's that a mapper is > > outputting a whole lot of info, perhaps 'too much'. I don't know for > sure, > > just another a guess for the pile. > > > > On Thu, Feb 2, 2012 at 1:44 AM, Ted Dunning <[EMAIL PROTECTED]> > wrote: > > > >> Status reporting happens automatically when output is generated. In a > long > >> computation, it is good form to occasionally update a counter or > otherwise > >> indicate that the computation is still progressing. > >> > >> On Wed, Feb 1, 2012 at 5:23 PM, Nicholas Kolegraff > >> <[EMAIL PROTECTED]>wrote: > >> > >>> Do you know if it should still report status in the midst of a complex > >>> task? Seems questionable that it wouldn't just send a friendly hello? > >>> > >>> > >> > > > >
+
Nicholas Kolegraff 2012-02-02, 16:25
-
Re: Parallel ALS-WR on very large matrix -- crashing (I think)
Sebastian Schelter 2012-02-02, 16:31
Your parameters look good, except if you have binary data, you should set --implicitFeedback=true. You could also set numFeatures to a very small value (like 5) just to see if that helps.
The mappers load one of the feature matrices into memory which are dense (#items x #features entries or #users x #features entries). Are you sure that the mappers have enough memory for that?
It's really strange that you have problems with such small data, I tested this with Netflix (> 100M non-zeros) on a few machines and it worked quite well.
--sebastian
On 02.02.2012 17:25, Nicholas Kolegraff wrote: > I will up the ante with the time out and report back -- thanks all for the > suggestions > > Hey, Sebastian -- Here are the arguments I am using: > --input matrix --output ALS --numFeatures 25 --numIterations 10 --lambda > 0.065 > When the mapper loads the matrix into memory it only loads the actual > non-zero data, correct? > > Hey Ted -- I messed up on the sparsity. Turns out there are only 70M > non-zero elements. > > Oh, and, I only have binary data -- I wasn't sure of the implications with > ALS-WR on binary data -- I couldn't find anything to suggest otherwise. > I am using data of the format user,item,1 > I have read about probabilistic factorization -- which works with binary > data -- and perhaps naively, thought ALS-WR was similar so what-the-heck :-) > > I'd love nothing more than to share the data, however, I'd probably get in > some trouble :-) > Perhaps I could generate a matrix with a similar distribution? -- I'll have > to check on that and see if it is ok #bureaucracy > > Stay tuned... > > On Thu, Feb 2, 2012 at 1:47 AM, Sebastian Schelter <[EMAIL PROTECTED]> wrote: > >> Nicholas, >> >> can you give us the detailed arguments you start the job with? I'd >> especially be interested in the number of features (--numFeatures) you >> use. Do you use the job with implicit feedback data >> (--implicitFeedback=true)? >> >> The memory requirements of the job are the following: >> >> In each iteration either the item-features matrix (items x features) or >> the user-features matrix (users x features) is loaded into the memory of >> each mapper. Then the original user-item matrix (or its transpose) is >> read row-wise by the mappers and they recompute the features via >> >> AlternatingLeastSquaresSolver/ImplicitFeedbackAlternatingLeastSquaresSolver. >> >> --sebastian >> >> >> On 02.02.2012 09:53, Sean Owen wrote: >>> I have seen this happen in "normal" operation when the sorting on the >>> mapper is taking a long long time, because the output is large. You can >>> tell it to increase the timeout. If this is what is happening, you won't >>> have a chance to update a counter as a keep-alive ping, but yes that is >>> generally right otherwise. If this is the case it's that a mapper is >>> outputting a whole lot of info, perhaps 'too much'. I don't know for >> sure, >>> just another a guess for the pile. >>> >>> On Thu, Feb 2, 2012 at 1:44 AM, Ted Dunning <[EMAIL PROTECTED]> >> wrote: >>> >>>> Status reporting happens automatically when output is generated. In a >> long >>>> computation, it is good form to occasionally update a counter or >> otherwise >>>> indicate that the computation is still progressing. >>>> >>>> On Wed, Feb 1, 2012 at 5:23 PM, Nicholas Kolegraff >>>> <[EMAIL PROTECTED]>wrote: >>>> >>>>> Do you know if it should still report status in the midst of a complex >>>>> task? Seems questionable that it wouldn't just send a friendly hello? >>>>> >>>>> >>>> >>> >> >> >
+
Sebastian Schelter 2012-02-02, 16:31
-
Re: Parallel ALS-WR on very large matrix -- crashing (I think)
Nicholas Kolegraff 2012-02-02, 16:37
Sounds good. Thanks Sebastian
The interesting thing is -- I tried to sample the matrix down one time to about 10% of non-zeros -- and worked no problem.
On Thu, Feb 2, 2012 at 8:31 AM, Sebastian Schelter <[EMAIL PROTECTED]> wrote:
> Your parameters look good, except if you have binary data, you should > set --implicitFeedback=true. You could also set numFeatures to a very > small value (like 5) just to see if that helps. > > The mappers load one of the feature matrices into memory which are dense > (#items x #features entries or #users x #features entries). Are you sure > that the mappers have enough memory for that? > > It's really strange that you have problems with such small data, I > tested this with Netflix (> 100M non-zeros) on a few machines and it > worked quite well. > > --sebastian > > > > On 02.02.2012 17:25, Nicholas Kolegraff wrote: > > I will up the ante with the time out and report back -- thanks all for > the > > suggestions > > > > Hey, Sebastian -- Here are the arguments I am using: > > --input matrix --output ALS --numFeatures 25 --numIterations 10 --lambda > > 0.065 > > When the mapper loads the matrix into memory it only loads the actual > > non-zero data, correct? > > > > Hey Ted -- I messed up on the sparsity. Turns out there are only 70M > > non-zero elements. > > > > Oh, and, I only have binary data -- I wasn't sure of the implications > with > > ALS-WR on binary data -- I couldn't find anything to suggest otherwise. > > I am using data of the format user,item,1 > > I have read about probabilistic factorization -- which works with binary > > data -- and perhaps naively, thought ALS-WR was similar so what-the-heck > :-) > > > > I'd love nothing more than to share the data, however, I'd probably get > in > > some trouble :-) > > Perhaps I could generate a matrix with a similar distribution? -- I'll > have > > to check on that and see if it is ok #bureaucracy > > > > Stay tuned... > > > > On Thu, Feb 2, 2012 at 1:47 AM, Sebastian Schelter <[EMAIL PROTECTED]> > wrote: > > > >> Nicholas, > >> > >> can you give us the detailed arguments you start the job with? I'd > >> especially be interested in the number of features (--numFeatures) you > >> use. Do you use the job with implicit feedback data > >> (--implicitFeedback=true)? > >> > >> The memory requirements of the job are the following: > >> > >> In each iteration either the item-features matrix (items x features) or > >> the user-features matrix (users x features) is loaded into the memory of > >> each mapper. Then the original user-item matrix (or its transpose) is > >> read row-wise by the mappers and they recompute the features via > >> > >> > AlternatingLeastSquaresSolver/ImplicitFeedbackAlternatingLeastSquaresSolver. > >> > >> --sebastian > >> > >> > >> On 02.02.2012 09:53, Sean Owen wrote: > >>> I have seen this happen in "normal" operation when the sorting on the > >>> mapper is taking a long long time, because the output is large. You can > >>> tell it to increase the timeout. If this is what is happening, you > won't > >>> have a chance to update a counter as a keep-alive ping, but yes that is > >>> generally right otherwise. If this is the case it's that a mapper is > >>> outputting a whole lot of info, perhaps 'too much'. I don't know for > >> sure, > >>> just another a guess for the pile. > >>> > >>> On Thu, Feb 2, 2012 at 1:44 AM, Ted Dunning <[EMAIL PROTECTED]> > >> wrote: > >>> > >>>> Status reporting happens automatically when output is generated. In a > >> long > >>>> computation, it is good form to occasionally update a counter or > >> otherwise > >>>> indicate that the computation is still progressing. > >>>> > >>>> On Wed, Feb 1, 2012 at 5:23 PM, Nicholas Kolegraff > >>>> <[EMAIL PROTECTED]>wrote: > >>>> > >>>>> Do you know if it should still report status in the midst of a > complex > >>>>> task? Seems questionable that it wouldn't just send a friendly > hello? > >>>>> > >>>>> > >>>> > >>> > >> > >> > > > >
+
Nicholas Kolegraff 2012-02-02, 16:37
-
Re: Parallel ALS-WR on very large matrix -- crashing (I think)
Sebastian Schelter 2012-02-02, 16:40
Hmm, are you sure that the mappers have enough memory? You can set that via Dmapred.child.java.opts=-Xmx[some number]m
--sebastian
On 02.02.2012 17:37, Nicholas Kolegraff wrote: > Sounds good. Thanks Sebastian > > The interesting thing is -- I tried to sample the matrix down one time to > about 10% of non-zeros -- and worked no problem. > > On Thu, Feb 2, 2012 at 8:31 AM, Sebastian Schelter <[EMAIL PROTECTED]> wrote: > >> Your parameters look good, except if you have binary data, you should >> set --implicitFeedback=true. You could also set numFeatures to a very >> small value (like 5) just to see if that helps. >> >> The mappers load one of the feature matrices into memory which are dense >> (#items x #features entries or #users x #features entries). Are you sure >> that the mappers have enough memory for that? >> >> It's really strange that you have problems with such small data, I >> tested this with Netflix (> 100M non-zeros) on a few machines and it >> worked quite well. >> >> --sebastian >> >> >> >> On 02.02.2012 17:25, Nicholas Kolegraff wrote: >>> I will up the ante with the time out and report back -- thanks all for >> the >>> suggestions >>> >>> Hey, Sebastian -- Here are the arguments I am using: >>> --input matrix --output ALS --numFeatures 25 --numIterations 10 --lambda >>> 0.065 >>> When the mapper loads the matrix into memory it only loads the actual >>> non-zero data, correct? >>> >>> Hey Ted -- I messed up on the sparsity. Turns out there are only 70M >>> non-zero elements. >>> >>> Oh, and, I only have binary data -- I wasn't sure of the implications >> with >>> ALS-WR on binary data -- I couldn't find anything to suggest otherwise. >>> I am using data of the format user,item,1 >>> I have read about probabilistic factorization -- which works with binary >>> data -- and perhaps naively, thought ALS-WR was similar so what-the-heck >> :-) >>> >>> I'd love nothing more than to share the data, however, I'd probably get >> in >>> some trouble :-) >>> Perhaps I could generate a matrix with a similar distribution? -- I'll >> have >>> to check on that and see if it is ok #bureaucracy >>> >>> Stay tuned... >>> >>> On Thu, Feb 2, 2012 at 1:47 AM, Sebastian Schelter <[EMAIL PROTECTED]> >> wrote: >>> >>>> Nicholas, >>>> >>>> can you give us the detailed arguments you start the job with? I'd >>>> especially be interested in the number of features (--numFeatures) you >>>> use. Do you use the job with implicit feedback data >>>> (--implicitFeedback=true)? >>>> >>>> The memory requirements of the job are the following: >>>> >>>> In each iteration either the item-features matrix (items x features) or >>>> the user-features matrix (users x features) is loaded into the memory of >>>> each mapper. Then the original user-item matrix (or its transpose) is >>>> read row-wise by the mappers and they recompute the features via >>>> >>>> >> AlternatingLeastSquaresSolver/ImplicitFeedbackAlternatingLeastSquaresSolver. >>>> >>>> --sebastian >>>> >>>> >>>> On 02.02.2012 09:53, Sean Owen wrote: >>>>> I have seen this happen in "normal" operation when the sorting on the >>>>> mapper is taking a long long time, because the output is large. You can >>>>> tell it to increase the timeout. If this is what is happening, you >> won't >>>>> have a chance to update a counter as a keep-alive ping, but yes that is >>>>> generally right otherwise. If this is the case it's that a mapper is >>>>> outputting a whole lot of info, perhaps 'too much'. I don't know for >>>> sure, >>>>> just another a guess for the pile. >>>>> >>>>> On Thu, Feb 2, 2012 at 1:44 AM, Ted Dunning <[EMAIL PROTECTED]> >>>> wrote: >>>>> >>>>>> Status reporting happens automatically when output is generated. In a >>>> long >>>>>> computation, it is good form to occasionally update a counter or >>>> otherwise >>>>>> indicate that the computation is still progressing. >>>>>> >>>>>> On Wed, Feb 1, 2012 at 5:23 PM, Nicholas Kolegraff >>>>>> <[EMAIL PROTECTED]>wrote: >>>>>> >>>>>
+
Sebastian Schelter 2012-02-02, 16:40
-
Re: Parallel ALS-WR on very large matrix -- crashing (I think)
Nicholas Kolegraff 2012-02-02, 18:56
Ok, I took a bit deeper look into this having changed some parameters and kicked off the new job..
Seems plausible that I didn't have enough memory for some of the mappers -- unless I'm missing something here. An upper bound on the memory would be (assuming my original parameter of 25 features) 8Mil * 25 Features = 200Mil (multiply by 8 bytes assuming double precision floating point) and we get: 1.6billion 1.6B / (1024^3) = ~1.5GB memory needed
The tasktracker heapsize and datanode heap sizes were only set to: 1GB
So I have changed the bootstrap action on EC2 as follows (this is a diff between the original and the changes I made) # Parameters of the array: # [mapred.child.java.opts, mapred.tasktracker.map.tasks.maximum, mapred.tasktracker.reduce.tasks.maximum] 29c29 < "m2.2xlarge" => ["-Xmx4096m", "6", "2"], --- > "m2.2xlarge" => ["-Xmx8192m", "3", "2"], # Parameters of the array (Vars modified in hadoop.env.sh) # [HADOOP_JOBTRACKER_HEAPSIZE, HADOOP_NAMENODE_HEAPSIZE, HADOOP_TASKTRACKER_HEAPSIZE, HADOOP_DATANODE_HEAPSIZE] 47c47 < "m2.2xlarge" => ["2048", "8192", "1024", "1024"], --- > "m2.2xlarge" => ["4096", "16384", "2048", "2048"]
On Thu, Feb 2, 2012 at 8:40 AM, Sebastian Schelter <[EMAIL PROTECTED]> wrote:
> Hmm, are you sure that the mappers have enough memory? You can set that > via Dmapred.child.java.opts=-Xmx[some number]m > > --sebastian > > On 02.02.2012 17:37, Nicholas Kolegraff wrote: > > Sounds good. Thanks Sebastian > > > > The interesting thing is -- I tried to sample the matrix down one time to > > about 10% of non-zeros -- and worked no problem. > > > > On Thu, Feb 2, 2012 at 8:31 AM, Sebastian Schelter <[EMAIL PROTECTED]> > wrote: > > > >> Your parameters look good, except if you have binary data, you should > >> set --implicitFeedback=true. You could also set numFeatures to a very > >> small value (like 5) just to see if that helps. > >> > >> The mappers load one of the feature matrices into memory which are dense > >> (#items x #features entries or #users x #features entries). Are you sure > >> that the mappers have enough memory for that? > >> > >> It's really strange that you have problems with such small data, I > >> tested this with Netflix (> 100M non-zeros) on a few machines and it > >> worked quite well. > >> > >> --sebastian > >> > >> > >> > >> On 02.02.2012 17:25, Nicholas Kolegraff wrote: > >>> I will up the ante with the time out and report back -- thanks all for > >> the > >>> suggestions > >>> > >>> Hey, Sebastian -- Here are the arguments I am using: > >>> --input matrix --output ALS --numFeatures 25 --numIterations 10 > --lambda > >>> 0.065 > >>> When the mapper loads the matrix into memory it only loads the actual > >>> non-zero data, correct? > >>> > >>> Hey Ted -- I messed up on the sparsity. Turns out there are only 70M > >>> non-zero elements. > >>> > >>> Oh, and, I only have binary data -- I wasn't sure of the implications > >> with > >>> ALS-WR on binary data -- I couldn't find anything to suggest otherwise. > >>> I am using data of the format user,item,1 > >>> I have read about probabilistic factorization -- which works with > binary > >>> data -- and perhaps naively, thought ALS-WR was similar so > what-the-heck > >> :-) > >>> > >>> I'd love nothing more than to share the data, however, I'd probably get > >> in > >>> some trouble :-) > >>> Perhaps I could generate a matrix with a similar distribution? -- I'll > >> have > >>> to check on that and see if it is ok #bureaucracy > >>> > >>> Stay tuned... > >>> > >>> On Thu, Feb 2, 2012 at 1:47 AM, Sebastian Schelter <[EMAIL PROTECTED]> > >> wrote: > >>> > >>>> Nicholas, > >>>> > >>>> can you give us the detailed arguments you start the job with? I'd > >>>> especially be interested in the number of features (--numFeatures) you > >>>> use. Do you use the job with implicit feedback data > >>>> (--implicitFeedback=true)? > >>>> > >>>> The memory requirements of the job are the following: > >>>> > >>>> In each iteration either the item-features matrix (items x features)
+
Nicholas Kolegraff 2012-02-02, 18:56
-
Re: Parallel ALS-WR on very large matrix -- crashing (I think)
Ken Krugler 2012-02-02, 19:25
Hi Nicholas, On Feb 2, 2012, at 10:56am, Nicholas Kolegraff wrote: > Ok, I took a bit deeper look into this having changed some parameters and > kicked off the new job.. > > Seems plausible that I didn't have enough memory for some of the mappers -- > unless I'm missing something here. > An upper bound on the memory would be (assuming my original parameter of 25 > features) > 8Mil * 25 Features = 200Mil > (multiply by 8 bytes assuming double precision floating point) and we get: > 1.6billion > 1.6B / (1024^3) = ~1.5GB memory needed > > The tasktracker heapsize and datanode heap sizes were only set to: 1GB The memory you need for this task is based on the mapped.child.java.opts setting (the -Xmx setting), not what's allocated for the NameNode, JobTracker, DataNode or TaskTracker. In fact increasing the DataNode & TaskTracker sizes removes memory that could/should be used by the child JVMs that the TaskTracker creates to run your map & reduce tasks. Currently it looks like you have 4GB allocated for m2.2xlarge tasks, which should be sufficient given your analysis above. -- Ken > > So I have changed the bootstrap action on EC2 as follows (this is a diff > between the original and the changes I made) > # Parameters of the array: > # [mapred.child.java.opts, mapred.tasktracker.map.tasks.maximum, > mapred.tasktracker.reduce.tasks.maximum] > 29c29 > < "m2.2xlarge" => ["-Xmx4096m", "6", "2"], > --- >> "m2.2xlarge" => ["-Xmx8192m", "3", "2"], > # Parameters of the array (Vars modified in hadoop.env.sh) > # [HADOOP_JOBTRACKER_HEAPSIZE, HADOOP_NAMENODE_HEAPSIZE, > HADOOP_TASKTRACKER_HEAPSIZE, HADOOP_DATANODE_HEAPSIZE] > 47c47 > < "m2.2xlarge" => ["2048", "8192", "1024", "1024"], > --- >> "m2.2xlarge" => ["4096", "16384", "2048", "2048"] > > > > On Thu, Feb 2, 2012 at 8:40 AM, Sebastian Schelter <[EMAIL PROTECTED]> wrote: > >> Hmm, are you sure that the mappers have enough memory? You can set that >> via Dmapred.child.java.opts=-Xmx[some number]m >> >> --sebastian >> >> On 02.02.2012 17:37, Nicholas Kolegraff wrote: >>> Sounds good. Thanks Sebastian >>> >>> The interesting thing is -- I tried to sample the matrix down one time to >>> about 10% of non-zeros -- and worked no problem. >>> >>> On Thu, Feb 2, 2012 at 8:31 AM, Sebastian Schelter <[EMAIL PROTECTED]> >> wrote: >>> >>>> Your parameters look good, except if you have binary data, you should >>>> set --implicitFeedback=true. You could also set numFeatures to a very >>>> small value (like 5) just to see if that helps. >>>> >>>> The mappers load one of the feature matrices into memory which are dense >>>> (#items x #features entries or #users x #features entries). Are you sure >>>> that the mappers have enough memory for that? >>>> >>>> It's really strange that you have problems with such small data, I >>>> tested this with Netflix (> 100M non-zeros) on a few machines and it >>>> worked quite well. >>>> >>>> --sebastian >>>> >>>> >>>> >>>> On 02.02.2012 17:25, Nicholas Kolegraff wrote: >>>>> I will up the ante with the time out and report back -- thanks all for >>>> the >>>>> suggestions >>>>> >>>>> Hey, Sebastian -- Here are the arguments I am using: >>>>> --input matrix --output ALS --numFeatures 25 --numIterations 10 >> --lambda >>>>> 0.065 >>>>> When the mapper loads the matrix into memory it only loads the actual >>>>> non-zero data, correct? >>>>> >>>>> Hey Ted -- I messed up on the sparsity. Turns out there are only 70M >>>>> non-zero elements. >>>>> >>>>> Oh, and, I only have binary data -- I wasn't sure of the implications >>>> with >>>>> ALS-WR on binary data -- I couldn't find anything to suggest otherwise. >>>>> I am using data of the format user,item,1 >>>>> I have read about probabilistic factorization -- which works with >> binary >>>>> data -- and perhaps naively, thought ALS-WR was similar so >> what-the-heck >>>> :-) >>>>> >>>>> I'd love nothing more than to share the data, however, I'd probably get >>>> in >>>>> some trouble :-) >> Ken Krugler http://www.scaleunlimited.comcustom big data solutions & training Hadoop, Cascading, Mahout & Solr
+
Ken Krugler 2012-02-02, 19:25
-
Re: Parallel ALS-WR on very large matrix -- crashing (I think)
Nicholas Kolegraff 2012-02-03, 01:48
Success! Thanks all!! I changed the --numFeatures option to 5 and it went through no problems, however, the final step of 'SolveExplicitFeedback' took a very long time relative to the others thus, I suspect the other suggestions of changing mapred.task.timeout to something much larger than 600 seconds would also have fixed the issue. (given you have enough memory)
On Thu, Feb 2, 2012 at 11:25 AM, Ken Krugler <[EMAIL PROTECTED]>wrote:
> Hi Nicholas, > > On Feb 2, 2012, at 10:56am, Nicholas Kolegraff wrote: > > > Ok, I took a bit deeper look into this having changed some parameters and > > kicked off the new job.. > > > > Seems plausible that I didn't have enough memory for some of the mappers > -- > > unless I'm missing something here. > > An upper bound on the memory would be (assuming my original parameter of > 25 > > features) > > 8Mil * 25 Features = 200Mil > > (multiply by 8 bytes assuming double precision floating point) and we > get: > > 1.6billion > > 1.6B / (1024^3) = ~1.5GB memory needed > > > > The tasktracker heapsize and datanode heap sizes were only set to: 1GB > > The memory you need for this task is based on the mapped.child.java.opts > setting (the -Xmx setting), not what's allocated for the NameNode, > JobTracker, DataNode or TaskTracker. > > In fact increasing the DataNode & TaskTracker sizes removes memory that > could/should be used by the child JVMs that the TaskTracker creates to run > your map & reduce tasks. > > Currently it looks like you have 4GB allocated for m2.2xlarge tasks, which > should be sufficient given your analysis above. > > -- Ken > > > > > So I have changed the bootstrap action on EC2 as follows (this is a diff > > between the original and the changes I made) > > # Parameters of the array: > > # [mapred.child.java.opts, mapred.tasktracker.map.tasks.maximum, > > mapred.tasktracker.reduce.tasks.maximum] > > 29c29 > > < "m2.2xlarge" => ["-Xmx4096m", "6", "2"], > > --- > >> "m2.2xlarge" => ["-Xmx8192m", "3", "2"], > > # Parameters of the array (Vars modified in hadoop.env.sh) > > # [HADOOP_JOBTRACKER_HEAPSIZE, HADOOP_NAMENODE_HEAPSIZE, > > HADOOP_TASKTRACKER_HEAPSIZE, HADOOP_DATANODE_HEAPSIZE] > > 47c47 > > < "m2.2xlarge" => ["2048", "8192", "1024", "1024"], > > --- > >> "m2.2xlarge" => ["4096", "16384", "2048", "2048"] > > > > > > > > On Thu, Feb 2, 2012 at 8:40 AM, Sebastian Schelter <[EMAIL PROTECTED]> > wrote: > > > >> Hmm, are you sure that the mappers have enough memory? You can set that > >> via Dmapred.child.java.opts=-Xmx[some number]m > >> > >> --sebastian > >> > >> On 02.02.2012 17:37, Nicholas Kolegraff wrote: > >>> Sounds good. Thanks Sebastian > >>> > >>> The interesting thing is -- I tried to sample the matrix down one time > to > >>> about 10% of non-zeros -- and worked no problem. > >>> > >>> On Thu, Feb 2, 2012 at 8:31 AM, Sebastian Schelter <[EMAIL PROTECTED]> > >> wrote: > >>> > >>>> Your parameters look good, except if you have binary data, you should > >>>> set --implicitFeedback=true. You could also set numFeatures to a very > >>>> small value (like 5) just to see if that helps. > >>>> > >>>> The mappers load one of the feature matrices into memory which are > dense > >>>> (#items x #features entries or #users x #features entries). Are you > sure > >>>> that the mappers have enough memory for that? > >>>> > >>>> It's really strange that you have problems with such small data, I > >>>> tested this with Netflix (> 100M non-zeros) on a few machines and it > >>>> worked quite well. > >>>> > >>>> --sebastian > >>>> > >>>> > >>>> > >>>> On 02.02.2012 17:25, Nicholas Kolegraff wrote: > >>>>> I will up the ante with the time out and report back -- thanks all > for > >>>> the > >>>>> suggestions > >>>>> > >>>>> Hey, Sebastian -- Here are the arguments I am using: > >>>>> --input matrix --output ALS --numFeatures 25 --numIterations 10 > >> --lambda > >>>>> 0.065 > >>>>> When the mapper loads the matrix into memory it only loads the actual > >>>>> non-zero data, correct?
+
Nicholas Kolegraff 2012-02-03, 01:48
-
Re: Parallel ALS-WR on very large matrix -- crashing (I think)
Nicholas Kolegraff 2012-02-09, 02:50
Sorry to re-surface, When I try to evaluate the factorization I am now running into this error: java.lang.ArrayIndexOutOfBoundsException: 2 at org.apache.mahout.cf.taste.hadoop.als.FactorizationEvaluator$PredictRatingsMapper.map(FactorizationEvaluator.java:137) at org.apache.mahout.cf.taste.hadoop.als.FactorizationEvaluator$PredictRatingsMapper.map(FactorizationEvaluator.java:117) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:629) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:310) at org.apache.hadoop.mapred.Child.main(Child.java:170)
However, This index (assuming it is a userID?) does exist in the training and test set? (not sure if that matters?) On Thu, Feb 2, 2012 at 11:25 AM, Ken Krugler <[EMAIL PROTECTED]>wrote:
> Hi Nicholas, > > On Feb 2, 2012, at 10:56am, Nicholas Kolegraff wrote: > > > Ok, I took a bit deeper look into this having changed some parameters and > > kicked off the new job.. > > > > Seems plausible that I didn't have enough memory for some of the mappers > -- > > unless I'm missing something here. > > An upper bound on the memory would be (assuming my original parameter of > 25 > > features) > > 8Mil * 25 Features = 200Mil > > (multiply by 8 bytes assuming double precision floating point) and we > get: > > 1.6billion > > 1.6B / (1024^3) = ~1.5GB memory needed > > > > The tasktracker heapsize and datanode heap sizes were only set to: 1GB > > The memory you need for this task is based on the mapped.child.java.opts > setting (the -Xmx setting), not what's allocated for the NameNode, > JobTracker, DataNode or TaskTracker. > > In fact increasing the DataNode & TaskTracker sizes removes memory that > could/should be used by the child JVMs that the TaskTracker creates to run > your map & reduce tasks. > > Currently it looks like you have 4GB allocated for m2.2xlarge tasks, which > should be sufficient given your analysis above. > > -- Ken > > > > > So I have changed the bootstrap action on EC2 as follows (this is a diff > > between the original and the changes I made) > > # Parameters of the array: > > # [mapred.child.java.opts, mapred.tasktracker.map.tasks.maximum, > > mapred.tasktracker.reduce.tasks.maximum] > > 29c29 > > < "m2.2xlarge" => ["-Xmx4096m", "6", "2"], > > --- > >> "m2.2xlarge" => ["-Xmx8192m", "3", "2"], > > # Parameters of the array (Vars modified in hadoop.env.sh) > > # [HADOOP_JOBTRACKER_HEAPSIZE, HADOOP_NAMENODE_HEAPSIZE, > > HADOOP_TASKTRACKER_HEAPSIZE, HADOOP_DATANODE_HEAPSIZE] > > 47c47 > > < "m2.2xlarge" => ["2048", "8192", "1024", "1024"], > > --- > >> "m2.2xlarge" => ["4096", "16384", "2048", "2048"] > > > > > > > > On Thu, Feb 2, 2012 at 8:40 AM, Sebastian Schelter <[EMAIL PROTECTED]> > wrote: > > > >> Hmm, are you sure that the mappers have enough memory? You can set that > >> via Dmapred.child.java.opts=-Xmx[some number]m > >> > >> --sebastian > >> > >> On 02.02.2012 17:37, Nicholas Kolegraff wrote: > >>> Sounds good. Thanks Sebastian > >>> > >>> The interesting thing is -- I tried to sample the matrix down one time > to > >>> about 10% of non-zeros -- and worked no problem. > >>> > >>> On Thu, Feb 2, 2012 at 8:31 AM, Sebastian Schelter <[EMAIL PROTECTED]> > >> wrote: > >>> > >>>> Your parameters look good, except if you have binary data, you should > >>>> set --implicitFeedback=true. You could also set numFeatures to a very > >>>> small value (like 5) just to see if that helps. > >>>> > >>>> The mappers load one of the feature matrices into memory which are > dense > >>>> (#items x #features entries or #users x #features entries). Are you > sure > >>>> that the mappers have enough memory for that? > >>>> > >>>> It's really strange that you have problems with such small data, I > >>>> tested this with Netflix (> 100M non-zeros) on a few machines and it > >>>> worked quite well. > >>>> > >>>> --sebastian > >>>> > >>>> > >>>> > >>>> On 02.02.2012 17:25, Nicholas Kolegraff wrote:
+
Nicholas Kolegraff 2012-02-09, 02:50
|