|
Dmitriy Lyubimov
2011-12-19, 15:58
Sebastian Schelter
2011-12-17, 14:00
Ted Dunning
2011-12-17, 20:40
Dmitriy Lyubimov
2011-12-17, 16:27
Sebastian Schelter
2011-12-17, 22:58
Dmitriy Lyubimov
2011-12-18, 03:24
Dmitriy Lyubimov
2011-12-18, 03:27
Sebastian Schelter
2011-12-18, 08:05
Dmitriy Lyubimov
2011-12-19, 08:09
Sebastian Schelter
2011-12-19, 10:00
Dmitriy Lyubimov
2011-12-19, 16:14
Sebastian Schelter
2011-12-19, 16:27
Dmitriy Lyubimov
2011-12-19, 16:34
Dmitriy Lyubimov
2011-12-19, 17:48
Sebastian Schelter
2011-12-19, 14:15
Dmitriy Lyubimov
2011-12-19, 16:07
Dmitriy Lyubimov
2011-12-17, 23:04
Dmitriy Lyubimov
2011-12-17, 23:13
Dmitriy Lyubimov
2011-12-17, 23:25
Dmitriy Lyubimov
2011-12-18, 01:49
|
-
Re: Decomposing large graphsDmitriy Lyubimov 2011-12-19, 15:58
Also of course I am running 1 map task per core which creates full CPU
load. If your have 4 cores well then running 8 mappers will double the map task running time in your case. These tasks are CPU bound. The optimal running time for ssvd is achieved with 1 map task per core. On Dec 19, 2011 12:09 AM, "Dmitriy Lyubimov" <[EMAIL PROTECTED]> wrote: +
Dmitriy Lyubimov 2011-12-19, 15:58
-
Decomposing large graphsSebastian Schelter 2011-12-17, 14:00
Hi there,
I played with Mahout to decompose the adjacency matrices of large graphs lately. I stumbled on a paper of Christos Faloutsos that describes a variation of the Lanczos algorithm they use for this on top of Hadoop. They even explicitly mention Mahout: "Very recently(March 2010), the Mahout project [2] provides SVD on top of HADOOP. Due to insufficient documentation, we were not able to find the input format and run a head-to-head comparison. But, reading the source code, we discovered that Mahout suffers from two major issues: (a) it assumes that the vector (b, with n=O(billion) entries) fits in the memory of a single machine, and (b) it implements the full re-orthogonalization which is inefficient." http://www.cs.cmu.edu/~ukang/papers/HeigenPAKDD2011.pdf --sebastian +
Sebastian Schelter 2011-12-17, 14:00
-
Re: Decomposing large graphsTed Dunning 2011-12-17, 20:40
Both of those complaints are dealt with by Dmitriy's work on the stochastic
decomposition. On Sat, Dec 17, 2011 at 6:00 AM, Sebastian Schelter <[EMAIL PROTECTED] > wrote: > Hi there, > > I played with Mahout to decompose the adjacency matrices of large graphs > lately. I stumbled on a paper of Christos Faloutsos that describes a > variation of the Lanczos algorithm they use for this on top of Hadoop. > They even explicitly mention Mahout: > > "Very recently(March 2010), the Mahout project [2] provides > SVD on top of HADOOP. Due to insufficient documentation, we were not > able to find the input format and run a head-to-head comparison. But, > reading the source code, we discovered that Mahout suffers from two > major issues: (a) it assumes that the vector (b, with n=O(billion) > entries) fits in the memory of a single machine, and (b) it implements > the full re-orthogonalization which is inefficient." > > http://www.cs.cmu.edu/~ukang/papers/HeigenPAKDD2011.pdf > > --sebastian > +
Ted Dunning 2011-12-17, 20:40
-
Re: Decomposing large graphsDmitriy Lyubimov 2011-12-17, 16:27
Interesting.
Well so how did your decomposing go? On Dec 17, 2011 6:00 AM, "Sebastian Schelter" <[EMAIL PROTECTED]> wrote: > Hi there, > > I played with Mahout to decompose the adjacency matrices of large graphs > lately. I stumbled on a paper of Christos Faloutsos that describes a > variation of the Lanczos algorithm they use for this on top of Hadoop. > They even explicitly mention Mahout: > > "Very recently(March 2010), the Mahout project [2] provides > SVD on top of HADOOP. Due to insufficient documentation, we were not > able to find the input format and run a head-to-head comparison. But, > reading the source code, we discovered that Mahout suffers from two > major issues: (a) it assumes that the vector (b, with n=O(billion) > entries) fits in the memory of a single machine, and (b) it implements > the full re-orthogonalization which is inefficient." > > http://www.cs.cmu.edu/~ukang/papers/HeigenPAKDD2011.pdf > > --sebastian > +
Dmitriy Lyubimov 2011-12-17, 16:27
-
Re: Decomposing large graphsSebastian Schelter 2011-12-17, 22:58
On 17.12.2011 17:27, Dmitriy Lyubimov wrote:
> Interesting. > > Well so how did your decomposing go? I tested the decomposition of the wikipedia pagelink graph (130M edges, 5.6M vertices making approx. quarter of a billion non-zeros in the symmetric adjacency matrix) on a 6 machine hadoop cluster. Got these running times for k = 10, p = 5 and one power-iteration: Q-job 1mins, 41sec Bt-job 9mins, 30sec ABt-job 37mins, 22sec Bt-job 9mins, 41sec U-job 30sec I think I'd need a couple more machines to handle the twitter graph though... --sebastian > On Dec 17, 2011 6:00 AM, "Sebastian Schelter" <[EMAIL PROTECTED]> > wrote: > >> Hi there, >> >> I played with Mahout to decompose the adjacency matrices of large graphs >> lately. I stumbled on a paper of Christos Faloutsos that describes a >> variation of the Lanczos algorithm they use for this on top of Hadoop. >> They even explicitly mention Mahout: >> >> "Very recently(March 2010), the Mahout project [2] provides >> SVD on top of HADOOP. Due to insufficient documentation, we were not >> able to find the input format and run a head-to-head comparison. But, >> reading the source code, we discovered that Mahout suffers from two >> major issues: (a) it assumes that the vector (b, with n=O(billion) >> entries) fits in the memory of a single machine, and (b) it implements >> the full re-orthogonalization which is inefficient." >> >> http://www.cs.cmu.edu/~ukang/papers/HeigenPAKDD2011.pdf >> >> --sebastian >> > +
Sebastian Schelter 2011-12-17, 22:58
-
Re: Decomposing large graphsDmitriy Lyubimov 2011-12-18, 03:24
Sebastian,
I think my commit got screwed somehow. Here's the benchmark very close to yours (I also added broadcast option for B thru distributed cache which seems to have been a 30% win in my situation, ok, i leave it on by default). input: 4.5M x 4.5M sparse matrix with 40 non-zero elements per row, total size -rw-r--r-- 3 dmitriy supergroup 2353201134 2011-12-17 17:17 /user/dmitriy/drm-sparse.seq This is probably a little smaller than your wikipedia test in terms of dimensions but not in size. ssvd -i /user/dmitriy/drm-sparse.seq -o /user/dmitriy/SSVD-OUT --tempDir /user/dmitriy/temp -ow -br false -k 10 -p 1 -t 20 -q 1 -Dmapred.child.java.opts="-Xmx800m" so this is for 10 useful eigenvalues and U,V with 1 power iteration: with broadcast option: QJob 1 min 11s BtJob 1 min 58s AB' Job 2 min 6s B'Job 1 min 51 s U Job - 0m 12 s, V job - 20 s (U and V run in parallel, not sequential) without broadcast option (only affects AB' step) QJob 1 min 12s BtJob 1 min 59s AB' Job 3 min 3s B'Job 2 min 0 s 10 nodes 4 cpu each. (I had 35 splits of original input, so everything got allocated). It also runs our QA stuff starting something else almost every minute. frequent other jobs so while not loaded, job/task trackers are fairly busy setting up/tearing down. I'll commit shortly. On Sat, Dec 17, 2011 at 2:58 PM, Sebastian Schelter <[EMAIL PROTECTED]> wrote: > On 17.12.2011 17:27, Dmitriy Lyubimov wrote: >> Interesting. >> >> Well so how did your decomposing go? > > I tested the decomposition of the wikipedia pagelink graph (130M edges, > 5.6M vertices making approx. quarter of a billion non-zeros in the > symmetric adjacency matrix) on a 6 machine hadoop cluster. > > Got these running times for k = 10, p = 5 and one power-iteration: > > Q-job 1mins, 41sec > Bt-job 9mins, 30sec > ABt-job 37mins, 22sec > Bt-job 9mins, 41sec > U-job 30sec > > I think I'd need a couple more machines to handle the twitter graph > though... > > --sebastian > > >> On Dec 17, 2011 6:00 AM, "Sebastian Schelter" <[EMAIL PROTECTED]> >> wrote: >> >>> Hi there, >>> >>> I played with Mahout to decompose the adjacency matrices of large graphs >>> lately. I stumbled on a paper of Christos Faloutsos that describes a >>> variation of the Lanczos algorithm they use for this on top of Hadoop. >>> They even explicitly mention Mahout: >>> >>> "Very recently(March 2010), the Mahout project [2] provides >>> SVD on top of HADOOP. Due to insufficient documentation, we were not >>> able to find the input format and run a head-to-head comparison. But, >>> reading the source code, we discovered that Mahout suffers from two >>> major issues: (a) it assumes that the vector (b, with n=O(billion) >>> entries) fits in the memory of a single machine, and (b) it implements >>> the full re-orthogonalization which is inefficient." >>> >>> http://www.cs.cmu.edu/~ukang/papers/HeigenPAKDD2011.pdf >>> >>> --sebastian >>> >> > +
Dmitriy Lyubimov 2011-12-18, 03:24
-
Re: Decomposing large graphsDmitriy Lyubimov 2011-12-18, 03:27
also make sure you are not hitting swap. If you specifiy 3G per task,
and you have typical quad core commodity nodes, it probably means that you have at least 4 map / 4 reduce tasks configured. that's 8 task total, unless you have ~32G ram on these machines, i think you are in danger of hitting swap with settings that high when GC gets all excited and all. I never though i would say that, but maybe you can run it a little bit more conservatively. my cluster is cdh3u0. -Dmitriy On Sat, Dec 17, 2011 at 7:24 PM, Dmitriy Lyubimov <[EMAIL PROTECTED]> wrote: > Sebastian, > > I think my commit got screwed somehow. Here's the benchmark very close > to yours (I also added broadcast option for B thru distributed cache > which seems to have been a 30% win in my situation, ok, i leave it on > by default). > > input: 4.5M x 4.5M sparse matrix with 40 non-zero elements per row, total size > > -rw-r--r-- 3 dmitriy supergroup 2353201134 2011-12-17 17:17 > /user/dmitriy/drm-sparse.seq > > This is probably a little smaller than your wikipedia test in terms of > dimensions but not in size. > > ssvd -i /user/dmitriy/drm-sparse.seq -o /user/dmitriy/SSVD-OUT > --tempDir /user/dmitriy/temp -ow -br false -k 10 -p 1 -t 20 -q 1 > -Dmapred.child.java.opts="-Xmx800m" > > so this is for 10 useful eigenvalues and U,V with 1 power iteration: > > with broadcast option: > QJob 1 min 11s > BtJob 1 min 58s > AB' Job 2 min 6s > B'Job 1 min 51 s > > U Job - 0m 12 s, V job - 20 s (U and V run in parallel, not sequential) > > without broadcast option (only affects AB' step) > > QJob 1 min 12s > BtJob 1 min 59s > AB' Job 3 min 3s > B'Job 2 min 0 s > > 10 nodes 4 cpu each. (I had 35 splits of original input, so everything > got allocated). It also runs our QA stuff starting something else > almost every minute. frequent other jobs so while not loaded, job/task > trackers are fairly busy setting up/tearing down. > > I'll commit shortly. > > On Sat, Dec 17, 2011 at 2:58 PM, Sebastian Schelter <[EMAIL PROTECTED]> wrote: >> On 17.12.2011 17:27, Dmitriy Lyubimov wrote: >>> Interesting. >>> >>> Well so how did your decomposing go? >> >> I tested the decomposition of the wikipedia pagelink graph (130M edges, >> 5.6M vertices making approx. quarter of a billion non-zeros in the >> symmetric adjacency matrix) on a 6 machine hadoop cluster. >> >> Got these running times for k = 10, p = 5 and one power-iteration: >> >> Q-job 1mins, 41sec >> Bt-job 9mins, 30sec >> ABt-job 37mins, 22sec >> Bt-job 9mins, 41sec >> U-job 30sec >> >> I think I'd need a couple more machines to handle the twitter graph >> though... >> >> --sebastian >> >> >>> On Dec 17, 2011 6:00 AM, "Sebastian Schelter" <[EMAIL PROTECTED]> >>> wrote: >>> >>>> Hi there, >>>> >>>> I played with Mahout to decompose the adjacency matrices of large graphs >>>> lately. I stumbled on a paper of Christos Faloutsos that describes a >>>> variation of the Lanczos algorithm they use for this on top of Hadoop. >>>> They even explicitly mention Mahout: >>>> >>>> "Very recently(March 2010), the Mahout project [2] provides >>>> SVD on top of HADOOP. Due to insufficient documentation, we were not >>>> able to find the input format and run a head-to-head comparison. But, >>>> reading the source code, we discovered that Mahout suffers from two >>>> major issues: (a) it assumes that the vector (b, with n=O(billion) >>>> entries) fits in the memory of a single machine, and (b) it implements >>>> the full re-orthogonalization which is inefficient." >>>> >>>> http://www.cs.cmu.edu/~ukang/papers/HeigenPAKDD2011.pdf >>>> >>>> --sebastian >>>> >>> >> +
Dmitriy Lyubimov 2011-12-18, 03:27
-
Re: Decomposing large graphsSebastian Schelter 2011-12-18, 08:05
I ran it with 8 map and 4 reduce tasks allowed per machine using
-Dmapred.child.java.opts=-Xmx1024m (each machine has 32GB RAM). Most of the time was spent in QRReducer not in ABtMapper. I was promised to get better monitoring next week, I'll rerun the tests and hopefully can provide more informative results. --sebastian On 18.12.2011 04:27, Dmitriy Lyubimov wrote: > also make sure you are not hitting swap. If you specifiy 3G per task, > and you have typical quad core commodity nodes, it probably means that > you have at least 4 map / 4 reduce tasks configured. that's 8 task > total, unless you have ~32G ram on these machines, i think you are in > danger of hitting swap with settings that high when GC gets all > excited and all. I never though i would say that, but maybe you can > run it a little bit more conservatively. > > my cluster is cdh3u0. > > -Dmitriy > > On Sat, Dec 17, 2011 at 7:24 PM, Dmitriy Lyubimov <[EMAIL PROTECTED]> wrote: >> Sebastian, >> >> I think my commit got screwed somehow. Here's the benchmark very close >> to yours (I also added broadcast option for B thru distributed cache >> which seems to have been a 30% win in my situation, ok, i leave it on >> by default). >> >> input: 4.5M x 4.5M sparse matrix with 40 non-zero elements per row, total size >> >> -rw-r--r-- 3 dmitriy supergroup 2353201134 2011-12-17 17:17 >> /user/dmitriy/drm-sparse.seq >> >> This is probably a little smaller than your wikipedia test in terms of >> dimensions but not in size. >> >> ssvd -i /user/dmitriy/drm-sparse.seq -o /user/dmitriy/SSVD-OUT >> --tempDir /user/dmitriy/temp -ow -br false -k 10 -p 1 -t 20 -q 1 >> -Dmapred.child.java.opts="-Xmx800m" >> >> so this is for 10 useful eigenvalues and U,V with 1 power iteration: >> >> with broadcast option: >> QJob 1 min 11s >> BtJob 1 min 58s >> AB' Job 2 min 6s >> B'Job 1 min 51 s >> >> U Job - 0m 12 s, V job - 20 s (U and V run in parallel, not sequential) >> >> without broadcast option (only affects AB' step) >> >> QJob 1 min 12s >> BtJob 1 min 59s >> AB' Job 3 min 3s >> B'Job 2 min 0 s >> >> 10 nodes 4 cpu each. (I had 35 splits of original input, so everything >> got allocated). It also runs our QA stuff starting something else >> almost every minute. frequent other jobs so while not loaded, job/task >> trackers are fairly busy setting up/tearing down. >> >> I'll commit shortly. >> >> On Sat, Dec 17, 2011 at 2:58 PM, Sebastian Schelter <[EMAIL PROTECTED]> wrote: >>> On 17.12.2011 17:27, Dmitriy Lyubimov wrote: >>>> Interesting. >>>> >>>> Well so how did your decomposing go? >>> >>> I tested the decomposition of the wikipedia pagelink graph (130M edges, >>> 5.6M vertices making approx. quarter of a billion non-zeros in the >>> symmetric adjacency matrix) on a 6 machine hadoop cluster. >>> >>> Got these running times for k = 10, p = 5 and one power-iteration: >>> >>> Q-job 1mins, 41sec >>> Bt-job 9mins, 30sec >>> ABt-job 37mins, 22sec >>> Bt-job 9mins, 41sec >>> U-job 30sec >>> >>> I think I'd need a couple more machines to handle the twitter graph >>> though... >>> >>> --sebastian >>> >>> >>>> On Dec 17, 2011 6:00 AM, "Sebastian Schelter" <[EMAIL PROTECTED]> >>>> wrote: >>>> >>>>> Hi there, >>>>> >>>>> I played with Mahout to decompose the adjacency matrices of large graphs >>>>> lately. I stumbled on a paper of Christos Faloutsos that describes a >>>>> variation of the Lanczos algorithm they use for this on top of Hadoop. >>>>> They even explicitly mention Mahout: >>>>> >>>>> "Very recently(March 2010), the Mahout project [2] provides >>>>> SVD on top of HADOOP. Due to insufficient documentation, we were not >>>>> able to find the input format and run a head-to-head comparison. But, >>>>> reading the source code, we discovered that Mahout suffers from two >>>>> major issues: (a) it assumes that the vector (b, with n=O(billion) >>>>> entries) fits in the memory of a single machine, and (b) it implements >>>>> the full re-orthogonalization which is inefficient." +
Sebastian Schelter 2011-12-18, 08:05
-
Re: Decomposing large graphsDmitriy Lyubimov 2011-12-19, 08:09
Ok reducer then, hm. Then it's probably not the memory, for the
pressure holds high mostly in the mapper only. Reducer actually doesn't use much memory at all. The problem i previously had was with memory pressure and load time in the AB' mappers but reducers were always ok, copy+sort+run has always taken about a minute there for this input, nothing changed much there still. --> just in case, please check how many records map phase outputs. If it is more than 1 per mapper, please note by how much and increase -abth option to say 300,000 or until you don't see the multiple records output by map tasks. it should be ok, it is still roughly 24mb for everything for k+p=15 of your case ( (k+p) x abth x 8bytes packed as long double[] swathes, so it doesn't take any object overhead so much). Although with 922 patch it should not matter much, that was the point. If you can, please notice if it is a sort, copy or the reducer itself? If it is the sort, you may need to tweak sort buffer size, especially if you didn't set up too many reducers per job (but i assume you set num reducers at at least 20). By default, those buffers are just 100m which may turn out below block size, i am not sure how exactly it would work out in this case then. i think i bumped that up a little some long time ago. What else may be different different? i am at CDH3u0 here. your input seems to be 25% wider, which may affect AB' job map time in particular by at most that, which in my case would constitute about 15s difference, but hopefully not so much anything else. My setup is actually pretty meager, we have 10 quad-cpu nodes with 16G RAM each and they are already running hbase on top of it. which is why i mostly avoid running jobs much over 500m mem there, or i start scratching something on the disk. So memory doesn't seem to be the factor. (But my cluster is a barebone assembly sitting in the same rack which is a big plus). I also currently allow 4x4 map/reduce tasks per node. if you can point me where i can download your input exactly (without or with minimum prep steps, i think i can't spend much energy on that while at the office), i may try to try on your exact input as well. ------ On a side note i found an error with my work on broadcasting via distributed cache. After patching it up, i actually see no difference whatsoever between direct streaming and using distributed cache, just like i initially figured. My timing for AB' multiplication revolves around 3 min 10 s, distributed cache or streaming. Since i did that patch anyway, i will be offering it for commit but i will probably have to put it on the review board as it touches some iterator stuff that Sean previously crafted. the branch name with broadcast option enabled is called MAHOUT-922-2. On Sun, Dec 18, 2011 at 12:05 AM, Sebastian Schelter <[EMAIL PROTECTED]> wrote: > I ran it with 8 map and 4 reduce tasks allowed per machine using > -Dmapred.child.java.opts=-Xmx1024m (each machine has 32GB RAM). > > Most of the time was spent in QRReducer not in ABtMapper. I was promised > to get better monitoring next week, I'll rerun the tests and hopefully > can provide more informative results. > > --sebastian > > On 18.12.2011 04:27, Dmitriy Lyubimov wrote: >> also make sure you are not hitting swap. If you specifiy 3G per task, >> and you have typical quad core commodity nodes, it probably means that >> you have at least 4 map / 4 reduce tasks configured. that's 8 task >> total, unless you have ~32G ram on these machines, i think you are in >> danger of hitting swap with settings that high when GC gets all >> excited and all. I never though i would say that, but maybe you can >> run it a little bit more conservatively. >> >> my cluster is cdh3u0. >> >> -Dmitriy >> >> On Sat, Dec 17, 2011 at 7:24 PM, Dmitriy Lyubimov <[EMAIL PROTECTED]> wrote: >>> Sebastian, >>> >>> I think my commit got screwed somehow. Here's the benchmark very close >>> to yours (I also added broadcast option for B thru distributed cache +
Dmitriy Lyubimov 2011-12-19, 08:09
-
Re: Decomposing large graphsSebastian Schelter 2011-12-19, 10:00
Hi Dmitriy,
I finally found the problem... I did not set --reduceTasks, I ran the job again with 24 reduce tasks and was able to run the whole decomposition in less than 10 minutes. Sorry for causing you so much work. --sebastian hadoop jar mahout-core-0.6-SNAPSHOT-job.jar org.apache.mahout.math.hadoop.stochasticsvd.SSVDCli -Dmapred.tasktracker.map.tasks.maximum=8 -Dmapred.tasktracker.reduce.tasks.maximum=6 -Dmapred.child.java.opts=-Xmx1024m --input hdfs:///ssc/wiki-eig15-tmp/adjacencyMatrix --output hdfs:///ssc/M922-7 --tempDir hdfs:///ssc/M922-tmp-7 --rank 10 --oversampling 5 --computeV false --powerIter 1 --reduceTasks 24 Q-Job 1mins, 49sec Bt-Job 2mins, 39sec ABt-Job 3mins, 51sec Bt-Job 2mins, 41sec U-Job 29sec On 19.12.2011 09:09, Dmitriy Lyubimov wrote: > Ok reducer then, hm. Then it's probably not the memory, for the > pressure holds high mostly in the mapper only. Reducer actually > doesn't use much memory at all. The problem i previously had was with > memory pressure and load time in the AB' mappers but reducers were > always ok, copy+sort+run has always taken about a minute there for > this input, nothing changed much there still. --> just in case, please > check how many records map phase outputs. If it is more than 1 per > mapper, please note by how much and increase -abth option to say > 300,000 or until you don't see the multiple records output by map > tasks. it should be ok, it is still roughly 24mb for everything for > k+p=15 of your case ( (k+p) x abth x 8bytes packed as long double[] > swathes, so it doesn't take any object overhead so much). Although > with 922 patch it should not matter much, that was the point. > > If you can, please notice if it is a sort, copy or the reducer itself? > If it is the sort, you may need to tweak sort buffer size, especially > if you didn't set up too many reducers per job (but i assume you set > num reducers at at least 20). By default, those buffers are just 100m > which may turn out below block size, i am not sure how exactly it > would work out in this case then. i think i bumped that up a little > some long time ago. > > > What else may be different different? i am at CDH3u0 here. your input > seems to be 25% wider, which may affect AB' job map time in particular > by at most that, which in my case would constitute about 15s > difference, but hopefully not so much anything else. > > My setup is actually pretty meager, we have 10 quad-cpu nodes with 16G > RAM each and they are already running hbase on top of it. which is why > i mostly avoid running jobs much over 500m mem there, or i start > scratching something on the disk. So memory doesn't seem to be the > factor. (But my cluster is a barebone assembly sitting in the same > rack which is a big plus). I also currently allow 4x4 map/reduce tasks > per node. > > if you can point me where i can download your input exactly (without > or with minimum prep steps, i think i can't spend much energy on that > while at the office), i may try to try on your exact input as well. > > > ------ > On a side note i found an error with my work on broadcasting via > distributed cache. After patching it up, i actually see no difference > whatsoever between direct streaming and using distributed cache, just > like i initially figured. My timing for AB' multiplication revolves > around 3 min 10 s, distributed cache or streaming. Since i did that > patch anyway, i will be offering it for commit but i will probably > have to put it on the review board as it touches some iterator stuff > that Sean previously crafted. the branch name with broadcast option > enabled is called MAHOUT-922-2. > > > > On Sun, Dec 18, 2011 at 12:05 AM, Sebastian Schelter <[EMAIL PROTECTED]> wrote: >> I ran it with 8 map and 4 reduce tasks allowed per machine using >> -Dmapred.child.java.opts=-Xmx1024m (each machine has 32GB RAM). >> >> Most of the time was spent in QRReducer not in ABtMapper. I was promised >> to get better monitoring next week, I'll rerun the tests and hopefully +
Sebastian Schelter 2011-12-19, 10:00
-
Re: Decomposing large graphsDmitriy Lyubimov 2011-12-19, 16:14
No problem. We actually got one patch done that was useful for extra sparse
cases, that's good. I think you still can achieve extra performance if you constrain to 1 task per core and have enough cores for all tasks to run at once. On Dec 19, 2011 2:00 AM, "Sebastian Schelter" <[EMAIL PROTECTED]> wrote: > Hi Dmitriy, > > I finally found the problem... I did not set --reduceTasks, I ran the > job again with 24 reduce tasks and was able to run the whole > decomposition in less than 10 minutes. Sorry for causing you so much work. > > --sebastian > > hadoop jar mahout-core-0.6-SNAPSHOT-job.jar > org.apache.mahout.math.hadoop.stochasticsvd.SSVDCli > -Dmapred.tasktracker.map.tasks.maximum=8 > -Dmapred.tasktracker.reduce.tasks.maximum=6 > -Dmapred.child.java.opts=-Xmx1024m --input > hdfs:///ssc/wiki-eig15-tmp/adjacencyMatrix --output hdfs:///ssc/M922-7 > --tempDir hdfs:///ssc/M922-tmp-7 --rank 10 --oversampling 5 --computeV > false --powerIter 1 --reduceTasks 24 > > Q-Job 1mins, 49sec > Bt-Job 2mins, 39sec > ABt-Job 3mins, 51sec > Bt-Job 2mins, 41sec > U-Job 29sec > > On 19.12.2011 09:09, Dmitriy Lyubimov wrote: > > Ok reducer then, hm. Then it's probably not the memory, for the > > pressure holds high mostly in the mapper only. Reducer actually > > doesn't use much memory at all. The problem i previously had was with > > memory pressure and load time in the AB' mappers but reducers were > > always ok, copy+sort+run has always taken about a minute there for > > this input, nothing changed much there still. --> just in case, please > > check how many records map phase outputs. If it is more than 1 per > > mapper, please note by how much and increase -abth option to say > > 300,000 or until you don't see the multiple records output by map > > tasks. it should be ok, it is still roughly 24mb for everything for > > k+p=15 of your case ( (k+p) x abth x 8bytes packed as long double[] > > swathes, so it doesn't take any object overhead so much). Although > > with 922 patch it should not matter much, that was the point. > > > > If you can, please notice if it is a sort, copy or the reducer itself? > > If it is the sort, you may need to tweak sort buffer size, especially > > if you didn't set up too many reducers per job (but i assume you set > > num reducers at at least 20). By default, those buffers are just 100m > > which may turn out below block size, i am not sure how exactly it > > would work out in this case then. i think i bumped that up a little > > some long time ago. > > > > > > What else may be different different? i am at CDH3u0 here. your input > > seems to be 25% wider, which may affect AB' job map time in particular > > by at most that, which in my case would constitute about 15s > > difference, but hopefully not so much anything else. > > > > My setup is actually pretty meager, we have 10 quad-cpu nodes with 16G > > RAM each and they are already running hbase on top of it. which is why > > i mostly avoid running jobs much over 500m mem there, or i start > > scratching something on the disk. So memory doesn't seem to be the > > factor. (But my cluster is a barebone assembly sitting in the same > > rack which is a big plus). I also currently allow 4x4 map/reduce tasks > > per node. > > > > if you can point me where i can download your input exactly (without > > or with minimum prep steps, i think i can't spend much energy on that > > while at the office), i may try to try on your exact input as well. > > > > > > ------ > > On a side note i found an error with my work on broadcasting via > > distributed cache. After patching it up, i actually see no difference > > whatsoever between direct streaming and using distributed cache, just > > like i initially figured. My timing for AB' multiplication revolves > > around 3 min 10 s, distributed cache or streaming. Since i did that > > patch anyway, i will be offering it for commit but i will probably > > have to put it on the review board as it touches some iterator stuff +
Dmitriy Lyubimov 2011-12-19, 16:14
-
Re: Decomposing large graphsSebastian Schelter 2011-12-19, 16:27
I have 16 cores per machine, I'll try to decrease Xmx a little and see
if I can run more tasks. On 19.12.2011 17:14, Dmitriy Lyubimov wrote: > No problem. We actually got one patch done that was useful for extra sparse > cases, that's good. I think you still can achieve extra performance if you > constrain to 1 task per core and have enough cores for all tasks to run at > once. > On Dec 19, 2011 2:00 AM, "Sebastian Schelter" <[EMAIL PROTECTED]> wrote: > >> Hi Dmitriy, >> >> I finally found the problem... I did not set --reduceTasks, I ran the >> job again with 24 reduce tasks and was able to run the whole >> decomposition in less than 10 minutes. Sorry for causing you so much work. >> >> --sebastian >> >> hadoop jar mahout-core-0.6-SNAPSHOT-job.jar >> org.apache.mahout.math.hadoop.stochasticsvd.SSVDCli >> -Dmapred.tasktracker.map.tasks.maximum=8 >> -Dmapred.tasktracker.reduce.tasks.maximum=6 >> -Dmapred.child.java.opts=-Xmx1024m --input >> hdfs:///ssc/wiki-eig15-tmp/adjacencyMatrix --output hdfs:///ssc/M922-7 >> --tempDir hdfs:///ssc/M922-tmp-7 --rank 10 --oversampling 5 --computeV >> false --powerIter 1 --reduceTasks 24 >> >> Q-Job 1mins, 49sec >> Bt-Job 2mins, 39sec >> ABt-Job 3mins, 51sec >> Bt-Job 2mins, 41sec >> U-Job 29sec >> >> On 19.12.2011 09:09, Dmitriy Lyubimov wrote: >>> Ok reducer then, hm. Then it's probably not the memory, for the >>> pressure holds high mostly in the mapper only. Reducer actually >>> doesn't use much memory at all. The problem i previously had was with >>> memory pressure and load time in the AB' mappers but reducers were >>> always ok, copy+sort+run has always taken about a minute there for >>> this input, nothing changed much there still. --> just in case, please >>> check how many records map phase outputs. If it is more than 1 per >>> mapper, please note by how much and increase -abth option to say >>> 300,000 or until you don't see the multiple records output by map >>> tasks. it should be ok, it is still roughly 24mb for everything for >>> k+p=15 of your case ( (k+p) x abth x 8bytes packed as long double[] >>> swathes, so it doesn't take any object overhead so much). Although >>> with 922 patch it should not matter much, that was the point. >>> >>> If you can, please notice if it is a sort, copy or the reducer itself? >>> If it is the sort, you may need to tweak sort buffer size, especially >>> if you didn't set up too many reducers per job (but i assume you set >>> num reducers at at least 20). By default, those buffers are just 100m >>> which may turn out below block size, i am not sure how exactly it >>> would work out in this case then. i think i bumped that up a little >>> some long time ago. >>> >>> >>> What else may be different different? i am at CDH3u0 here. your input >>> seems to be 25% wider, which may affect AB' job map time in particular >>> by at most that, which in my case would constitute about 15s >>> difference, but hopefully not so much anything else. >>> >>> My setup is actually pretty meager, we have 10 quad-cpu nodes with 16G >>> RAM each and they are already running hbase on top of it. which is why >>> i mostly avoid running jobs much over 500m mem there, or i start >>> scratching something on the disk. So memory doesn't seem to be the >>> factor. (But my cluster is a barebone assembly sitting in the same >>> rack which is a big plus). I also currently allow 4x4 map/reduce tasks >>> per node. >>> >>> if you can point me where i can download your input exactly (without >>> or with minimum prep steps, i think i can't spend much energy on that >>> while at the office), i may try to try on your exact input as well. >>> >>> >>> ------ >>> On a side note i found an error with my work on broadcasting via >>> distributed cache. After patching it up, i actually see no difference >>> whatsoever between direct streaming and using distributed cache, just >>> like i initially figured. My timing for AB' multiplication revolves >>> around 3 min 10 s, distributed cache or streaming. Since i did that +
Sebastian Schelter 2011-12-19, 16:27
-
Re: Decomposing large graphsDmitriy Lyubimov 2011-12-19, 16:34
Oh that's a nice hardware you have out there then. Then you can indeed try
16 map tasks per node then if you have some CPU time to spare. I don't know if you will have enough slots to run all them at once though but that's nice. At this capacity you may probably even see some effects of having distributed cache enabled (not yet committed, MAHOUT-922-2 branch). On Dec 19, 2011 8:28 AM, "Sebastian Schelter" <[EMAIL PROTECTED]> wrote: > I have 16 cores per machine, I'll try to decrease Xmx a little and see > if I can run more tasks. > > On 19.12.2011 17:14, Dmitriy Lyubimov wrote: > > No problem. We actually got one patch done that was useful for extra > sparse > > cases, that's good. I think you still can achieve extra performance if > you > > constrain to 1 task per core and have enough cores for all tasks to run > at > > once. > > On Dec 19, 2011 2:00 AM, "Sebastian Schelter" <[EMAIL PROTECTED]> wrote: > > > >> Hi Dmitriy, > >> > >> I finally found the problem... I did not set --reduceTasks, I ran the > >> job again with 24 reduce tasks and was able to run the whole > >> decomposition in less than 10 minutes. Sorry for causing you so much > work. > >> > >> --sebastian > >> > >> hadoop jar mahout-core-0.6-SNAPSHOT-job.jar > >> org.apache.mahout.math.hadoop.stochasticsvd.SSVDCli > >> -Dmapred.tasktracker.map.tasks.maximum=8 > >> -Dmapred.tasktracker.reduce.tasks.maximum=6 > >> -Dmapred.child.java.opts=-Xmx1024m --input > >> hdfs:///ssc/wiki-eig15-tmp/adjacencyMatrix --output hdfs:///ssc/M922-7 > >> --tempDir hdfs:///ssc/M922-tmp-7 --rank 10 --oversampling 5 --computeV > >> false --powerIter 1 --reduceTasks 24 > >> > >> Q-Job 1mins, 49sec > >> Bt-Job 2mins, 39sec > >> ABt-Job 3mins, 51sec > >> Bt-Job 2mins, 41sec > >> U-Job 29sec > >> > >> On 19.12.2011 09:09, Dmitriy Lyubimov wrote: > >>> Ok reducer then, hm. Then it's probably not the memory, for the > >>> pressure holds high mostly in the mapper only. Reducer actually > >>> doesn't use much memory at all. The problem i previously had was with > >>> memory pressure and load time in the AB' mappers but reducers were > >>> always ok, copy+sort+run has always taken about a minute there for > >>> this input, nothing changed much there still. --> just in case, please > >>> check how many records map phase outputs. If it is more than 1 per > >>> mapper, please note by how much and increase -abth option to say > >>> 300,000 or until you don't see the multiple records output by map > >>> tasks. it should be ok, it is still roughly 24mb for everything for > >>> k+p=15 of your case ( (k+p) x abth x 8bytes packed as long double[] > >>> swathes, so it doesn't take any object overhead so much). Although > >>> with 922 patch it should not matter much, that was the point. > >>> > >>> If you can, please notice if it is a sort, copy or the reducer itself? > >>> If it is the sort, you may need to tweak sort buffer size, especially > >>> if you didn't set up too many reducers per job (but i assume you set > >>> num reducers at at least 20). By default, those buffers are just 100m > >>> which may turn out below block size, i am not sure how exactly it > >>> would work out in this case then. i think i bumped that up a little > >>> some long time ago. > >>> > >>> > >>> What else may be different different? i am at CDH3u0 here. your input > >>> seems to be 25% wider, which may affect AB' job map time in particular > >>> by at most that, which in my case would constitute about 15s > >>> difference, but hopefully not so much anything else. > >>> > >>> My setup is actually pretty meager, we have 10 quad-cpu nodes with 16G > >>> RAM each and they are already running hbase on top of it. which is why > >>> i mostly avoid running jobs much over 500m mem there, or i start > >>> scratching something on the disk. So memory doesn't seem to be the > >>> factor. (But my cluster is a barebone assembly sitting in the same > >>> rack which is a big plus). I also currently allow 4x4 map/reduce tasks > >>> per node. +
Dmitriy Lyubimov 2011-12-19, 16:34
-
Re: Decomposing large graphsDmitriy Lyubimov 2011-12-19, 17:48
Also, one of the reasons AB' job runs slower than B' job is because we
underload the reduce phase. AB' job has the same flops scale as B' job but it splits its job more evenly between mappers and reducers. Usually we are subconciously accustomed to a pattern where reducers deal with already aggregated results and do little more but save them into long files. but that's not the case with AB' job. Ideally, if we load reducers there to a maximum of 1 task per core (or, as hadoop recommends, 95% of available cores), then we should be approaching the running times of B' job there. There's littlel effect on B' job, though, it's all map-side. Once i've run my test with # of mappers == # of reducers (35, vs. previously 20), the B' job running time reduced by another average of 30 seconds (~15% running time reduction). So in your case you probably need to try to scale up reduce phases as well and run as many reducers as hadoop recommends (95% of your cluster core capacity, careful about memory though, but -1024m should be plenty). Another thing that i now remember i have on my cluster is enabled native lzo compression for MR jobs. in theory it helps the sorting phase a little bit. Hope that helps. -Dmitriy On Mon, Dec 19, 2011 at 8:34 AM, Dmitriy Lyubimov <[EMAIL PROTECTED]> wrote: > Oh that's a nice hardware you have out there then. Then you can indeed try > 16 map tasks per node then if you have some CPU time to spare. I don't know > if you will have enough slots to run all them at once though but that's > nice. At this capacity you may probably even see some effects of having > distributed cache enabled (not yet committed, MAHOUT-922-2 branch). > > On Dec 19, 2011 8:28 AM, "Sebastian Schelter" <[EMAIL PROTECTED]> wrote: >> >> I have 16 cores per machine, I'll try to decrease Xmx a little and see >> if I can run more tasks. >> >> On 19.12.2011 17:14, Dmitriy Lyubimov wrote: >> > No problem. We actually got one patch done that was useful for extra >> > sparse >> > cases, that's good. I think you still can achieve extra performance if >> > you >> > constrain to 1 task per core and have enough cores for all tasks to run >> > at >> > once. >> > On Dec 19, 2011 2:00 AM, "Sebastian Schelter" <[EMAIL PROTECTED]> wrote: >> > >> >> Hi Dmitriy, >> >> >> >> I finally found the problem... I did not set --reduceTasks, I ran the >> >> job again with 24 reduce tasks and was able to run the whole >> >> decomposition in less than 10 minutes. Sorry for causing you so much >> >> work. >> >> >> >> --sebastian >> >> >> >> hadoop jar mahout-core-0.6-SNAPSHOT-job.jar >> >> org.apache.mahout.math.hadoop.stochasticsvd.SSVDCli >> >> -Dmapred.tasktracker.map.tasks.maximum=8 >> >> -Dmapred.tasktracker.reduce.tasks.maximum=6 >> >> -Dmapred.child.java.opts=-Xmx1024m --input >> >> hdfs:///ssc/wiki-eig15-tmp/adjacencyMatrix --output hdfs:///ssc/M922-7 >> >> --tempDir hdfs:///ssc/M922-tmp-7 --rank 10 --oversampling 5 --computeV >> >> false --powerIter 1 --reduceTasks 24 >> >> >> >> Q-Job 1mins, 49sec >> >> Bt-Job 2mins, 39sec >> >> ABt-Job 3mins, 51sec >> >> Bt-Job 2mins, 41sec >> >> U-Job 29sec >> >> >> >> On 19.12.2011 09:09, Dmitriy Lyubimov wrote: >> >>> Ok reducer then, hm. Then it's probably not the memory, for the >> >>> pressure holds high mostly in the mapper only. Reducer actually >> >>> doesn't use much memory at all. The problem i previously had was with >> >>> memory pressure and load time in the AB' mappers but reducers were >> >>> always ok, copy+sort+run has always taken about a minute there for >> >>> this input, nothing changed much there still. --> just in case, please >> >>> check how many records map phase outputs. If it is more than 1 per >> >>> mapper, please note by how much and increase -abth option to say >> >>> 300,000 or until you don't see the multiple records output by map >> >>> tasks. it should be ok, it is still roughly 24mb for everything for >> >>> k+p=15 of your case ( (k+p) x abth x 8bytes packed as long double[] >> >>> swathes, so it doesn't take any object overhead so much). Although +
Dmitriy Lyubimov 2011-12-19, 17:48
-
Re: Decomposing large graphsSebastian Schelter 2011-12-19, 14:15
I did some further testing today and I was able to decompose the twitter
graph (matrix with approx 3B non-zeros) in less than 2 hours: hadoop jar mahout-core-0.6-SNAPSHOT-job.jar org.apache.mahout.math.hadoop.stochasticsvd.SSVDCli -Dmapred.tasktracker.map.tasks.maximum=8 -Dmapred.tasktracker.reduce.tasks.maximum=6 -Dmapred.child.java.opts=-Xmx2048m --input hdfs:///ssc/preprocessed/twitter/adjacencyMatrix --output hdfs:///ssc/twitter-3 --tempDir hdfs:///ssc/twitter-tmp-3 --rank 10 --oversampling 5 --computeV false --powerIter 1 --reduceTasks 24 Q-Job 9mins, 57sec Bt-Job 19mins, 15sec ABt-Job 1hrs, 8mins, 16sec Bt-Job 17mins, 3sec U-Job 1mins, 35sec I'm really impressed by your great work, Dmitriy! --sebastian On 19.12.2011 09:09, Dmitriy Lyubimov wrote: > Ok reducer then, hm. Then it's probably not the memory, for the > pressure holds high mostly in the mapper only. Reducer actually > doesn't use much memory at all. The problem i previously had was with > memory pressure and load time in the AB' mappers but reducers were > always ok, copy+sort+run has always taken about a minute there for > this input, nothing changed much there still. --> just in case, please > check how many records map phase outputs. If it is more than 1 per > mapper, please note by how much and increase -abth option to say > 300,000 or until you don't see the multiple records output by map > tasks. it should be ok, it is still roughly 24mb for everything for > k+p=15 of your case ( (k+p) x abth x 8bytes packed as long double[] > swathes, so it doesn't take any object overhead so much). Although > with 922 patch it should not matter much, that was the point. > > If you can, please notice if it is a sort, copy or the reducer itself? > If it is the sort, you may need to tweak sort buffer size, especially > if you didn't set up too many reducers per job (but i assume you set > num reducers at at least 20). By default, those buffers are just 100m > which may turn out below block size, i am not sure how exactly it > would work out in this case then. i think i bumped that up a little > some long time ago. > > > What else may be different different? i am at CDH3u0 here. your input > seems to be 25% wider, which may affect AB' job map time in particular > by at most that, which in my case would constitute about 15s > difference, but hopefully not so much anything else. > > My setup is actually pretty meager, we have 10 quad-cpu nodes with 16G > RAM each and they are already running hbase on top of it. which is why > i mostly avoid running jobs much over 500m mem there, or i start > scratching something on the disk. So memory doesn't seem to be the > factor. (But my cluster is a barebone assembly sitting in the same > rack which is a big plus). I also currently allow 4x4 map/reduce tasks > per node. > > if you can point me where i can download your input exactly (without > or with minimum prep steps, i think i can't spend much energy on that > while at the office), i may try to try on your exact input as well. > > > ------ > On a side note i found an error with my work on broadcasting via > distributed cache. After patching it up, i actually see no difference > whatsoever between direct streaming and using distributed cache, just > like i initially figured. My timing for AB' multiplication revolves > around 3 min 10 s, distributed cache or streaming. Since i did that > patch anyway, i will be offering it for commit but i will probably > have to put it on the review board as it touches some iterator stuff > that Sean previously crafted. the branch name with broadcast option > enabled is called MAHOUT-922-2. > > > > On Sun, Dec 18, 2011 at 12:05 AM, Sebastian Schelter <[EMAIL PROTECTED]> wrote: >> I ran it with 8 map and 4 reduce tasks allowed per machine using >> -Dmapred.child.java.opts=-Xmx1024m (each machine has 32GB RAM). >> >> Most of the time was spent in QRReducer not in ABtMapper. I was promised >> to get better monitoring next week, I'll rerun the tests and hopefully +
Sebastian Schelter 2011-12-19, 14:15
-
Re: Decomposing large graphsDmitriy Lyubimov 2011-12-19, 16:07
That is in 6 nodes, 24 cores, right? It would be interesting to have
capacity to map all tasks per core at once, then you probably would be getting optimal thruput.. On Dec 19, 2011 6:16 AM, "Sebastian Schelter" <[EMAIL PROTECTED]> wrote: > I did some further testing today and I was able to decompose the twitter > graph (matrix with approx 3B non-zeros) in less than 2 hours: > > hadoop jar mahout-core-0.6-SNAPSHOT-job.jar > org.apache.mahout.math.hadoop.stochasticsvd.SSVDCli > -Dmapred.tasktracker.map.tasks.maximum=8 > -Dmapred.tasktracker.reduce.tasks.maximum=6 > -Dmapred.child.java.opts=-Xmx2048m --input > hdfs:///ssc/preprocessed/twitter/adjacencyMatrix --output > hdfs:///ssc/twitter-3 --tempDir hdfs:///ssc/twitter-tmp-3 --rank 10 > --oversampling 5 --computeV false --powerIter 1 --reduceTasks 24 > > Q-Job 9mins, 57sec > Bt-Job 19mins, 15sec > ABt-Job 1hrs, 8mins, 16sec > Bt-Job 17mins, 3sec > U-Job 1mins, 35sec > > I'm really impressed by your great work, Dmitriy! > > --sebastian > > On 19.12.2011 09:09, Dmitriy Lyubimov wrote: > > Ok reducer then, hm. Then it's probably not the memory, for the > > pressure holds high mostly in the mapper only. Reducer actually > > doesn't use much memory at all. The problem i previously had was with > > memory pressure and load time in the AB' mappers but reducers were > > always ok, copy+sort+run has always taken about a minute there for > > this input, nothing changed much there still. --> just in case, please > > check how many records map phase outputs. If it is more than 1 per > > mapper, please note by how much and increase -abth option to say > > 300,000 or until you don't see the multiple records output by map > > tasks. it should be ok, it is still roughly 24mb for everything for > > k+p=15 of your case ( (k+p) x abth x 8bytes packed as long double[] > > swathes, so it doesn't take any object overhead so much). Although > > with 922 patch it should not matter much, that was the point. > > > > If you can, please notice if it is a sort, copy or the reducer itself? > > If it is the sort, you may need to tweak sort buffer size, especially > > if you didn't set up too many reducers per job (but i assume you set > > num reducers at at least 20). By default, those buffers are just 100m > > which may turn out below block size, i am not sure how exactly it > > would work out in this case then. i think i bumped that up a little > > some long time ago. > > > > > > What else may be different different? i am at CDH3u0 here. your input > > seems to be 25% wider, which may affect AB' job map time in particular > > by at most that, which in my case would constitute about 15s > > difference, but hopefully not so much anything else. > > > > My setup is actually pretty meager, we have 10 quad-cpu nodes with 16G > > RAM each and they are already running hbase on top of it. which is why > > i mostly avoid running jobs much over 500m mem there, or i start > > scratching something on the disk. So memory doesn't seem to be the > > factor. (But my cluster is a barebone assembly sitting in the same > > rack which is a big plus). I also currently allow 4x4 map/reduce tasks > > per node. > > > > if you can point me where i can download your input exactly (without > > or with minimum prep steps, i think i can't spend much energy on that > > while at the office), i may try to try on your exact input as well. > > > > > > ------ > > On a side note i found an error with my work on broadcasting via > > distributed cache. After patching it up, i actually see no difference > > whatsoever between direct streaming and using distributed cache, just > > like i initially figured. My timing for AB' multiplication revolves > > around 3 min 10 s, distributed cache or streaming. Since i did that > > patch anyway, i will be offering it for commit but i will probably > > have to put it on the review board as it touches some iterator stuff > > that Sean previously crafted. the branch name with broadcast option +
Dmitriy Lyubimov 2011-12-19, 16:07
-
Re: Decomposing large graphsDmitriy Lyubimov 2011-12-17, 23:04
ABt-job 37mins, 22sec
this guy should run under Bt-job (under 9 minutes in your case) i think . In my tests it was. is this with 922 patch? And it should be mentioned that the cluster size couldn't accomodate all the generated tasks, is this correct assessment? On Sat, Dec 17, 2011 at 2:58 PM, Sebastian Schelter <[EMAIL PROTECTED]> wrote: > On 17.12.2011 17:27, Dmitriy Lyubimov wrote: >> Interesting. >> >> Well so how did your decomposing go? > > I tested the decomposition of the wikipedia pagelink graph (130M edges, > 5.6M vertices making approx. quarter of a billion non-zeros in the > symmetric adjacency matrix) on a 6 machine hadoop cluster. > > Got these running times for k = 10, p = 5 and one power-iteration: > > Q-job 1mins, 41sec > Bt-job 9mins, 30sec > ABt-job 37mins, 22sec > Bt-job 9mins, 41sec > U-job 30sec > > I think I'd need a couple more machines to handle the twitter graph > though... > > --sebastian > > >> On Dec 17, 2011 6:00 AM, "Sebastian Schelter" <[EMAIL PROTECTED]> >> wrote: >> >>> Hi there, >>> >>> I played with Mahout to decompose the adjacency matrices of large graphs >>> lately. I stumbled on a paper of Christos Faloutsos that describes a >>> variation of the Lanczos algorithm they use for this on top of Hadoop. >>> They even explicitly mention Mahout: >>> >>> "Very recently(March 2010), the Mahout project [2] provides >>> SVD on top of HADOOP. Due to insufficient documentation, we were not >>> able to find the input format and run a head-to-head comparison. But, >>> reading the source code, we discovered that Mahout suffers from two >>> major issues: (a) it assumes that the vector (b, with n=O(billion) >>> entries) fits in the memory of a single machine, and (b) it implements >>> the full re-orthogonalization which is inefficient." >>> >>> http://www.cs.cmu.edu/~ukang/papers/HeigenPAKDD2011.pdf >>> >>> --sebastian >>> >> > +
Dmitriy Lyubimov 2011-12-17, 23:04
-
Re: Decomposing large graphsDmitriy Lyubimov 2011-12-17, 23:13
In my tests, ABt-job took under 3 minutes per mapper and practically
no time for reducing. so it should be running at about 4 minutes on a cluster with sufficient capacity (in your case, something like 10=11 nodes, it seemed). Ok i'll rerun in our QA on Monday again to see what's happening. On Sat, Dec 17, 2011 at 3:04 PM, Dmitriy Lyubimov <[EMAIL PROTECTED]> wrote: > ABt-job 37mins, 22sec > this guy should run under Bt-job (under 9 minutes in your case) i > think . In my tests it was. is this with 922 patch? > > And it should be mentioned that the cluster size couldn't accomodate > all the generated tasks, is this correct assessment? > > > On Sat, Dec 17, 2011 at 2:58 PM, Sebastian Schelter <[EMAIL PROTECTED]> wrote: >> On 17.12.2011 17:27, Dmitriy Lyubimov wrote: >>> Interesting. >>> >>> Well so how did your decomposing go? >> >> I tested the decomposition of the wikipedia pagelink graph (130M edges, >> 5.6M vertices making approx. quarter of a billion non-zeros in the >> symmetric adjacency matrix) on a 6 machine hadoop cluster. >> >> Got these running times for k = 10, p = 5 and one power-iteration: >> >> Q-job 1mins, 41sec >> Bt-job 9mins, 30sec >> ABt-job 37mins, 22sec >> Bt-job 9mins, 41sec >> U-job 30sec >> >> I think I'd need a couple more machines to handle the twitter graph >> though... >> >> --sebastian >> >> >>> On Dec 17, 2011 6:00 AM, "Sebastian Schelter" <[EMAIL PROTECTED]> >>> wrote: >>> >>>> Hi there, >>>> >>>> I played with Mahout to decompose the adjacency matrices of large graphs >>>> lately. I stumbled on a paper of Christos Faloutsos that describes a >>>> variation of the Lanczos algorithm they use for this on top of Hadoop. >>>> They even explicitly mention Mahout: >>>> >>>> "Very recently(March 2010), the Mahout project [2] provides >>>> SVD on top of HADOOP. Due to insufficient documentation, we were not >>>> able to find the input format and run a head-to-head comparison. But, >>>> reading the source code, we discovered that Mahout suffers from two >>>> major issues: (a) it assumes that the vector (b, with n=O(billion) >>>> entries) fits in the memory of a single machine, and (b) it implements >>>> the full re-orthogonalization which is inefficient." >>>> >>>> http://www.cs.cmu.edu/~ukang/papers/HeigenPAKDD2011.pdf >>>> >>>> --sebastian >>>> >>> >> +
Dmitriy Lyubimov 2011-12-17, 23:13
-
Re: Decomposing large graphsDmitriy Lyubimov 2011-12-17, 23:25
yeah ok i requested slightly less k+p so not 4 mins for AB' but it
still should be slightly under Bt running time (as in 8 mins perhaps). On Sat, Dec 17, 2011 at 3:13 PM, Dmitriy Lyubimov <[EMAIL PROTECTED]> wrote: > In my tests, ABt-job took under 3 minutes per mapper and practically > no time for reducing. so it should be running at about 4 minutes on a > cluster with sufficient capacity (in your case, something like 10=11 > nodes, it seemed). Ok i'll rerun in our QA on Monday again to see > what's happening. > > On Sat, Dec 17, 2011 at 3:04 PM, Dmitriy Lyubimov <[EMAIL PROTECTED]> wrote: >> ABt-job 37mins, 22sec >> this guy should run under Bt-job (under 9 minutes in your case) i >> think . In my tests it was. is this with 922 patch? >> >> And it should be mentioned that the cluster size couldn't accomodate >> all the generated tasks, is this correct assessment? >> >> >> On Sat, Dec 17, 2011 at 2:58 PM, Sebastian Schelter <[EMAIL PROTECTED]> wrote: >>> On 17.12.2011 17:27, Dmitriy Lyubimov wrote: >>>> Interesting. >>>> >>>> Well so how did your decomposing go? >>> >>> I tested the decomposition of the wikipedia pagelink graph (130M edges, >>> 5.6M vertices making approx. quarter of a billion non-zeros in the >>> symmetric adjacency matrix) on a 6 machine hadoop cluster. >>> >>> Got these running times for k = 10, p = 5 and one power-iteration: >>> >>> Q-job 1mins, 41sec >>> Bt-job 9mins, 30sec >>> ABt-job 37mins, 22sec >>> Bt-job 9mins, 41sec >>> U-job 30sec >>> >>> I think I'd need a couple more machines to handle the twitter graph >>> though... >>> >>> --sebastian >>> >>> >>>> On Dec 17, 2011 6:00 AM, "Sebastian Schelter" <[EMAIL PROTECTED]> >>>> wrote: >>>> >>>>> Hi there, >>>>> >>>>> I played with Mahout to decompose the adjacency matrices of large graphs >>>>> lately. I stumbled on a paper of Christos Faloutsos that describes a >>>>> variation of the Lanczos algorithm they use for this on top of Hadoop. >>>>> They even explicitly mention Mahout: >>>>> >>>>> "Very recently(March 2010), the Mahout project [2] provides >>>>> SVD on top of HADOOP. Due to insufficient documentation, we were not >>>>> able to find the input format and run a head-to-head comparison. But, >>>>> reading the source code, we discovered that Mahout suffers from two >>>>> major issues: (a) it assumes that the vector (b, with n=O(billion) >>>>> entries) fits in the memory of a single machine, and (b) it implements >>>>> the full re-orthogonalization which is inefficient." >>>>> >>>>> http://www.cs.cmu.edu/~ukang/papers/HeigenPAKDD2011.pdf >>>>> >>>>> --sebastian >>>>> >>>> >>> +
Dmitriy Lyubimov 2011-12-17, 23:25
-
Re: Decomposing large graphsDmitriy Lyubimov 2011-12-18, 01:49
Oh. my commit for 922 branch did not go thru in its entirety for some
reason. need to fix this. On Sat, Dec 17, 2011 at 3:25 PM, Dmitriy Lyubimov <[EMAIL PROTECTED]> wrote: > yeah ok i requested slightly less k+p so not 4 mins for AB' but it > still should be slightly under Bt running time (as in 8 mins perhaps). > > On Sat, Dec 17, 2011 at 3:13 PM, Dmitriy Lyubimov <[EMAIL PROTECTED]> wrote: >> In my tests, ABt-job took under 3 minutes per mapper and practically >> no time for reducing. so it should be running at about 4 minutes on a >> cluster with sufficient capacity (in your case, something like 10=11 >> nodes, it seemed). Ok i'll rerun in our QA on Monday again to see >> what's happening. >> >> On Sat, Dec 17, 2011 at 3:04 PM, Dmitriy Lyubimov <[EMAIL PROTECTED]> wrote: >>> ABt-job 37mins, 22sec >>> this guy should run under Bt-job (under 9 minutes in your case) i >>> think . In my tests it was. is this with 922 patch? >>> >>> And it should be mentioned that the cluster size couldn't accomodate >>> all the generated tasks, is this correct assessment? >>> >>> >>> On Sat, Dec 17, 2011 at 2:58 PM, Sebastian Schelter <[EMAIL PROTECTED]> wrote: >>>> On 17.12.2011 17:27, Dmitriy Lyubimov wrote: >>>>> Interesting. >>>>> >>>>> Well so how did your decomposing go? >>>> >>>> I tested the decomposition of the wikipedia pagelink graph (130M edges, >>>> 5.6M vertices making approx. quarter of a billion non-zeros in the >>>> symmetric adjacency matrix) on a 6 machine hadoop cluster. >>>> >>>> Got these running times for k = 10, p = 5 and one power-iteration: >>>> >>>> Q-job 1mins, 41sec >>>> Bt-job 9mins, 30sec >>>> ABt-job 37mins, 22sec >>>> Bt-job 9mins, 41sec >>>> U-job 30sec >>>> >>>> I think I'd need a couple more machines to handle the twitter graph >>>> though... >>>> >>>> --sebastian >>>> >>>> >>>>> On Dec 17, 2011 6:00 AM, "Sebastian Schelter" <[EMAIL PROTECTED]> >>>>> wrote: >>>>> >>>>>> Hi there, >>>>>> >>>>>> I played with Mahout to decompose the adjacency matrices of large graphs >>>>>> lately. I stumbled on a paper of Christos Faloutsos that describes a >>>>>> variation of the Lanczos algorithm they use for this on top of Hadoop. >>>>>> They even explicitly mention Mahout: >>>>>> >>>>>> "Very recently(March 2010), the Mahout project [2] provides >>>>>> SVD on top of HADOOP. Due to insufficient documentation, we were not >>>>>> able to find the input format and run a head-to-head comparison. But, >>>>>> reading the source code, we discovered that Mahout suffers from two >>>>>> major issues: (a) it assumes that the vector (b, with n=O(billion) >>>>>> entries) fits in the memory of a single machine, and (b) it implements >>>>>> the full re-orthogonalization which is inefficient." >>>>>> >>>>>> http://www.cs.cmu.edu/~ukang/papers/HeigenPAKDD2011.pdf >>>>>> >>>>>> --sebastian >>>>>> >>>>> >>>> +
Dmitriy Lyubimov 2011-12-18, 01:49
|