|
|
Tianwei 2012-07-07, 20:43
Hi, all,
I successfully build and run a hadoop job based on nutch 2.0 rc3. I have a very large seed list(around 100K). I set the depth as 4, after two iterations, I found one reduce task in the fetch phase is always very slow, about 10X slow down. As a result, even though other 11 tasks (I configured to use 12 reduce tasks) already finished, the whole job can't advance to the next "parse" phase and further to the next iteration.
I diagnosed this problem a bit, the major problem may be that task is fetching pages at a very slow speed, as: " 10/10 spinwaiting/active, 2290 pages, 31 errors, 0.5 0.4 pages/s, 101 71 kb/s, 500 URLs in 1 queues > reduce "
I guess the the slowest task is fetching urls from those slow remote websites, is that true? Since the performance of Map-reduce job is determined by the slowest task, so I guess it's hard to change once the "fetch" map tasks finished. I am wondering if there are any way to do better load balance or dynamically adjust the load on slow tasks? Thanks
Tianwei
-
Re: performance bottleneck
Ferdy Galema 2012-07-09, 13:43
Hi,
There are options to abort fetcher on certain conditions, for example fetcher.timelimit.mins for timelimit or fetcher.throughput.threshold.* for throughput. The fetcher.max.exceptions.per.queue options seems to be broken for nutch2. Afaik there is no current work in progress with regard to dynamic balancing of queues or something like that. Please search the issuetracker for some related issues. If you have some ideas to improve the fetch behaviour feel free to share them.
Ferdy
On Sat, Jul 7, 2012 at 10:43 PM, Tianwei <[EMAIL PROTECTED]> wrote:
> Hi, all, > > I successfully build and run a hadoop job based on nutch 2.0 rc3. I > have a very large seed list(around 100K). I set the depth as 4, after > two iterations, I found one reduce task in the fetch phase is always > very slow, about 10X slow down. As a result, even though other 11 > tasks (I configured to use 12 reduce tasks) already finished, the > whole job can't advance to the next "parse" phase and further to the > next iteration. > > I diagnosed this problem a bit, the major problem may be that task is > fetching pages at a very slow speed, as: > " > 10/10 spinwaiting/active, 2290 pages, 31 errors, 0.5 0.4 pages/s, 101 > 71 kb/s, 500 URLs in 1 queues > reduce > " > > I guess the the slowest task is fetching urls from those slow remote > websites, is that true? > > > Since the performance of Map-reduce job is determined by the slowest > task, so I guess it's hard to change once the "fetch" map tasks > finished. I am wondering if there are any way to do better load > balance or dynamically adjust the load on slow tasks? > > > Thanks > > Tianwei >
-
Re: performance bottleneck
Tianwei Sheng 2012-07-09, 16:44
Hi, Ferdy,
Got it, Thanks a lot for your reply. I will try those options.
Now I just manually monitored my job for a while, collected those slow domains and add them into url filter files, then simply kill and restart my job.
I will try your recommended options and also to see if I have any better ideas for the load balance improvement ;-)
Thanks again.
Tianwei On Mon, Jul 9, 2012 at 6:43 AM, Ferdy Galema <[EMAIL PROTECTED]>wrote:
> Hi, > > There are options to abort fetcher on certain conditions, for > example fetcher.timelimit.mins for timelimit > or fetcher.throughput.threshold.* for throughput. The > fetcher.max.exceptions.per.queue options seems to be broken for nutch2. > Afaik there is no current work in progress with regard to dynamic balancing > of queues or something like that. Please search the issuetracker for some > related issues. If you have some ideas to improve the fetch behaviour feel > free to share them. > > Ferdy > > On Sat, Jul 7, 2012 at 10:43 PM, Tianwei <[EMAIL PROTECTED]> wrote: > > > Hi, all, > > > > I successfully build and run a hadoop job based on nutch 2.0 rc3. I > > have a very large seed list(around 100K). I set the depth as 4, after > > two iterations, I found one reduce task in the fetch phase is always > > very slow, about 10X slow down. As a result, even though other 11 > > tasks (I configured to use 12 reduce tasks) already finished, the > > whole job can't advance to the next "parse" phase and further to the > > next iteration. > > > > I diagnosed this problem a bit, the major problem may be that task is > > fetching pages at a very slow speed, as: > > " > > 10/10 spinwaiting/active, 2290 pages, 31 errors, 0.5 0.4 pages/s, 101 > > 71 kb/s, 500 URLs in 1 queues > reduce > > " > > > > I guess the the slowest task is fetching urls from those slow remote > > websites, is that true? > > > > > > Since the performance of Map-reduce job is determined by the slowest > > task, so I guess it's hard to change once the "fetch" map tasks > > finished. I am wondering if there are any way to do better load > > balance or dynamically adjust the load on slow tasks? > > > > > > Thanks > > > > Tianwei > > >
|
|