|
Hannes Carl Meyer
2010-11-18, 11:51
Ken Krugler
2010-11-18, 14:36
Hannes Carl Meyer
2010-11-18, 15:06
Ye T Thet
2010-11-20, 10:33
Hannes Carl Meyer
2010-11-20, 15:51
Ken Krugler
2010-11-20, 18:06
Hannes Carl Meyer
2010-11-20, 18:52
Ken Krugler
2010-11-20, 20:02
Andrzej Bialecki
2010-11-20, 20:20
Ken Krugler
2010-11-20, 21:53
Hannes Carl Meyer
2010-11-21, 08:47
Ken Krugler
2011-02-07, 21:32
|
-
Performance Configuration on Focused Web CrawlHannes Carl Meyer 2010-11-18, 11:51
Hi,
I'm using nutch 0.9 to crawl about 400 hosts with an average of 600 pages. That makes a volume of 240.000 fetched pages - I want to get all of them. Can one give me an advice on the right threads/delay/per-host configuration in this environnement? My current conf: <property> <name>fetcher.server.delay</name> <value>1.0</value> </property> <property> <name>fetcher.threads.fetch</name> <value>90</value> </property> <property> <name>fetcher.threads.per.host</name> <value>45</value> </property> <property> <name>fetcher.threads.per.host.by.ip</name> <value>false</value> </property> The total runtime is about 5 hours. How can performance be improved? (I still have enough CPU, Bandwith) Note: This runs on a single machine, distribution to other machines is not planned. Thanks and Regards Hannes
-
Re: Performance Configuration on Focused Web CrawlKen Krugler 2010-11-18, 14:36
If you're hitting each host with 45 threads, you better be on really
good terms with those webmasters :) With 90 total threads, that means as few as 2 hosts are active at any time, yes? -- Ken On Nov 18, 2010, at 3:51am, Hannes Carl Meyer wrote: > Hi, > I'm using nutch 0.9 to crawl about 400 hosts with an average of 600 > pages. > That makes a volume of 240.000 fetched pages - I want to get all of > them. > > Can one give me an advice on the right threads/delay/per-host > configuration > in this environnement? > > My current conf: > > <property> > <name>fetcher.server.delay</name> > <value>1.0</value> > </property> > > <property> > <name>fetcher.threads.fetch</name> > <value>90</value> > </property> > > <property> > <name>fetcher.threads.per.host</name> > <value>45</value> > </property> > > <property> > <name>fetcher.threads.per.host.by.ip</name> > <value>false</value> > </property> > > The total runtime is about 5 hours. > > How can performance be improved? (I still have enough CPU, Bandwith) > > Note: This runs on a single machine, distribution to other machines > is not > planned. > > Thanks and Regards > > Hannes -------------------------- Ken Krugler +1 530-210-6378 http://bixolabs.com e l a s t i c w e b m i n i n g
-
Re: Performance Configuration on Focused Web CrawlHannes Carl Meyer 2010-11-18, 15:06
Hi Ken,
our Crawler is allowed to hit those hosts in a frequent way at night so we are not getting a penalty ;-) Could you imagine running nutch in this case with about 400 threads, with 1 thread per host and a delay of 1.0? I tried that way but experienced some really long idle times... My idea was one thread per host. That would mean adding another host would require add an additional thread. Regards Hannes On Thu, Nov 18, 2010 at 3:36 PM, Ken Krugler <[EMAIL PROTECTED]>wrote: > If you're hitting each host with 45 threads, you better be on really good > terms with those webmasters :) > > With 90 total threads, that means as few as 2 hosts are active at any time, > yes? > > -- Ken > > > > On Nov 18, 2010, at 3:51am, Hannes Carl Meyer wrote: > > Hi, >> I'm using nutch 0.9 to crawl about 400 hosts with an average of 600 pages. >> That makes a volume of 240.000 fetched pages - I want to get all of them. >> >> Can one give me an advice on the right threads/delay/per-host >> configuration >> in this environnement? >> >> My current conf: >> >> <property> >> <name>fetcher.server.delay</name> >> <value>1.0</value> >> </property> >> >> <property> >> <name>fetcher.threads.fetch</name> >> <value>90</value> >> </property> >> >> <property> >> <name>fetcher.threads.per.host</name> >> <value>45</value> >> </property> >> >> <property> >> <name>fetcher.threads.per.host.by.ip</name> >> <value>false</value> >> </property> >> >> The total runtime is about 5 hours. >> >> How can performance be improved? (I still have enough CPU, Bandwith) >> >> Note: This runs on a single machine, distribution to other machines is not >> planned. >> >> Thanks and Regards >> >> Hannes >> > > -------------------------- > Ken Krugler > +1 530-210-6378 > http://bixolabs.com > e l a s t i c w e b m i n i n g > > > > > >
-
Re: Performance Configuration on Focused Web CrawlYe T Thet 2010-11-20, 10:33
Hannes,
I guess It would depends on situation - your server specs (where cralwer is running) and - hosts specs Anyway, I have been crawling around 50 hosts. I tweaked a few to get it right for my situation. Currently I am using 500 threads. and 10 threads per host. In my opinion, number of threads for crawler does not matter much. Because crawler does not take much of a resource (memory and CPU). As far as your server network band width can handle, it should be fine. In my case, number of threads per host matters. Because some of my server cannot handle that much of bandwidth. Not sure if it would helps, I had to adjust fetcher.server.delay, fetcher.server.min.delay and fetcher.max.crawl.delay because, my hosts sometimes cannot handle that much of threads. Warm Regards, Y.T. Thet On Thu, Nov 18, 2010 at 11:06 PM, Hannes Carl Meyer < [EMAIL PROTECTED]> wrote: > Hi Ken, > > our Crawler is allowed to hit those hosts in a frequent way at night so we > are not getting a penalty ;-) > > Could you imagine running nutch in this case with about 400 threads, with 1 > thread per host and a delay of 1.0? > > I tried that way but experienced some really long idle times... My idea was > one thread per host. That would mean adding another host would require add > an additional thread. > > Regards > > Hannes > > On Thu, Nov 18, 2010 at 3:36 PM, Ken Krugler <[EMAIL PROTECTED] > >wrote: > > > If you're hitting each host with 45 threads, you better be on really good > > terms with those webmasters :) > > > > With 90 total threads, that means as few as 2 hosts are active at any > time, > > yes? > > > > -- Ken > > > > > > > > On Nov 18, 2010, at 3:51am, Hannes Carl Meyer wrote: > > > > Hi, > >> I'm using nutch 0.9 to crawl about 400 hosts with an average of 600 > pages. > >> That makes a volume of 240.000 fetched pages - I want to get all of > them. > >> > >> Can one give me an advice on the right threads/delay/per-host > >> configuration > >> in this environnement? > >> > >> My current conf: > >> > >> <property> > >> <name>fetcher.server.delay</name> > >> <value>1.0</value> > >> </property> > >> > >> <property> > >> <name>fetcher.threads.fetch</name> > >> <value>90</value> > >> </property> > >> > >> <property> > >> <name>fetcher.threads.per.host</name> > >> <value>45</value> > >> </property> > >> > >> <property> > >> <name>fetcher.threads.per.host.by.ip</name> > >> <value>false</value> > >> </property> > >> > >> The total runtime is about 5 hours. > >> > >> How can performance be improved? (I still have enough CPU, Bandwith) > >> > >> Note: This runs on a single machine, distribution to other machines is > not > >> planned. > >> > >> Thanks and Regards > >> > >> Hannes > >> > > > > -------------------------- > > Ken Krugler > > +1 530-210-6378 > > http://bixolabs.com > > e l a s t i c w e b m i n i n g > > > > > > > > > > > > >
-
Re: Performance Configuration on Focused Web CrawlHannes Carl Meyer 2010-11-20, 15:51
Thank you for sharing your experiences!
in my case the web servers are pretty stable and we are allowed to perform intensive crawling which make it easy to increase the threads per host. imho the fetch process isn't really the bottleneck. It is the process between the fetch process when merging and updating the crawldb. We are using a 16 Core Hardware, during fetch process CPUs are being used around 1000 % but in between fetching it is always around 90-100 % on a single core On Sat, Nov 20, 2010 at 11:33 AM, Ye T Thet <[EMAIL PROTECTED]> wrote: > Hannes, > > I guess It would depends on situation > - your server specs (where cralwer is running) and > - hosts specs > > Anyway, I have been crawling around 50 hosts. I tweaked a few to get it > right for my situation. > > Currently I am using 500 threads. and 10 threads per host. > > In my opinion, number of threads for crawler does not matter much. Because > crawler does not take much of a resource (memory and CPU). As far as your > server network band width can handle, it should be fine. > > In my case, number of threads per host matters. Because some of my server > cannot handle that much of bandwidth. > > Not sure if it would helps, I had to adjust fetcher.server.delay, > fetcher.server.min.delay and fetcher.max.crawl.delay because, my hosts > sometimes cannot handle that much of threads. > > > Warm Regards, > > Y.T. Thet > > > > > On Thu, Nov 18, 2010 at 11:06 PM, Hannes Carl Meyer < > [EMAIL PROTECTED]> wrote: > >> Hi Ken, >> >> our Crawler is allowed to hit those hosts in a frequent way at night so we >> are not getting a penalty ;-) >> >> Could you imagine running nutch in this case with about 400 threads, with >> 1 >> thread per host and a delay of 1.0? >> >> I tried that way but experienced some really long idle times... My idea >> was >> one thread per host. That would mean adding another host would require add >> an additional thread. >> >> Regards >> >> Hannes >> >> On Thu, Nov 18, 2010 at 3:36 PM, Ken Krugler <[EMAIL PROTECTED] >> >wrote: >> >> > If you're hitting each host with 45 threads, you better be on really >> good >> > terms with those webmasters :) >> > >> > With 90 total threads, that means as few as 2 hosts are active at any >> time, >> > yes? >> > >> > -- Ken >> > >> > >> > >> > On Nov 18, 2010, at 3:51am, Hannes Carl Meyer wrote: >> > >> > Hi, >> >> I'm using nutch 0.9 to crawl about 400 hosts with an average of 600 >> pages. >> >> That makes a volume of 240.000 fetched pages - I want to get all of >> them. >> >> >> >> Can one give me an advice on the right threads/delay/per-host >> >> configuration >> >> in this environnement? >> >> >> >> My current conf: >> >> >> >> <property> >> >> <name>fetcher.server.delay</name> >> >> <value>1.0</value> >> >> </property> >> >> >> >> <property> >> >> <name>fetcher.threads.fetch</name> >> >> <value>90</value> >> >> </property> >> >> >> >> <property> >> >> <name>fetcher.threads.per.host</name> >> >> <value>45</value> >> >> </property> >> >> >> >> <property> >> >> <name>fetcher.threads.per.host.by.ip</name> >> >> <value>false</value> >> >> </property> >> >> >> >> The total runtime is about 5 hours. >> >> >> >> How can performance be improved? (I still have enough CPU, Bandwith) >> >> >> >> Note: This runs on a single machine, distribution to other machines is >> not >> >> planned. >> >> >> >> Thanks and Regards >> >> >> >> Hannes >> >> >> > >> > -------------------------- >> > Ken Krugler >> > +1 530-210-6378 >> > http://bixolabs.com >> > e l a s t i c w e b m i n i n g >> > >> > >> > >> > >> > >> > >> > >
-
Re: Performance Configuration on Focused Web CrawlKen Krugler 2010-11-20, 18:06
On Nov 20, 2010, at 7:51am, Hannes Carl Meyer wrote: > Thank you for sharing your experiences! > > in my case the web servers are pretty stable and we are allowed to > perform > intensive crawling which make it easy to increase the threads per > host. > > imho the fetch process isn't really the bottleneck. It is the process > between the fetch process when merging and updating the crawldb. > > We are using a 16 Core Hardware, during fetch process CPUs are being > used > around 1000 % but in between fetching it is always around 90-100 % > on a > single core In regular map-reduce Hadoop jobs you get this situation if the job has been configured to use a single reducer, and thus only one core is active Though it would surprise me if the crawlDB update job was configured this way, as I don't see a reason why the crawlDB has to be a single file in HDFS. Andrzej and others would know best, of course. -- Ken > > On Sat, Nov 20, 2010 at 11:33 AM, Ye T Thet <[EMAIL PROTECTED]> > wrote: > >> Hannes, >> >> I guess It would depends on situation >> - your server specs (where cralwer is running) and >> - hosts specs >> >> Anyway, I have been crawling around 50 hosts. I tweaked a few to >> get it >> right for my situation. >> >> Currently I am using 500 threads. and 10 threads per host. >> >> In my opinion, number of threads for crawler does not matter much. >> Because >> crawler does not take much of a resource (memory and CPU). As far >> as your >> server network band width can handle, it should be fine. >> >> In my case, number of threads per host matters. Because some of my >> server >> cannot handle that much of bandwidth. >> >> Not sure if it would helps, I had to adjust fetcher.server.delay, >> fetcher.server.min.delay and fetcher.max.crawl.delay because, my >> hosts >> sometimes cannot handle that much of threads. >> >> >> Warm Regards, >> >> Y.T. Thet >> >> >> >> >> On Thu, Nov 18, 2010 at 11:06 PM, Hannes Carl Meyer < >> [EMAIL PROTECTED]> wrote: >> >>> Hi Ken, >>> >>> our Crawler is allowed to hit those hosts in a frequent way at >>> night so we >>> are not getting a penalty ;-) >>> >>> Could you imagine running nutch in this case with about 400 >>> threads, with >>> 1 >>> thread per host and a delay of 1.0? >>> >>> I tried that way but experienced some really long idle times... My >>> idea >>> was >>> one thread per host. That would mean adding another host would >>> require add >>> an additional thread. >>> >>> Regards >>> >>> Hannes >>> >>> On Thu, Nov 18, 2010 at 3:36 PM, Ken Krugler <[EMAIL PROTECTED] >>>> wrote: >>> >>>> If you're hitting each host with 45 threads, you better be on >>>> really >>> good >>>> terms with those webmasters :) >>>> >>>> With 90 total threads, that means as few as 2 hosts are active at >>>> any >>> time, >>>> yes? >>>> >>>> -- Ken >>>> >>>> >>>> >>>> On Nov 18, 2010, at 3:51am, Hannes Carl Meyer wrote: >>>> >>>> Hi, >>>>> I'm using nutch 0.9 to crawl about 400 hosts with an average of >>>>> 600 >>> pages. >>>>> That makes a volume of 240.000 fetched pages - I want to get all >>>>> of >>> them. >>>>> >>>>> Can one give me an advice on the right threads/delay/per-host >>>>> configuration >>>>> in this environnement? >>>>> >>>>> My current conf: >>>>> >>>>> <property> >>>>> <name>fetcher.server.delay</name> >>>>> <value>1.0</value> >>>>> </property> >>>>> >>>>> <property> >>>>> <name>fetcher.threads.fetch</name> >>>>> <value>90</value> >>>>> </property> >>>>> >>>>> <property> >>>>> <name>fetcher.threads.per.host</name> >>>>> <value>45</value> >>>>> </property> >>>>> >>>>> <property> >>>>> <name>fetcher.threads.per.host.by.ip</name> >>>>> <value>false</value> >>>>> </property> >>>>> >>>>> The total runtime is about 5 hours. >>>>> >>>>> How can performance be improved? (I still have enough CPU, >>>>> Bandwith) >>>>> >>>>> Note: This runs on a single machine, distribution to other >>>>> machines is Ken Krugler +1 530-210-6378 http://bixolabs.com e l a s t i c w e b m i n i n g
-
Re: Performance Configuration on Focused Web CrawlHannes Carl Meyer 2010-11-20, 18:52
Ken, thanks, I guess thats a good hint!
I'm using the simple org.apache.nutch.crawl.Crawl to perform the crawl - I guess the configuration of the Map-Reduce Job then is pretty low. @Andrzej could you give me a hint where to configure the number of reduce tasks in nutch 0.9? (running on a single machine) Regards, Hannes On Sat, Nov 20, 2010 at 7:06 PM, Ken Krugler <[EMAIL PROTECTED]>wrote: > > On Nov 20, 2010, at 7:51am, Hannes Carl Meyer wrote: > > Thank you for sharing your experiences! >> >> in my case the web servers are pretty stable and we are allowed to perform >> intensive crawling which make it easy to increase the threads per host. >> >> imho the fetch process isn't really the bottleneck. It is the process >> between the fetch process when merging and updating the crawldb. >> >> We are using a 16 Core Hardware, during fetch process CPUs are being used >> around 1000 % but in between fetching it is always around 90-100 % on a >> single core >> > > In regular map-reduce Hadoop jobs you get this situation if the job has > been configured to use a single reducer, and thus only one core is active > > Though it would surprise me if the crawlDB update job was configured this > way, as I don't see a reason why the crawlDB has to be a single file in > HDFS. > > Andrzej and others would know best, of course. > > -- Ken > > > > >> On Sat, Nov 20, 2010 at 11:33 AM, Ye T Thet <[EMAIL PROTECTED]> >> wrote: >> >> Hannes, >>> >>> I guess It would depends on situation >>> - your server specs (where cralwer is running) and >>> - hosts specs >>> >>> Anyway, I have been crawling around 50 hosts. I tweaked a few to get it >>> right for my situation. >>> >>> Currently I am using 500 threads. and 10 threads per host. >>> >>> In my opinion, number of threads for crawler does not matter much. >>> Because >>> crawler does not take much of a resource (memory and CPU). As far as your >>> server network band width can handle, it should be fine. >>> >>> In my case, number of threads per host matters. Because some of my server >>> cannot handle that much of bandwidth. >>> >>> Not sure if it would helps, I had to adjust fetcher.server.delay, >>> fetcher.server.min.delay and fetcher.max.crawl.delay because, my hosts >>> sometimes cannot handle that much of threads. >>> >>> >>> Warm Regards, >>> >>> Y.T. Thet >>> >>> >>> >>> >>> On Thu, Nov 18, 2010 at 11:06 PM, Hannes Carl Meyer < >>> [EMAIL PROTECTED]> wrote: >>> >>> Hi Ken, >>>> >>>> our Crawler is allowed to hit those hosts in a frequent way at night so >>>> we >>>> are not getting a penalty ;-) >>>> >>>> Could you imagine running nutch in this case with about 400 threads, >>>> with >>>> 1 >>>> thread per host and a delay of 1.0? >>>> >>>> I tried that way but experienced some really long idle times... My idea >>>> was >>>> one thread per host. That would mean adding another host would require >>>> add >>>> an additional thread. >>>> >>>> Regards >>>> >>>> Hannes >>>> >>>> On Thu, Nov 18, 2010 at 3:36 PM, Ken Krugler < >>>> [EMAIL PROTECTED] >>>> >>>>> wrote: >>>>> >>>> >>>> If you're hitting each host with 45 threads, you better be on really >>>>> >>>> good >>>> >>>>> terms with those webmasters :) >>>>> >>>>> With 90 total threads, that means as few as 2 hosts are active at any >>>>> >>>> time, >>>> >>>>> yes? >>>>> >>>>> -- Ken >>>>> >>>>> >>>>> >>>>> On Nov 18, 2010, at 3:51am, Hannes Carl Meyer wrote: >>>>> >>>>> Hi, >>>>> >>>>>> I'm using nutch 0.9 to crawl about 400 hosts with an average of 600 >>>>>> >>>>> pages. >>>> >>>>> That makes a volume of 240.000 fetched pages - I want to get all of >>>>>> >>>>> them. >>>> >>>>> >>>>>> Can one give me an advice on the right threads/delay/per-host >>>>>> configuration >>>>>> in this environnement? >>>>>> >>>>>> My current conf: >>>>>> >>>>>> <property> >>>>>> <name>fetcher.server.delay</name> >>>>>> <value>1.0</value> >>>>>> </property> >>>>>> >>>>>> <property> >>>>>> <name>fetcher.threads.fetch</name>
-
Re: Performance Configuration on Focused Web CrawlKen Krugler 2010-11-20, 20:02
On Nov 20, 2010, at 10:52am, Hannes Carl Meyer wrote: > Ken, thanks, I guess thats a good hint! > > I'm using the simple org.apache.nutch.crawl.Crawl to perform the > crawl - I > guess the configuration of the Map-Reduce Job then is pretty low. > > @Andrzej could you give me a hint where to configure the number of > reduce > tasks in nutch 0.9? (running on a single machine) Sounds like you're running in local mode. During fetching, multiple threads are spawned which will then use all your cores. But during regular map-reduce tasks (such as the CrawlDB update), you'll get a single map and a single reduce running sequentially. To get reasonable performance from one box, you'd need to set up Hadoop to run in pseudo-distributed mode, and then run your Nutch crawl as a regular/distributed job. And also tweak the hadoop-site.xml settings, to specify something like 6 mappers and 6 reducers (leave four cores for JobTracker, NameNode, TaskTracker, DataNode). But I'll confess, I've never tried to run a real job this way. -- Ken > On Sat, Nov 20, 2010 at 7:06 PM, Ken Krugler <[EMAIL PROTECTED] > >wrote: > >> >> On Nov 20, 2010, at 7:51am, Hannes Carl Meyer wrote: >> >> Thank you for sharing your experiences! >>> >>> in my case the web servers are pretty stable and we are allowed to >>> perform >>> intensive crawling which make it easy to increase the threads per >>> host. >>> >>> imho the fetch process isn't really the bottleneck. It is the >>> process >>> between the fetch process when merging and updating the crawldb. >>> >>> We are using a 16 Core Hardware, during fetch process CPUs are >>> being used >>> around 1000 % but in between fetching it is always around 90-100 % >>> on a >>> single core >>> >> >> In regular map-reduce Hadoop jobs you get this situation if the job >> has >> been configured to use a single reducer, and thus only one core is >> active >> >> Though it would surprise me if the crawlDB update job was >> configured this >> way, as I don't see a reason why the crawlDB has to be a single >> file in >> HDFS. >> >> Andrzej and others would know best, of course. >> >> -- Ken >> >> >> >> >>> On Sat, Nov 20, 2010 at 11:33 AM, Ye T Thet <[EMAIL PROTECTED]> >>> wrote: >>> >>> Hannes, >>>> >>>> I guess It would depends on situation >>>> - your server specs (where cralwer is running) and >>>> - hosts specs >>>> >>>> Anyway, I have been crawling around 50 hosts. I tweaked a few to >>>> get it >>>> right for my situation. >>>> >>>> Currently I am using 500 threads. and 10 threads per host. >>>> >>>> In my opinion, number of threads for crawler does not matter much. >>>> Because >>>> crawler does not take much of a resource (memory and CPU). As far >>>> as your >>>> server network band width can handle, it should be fine. >>>> >>>> In my case, number of threads per host matters. Because some of >>>> my server >>>> cannot handle that much of bandwidth. >>>> >>>> Not sure if it would helps, I had to adjust fetcher.server.delay, >>>> fetcher.server.min.delay and fetcher.max.crawl.delay because, my >>>> hosts >>>> sometimes cannot handle that much of threads. >>>> >>>> >>>> Warm Regards, >>>> >>>> Y.T. Thet >>>> >>>> >>>> >>>> >>>> On Thu, Nov 18, 2010 at 11:06 PM, Hannes Carl Meyer < >>>> [EMAIL PROTECTED]> wrote: >>>> >>>> Hi Ken, >>>>> >>>>> our Crawler is allowed to hit those hosts in a frequent way at >>>>> night so >>>>> we >>>>> are not getting a penalty ;-) >>>>> >>>>> Could you imagine running nutch in this case with about 400 >>>>> threads, >>>>> with >>>>> 1 >>>>> thread per host and a delay of 1.0? >>>>> >>>>> I tried that way but experienced some really long idle times... >>>>> My idea >>>>> was >>>>> one thread per host. That would mean adding another host would >>>>> require >>>>> add >>>>> an additional thread. >>>>> >>>>> Regards >>>>> >>>>> Hannes >>>>> >>>>> On Thu, Nov 18, 2010 at 3:36 PM, Ken Krugler < >>>>> [EMAIL PROTECTED] Ken Krugler +1 530-210-6378 http://bixolabs.com e l a s t i c w e b m i n i n g
-
Re: Performance Configuration on Focused Web CrawlAndrzej Bialecki 2010-11-20, 20:20
On 2010-11-20 21:02, Ken Krugler wrote:
>> @Andrzej could you give me a hint where to configure the number of reduce >> tasks in nutch 0.9? (running on a single machine) This is not possible in local mode. In local mode all map tasks are run sequentially, and there is always 1 reduce. As Ken points out, you need to run at least in pseudo-distributed mode, i.e. using a real JobTracker/TaskTracker on a single machine. > > Sounds like you're running in local mode. > > During fetching, multiple threads are spawned which will then use all > your cores. > > But during regular map-reduce tasks (such as the CrawlDB update), you'll > get a single map and a single reduce running sequentially. (Actually, LocalJobTracker will create multiple map tasks - as many as there are input splits - but running sequentially). > > To get reasonable performance from one box, you'd need to set up Hadoop > to run in pseudo-distributed mode, and then run your Nutch crawl as a > regular/distributed job. > > And also tweak the hadoop-site.xml settings, to specify something like 6 > mappers and 6 reducers (leave four cores for JobTracker, NameNode, > TaskTracker, DataNode). > > But I'll confess, I've never tried to run a real job this way. I have. Within the limits of a single machine performance this works reasonably well - if you have a node with 4 cores and enough RAM then you can easily run 4 tasks in parallel. Jobs become then limited by the amount of disk IO. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __________________________________ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
-
Re: Performance Configuration on Focused Web CrawlKen Krugler 2010-11-20, 21:53
[snip]
>> During fetching, multiple threads are spawned which will then use all >> your cores. >> >> But during regular map-reduce tasks (such as the CrawlDB update), >> you'll >> get a single map and a single reduce running sequentially. > > (Actually, LocalJobTracker will create multiple map tasks - as many as > there are input splits - but running sequentially). Sorry, I was being vague in my wording. I meant one mapper and one reducer, which won't be run in parallel. You're right that there will be N map tasks, one per split (which typically means one per HDFS block). >> To get reasonable performance from one box, you'd need to set up >> Hadoop >> to run in pseudo-distributed mode, and then run your Nutch crawl as a >> regular/distributed job. >> >> And also tweak the hadoop-site.xml settings, to specify something >> like 6 >> mappers and 6 reducers (leave four cores for JobTracker, NameNode, >> TaskTracker, DataNode). >> >> But I'll confess, I've never tried to run a real job this way. > > I have. Within the limits of a single machine performance this works > reasonably well - if you have a node with 4 cores and enough RAM then > you can easily run 4 tasks in parallel. Jobs become then limited by > the > amount of disk IO. I'd be interested in hearing back from Hannes as to performance with a 16-core box. Based on the paper by the IRLBot team, it seems like this could scale pretty well. They did wind up having to install a lot of disks in their crawling box. And as Andrzej mentions, disk I/O will become a bottleneck, especially for crawlDB updates (less so for fetching or parsing). If you have multiple drives, then you could run multiple DataNodes, and configure each one to use a separate disk. I don't have a good sense of whether it would be worthwhile to use replication, but in the past Hadoop had some issues running with a replication of 1, so I'd probably set this to 2. -- Ken -------------------------- Ken Krugler +1 530-210-6378 http://bixolabs.com e l a s t i c w e b m i n i n g
-
Re: Performance Configuration on Focused Web CrawlHannes Carl Meyer 2010-11-21, 08:47
I'm going to give it a try and confgure a peudo-distributed env on our
testing machine (which also has 16 Cores and 24 GB RAM). I'll get back here after testing it! On Sat, Nov 20, 2010 at 10:53 PM, Ken Krugler <[EMAIL PROTECTED]>wrote: > [snip] > > > During fetching, multiple threads are spawned which will then use all >>> your cores. >>> >>> But during regular map-reduce tasks (such as the CrawlDB update), you'll >>> get a single map and a single reduce running sequentially. >>> >> >> (Actually, LocalJobTracker will create multiple map tasks - as many as >> there are input splits - but running sequentially). >> > > Sorry, I was being vague in my wording. I meant one mapper and one reducer, > which won't be run in parallel. > > You're right that there will be N map tasks, one per split (which typically > means one per HDFS block). > > > To get reasonable performance from one box, you'd need to set up Hadoop >>> to run in pseudo-distributed mode, and then run your Nutch crawl as a >>> regular/distributed job. >>> >>> And also tweak the hadoop-site.xml settings, to specify something like 6 >>> mappers and 6 reducers (leave four cores for JobTracker, NameNode, >>> TaskTracker, DataNode). >>> >>> But I'll confess, I've never tried to run a real job this way. >>> >> >> I have. Within the limits of a single machine performance this works >> reasonably well - if you have a node with 4 cores and enough RAM then >> you can easily run 4 tasks in parallel. Jobs become then limited by the >> amount of disk IO. >> > > I'd be interested in hearing back from Hannes as to performance with a > 16-core box. Based on the paper by the IRLBot team, it seems like this could > scale pretty well. > > They did wind up having to install a lot of disks in their crawling box. > And as Andrzej mentions, disk I/O will become a bottleneck, especially for > crawlDB updates (less so for fetching or parsing). > > If you have multiple drives, then you could run multiple DataNodes, and > configure each one to use a separate disk. > > I don't have a good sense of whether it would be worthwhile to use > replication, but in the past Hadoop had some issues running with a > replication of 1, so I'd probably set this to 2. > > -- Ken > > > -------------------------- > Ken Krugler > +1 530-210-6378 > http://bixolabs.com > e l a s t i c w e b m i n i n g > > > > > >
-
Re: Performance Configuration on Focused Web CrawlKen Krugler 2011-02-07, 21:32
Hi Hannes,
I'm curious as to whether you got this configuration running, any issues you ran into, and what performance you saw. Thanks, -- Ken On Nov 20, 2010, at 10:52am, Hannes Carl Meyer wrote: > Ken, thanks, I guess thats a good hint! > > I'm using the simple org.apache.nutch.crawl.Crawl to perform the > crawl - I > guess the configuration of the Map-Reduce Job then is pretty low. > > @Andrzej could you give me a hint where to configure the number of > reduce > tasks in nutch 0.9? (running on a single machine) > > Regards, > > Hannes > > On Sat, Nov 20, 2010 at 7:06 PM, Ken Krugler <[EMAIL PROTECTED] > >wrote: > >> >> On Nov 20, 2010, at 7:51am, Hannes Carl Meyer wrote: >> >> Thank you for sharing your experiences! >>> >>> in my case the web servers are pretty stable and we are allowed to >>> perform >>> intensive crawling which make it easy to increase the threads per >>> host. >>> >>> imho the fetch process isn't really the bottleneck. It is the >>> process >>> between the fetch process when merging and updating the crawldb. >>> >>> We are using a 16 Core Hardware, during fetch process CPUs are >>> being used >>> around 1000 % but in between fetching it is always around 90-100 % >>> on a >>> single core >>> >> >> In regular map-reduce Hadoop jobs you get this situation if the job >> has >> been configured to use a single reducer, and thus only one core is >> active >> >> Though it would surprise me if the crawlDB update job was >> configured this >> way, as I don't see a reason why the crawlDB has to be a single >> file in >> HDFS. >> >> Andrzej and others would know best, of course. >> >> -- Ken >> >> >> >> >>> On Sat, Nov 20, 2010 at 11:33 AM, Ye T Thet <[EMAIL PROTECTED]> >>> wrote: >>> >>> Hannes, >>>> >>>> I guess It would depends on situation >>>> - your server specs (where cralwer is running) and >>>> - hosts specs >>>> >>>> Anyway, I have been crawling around 50 hosts. I tweaked a few to >>>> get it >>>> right for my situation. >>>> >>>> Currently I am using 500 threads. and 10 threads per host. >>>> >>>> In my opinion, number of threads for crawler does not matter much. >>>> Because >>>> crawler does not take much of a resource (memory and CPU). As far >>>> as your >>>> server network band width can handle, it should be fine. >>>> >>>> In my case, number of threads per host matters. Because some of >>>> my server >>>> cannot handle that much of bandwidth. >>>> >>>> Not sure if it would helps, I had to adjust fetcher.server.delay, >>>> fetcher.server.min.delay and fetcher.max.crawl.delay because, my >>>> hosts >>>> sometimes cannot handle that much of threads. >>>> >>>> >>>> Warm Regards, >>>> >>>> Y.T. Thet >>>> >>>> >>>> >>>> >>>> On Thu, Nov 18, 2010 at 11:06 PM, Hannes Carl Meyer < >>>> [EMAIL PROTECTED]> wrote: >>>> >>>> Hi Ken, >>>>> >>>>> our Crawler is allowed to hit those hosts in a frequent way at >>>>> night so >>>>> we >>>>> are not getting a penalty ;-) >>>>> >>>>> Could you imagine running nutch in this case with about 400 >>>>> threads, >>>>> with >>>>> 1 >>>>> thread per host and a delay of 1.0? >>>>> >>>>> I tried that way but experienced some really long idle times... >>>>> My idea >>>>> was >>>>> one thread per host. That would mean adding another host would >>>>> require >>>>> add >>>>> an additional thread. >>>>> >>>>> Regards >>>>> >>>>> Hannes >>>>> >>>>> On Thu, Nov 18, 2010 at 3:36 PM, Ken Krugler < >>>>> [EMAIL PROTECTED] >>>>> >>>>>> wrote: >>>>>> >>>>> >>>>> If you're hitting each host with 45 threads, you better be on >>>>> really >>>>>> >>>>> good >>>>> >>>>>> terms with those webmasters :) >>>>>> >>>>>> With 90 total threads, that means as few as 2 hosts are active >>>>>> at any >>>>>> >>>>> time, >>>>> >>>>>> yes? >>>>>> >>>>>> -- Ken >>>>>> >>>>>> >>>>>> >>>>>> On Nov 18, 2010, at 3:51am, Hannes Carl Meyer wrote: >>>>>> >>>>>> Hi, >>>>>> >>>>>>> I'm using nutch 0.9 to crawl about 400 hosts with an average Ken Krugler +1 530-210-6378 http://bixolabs.com e l a s t i c w e b m i n i n g |