|
alxsss@...
2011-01-27, 23:00
Chris Woolum
2011-01-28, 00:01
Alexis
2011-01-28, 03:32
Julien Nioche
2011-01-28, 14:01
alxsss@...
2011-01-28, 21:53
Julien Nioche
2011-01-29, 13:55
Ken Krugler
2011-01-29, 15:03
alxsss@...
2011-01-31, 21:20
Kirby Bohling
2011-02-01, 00:39
Andrzej Bialecki
2011-02-01, 11:39
Ken Krugler
2011-02-07, 20:00
Alexis
2011-02-08, 17:58
alxsss@...
2011-03-14, 18:21
Markus Jelsma
2011-03-14, 18:37
alxsss@...
2011-03-14, 18:39
|
-
nutch crawl command takes 98% of cpualxsss@... 2011-01-27, 23:00
Hello,
I run crawl command with -depth 7 -topN -1 on my linux box with 1.5Mps internet, amd 3.1ghz processor, 4GB memory, Fedora Linux 14, nutch 1.2. After 1-2 days nutch takes 98% of cpu. My seed file includes about 3500 domains and I put fetch.external links to false. Is this normal? If not, what can be done to improve it? Thanks. Alex.
-
RE: nutch crawl command takes 98% of cpuChris Woolum 2011-01-28, 00:01
If you are looking at the tasktracker control panel, what does it show?
The link is http://localhost:50030 -----Original Message----- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]] Sent: Thursday, January 27, 2011 3:01 PM To: [EMAIL PROTECTED] Subject: nutch crawl command takes 98% of cpu Hello, I run crawl command with -depth 7 -topN -1 on my linux box with 1.5Mps internet, amd 3.1ghz processor, 4GB memory, Fedora Linux 14, nutch 1.2. After 1-2 days nutch takes 98% of cpu. My seed file includes about 3500 domains and I put fetch.external links to false. Is this normal? If not, what can be done to improve it? Thanks. Alex.
-
Re: nutch crawl command takes 98% of cpuAlexis 2011-01-28, 03:32
Hi,
I ran into the same issue as well with Nutch 1.2. You could fix it by upgrading the version of tika parser to at least 0.8. The lib can be found in the plugins/parse-tika/ directory of your Nutch release. This has already been mentioned twice in the mailing-list: See http://lucene.472066.n3.nabble.com/Full-CPU-usage-td1976780.html I hope this will help you out. Alexis On Fri, Jan 28, 2011 at 1:01 AM, Chris Woolum <[EMAIL PROTECTED]> wrote: > If you are looking at the tasktracker control panel, what does it show? > The link is http://localhost:50030 > > > -----Original Message----- > From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]] > Sent: Thursday, January 27, 2011 3:01 PM > To: [EMAIL PROTECTED] > Subject: nutch crawl command takes 98% of cpu > > Hello, > > I run crawl command with -depth 7 -topN -1 on my linux box with 1.5Mps > internet, amd 3.1ghz processor, 4GB memory, Fedora Linux 14, nutch 1.2. > After 1-2 days nutch takes 98% of cpu. My seed file includes about 3500 > domains and I put fetch.external links to false. > > Is this normal? If not, what can be done to improve it? > > Thanks. > Alex. >
-
Re: nutch crawl command takes 98% of cpuJulien Nioche 2011-01-28, 14:01
That's assuming that the problem comes from the parsing.
Alex, can you either run jstack on the process to see what is is hanging on or do as Chris suggested? Note that it is not recommended to upgrade to Tika 0.8 if you want to process PDF docs because of an issue which will be resolved in the next Tika release. Another solution - if the problem comes from flv files and you are not interested in them - is to add a URLFilter which will prevent such files to be fetched. Julien On 28 January 2011 03:32, Alexis <[EMAIL PROTECTED]> wrote: > Hi, > > I ran into the same issue as well with Nutch 1.2. You could fix it by > upgrading the version of tika parser to at least 0.8. The lib can be > found in the plugins/parse-tika/ directory of your Nutch release. > > This has already been mentioned twice in the mailing-list: See > http://lucene.472066.n3.nabble.com/Full-CPU-usage-td1976780.html > > I hope this will help you out. > > Alexis > > On Fri, Jan 28, 2011 at 1:01 AM, Chris Woolum <[EMAIL PROTECTED]> > wrote: > > If you are looking at the tasktracker control panel, what does it show? > > The link is http://localhost:50030 > > > > > > -----Original Message----- > > From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]] > > Sent: Thursday, January 27, 2011 3:01 PM > > To: [EMAIL PROTECTED] > > Subject: nutch crawl command takes 98% of cpu > > > > Hello, > > > > I run crawl command with -depth 7 -topN -1 on my linux box with 1.5Mps > > internet, amd 3.1ghz processor, 4GB memory, Fedora Linux 14, nutch 1.2. > > After 1-2 days nutch takes 98% of cpu. My seed file includes about 3500 > > domains and I put fetch.external links to false. > > > > Is this normal? If not, what can be done to improve it? > > > > Thanks. > > Alex. > > > -- * *Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com
-
Re: nutch crawl command takes 98% of cpualxsss@... 2011-01-28, 21:53
Hello,
I did jstack and the result is below. Could you please let me know how to interpret it? ---------------------------------------------------------------- 2011-01-28 13:46:50 Full thread dump OpenJDK Server VM (19.0-b06 mixed mode): "Attach Listener" daemon prio=10 tid=0x6cb21800 nid=0x1e95 waiting on condition [0x00000000] java.lang.Thread.State: RUNNABLE Locked ownable synchronizers: - None "SpillThread" daemon prio=10 tid=0x6053c400 nid=0x1e18 waiting on condition [0x6c3ad000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x7f9a8768> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask.java:1169) Locked ownable synchronizers: - None "communication thread" daemon prio=10 tid=0x607bd400 nid=0x1e17 waiting on condition [0x6c8ad000] java.lang.Thread.State: TIMED_WAITING (sleeping) at java.lang.Thread.sleep(Native Method) at org.apache.hadoop.mapred.Task$TaskReporter.run(Task.java:529) at java.lang.Thread.run(Thread.java:636) Locked ownable synchronizers: - None "Thread-415331" prio=10 tid=0x6cb96c00 nid=0x175f runnable [0x6c2ba000] java.lang.Thread.State: RUNNABLE at org.apache.oro.text.regex.Perl5Matcher.__matchUnicodeClass(Unknown Source) at org.apache.oro.text.regex.Perl5Matcher.__repeat(Unknown Source) at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown Source) at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown Source) at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown Source) at org.apache.oro.text.regex.Perl5Matcher.__tryExpression(Unknown Source) at org.apache.oro.text.regex.Perl5Matcher.__interpret(Unknown Source) at org.apache.oro.text.regex.Perl5Matcher.contains(Unknown Source) at org.apache.oro.text.regex.Util.substitute(Unknown Source) at org.apache.oro.text.regex.Util.substitute(Unknown Source) at org.apache.nutch.net.urlnormalizer.basic.BasicURLNormalizer.substituteUnnecessaryRelativePaths(BasicURLNormalizer.java:166) at org.apache.nutch.net.urlnormalizer.basic.BasicURLNormalizer.normalize(BasicURLNormalizer.java:125) at org.apache.nutch.net.URLNormalizers.normalize(URLNormalizers.java:286) at org.apache.nutch.crawl.URLPartitioner.getPartition(URLPartitioner.java:69) at org.apache.nutch.crawl.URLPartitioner.getPartition(URLPartitioner.java:36) at org.apache.nutch.crawl.Generator$Selector.getPartition(Generator.java:217) at org.apache.nutch.crawl.Generator$Selector.getPartition(Generator.java:109) at org.apache.hadoop.mapred.MapTask$OldOutputCollector.collect(MapTask.java:466) at org.apache.nutch.crawl.Generator$Selector.map(Generator.java:212) at org.apache.nutch.crawl.Generator$Selector.map(Generator.java:109) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177) Locked ownable synchronizers: - None "Thread-414136" daemon prio=10 tid=0x609f8000 nid=0x207b runnable [0x61fad000] java.lang.Thread.State: RUNNABLE at java.util.zip.Inflater.inflateBytes(Native Method) at java.util.zip.Inflater.inflate(Inflater.java:255) - locked <0x78fc22d0> (a java.util.zip.ZStreamRef) at org.apache.commons.compress.archivers.zip.ZipArchiveInputStream.read(ZipArchiveInputStream.java:235) at org.apache.commons.compress.archivers.zip.ZipArchiveInputStream.skip(ZipArchiveInputStream.java:261) at org.apache.commons.compress.archivers.zip.ZipArchiveInputStream.closeEntry(ZipArchiveInputStream.java:302) at org.apache.commons.compress.archivers.zip.ZipArchiveInputStream.getNextZipEntry(ZipArchiveInputStream.java:112) at org.apache.commons.compress.archivers.zip.ZipArchiveInputStream.getNextEntry(ZipArchiveInputStream.java:188) at org.apache.tika.parser.pkg.PackageExtractor.unpack(PackageExtractor.java:177) at org.apache.tika.parser.pkg.PackageExtractor.parse(PackageExtractor.java:93) at org.apache.tika.parser.pkg.PackageParser.parse(PackageParser.java:61) at org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:95) at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:18) at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:7) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) at java.util.concurrent.FutureTask.run(FutureTask.java:166) at java.lang.Thread.run(Thread.java:636) Locked ownable synchronizers: - None "Thread-398562" daemon prio=10 tid=0x611fa000 nid=0x5977 runnable [0x629fe000] java.lang.Thread.State: RUNNABLE at java.util.zip.Inflater.inflateBytes(Native Method) at java.util.zip.Inflater.inflate(Inflater.java:255) - locked <0x78f9f6c8> (a java.util.zip.ZStreamRef) at org.apache.commons.compress.archivers.zip.ZipArchiveInputStream.read(ZipArchiveInputStream.java:235) at org.apache.commons.compress.archivers.zip.ZipArchiveInputStream.skip(ZipArchiveInputStream.java:261) at org.apache.commons.compress.archivers.zip.ZipArchiveInputStream.closeEntry(ZipArchiveInputStream.java:302) at org.apache.commons.compress.archivers.zip.ZipArchiveInputStream.getNextZipEntry(ZipArchiveInputStream.java:112) at org.apache.commons.compress.archivers.zip.ZipArchiveInputStream.getNextEntry(ZipArchiveInputStream.java:188) at org.apache.tika.parser.pkg.PackageExtractor.unpack(PackageExtractor.java:177) at org.apache.tika.parser.pkg.PackageExtractor.parse(PackageExtractor.java:93) at org.apache.ti
-
Re: nutch crawl command takes 98% of cpuJulien Nioche 2011-01-29, 13:55
Hi,
This shows the state of the various threads within a Java process. Most of them seem to be busy parsing zip archives with Tika. The interesting part is that the main thread is at the Generation step : * at org.apache.nutch.crawl.Generator.generate(Generator.java:431) at org.apache.nutch.crawl.Crawl.main(Crawl.java:127) * with the "Thread-415331" normalizing the URLs as part of the generation. So why do we see threads busy at parsing these archives? I think this is a result of the Timeout mechanism ( https://issues.apache.org/jira/browse/NUTCH-696) used for the parsing. Before it, we used to have the parsing step loop on a single document and never complete. Thanks to Andrzej's patch, the parsing is done is separate threads which are abandonned if more than X seconds have passed (default 30 I think). Obiously these threads are still lurking around in the background and consuming CPU. This is an issue when calling the Crawl command only. When using the separate commands for the various steps, the runaway threads die with the main process, however since the Crawl uses a single process, these timeout threads keep going. Am not an expert in multithreading and don't have an idea of whether these threads could be killed somehow. Andrzej, any clue? Would be interesting from a Tika point of view to know what documents caused this? Alex is there a trace of the URLs in your logs? Could be something like the content being trimmed and causing the parser to go in a loop, anyway it would be good to identify the source of the problem. I have to admit that I am not a big fan of the one-in-all Crawl command, one way to alleviate the problem would be not to use it and call the separate commands individually, which has also the merit of giving a better idea of what goes under the bonnet. I'd rather we shipped a nice and tidy shell script to achieve the same goals as the Crawl command, it will also replace the numerous and somewhat faulty scripts that can be found on the Wiki. It seems that this is a feature that people often request or comment on. Any thoughts? Alex, would you mind opening an issue on JIRA for this? Would be great if you could see if the URLS causing the parsing to loop could be found in the logs and if the same issue can be reproduced with the latest version of Tika. Thanks Julien On 28 January 2011 21:53, <[EMAIL PROTECTED]> wrote: > Hello, > > I did jstack and the result is below. Could you please let me know how to > interpret it? > > ---------------------------------------------------------------- > > > > 2011-01-28 13:46:50 > Full thread dump OpenJDK Server VM (19.0-b06 mixed mode): > > "Attach Listener" daemon prio=10 tid=0x6cb21800 nid=0x1e95 waiting on > condition [0x00000000] > java.lang.Thread.State: RUNNABLE > > Locked ownable synchronizers: > - None > > "SpillThread" daemon prio=10 tid=0x6053c400 nid=0x1e18 waiting on condition > [0x6c3ad000] > java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x7f9a8768> (a > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043) > at > org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask.java:1169) > > Locked ownable synchronizers: > - None > > "communication thread" daemon prio=10 tid=0x607bd400 nid=0x1e17 waiting on > condition [0x6c8ad000] > java.lang.Thread.State: TIMED_WAITING (sleeping) > at java.lang.Thread.sleep(Native Method) > at org.apache.hadoop.mapred.Task$TaskReporter.run(Task.java:529) > at java.lang.Thread.run(Thread.java:636) > > Locked ownable synchronizers: > - None > > "Thread-415331" prio=10 tid=0x6cb96c00 nid=0x175f runnable [0x6c2ba000] > java.lang.Thread.State: RUNNABLE > at org.apache.oro.text.regex.Perl5Matcher.__matchUnicodeClass(Unknown * *Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com
-
Re: nutch crawl command takes 98% of cpuKen Krugler 2011-01-29, 15:03
Some comments below.
On Jan 29, 2011, at 5:55am, Julien Nioche wrote: > Hi, > > This shows the state of the various threads within a Java process. > Most of > them seem to be busy parsing zip archives with Tika. The interesting > part is > that the main thread is at the Generation step : > > * at org.apache.nutch.crawl.Generator.generate(Generator.java:431) > at org.apache.nutch.crawl.Crawl.main(Crawl.java:127) > * > with the "Thread-415331" normalizing the URLs as part of the > generation. > > So why do we see threads busy at parsing these archives? I think > this is a > result of the Timeout mechanism ( > https://issues.apache.org/jira/browse/NUTCH-696) used for the parsing. > Before it, we used to have the parsing step loop on a single > document and > never complete. Thanks to Andrzej's patch, the parsing is done is > separate > threads which are abandonned if more than X seconds have passed > (default 30 > I think). Obiously these threads are still lurking around in the > background > and consuming CPU. > > This is an issue when calling the Crawl command only. When using the > separate commands for the various steps, the runaway threads die > with the > main process, however since the Crawl uses a single process, these > timeout > threads keep going. > > Am not an expert in multithreading and don't have an idea of whether > these > threads could be killed somehow. Andrzej, any clue? This is a fundamental problem with run-away threads - there is no safe, reliable way to kill them off. And if you parse enough documents, you will run into a number that currently cause Tika to hang. Zip files for sure, but we ran into the same issue with FLV files. Over in Tika-land, Jukka has a patch that fires up a child JVM and runs parsers there. See https://issues.apache.org/jira/browse/TIKA-416 -- Ken > Would be interesting from a Tika point of view to know what > documents caused > this? Alex is there a trace of the URLs in your logs? Could be > something > like the content being trimmed and causing the parser to go in a loop, > anyway it would be good to identify the source of the problem. > > I have to admit that I am not a big fan of the one-in-all Crawl > command, one > way to alleviate the problem would be not to use it and call the > separate > commands individually, which has also the merit of giving a better > idea of > what goes under the bonnet. I'd rather we shipped a nice and tidy > shell > script to achieve the same goals as the Crawl command, it will also > replace > the numerous and somewhat faulty scripts that can be found on the > Wiki. It > seems that this is a feature that people often request or comment on. > > Any thoughts? > > Alex, would you mind opening an issue on JIRA for this? Would be > great if > you could see if the URLS causing the parsing to loop could be found > in the > logs and if the same issue can be reproduced with the latest version > of > Tika. > > Thanks > > Julien > > > On 28 January 2011 21:53, <[EMAIL PROTECTED]> wrote: > >> Hello, >> >> I did jstack and the result is below. Could you please let me know >> how to >> interpret it? >> >> ---------------------------------------------------------------- >> >> >> >> 2011-01-28 13:46:50 >> Full thread dump OpenJDK Server VM (19.0-b06 mixed mode): >> >> "Attach Listener" daemon prio=10 tid=0x6cb21800 nid=0x1e95 waiting on >> condition [0x00000000] >> java.lang.Thread.State: RUNNABLE >> >> Locked ownable synchronizers: >> - None >> >> "SpillThread" daemon prio=10 tid=0x6053c400 nid=0x1e18 waiting on >> condition >> [0x6c3ad000] >> java.lang.Thread.State: WAITING (parking) >> at sun.misc.Unsafe.park(Native Method) >> - parking to wait for <0x7f9a8768> (a >> java.util.concurrent.locks.AbstractQueuedSynchronizer >> $ConditionObject) >> at java.util.concurrent.locks.LockSupport.park(LockSupport.java: >> 186) >> at >> java.util.concurrent.locks.AbstractQueuedSynchronizer >> $ConditionObject.await(AbstractQueuedSynchronizer.java:2043) Ken Krugler +1 530-210-6378 http://bixolabs.com e l a s t i c w e b m i n i n g
-
Re: nutch crawl command takes 98% of cpualxsss@... 2011-01-31, 21:20
Hello,
It was in the generation stage, so decided to do jstack again in the fetch step. The results are below. I have added zip to my crawl-urlfilter.txt file so it must not handle .zip files though. thanks. Alex. ---- 2011-01-31 13:12:19 Full thread dump OpenJDK Server VM (19.0-b06 mixed mode): "Thread-685455" daemon prio=10 tid=0x5a66e000 nid=0x78ac runnable [0x00000000] java.lang.Thread.State: RUNNABLE Locked ownable synchronizers: - None "FetcherThread" daemon prio=10 tid=0x5fe21400 nid=0x22eb waiting on condition [0x6c269000] java.lang.Thread.State: TIMED_WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0xa94c1df8> (a java.util.concurrent.FutureTask$Sync) at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:226) at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1037) at java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1326) at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:257) at java.util.concurrent.FutureTask.get(FutureTask.java:119) at org.apache.nutch.parse.ParseUtil.runParser(ParseUtil.java:159) at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:87) at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:879) at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:647) Locked ownable synchronizers: - None "FetcherThread" daemon prio=10 tid=0x5fe20400 nid=0x22ea sleeping[0x621ad000] java.lang.Thread.State: TIMED_WAITING (sleeping) at java.lang.Thread.sleep(Native Method) at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:575) Locked ownable synchronizers: - None "FetcherThread" daemon prio=10 tid=0x601b6000 nid=0x22e9 sleeping[0x6c85c000] java.lang.Thread.State: TIMED_WAITING (sleeping) at java.lang.Thread.sleep(Native Method) at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:575) Locked ownable synchronizers: - None "FetcherThread" daemon prio=10 tid=0x60169c00 nid=0x22e8 waiting on condition [0x6caad000] java.lang.Thread.State: TIMED_WAITING (sleeping) at java.lang.Thread.sleep(Native Method) at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:575) Locked ownable synchronizers: - None "FetcherThread" daemon prio=10 tid=0x601fac00 nid=0x22e7 waiting on condition [0x6c216000] java.lang.Thread.State: TIMED_WAITING (sleeping) at java.lang.Thread.sleep(Native Method) at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:575) Locked ownable synchronizers: - None "FetcherThread" daemon prio=10 tid=0x5fe09c00 nid=0x22e6 sleeping[0x6c8fe000] java.lang.Thread.State: TIMED_WAITING (sleeping) at java.lang.Thread.sleep(Native Method) at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:575) Locked ownable synchronizers: - None "FetcherThread" daemon prio=10 tid=0x5fe2f000 nid=0x22e5 runnable [0x6c3fe000] java.lang.Thread.State: RUNNABLE at java.net.PlainSocketImpl.socketConnect(Native Method) at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:310) - locked <0xa896c3d8> (a java.net.SocksSocketImpl) at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:176) at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:163) at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:384) at java.net.Socket.connect(Socket.java:546) at org.apache.nutch.protocol.http.HttpResponse.<init>(HttpResponse.java:97) at org.apache.nutch.protocol.http.Http.getResponse(Http.java:64) at org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:224) at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:628) Locked ownable synchronizers: - None "FetcherThread" daemon prio=10 tid=0x60176800 nid=0x22e4 sleeping[0x6c1c5000] java.lang.Thread.State: TIMED_WAITING (sleeping) at java.lang.Thread.sleep(Native Method) at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:575) Locked ownable synchronizers: - None "FetcherThread" daemon prio=10 tid=0x60369800 nid=0x22e3 sleeping[0x6c35c000] java.lang.Thread.State: TIMED_WAITING (sleeping) at java.lang.Thread.sleep(Native Method) at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:575) Locked ownable synchronizers: - None "FetcherThread" daemon prio=10 tid=0x601b6c00 nid=0x22e2 sleeping[0x6cafe000] java.lang.Thread.State: TIMED_WAITING (sleeping) at java.lang.Thread.sleep(Native Method) at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:575) Locked ownable synchronizers: - None "QueueFeeder" daemon prio=10 tid=0x6039d000 nid=0x22e1 waiting on condition [0x6ca5c000] java.lang.Thread.State: TIMED_WAITING (sleeping) at java.lang.Thread.sleep(Native Method) at org.apache.nutch.fetcher.Fetcher$QueueFeeder.run(Fetcher.java:500) Locked ownable synchronizers: - None "SpillThread" daemon prio=10 tid=0x60174400 nid=0x22e0 waiting on condition [0x6c3ad000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x7fa3c8d8> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask.java:1169) Locked ownable synchronizers: - None "communication thread" daemon prio=10 tid=0x602fa800 nid=0x22df waiting on condition [0x6c8ad000] java.lang.Thread.State: TIMED_WAITING (sleeping) at java.lang.Thread.sleep(Native Me
-
Re: nutch crawl command takes 98% of cpuKirby Bohling 2011-02-01, 00:39
On Sat, Jan 29, 2011 at 9:03 AM, Ken Krugler
<[EMAIL PROTECTED]> wrote: > Some comments below. > > On Jan 29, 2011, at 5:55am, Julien Nioche wrote: > >> Hi, >> >> This shows the state of the various threads within a Java process. Most of >> them seem to be busy parsing zip archives with Tika. The interesting part >> is >> that the main thread is at the Generation step : >> >> * at org.apache.nutch.crawl.Generator.generate(Generator.java:431) >> at org.apache.nutch.crawl.Crawl.main(Crawl.java:127) >> * >> with the "Thread-415331" normalizing the URLs as part of the generation. >> >> So why do we see threads busy at parsing these archives? I think this is a >> result of the Timeout mechanism ( >> https://issues.apache.org/jira/browse/NUTCH-696) used for the parsing. >> Before it, we used to have the parsing step loop on a single document and >> never complete. Thanks to Andrzej's patch, the parsing is done is separate >> threads which are abandonned if more than X seconds have passed (default >> 30 >> I think). Obiously these threads are still lurking around in the >> background >> and consuming CPU. >> >> This is an issue when calling the Crawl command only. When using the >> separate commands for the various steps, the runaway threads die with the >> main process, however since the Crawl uses a single process, these timeout >> threads keep going. >> >> Am not an expert in multithreading and don't have an idea of whether these >> threads could be killed somehow. Andrzej, any clue? > > This is a fundamental problem with run-away threads - there is no safe, > reliable way to kill them off. > > And if you parse enough documents, you will run into a number that currently > cause Tika to hang. Zip files for sure, but we ran into the same issue with > FLV files. > > Over in Tika-land, Jukka has a patch that fires up a child JVM and runs > parsers there. See https://issues.apache.org/jira/browse/TIKA-416 > > -- Ken > All, Just an observation, but the general approach to this problem is to use Thread.interrupt(). Virtually all code in the JDK treats the thread being interrupted as a request to cancel. Java Concurrency in Practice (JCIP) has a whole chapter on this topic (Chapter 7). IMHO, any general purpose library code that swallows "InterruptedException" and isn't implementing the Thread cancellation policy has a bug in it (the cancellation policy can only be implemented by the owner of the thread, unless the library is a task/thread library it cannot be implementing the cancellation policy). Any place you see: catch (InterruptedException ex) { // Ignore } Just plan on having a hard to track down bug at some point in the future. At the very least, just reset the interruption status like so: catch (InterruptedException ex) { // Resetting the interruption to avoid losing the cancellation request. Thread.currentThread().interrupt(); // Twiddle any state necessary to get a bail out in a timeline manner... } The problem with using the interruption status as cancellation approach is that it fails if there is a bug anywhere in any library that swallows the InterruptedException (in many ways it is similar to a data race). It is a fundamental problem with threading (there is no way to share memory space and have a reliable cancel that a bug can't subvert, an infinite loop while holding a lock is the canonical example of the problem, killing the thread could lead to an invariant being invalid). One trivial and simple way if you control the creation of Threads is to override "Thread.interrupt", and record that the interrupt method was called (and thus cancellation of the thread/work was requested), and at the top of the outer most loop check if the cancel was set, bail out. That assumes at some point you do in fact get back to the top of the loop. If you're stuck in an inner loop, fix the inner loop that is stuck to respect cancellation/interruption. There are several gotchas dealing with interruptions. Most blocking APIs inside of Java respect cancellation (they throw InterruptedException if isInterrupted() is true, rather then start a potentially blocking operation, and will wake up and throw the exception if interrupted in the middle of it). One exception is that sockets read/write operations don't operate this way, the socket must be closed to interrupt a read/write, the approach JCIP suggests is to tie the socket and thread in such a way that interrupt() closes the sockets that would be reading/writing inside that thread. I believe that the NIO code does as long as the Channel is a InterruptableChannel, which the stock network implementations should be. Selector.select() does not handle interruption, it must have .wakup called on it in an analogous way to closing the socket. Not sure exactly what the problems inside of Tika are, but getting it to respect interruption would be a wonderful thing for everybody that uses it. The problem might be getting all underlying libraries it uses to do so. Kirby
-
Re: nutch crawl command takes 98% of cpuAndrzej Bialecki 2011-02-01, 11:39
On 2/1/11 1:39 AM, Kirby Bohling wrote:
> On Sat, Jan 29, 2011 at 9:03 AM, Ken Krugler > <[EMAIL PROTECTED]> wrote: >> Some comments below. >> >> On Jan 29, 2011, at 5:55am, Julien Nioche wrote: >> >>> Hi, >>> >>> This shows the state of the various threads within a Java process. Most of >>> them seem to be busy parsing zip archives with Tika. The interesting part >>> is >>> that the main thread is at the Generation step : >>> >>> * at org.apache.nutch.crawl.Generator.generate(Generator.java:431) >>> at org.apache.nutch.crawl.Crawl.main(Crawl.java:127) >>> * >>> with the "Thread-415331" normalizing the URLs as part of the generation. >>> >>> So why do we see threads busy at parsing these archives? I think this is a >>> result of the Timeout mechanism ( >>> https://issues.apache.org/jira/browse/NUTCH-696) used for the parsing. >>> Before it, we used to have the parsing step loop on a single document and >>> never complete. Thanks to Andrzej's patch, the parsing is done is separate >>> threads which are abandonned if more than X seconds have passed (default >>> 30 >>> I think). Obiously these threads are still lurking around in the >>> background >>> and consuming CPU. >>> >>> This is an issue when calling the Crawl command only. When using the >>> separate commands for the various steps, the runaway threads die with the >>> main process, however since the Crawl uses a single process, these timeout >>> threads keep going. >>> >>> Am not an expert in multithreading and don't have an idea of whether these >>> threads could be killed somehow. Andrzej, any clue? >> >> This is a fundamental problem with run-away threads - there is no safe, >> reliable way to kill them off. >> >> And if you parse enough documents, you will run into a number that currently >> cause Tika to hang. Zip files for sure, but we ran into the same issue with >> FLV files. >> >> Over in Tika-land, Jukka has a patch that fires up a child JVM and runs >> parsers there. See https://issues.apache.org/jira/browse/TIKA-416 >> >> -- Ken >> > > All, > > Just an observation, but the general approach to this problem is to > use Thread.interrupt(). Virtually all code in the JDK treats the > thread being interrupted as a request to cancel. Java Concurrency in > Practice (JCIP) has a whole chapter on this topic (Chapter 7). IMHO, > any general purpose library code that swallows "InterruptedException" > and isn't implementing the Thread cancellation policy has a bug in it > (the cancellation policy can only be implemented by the owner of the > thread, unless the library is a task/thread library it cannot be > implementing the cancellation policy). Any place you see: > > catch (InterruptedException ex) { > // Ignore > } > > Just plan on having a hard to track down bug at some point in the > future. At the very least, just reset the interruption status like > so: > > catch (InterruptedException ex) { > // Resetting the interruption to avoid losing the cancellation request. > Thread.currentThread().interrupt(); > // Twiddle any state necessary to get a bail out in a timeline manner... > } > > The problem with using the interruption status as cancellation > approach is that it fails if there is a bug anywhere in any library > that swallows the InterruptedException (in many ways it is similar to > a data race). It is a fundamental problem with threading (there is no > way to share memory space and have a reliable cancel that a bug can't > subvert, an infinite loop while holding a lock is the canonical > example of the problem, killing the thread could lead to an invariant > being invalid). > > One trivial and simple way if you control the creation of Threads > is to override "Thread.interrupt", and record that the interrupt > method was called (and thus cancellation of the thread/work was > requested), and at the top of the outer most loop check if the cancel > was set, bail out. That assumes at some point you do in fact get back > to the top of the loop. If you're stuck in an inner loop, fix the That was very informative and useful, thanks for explaining it. Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __________________________________ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
-
Re: nutch crawl command takes 98% of cpuKen Krugler 2011-02-07, 20:00
Hi Kirby & others,
On Jan 31, 2011, at 4:39pm, Kirby Bohling wrote: > On Sat, Jan 29, 2011 at 9:03 AM, Ken Krugler > <[EMAIL PROTECTED]> wrote: >> Some comments below. >> >> On Jan 29, 2011, at 5:55am, Julien Nioche wrote: >> >>> Hi, >>> >>> This shows the state of the various threads within a Java process. >>> Most of >>> them seem to be busy parsing zip archives with Tika. The >>> interesting part >>> is >>> that the main thread is at the Generation step : >>> >>> * at org.apache.nutch.crawl.Generator.generate(Generator.java:431) >>> at org.apache.nutch.crawl.Crawl.main(Crawl.java:127) >>> * >>> with the "Thread-415331" normalizing the URLs as part of the >>> generation. >>> >>> So why do we see threads busy at parsing these archives? I think >>> this is a >>> result of the Timeout mechanism ( >>> https://issues.apache.org/jira/browse/NUTCH-696) used for the >>> parsing. >>> Before it, we used to have the parsing step loop on a single >>> document and >>> never complete. Thanks to Andrzej's patch, the parsing is done is >>> separate >>> threads which are abandonned if more than X seconds have passed >>> (default >>> 30 >>> I think). Obiously these threads are still lurking around in the >>> background >>> and consuming CPU. >>> >>> This is an issue when calling the Crawl command only. When using the >>> separate commands for the various steps, the runaway threads die >>> with the >>> main process, however since the Crawl uses a single process, these >>> timeout >>> threads keep going. >>> >>> Am not an expert in multithreading and don't have an idea of >>> whether these >>> threads could be killed somehow. Andrzej, any clue? >> >> This is a fundamental problem with run-away threads - there is no >> safe, >> reliable way to kill them off. >> >> And if you parse enough documents, you will run into a number that >> currently >> cause Tika to hang. Zip files for sure, but we ran into the same >> issue with >> FLV files. >> >> Over in Tika-land, Jukka has a patch that fires up a child JVM and >> runs >> parsers there. See https://issues.apache.org/jira/browse/TIKA-416 >> >> -- Ken >> > > All, > > Just an observation, but the general approach to this problem is to > use Thread.interrupt(). Virtually all code in the JDK treats the > thread being interrupted as a request to cancel. Java Concurrency in > Practice (JCIP) has a whole chapter on this topic (Chapter 7). IMHO, > any general purpose library code that swallows "InterruptedException" > and isn't implementing the Thread cancellation policy has a bug in it > (the cancellation policy can only be implemented by the owner of the > thread, unless the library is a task/thread library it cannot be > implementing the cancellation policy). Any place you see: [snip] > One exception is that > sockets read/write operations don't operate this way, the socket must > be closed to interrupt a read/write, the approach JCIP suggests is to > tie the socket and thread in such a way that interrupt() closes the > sockets that would be reading/writing inside that thread. Excellent input, as I need to solve some issues with needing to abort HTTP requests. [snip] > Not sure exactly what the problems inside of Tika are, but getting it > to respect interruption would be a wonderful thing for everybody that > uses it. The problem might be getting all underlying libraries it > uses to do so. Yes, that's exactly the issue in the cases I've seen. The libraries used to do the actual parsing can get caught in loops, when processing unexpected data. There's no checks for interrupt, e.g. it's code that is walking some data structure, and doesn't realize that it's in a loop (e.g. offset to next chunk is set to zero, so the same chunk is endlessly reprocessed). Occasionally we can get the underlying libraries to fix issues, but each new release has the potential for new and exciting hangs. That's why Jukka went down the admittedly hard-core and heavy-weight path of providing an option to run parses in a child JVM. If there's another solution, we'd love to hear about it :) Thanks, Ken Krugler +1 530-210-6378 http://bixolabs.com e l a s t i c w e b m i n i n g
-
Re: nutch crawl command takes 98% of cpuAlexis 2011-02-08, 17:58
Hi,
Thanks for all the feedback. It looks like there is not much you can do if you give the FLV parser some corrupted data. From a practical point of view, we can say that this is extremely annoying as it takes up all the CPU resources and prevent other threads to perform their task properly, till the TIMEOUT occurs, kills the thread and frees up the CPU. We can notice that this happens when an FLV file is truncated (due to an http.content.limit property lower that its content-length, in bytes). So the suggestion is to hint to the parser that it is likely to get stuck and skip the parsing in case the downloaded content size mismatches the content-length header. Besides, I often see errors in the HTML parser when the content is truncated (https://issues.apache.org/jira/browse/TIKA-307). So it does not hurt saving time and avoiding errors. I created the issue here: https://issues.apache.org/jira/browse/NUTCH-965 See attached patch. Alexis. On Mon, Feb 7, 2011 at 12:00 PM, Ken Krugler <[EMAIL PROTECTED]> wrote: > Hi Kirby & others, > > On Jan 31, 2011, at 4:39pm, Kirby Bohling wrote: > >> On Sat, Jan 29, 2011 at 9:03 AM, Ken Krugler >> <[EMAIL PROTECTED]> wrote: >>> >>> Some comments below. >>> >>> On Jan 29, 2011, at 5:55am, Julien Nioche wrote: >>> >>>> Hi, >>>> >>>> This shows the state of the various threads within a Java process. Most >>>> of >>>> them seem to be busy parsing zip archives with Tika. The interesting >>>> part >>>> is >>>> that the main thread is at the Generation step : >>>> >>>> * at org.apache.nutch.crawl.Generator.generate(Generator.java:431) >>>> at org.apache.nutch.crawl.Crawl.main(Crawl.java:127) >>>> * >>>> with the "Thread-415331" normalizing the URLs as part of the generation. >>>> >>>> So why do we see threads busy at parsing these archives? I think this is >>>> a >>>> result of the Timeout mechanism ( >>>> https://issues.apache.org/jira/browse/NUTCH-696) used for the parsing. >>>> Before it, we used to have the parsing step loop on a single document >>>> and >>>> never complete. Thanks to Andrzej's patch, the parsing is done is >>>> separate >>>> threads which are abandonned if more than X seconds have passed (default >>>> 30 >>>> I think). Obiously these threads are still lurking around in the >>>> background >>>> and consuming CPU. >>>> >>>> This is an issue when calling the Crawl command only. When using the >>>> separate commands for the various steps, the runaway threads die with >>>> the >>>> main process, however since the Crawl uses a single process, these >>>> timeout >>>> threads keep going. >>>> >>>> Am not an expert in multithreading and don't have an idea of whether >>>> these >>>> threads could be killed somehow. Andrzej, any clue? >>> >>> This is a fundamental problem with run-away threads - there is no safe, >>> reliable way to kill them off. >>> >>> And if you parse enough documents, you will run into a number that >>> currently >>> cause Tika to hang. Zip files for sure, but we ran into the same issue >>> with >>> FLV files. >>> >>> Over in Tika-land, Jukka has a patch that fires up a child JVM and runs >>> parsers there. See https://issues.apache.org/jira/browse/TIKA-416 >>> >>> -- Ken >>> >> >> All, >> >> Just an observation, but the general approach to this problem is to >> use Thread.interrupt(). Virtually all code in the JDK treats the >> thread being interrupted as a request to cancel. Java Concurrency in >> Practice (JCIP) has a whole chapter on this topic (Chapter 7). IMHO, >> any general purpose library code that swallows "InterruptedException" >> and isn't implementing the Thread cancellation policy has a bug in it >> (the cancellation policy can only be implemented by the owner of the >> thread, unless the library is a task/thread library it cannot be >> implementing the cancellation policy). Any place you see: > > [snip] > >> One exception is that >> sockets read/write operations don't operate this way, the socket must >> be closed to interrupt a read/write, the approach JCIP suggests is to
-
Re: nutch crawl command takes 98% of cpualxsss@... 2011-03-14, 18:21
Hello,
Which version this patch is applicable? Thanks. Alex. -----Original Message----- From: Alexis <[EMAIL PROTECTED]> To: user <[EMAIL PROTECTED]> Sent: Tue, Feb 8, 2011 9:59 am Subject: Re: nutch crawl command takes 98% of cpu Hi, Thanks for all the feedback. It looks like there is not much you can do if you give the FLV parser some corrupted data. From a practical point of view, we can say that this is extremely annoying as it takes up all the CPU resources and prevent other threads to perform their task properly, till the TIMEOUT occurs, kills the thread and frees up the CPU. We can notice that this happens when an FLV file is truncated (due to an http.content.limit property lower that its content-length, in bytes). So the suggestion is to hint to the parser that it is likely to get stuck and skip the parsing in case the downloaded content size mismatches the content-length header. Besides, I often see errors in the HTML parser when the content is truncated (https://issues.apache.org/jira/browse/TIKA-307). So it does not hurt saving time and avoiding errors. I created the issue here: https://issues.apache.org/jira/browse/NUTCH-965 See attached patch. Alexis. On Mon, Feb 7, 2011 at 12:00 PM, Ken Krugler <[EMAIL PROTECTED]> wrote: > Hi Kirby & others, > > On Jan 31, 2011, at 4:39pm, Kirby Bohling wrote: > >> On Sat, Jan 29, 2011 at 9:03 AM, Ken Krugler >> <[EMAIL PROTECTED]> wrote: >>> >>> Some comments below. >>> >>> On Jan 29, 2011, at 5:55am, Julien Nioche wrote: >>> >>>> Hi, >>>> >>>> This shows the state of the various threads within a Java process. Most >>>> of >>>> them seem to be busy parsing zip archives with Tika. The interesting >>>> part >>>> is >>>> that the main thread is at the Generation step : >>>> >>>> * at org.apache.nutch.crawl.Generator.generate(Generator.java:431) >>>> at org.apache.nutch.crawl.Crawl.main(Crawl.java:127) >>>> * >>>> with the "Thread-415331" normalizing the URLs as part of the generation. >>>> >>>> So why do we see threads busy at parsing these archives? I think this is >>>> a >>>> result of the Timeout mechanism ( >>>> https://issues.apache.org/jira/browse/NUTCH-696) used for the parsing. >>>> Before it, we used to have the parsing step loop on a single document >>>> and >>>> never complete. Thanks to Andrzej's patch, the parsing is done is >>>> separate >>>> threads which are abandonned if more than X seconds have passed (default >>>> 30 >>>> I think). Obiously these threads are still lurking around in the >>>> background >>>> and consuming CPU. >>>> >>>> This is an issue when calling the Crawl command only. When using the >>>> separate commands for the various steps, the runaway threads die with >>>> the >>>> main process, however since the Crawl uses a single process, these >>>> timeout >>>> threads keep going. >>>> >>>> Am not an expert in multithreading and don't have an idea of whether >>>> these >>>> threads could be killed somehow. Andrzej, any clue? >>> >>> This is a fundamental problem with run-away threads - there is no safe, >>> reliable way to kill them off. >>> >>> And if you parse enough documents, you will run into a number that >>> currently >>> cause Tika to hang. Zip files for sure, but we ran into the same issue >>> with >>> FLV files. >>> >>> Over in Tika-land, Jukka has a patch that fires up a child JVM and runs >>> parsers there. See https://issues.apache.org/jira/browse/TIKA-416 >>> >>> -- Ken >>> >> >> All, >> >> Just an observation, but the general approach to this problem is to >> use Thread.interrupt(). Virtually all code in the JDK treats the >> thread being interrupted as a request to cancel. Java Concurrency in >> Practice (JCIP) has a whole chapter on this topic (Chapter 7). IMHO, >> any general purpose library code that swallows "InterruptedException" >> and isn't implementing the Thread cancellation policy has a bug in it
-
Re: nutch crawl command takes 98% of cpuMarkus Jelsma 2011-03-14, 18:37
Hi,
There is no -noParse option so your fetcher might actually fetch and parse, depening on the parse option in your Nutch config. Parsing usually takes a lot CPU. Cheers, > Hello, > > At this time, I am using step by step crawling. In depth 4 nutch-1.2 > started taking all CPU i.e., command bin/nutch fetch $s4 took all CPU > after fetching for about 1 day . > > Thanks. > Alex.
-
Re: nutch crawl command takes 98% of cpualxsss@... 2011-03-14, 18:39
Hello, At this time, I am using step by step crawling. In depth 4 nutch-1.2 started taking all CPU i.e., command bin/nutch fetch $s4 took all CPU after fetching for about 1 day . Thanks. Alex. |