Home | About | Sematext search-lucene.com search-hadoop.com
 Search Lucene and all its subprojects:

Switch to Threaded View
Nutch >> mail # user >> nutch crawl command takes 98% of cpu

Copy link to this message
Re: nutch crawl command takes 98% of cpu
On 2/1/11 1:39 AM, Kirby Bohling wrote:
> On Sat, Jan 29, 2011 at 9:03 AM, Ken Krugler
> <[EMAIL PROTECTED]>  wrote:
>> Some comments below.
>> On Jan 29, 2011, at 5:55am, Julien Nioche wrote:
>>> Hi,
>>> This shows the state of the various threads within a Java process. Most of
>>> them seem to be busy parsing zip archives with Tika. The interesting part
>>> is
>>> that the main thread is at the Generation step :
>>> *  at org.apache.nutch.crawl.Generator.generate(Generator.java:431)
>>>   at org.apache.nutch.crawl.Crawl.main(Crawl.java:127)
>>> *
>>> with the "Thread-415331" normalizing the URLs as part of the generation.
>>> So why do we see threads busy at parsing these archives? I think this is a
>>> result of the Timeout mechanism (
>>> https://issues.apache.org/jira/browse/NUTCH-696) used for the parsing.
>>> Before it, we used to have the parsing step loop on a single document and
>>> never complete. Thanks to Andrzej's patch, the parsing is done is separate
>>> threads which are abandonned if more than X seconds have passed (default
>>> 30
>>> I think). Obiously these threads are still lurking around in the
>>> background
>>> and consuming CPU.
>>> This is an issue when calling the Crawl command only. When using the
>>> separate commands for the various steps, the runaway threads die with the
>>> main process, however since the Crawl uses a single process, these timeout
>>> threads keep going.
>>> Am not an expert in multithreading and don't have an idea of whether these
>>> threads could be killed somehow. Andrzej, any clue?
>> This is a fundamental problem with run-away threads - there is no safe,
>> reliable way to kill them off.
>> And if you parse enough documents, you will run into a number that currently
>> cause Tika to hang. Zip files for sure, but we ran into the same issue with
>> FLV files.
>> Over in Tika-land, Jukka has a patch that fires up a child JVM and runs
>> parsers there. See https://issues.apache.org/jira/browse/TIKA-416
>> -- Ken
> All,
>    Just an observation, but the general approach to this problem is to
> use Thread.interrupt().  Virtually all code in the JDK treats the
> thread being interrupted as a request to cancel.  Java Concurrency in
> Practice (JCIP) has a whole chapter on this topic (Chapter 7).  IMHO,
> any general purpose library code that swallows "InterruptedException"
> and isn't implementing the Thread cancellation policy has a bug in it
> (the cancellation policy can only be implemented by the owner of the
> thread, unless the library is a task/thread library it cannot be
> implementing the cancellation policy).  Any place you see:
> catch (InterruptedException ex) {
> // Ignore
> }
> Just plan on having a hard to track down bug at some point in the
> future.  At the very least, just reset the interruption status like
> so:
> catch (InterruptedException ex) {
>     // Resetting the interruption to avoid losing the cancellation request.
>     Thread.currentThread().interrupt();
> //  Twiddle any state necessary to get a bail out in a timeline manner...
> }
>    The problem with using the interruption status as cancellation
> approach is that it fails if there is a bug anywhere in any library
> that swallows the InterruptedException (in many ways it is similar to
> a data race).  It is a fundamental problem with threading (there is no
> way to share memory space and have a reliable cancel that a bug can't
> subvert, an infinite loop while holding a lock is the canonical
> example of the problem, killing the thread could lead to an invariant
> being invalid).
>     One trivial and simple way if you control the creation of Threads
> is to override "Thread.interrupt", and record that the interrupt
> method was called (and thus cancellation of the thread/work was
> requested), and at the top of the outer most loop check if the cancel
> was set, bail out.  That assumes at some point you do in fact get back
> to the top of the loop.  If you're stuck in an inner loop, fix the

That was very informative and useful, thanks for explaining it.

Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com