Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Lucene and all its subprojects:

Switch to Threaded View
Nutch >> mail # user >> nutch crawl command takes 98% of cpu


Copy link to this message
-
Re: nutch crawl command takes 98% of cpu
On 2/1/11 1:39 AM, Kirby Bohling wrote:
> On Sat, Jan 29, 2011 at 9:03 AM, Ken Krugler
> <[EMAIL PROTECTED]>  wrote:
>> Some comments below.
>>
>> On Jan 29, 2011, at 5:55am, Julien Nioche wrote:
>>
>>> Hi,
>>>
>>> This shows the state of the various threads within a Java process. Most of
>>> them seem to be busy parsing zip archives with Tika. The interesting part
>>> is
>>> that the main thread is at the Generation step :
>>>
>>> *  at org.apache.nutch.crawl.Generator.generate(Generator.java:431)
>>>   at org.apache.nutch.crawl.Crawl.main(Crawl.java:127)
>>> *
>>> with the "Thread-415331" normalizing the URLs as part of the generation.
>>>
>>> So why do we see threads busy at parsing these archives? I think this is a
>>> result of the Timeout mechanism (
>>> https://issues.apache.org/jira/browse/NUTCH-696) used for the parsing.
>>> Before it, we used to have the parsing step loop on a single document and
>>> never complete. Thanks to Andrzej's patch, the parsing is done is separate
>>> threads which are abandonned if more than X seconds have passed (default
>>> 30
>>> I think). Obiously these threads are still lurking around in the
>>> background
>>> and consuming CPU.
>>>
>>> This is an issue when calling the Crawl command only. When using the
>>> separate commands for the various steps, the runaway threads die with the
>>> main process, however since the Crawl uses a single process, these timeout
>>> threads keep going.
>>>
>>> Am not an expert in multithreading and don't have an idea of whether these
>>> threads could be killed somehow. Andrzej, any clue?
>>
>> This is a fundamental problem with run-away threads - there is no safe,
>> reliable way to kill them off.
>>
>> And if you parse enough documents, you will run into a number that currently
>> cause Tika to hang. Zip files for sure, but we ran into the same issue with
>> FLV files.
>>
>> Over in Tika-land, Jukka has a patch that fires up a child JVM and runs
>> parsers there. See https://issues.apache.org/jira/browse/TIKA-416
>>
>> -- Ken
>>
>
> All,
>
>    Just an observation, but the general approach to this problem is to
> use Thread.interrupt().  Virtually all code in the JDK treats the
> thread being interrupted as a request to cancel.  Java Concurrency in
> Practice (JCIP) has a whole chapter on this topic (Chapter 7).  IMHO,
> any general purpose library code that swallows "InterruptedException"
> and isn't implementing the Thread cancellation policy has a bug in it
> (the cancellation policy can only be implemented by the owner of the
> thread, unless the library is a task/thread library it cannot be
> implementing the cancellation policy).  Any place you see:
>
> catch (InterruptedException ex) {
> // Ignore
> }
>
> Just plan on having a hard to track down bug at some point in the
> future.  At the very least, just reset the interruption status like
> so:
>
> catch (InterruptedException ex) {
>     // Resetting the interruption to avoid losing the cancellation request.
>     Thread.currentThread().interrupt();
> //  Twiddle any state necessary to get a bail out in a timeline manner...
> }
>
>    The problem with using the interruption status as cancellation
> approach is that it fails if there is a bug anywhere in any library
> that swallows the InterruptedException (in many ways it is similar to
> a data race).  It is a fundamental problem with threading (there is no
> way to share memory space and have a reliable cancel that a bug can't
> subvert, an infinite loop while holding a lock is the canonical
> example of the problem, killing the thread could lead to an invariant
> being invalid).
>
>     One trivial and simple way if you control the creation of Threads
> is to override "Thread.interrupt", and record that the interrupt
> method was called (and thus cancellation of the thread/work was
> requested), and at the top of the outer most loop check if the cancel
> was set, bail out.  That assumes at some point you do in fact get back
> to the top of the loop.  If you're stuck in an inner loop, fix the

That was very informative and useful, thanks for explaining it.

Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB