Home | About | Sematext search-lucene.com search-hadoop.com
 Search Lucene and all its subprojects:

Switch to Plain View
Nutch, mail # user - nutch crawl command takes 98% of cpu


+
alxsss@... 2011-01-27, 23:00
+
Chris Woolum 2011-01-28, 00:01
+
Alexis 2011-01-28, 03:32
+
Julien Nioche 2011-01-28, 14:01
+
alxsss@... 2011-01-28, 21:53
+
Julien Nioche 2011-01-29, 13:55
+
alxsss@... 2011-03-14, 18:39
+
Markus Jelsma 2011-03-14, 18:37
+
Ken Krugler 2011-01-29, 15:03
+
Kirby Bohling 2011-02-01, 00:39
Copy link to this message
-
Re: nutch crawl command takes 98% of cpu
Andrzej Bialecki 2011-02-01, 11:39
On 2/1/11 1:39 AM, Kirby Bohling wrote:
> On Sat, Jan 29, 2011 at 9:03 AM, Ken Krugler
> <[EMAIL PROTECTED]>  wrote:
>> Some comments below.
>>
>> On Jan 29, 2011, at 5:55am, Julien Nioche wrote:
>>
>>> Hi,
>>>
>>> This shows the state of the various threads within a Java process. Most of
>>> them seem to be busy parsing zip archives with Tika. The interesting part
>>> is
>>> that the main thread is at the Generation step :
>>>
>>> *  at org.apache.nutch.crawl.Generator.generate(Generator.java:431)
>>>   at org.apache.nutch.crawl.Crawl.main(Crawl.java:127)
>>> *
>>> with the "Thread-415331" normalizing the URLs as part of the generation.
>>>
>>> So why do we see threads busy at parsing these archives? I think this is a
>>> result of the Timeout mechanism (
>>> https://issues.apache.org/jira/browse/NUTCH-696) used for the parsing.
>>> Before it, we used to have the parsing step loop on a single document and
>>> never complete. Thanks to Andrzej's patch, the parsing is done is separate
>>> threads which are abandonned if more than X seconds have passed (default
>>> 30
>>> I think). Obiously these threads are still lurking around in the
>>> background
>>> and consuming CPU.
>>>
>>> This is an issue when calling the Crawl command only. When using the
>>> separate commands for the various steps, the runaway threads die with the
>>> main process, however since the Crawl uses a single process, these timeout
>>> threads keep going.
>>>
>>> Am not an expert in multithreading and don't have an idea of whether these
>>> threads could be killed somehow. Andrzej, any clue?
>>
>> This is a fundamental problem with run-away threads - there is no safe,
>> reliable way to kill them off.
>>
>> And if you parse enough documents, you will run into a number that currently
>> cause Tika to hang. Zip files for sure, but we ran into the same issue with
>> FLV files.
>>
>> Over in Tika-land, Jukka has a patch that fires up a child JVM and runs
>> parsers there. See https://issues.apache.org/jira/browse/TIKA-416
>>
>> -- Ken
>>
>
> All,
>
>    Just an observation, but the general approach to this problem is to
> use Thread.interrupt().  Virtually all code in the JDK treats the
> thread being interrupted as a request to cancel.  Java Concurrency in
> Practice (JCIP) has a whole chapter on this topic (Chapter 7).  IMHO,
> any general purpose library code that swallows "InterruptedException"
> and isn't implementing the Thread cancellation policy has a bug in it
> (the cancellation policy can only be implemented by the owner of the
> thread, unless the library is a task/thread library it cannot be
> implementing the cancellation policy).  Any place you see:
>
> catch (InterruptedException ex) {
> // Ignore
> }
>
> Just plan on having a hard to track down bug at some point in the
> future.  At the very least, just reset the interruption status like
> so:
>
> catch (InterruptedException ex) {
>     // Resetting the interruption to avoid losing the cancellation request.
>     Thread.currentThread().interrupt();
> //  Twiddle any state necessary to get a bail out in a timeline manner...
> }
>
>    The problem with using the interruption status as cancellation
> approach is that it fails if there is a bug anywhere in any library
> that swallows the InterruptedException (in many ways it is similar to
> a data race).  It is a fundamental problem with threading (there is no
> way to share memory space and have a reliable cancel that a bug can't
> subvert, an infinite loop while holding a lock is the canonical
> example of the problem, killing the thread could lead to an invariant
> being invalid).
>
>     One trivial and simple way if you control the creation of Threads
> is to override "Thread.interrupt", and record that the interrupt
> method was called (and thus cancellation of the thread/work was
> requested), and at the top of the outer most loop check if the cancel
> was set, bail out.  That assumes at some point you do in fact get back
> to the top of the loop.  If you're stuck in an inner loop, fix the

That was very informative and useful, thanks for explaining it.

Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com
+
Ken Krugler 2011-02-07, 20:00
+
Alexis 2011-02-08, 17:58
+
alxsss@... 2011-03-14, 18:21
+
alxsss@... 2011-01-31, 21:20