Home | About | Sematext search-lucene.com search-hadoop.com
 Search Lucene and all its subprojects:

Switch to Threaded View
Nutch >> mail # user >> nutch crawl command takes 98% of cpu

Copy link to this message
Re: nutch crawl command takes 98% of cpu

Thanks for all the feedback. It looks like there is not much you can
do if you give the FLV parser some corrupted data. From a practical
point of view, we can say that this is extremely annoying as it takes
up all the CPU resources and prevent other threads to perform their
task properly, till the TIMEOUT occurs, kills the thread and frees up
the CPU.

We can notice that this happens when an FLV file is truncated (due to
an http.content.limit property lower that its content-length, in
bytes). So the suggestion is to hint to the parser that it is likely
to get stuck and skip the parsing in case the downloaded content size
mismatches the content-length header.

Besides, I often see errors in the HTML parser when the content is
truncated (https://issues.apache.org/jira/browse/TIKA-307). So it does
not hurt saving time and avoiding errors.

I created the issue here: https://issues.apache.org/jira/browse/NUTCH-965
See attached patch.


On Mon, Feb 7, 2011 at 12:00 PM, Ken Krugler
> Hi Kirby & others,
> On Jan 31, 2011, at 4:39pm, Kirby Bohling wrote:
>> On Sat, Jan 29, 2011 at 9:03 AM, Ken Krugler
>> <[EMAIL PROTECTED]> wrote:
>>> Some comments below.
>>> On Jan 29, 2011, at 5:55am, Julien Nioche wrote:
>>>> Hi,
>>>> This shows the state of the various threads within a Java process. Most
>>>> of
>>>> them seem to be busy parsing zip archives with Tika. The interesting
>>>> part
>>>> is
>>>> that the main thread is at the Generation step :
>>>> *  at org.apache.nutch.crawl.Generator.generate(Generator.java:431)
>>>>  at org.apache.nutch.crawl.Crawl.main(Crawl.java:127)
>>>> *
>>>> with the "Thread-415331" normalizing the URLs as part of the generation.
>>>> So why do we see threads busy at parsing these archives? I think this is
>>>> a
>>>> result of the Timeout mechanism (
>>>> https://issues.apache.org/jira/browse/NUTCH-696) used for the parsing.
>>>> Before it, we used to have the parsing step loop on a single document
>>>> and
>>>> never complete. Thanks to Andrzej's patch, the parsing is done is
>>>> separate
>>>> threads which are abandonned if more than X seconds have passed (default
>>>> 30
>>>> I think). Obiously these threads are still lurking around in the
>>>> background
>>>> and consuming CPU.
>>>> This is an issue when calling the Crawl command only. When using the
>>>> separate commands for the various steps, the runaway threads die with
>>>> the
>>>> main process, however since the Crawl uses a single process, these
>>>> timeout
>>>> threads keep going.
>>>> Am not an expert in multithreading and don't have an idea of whether
>>>> these
>>>> threads could be killed somehow. Andrzej, any clue?
>>> This is a fundamental problem with run-away threads - there is no safe,
>>> reliable way to kill them off.
>>> And if you parse enough documents, you will run into a number that
>>> currently
>>> cause Tika to hang. Zip files for sure, but we ran into the same issue
>>> with
>>> FLV files.
>>> Over in Tika-land, Jukka has a patch that fires up a child JVM and runs
>>> parsers there. See https://issues.apache.org/jira/browse/TIKA-416
>>> -- Ken
>> All,
>>  Just an observation, but the general approach to this problem is to
>> use Thread.interrupt().  Virtually all code in the JDK treats the
>> thread being interrupted as a request to cancel.  Java Concurrency in
>> Practice (JCIP) has a whole chapter on this topic (Chapter 7).  IMHO,
>> any general purpose library code that swallows "InterruptedException"
>> and isn't implementing the Thread cancellation policy has a bug in it
>> (the cancellation policy can only be implemented by the owner of the
>> thread, unless the library is a task/thread library it cannot be
>> implementing the cancellation policy).  Any place you see:
> [snip]
>> One exception is that
>> sockets read/write operations don't operate this way, the socket must
>> be closed to interrupt a read/write, the approach JCIP suggests is to