Home | About | Sematext search-lucene.com search-hadoop.com
 Search Lucene and all its subprojects:

Switch to Threaded View
Nutch, mail # user - Large website not fully crawled


Copy link to this message
-
Re: Large website not fully crawled
Tolga 2012-05-24, 07:35
- I don't fully understand the use of topN parameter. Should I increase it?
- You mean parse-pdf thing? I've got that in my nutch-default.xml.
- I looked for the link, it was there. Besides, that was for another
website I was experimenting on.
- How do I check segments?
- I didn't check filenames, but I've tried searching for a word in that
PDF file.
- I've got more than 50gb free.
- I'm not sure about webserver kicking me off, I'll have the check that
with the sysadmin.

Regards,

On 5/24/12 10:25 AM, Piet van Remortel wrote:
> - your topN parameter limited the crawl : see the info at
> http://wiki.apache.org/nutch/NutchTutorial
>
> or :
>
> - file filters
> - there is no link to the files (as you suggested yourself already)
> - did you check the correct/all segments ?
> - did you check the fully correct filenames ? wildcards don't work on all
> segmentreader approaches
> - size limits of the crawler (see previous discussion)
> - did you check file presence in the segment, or parse result ?  i.e.
> parsing could have failed (cfr the previous discussion of the last few days)
> - your disk got full and crawling stopped
> - the webserver(s) kicked you off
> - your hadoop logs have overrun the local disk on which the crawler was
> running (i.e. disk full)
>
> Piet
>
>
> On Thu, May 24, 2012 at 9:17 AM, Tolga<[EMAIL PROTECTED]>  wrote:
>
>> Hi,
>>
>> I am crawling a large website, which is our university's. From the logs
>> and some grep'ing, I see that some pdf files were not crawled. Why could
>> this happen? I'm crawling with -depth 100 -topN 5.
>>
>> Regards,
>>