Home | About | Sematext search-lucene.com search-hadoop.com
 Search Lucene and all its subprojects:

Switch to Threaded View
Nutch, mail # user - parse data directory not found after merge


Copy link to this message
-
Re: parse data directory not found after merge
Dean Pullen 2012-01-06, 16:08
Lewis,

Changing the merge to * returns a similar response:

LinkDb: org.apache.hadoop.mapred.InvalidInputException: Input Pattern
file:/opt/nutch_1_4/data/crawl/segments/*/parse_data matches 0 files

And yes, your assumption was correct - it's a different segment
directory each loop.

Many thanks,

Dean.

On 06/01/2012 15:43, Lewis John Mcgibbney wrote:
> Hi Dean,
>
> Without discussing any of your configuration properties can you please try
>
> 6) MERGE SEGMENTS:
> /opt/nutch_1_4/bin/nutch mergesegs
> /opt/nutch_1_4/data/crawl/MERGEDsegments/ -dir
> /opt/nutch_1_4/data/crawl/segments/* -filter -normalize
>
> paying attention to the wildcard /* in -dir /opt/nutch_1_4/data/crawl/segments/*
>
> Also presumably, when you mention you repeat steps 2-5 another 4
> times, you are not recursively generating, fetching, parsing and
> updating the WebDB with
> /opt/nutch_1_4/data/crawl/segments/20120106152527? This should change
> with every iteration of the g/f/p/updatedb cycle.
>
> Thanks
>
> On Fri, Jan 6, 2012 at 3:30 PM, Dean Pullen<[EMAIL PROTECTED]>  wrote:
>> No problem Lewis, I appreciate you looking into it.
>>
>>
>> Firstly I have a seed URL XML document here:
>> http://www.ukcigarforums.com/injectlist.xml
>> This basically has 'http://www.ukcigarforums.com/content.php' as a URL
>> within it.
>>
>> Nutch's regex-urlfilter.txt contains this:
>>
>> # allow urls in ukcigarforums.com domain
>> +http://([a-z0-9-A-Z]*.)*ukcigarforums.com/
>> # deny anything else
>> -.
>>
>>
>> Here's the procedure:
>>
>>
>> 1) INJECT:
>> /opt/nutch_1_4/bin/nutch inject /opt/nutch_1_4/data/crawl/crawldb/
>> /opt/nutch_1_4/data/seed/
>>
>> 2) GENERATE:
>> /opt/nutch_1_4/bin/nutch generate /opt/nutch_1_4/data/crawl/crawldb/
>> /opt/semantico/slot/nutch_1_4/data/crawl/segments/ -topN 10000 -adddays 26
>>
>> 3) FETCH:
>> /opt/nutch_1_4/bin/nutch fetch
>> /opt/nutch_1_4/data/crawl/segments/20120106152527 -threads 15
>>
>> 4) PARSE:
>> /opt/nutch_1_4/bin/nutch parse
>> /opt/nutch_1_4/data/crawl/segments/20120106152527 -threads 15
>>
>> 5) UPDATE DB:
>> /opt/nutch_1_4/bin/nutch updatedb /opt/nutch_1_4/data/crawl/crawldb/
>> /opt/nutch_1_4/data/crawl/segments/20120106152527 -normalize -filter
>>
>>
>> Repeat steps 2 to 5 another 4 times, then:
>>
>> 6) MERGE SEGMENTS:
>> /opt/nutch_1_4/bin/nutch mergesegs /opt/nutch_1_4/data/crawl/MERGEDsegments/
>> -dir /opt/nutch_1_4/data/crawl/segments/ -filter -normalize
>>
>>
>> Interestingly, this prints out:
>> "SegmentMerger: using segment data from: crawl_generate crawl_fetch
>> crawl_parse parse_data parse_text"
>>
>> MERGEDsegments segment directory then has just two directories, instead of
>> all of those listed in the last output, i.e. just: crawl_generate and
>> crawl_fetch
>>
>> (when then delete from the segments directory and copy the MERGEDsegments
>> results into it)
>>
>>
>> Lastly we run invert links after merge segments:
>>
>> 7) INVERT LINKS:
>> /opt/nutch_1_4/bin/nutch invertlinks /opt/nutch_1_4/data/crawl/linkdb/ -dir
>> /opt/nutch_1_4/data/crawl/segments/
>>
>> Which produces:
>>
>> "LinkDb: org.apache.hadoop.mapred.InvalidInputException: Input path does not
>> exist: file:/opt/nutch_1_4/data/crawl/segments/20120106152527/parse_data"
>>
>>
>
>