Home | About | Sematext search-lucene.com search-hadoop.com
 Search Lucene and all its subprojects:

Switch to Threaded View
Nutch, mail # user - parse data directory not found after merge


Copy link to this message
-
Re: parse data directory not found after merge
Dean Pullen 2012-01-06, 16:24
Good spot because all of that was meant to be removed! No, I'm afraid
that's just a copy/paste problem.

Dean

On 06/01/2012 16:17, Lewis John Mcgibbney wrote:
> Ok then,
>
> How about your generate command:
>
> 2) GENERATE:
> /opt/nutch_1_4/bin/nutch generate /opt/nutch_1_4/data/crawl/crawldb/
> /opt/semantico/slot/nutch_1_4/data/crawl/segments/ -topN 10000 -adddays 26
>
> Your<segments_dir>  seems to point to /opt/semantico/slot/etc/etc/etc,
> when everything else being utilised within the crawl cycle points to
> an entirely different<segment_dirs>  path which is
> /opt/nutch_1_4/data/crawl/segments/segment_date
>
> Was this intentional?
>
> On Fri, Jan 6, 2012 at 4:08 PM, Dean Pullen<[EMAIL PROTECTED]>  wrote:
>> Lewis,
>>
>> Changing the merge to * returns a similar response:
>>
>> LinkDb: org.apache.hadoop.mapred.InvalidInputException: Input Pattern
>> file:/opt/nutch_1_4/data/crawl/segments/*/parse_data matches 0 files
>>
>> And yes, your assumption was correct - it's a different segment directory
>> each loop.
>>
>> Many thanks,
>>
>> Dean.
>>
>> On 06/01/2012 15:43, Lewis John Mcgibbney wrote:
>>> Hi Dean,
>>>
>>> Without discussing any of your configuration properties can you please try
>>>
>>> 6) MERGE SEGMENTS:
>>> /opt/nutch_1_4/bin/nutch mergesegs
>>> /opt/nutch_1_4/data/crawl/MERGEDsegments/ -dir
>>> /opt/nutch_1_4/data/crawl/segments/* -filter -normalize
>>>
>>> paying attention to the wildcard /* in -dir
>>> /opt/nutch_1_4/data/crawl/segments/*
>>>
>>> Also presumably, when you mention you repeat steps 2-5 another 4
>>> times, you are not recursively generating, fetching, parsing and
>>> updating the WebDB with
>>> /opt/nutch_1_4/data/crawl/segments/20120106152527? This should change
>>> with every iteration of the g/f/p/updatedb cycle.
>>>
>>> Thanks
>>>
>>> On Fri, Jan 6, 2012 at 3:30 PM, Dean Pullen<[EMAIL PROTECTED]>
>>>   wrote:
>>>> No problem Lewis, I appreciate you looking into it.
>>>>
>>>>
>>>> Firstly I have a seed URL XML document here:
>>>> http://www.ukcigarforums.com/injectlist.xml
>>>> This basically has 'http://www.ukcigarforums.com/content.php' as a URL
>>>> within it.
>>>>
>>>> Nutch's regex-urlfilter.txt contains this:
>>>>
>>>> # allow urls in ukcigarforums.com domain
>>>> +http://([a-z0-9-A-Z]*.)*ukcigarforums.com/
>>>> # deny anything else
>>>> -.
>>>>
>>>>
>>>> Here's the procedure:
>>>>
>>>>
>>>> 1) INJECT:
>>>> /opt/nutch_1_4/bin/nutch inject /opt/nutch_1_4/data/crawl/crawldb/
>>>> /opt/nutch_1_4/data/seed/
>>>>
>>>> 2) GENERATE:
>>>> /opt/nutch_1_4/bin/nutch generate /opt/nutch_1_4/data/crawl/crawldb/
>>>> /opt/semantico/slot/nutch_1_4/data/crawl/segments/ -topN 10000 -adddays
>>>> 26
>>>>
>>>> 3) FETCH:
>>>> /opt/nutch_1_4/bin/nutch fetch
>>>> /opt/nutch_1_4/data/crawl/segments/20120106152527 -threads 15
>>>>
>>>> 4) PARSE:
>>>> /opt/nutch_1_4/bin/nutch parse
>>>> /opt/nutch_1_4/data/crawl/segments/20120106152527 -threads 15
>>>>
>>>> 5) UPDATE DB:
>>>> /opt/nutch_1_4/bin/nutch updatedb /opt/nutch_1_4/data/crawl/crawldb/
>>>> /opt/nutch_1_4/data/crawl/segments/20120106152527 -normalize -filter
>>>>
>>>>
>>>> Repeat steps 2 to 5 another 4 times, then:
>>>>
>>>> 6) MERGE SEGMENTS:
>>>> /opt/nutch_1_4/bin/nutch mergesegs
>>>> /opt/nutch_1_4/data/crawl/MERGEDsegments/
>>>> -dir /opt/nutch_1_4/data/crawl/segments/ -filter -normalize
>>>>
>>>>
>>>> Interestingly, this prints out:
>>>> "SegmentMerger: using segment data from: crawl_generate crawl_fetch
>>>> crawl_parse parse_data parse_text"
>>>>
>>>> MERGEDsegments segment directory then has just two directories, instead
>>>> of
>>>> all of those listed in the last output, i.e. just: crawl_generate and
>>>> crawl_fetch
>>>>
>>>> (when then delete from the segments directory and copy the MERGEDsegments
>>>> results into it)
>>>>
>>>>
>>>> Lastly we run invert links after merge segments:
>>>>
>>>> 7) INVERT LINKS:
>>>> /opt/nutch_1_4/bin/nutch invertlinks /opt/nutch_1_4/data/crawl/linkdb/
>>>> -dir
>>>> /opt/nutch_1_4/data/crawl/segments/