Home | About | Sematext search-lucene.com search-hadoop.com
 Search Lucene and all its subprojects:

Switch to Threaded View
Nutch, mail # user - parse existing segments


Copy link to this message
-
Re: parse existing segments
Ferdy Galema 2011-11-03, 13:11
What are you trying to achieve? The crawl command does not invocate any
plugins of itself, it merely chains several Nutch jobs together. The
Nutch jobs themselves - or more specifically the mappers and reducers -
make use of the plugin repository.

On 11/03/2011 01:47 PM, Ashish M wrote:
> What method in crawl.java would trigger the invocation of plugins?
>
> Sent from my iPhone. Please ignore the typos.
>
> On Nov 3, 2011, at 5:30 AM, Markus Jelsma<[EMAIL PROTECTED]>  wrote:
>
>> remove *parse* in the segment and you're good to go.
>>
>> On Thursday 03 November 2011 13:16:40 Ashish Mehrotra wrote:
>>> Hi All,
>>>
>>> I am trying to parse already crawled segments using the method --
>>> ParseSegment.parse(seg);
>>>
>>>
>>> seg is the Path to the existing segment.
>>> This internally fires a new job and the error thrown is --
>>>
>>> Exception in thread "main" java.io.IOException: Segment already parsed!
>>> at
>>> org.apache.nutch.parse.ParseOutputFormat.checkOutputSpecs(ParseOutputForma
>>> t.java:80) at
>>> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:772)
>>> at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730) at
>>> org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
>>> at org.apache.nutch.parse.ParseSegment.parse(ParseSegment.java:156)
>>>
>>> What I am trying to do here is parse the already fetched data to test my
>>> HTML Parse Filter. Looks like the above method of ParseSegment gets called
>>> in the normal workflow of crawl, fetch, parse ...
>>>
>>> What I have done is modified the org.apache.nutch.crawl.Crawl.run()  to
>>> call only ParseSegment and commented the injector, generator and fetcher
>>> parts. I am calling ParseSegment.parse(segment) in the run() method. I am
>>> passing the segment name in the command line.
>>>
>>> Should I be calling some other method to test my HTML parser filter plugin
>>> without crawling again?
>>>
>>> Any pointers should be helpful.
>>>
>>> Thanks,
>>> Ashish
>> --
>> Markus Jelsma - CTO - Openindex
>> http://www.linkedin.com/in/markus17
>> 050-8536620 / 06-50258350