Home | About | Sematext search-lucene.com search-hadoop.com
 Search Lucene and all its subprojects:

Switch to Threaded View
Nutch, mail # user - parse data directory not found after merge


Copy link to this message
-
Re: parse data directory not found after merge
Dean Pullen 2012-01-08, 22:51
Where do we go from here? I can start looking/stepping through the
mergesegs code, but I'm reluctant due to it's probable complexity.

Dean.
On 08/01/2012 14:26, Dean Pullen wrote:
> No Lewis, -linkdb was already been used for the solrindex command, so
> we still have the same problem.
>
> Many thanks,
>
> Dean
>
> On 08/01/2012 14:08, Lewis John Mcgibbney wrote:
>> Hi dean is this sorted
>>
>> On Saturday, January 7, 2012, Dean Pullen<[EMAIL PROTECTED]>  
>> wrote:
>>> Sorry, you did mean on solrindex - which I already do...
>>>
>>> On 07/01/2012 13:15, Dean Pullen wrote:
>>>
>>> The -linkdb param isn't in the invertlinks docs
>> http://wiki.apache.org/nutch/bin/nutch_invertlinks
>>> (However it is in the solrindex docs)
>>>
>>> Adding it makes no difference to invertlinks.
>>>
>>> I think the problem is definitely with mergesegs, as opposed to
>> invertlinks etc.
>>> Thanks again,
>>>
>>> Dean.
>>>
>>> On 06/01/2012 17:53, Lewis John Mcgibbney wrote:
>>>
>>> OK so now I think were at the bottom of it. If you wish to create a
>>> linkdb in>= Nutch 1.4 you need to specifically pass the linkdb
>>> parameter. This was implemented as not everyone wishes to create a
>>> linkdb.
>>>
>>> Your invertlinks command should be passed as follows
>>>
>>> bin/nutch invertlinks path/you/wish/to/have/the/linkdb -dir
>>> /path/to/segment/dirs
>>> then
>>> bin/nutch solrindex http://solrUrl path/to/crawldb -linkdb
>>> path/to/linkdb -dir path/to/segment/dirs
>>>
>>> If you are not passing the -linkdb path/to/linkdb explicitly you will
>>> be thrown an exception as the linkdb is treated as a segment directory
>>> now.
>>>
>>> On Fri, Jan 6, 2012 at 5:17 PM, Dean Pullen<[EMAIL PROTECTED]>
>>   wrote:
>>> Only this:
>>>
>>> 2012-01-06 17:15:47,972 WARN  mapred.JobClient - Use
>>> GenericOptionsParser
>>> for parsing the arguments. Applications should implement Tool for the
>> same.
>>> 2012-01-06 17:15:48,692 WARN  util.NativeCodeLoader - Unable to load
>>> native-hadoop library for your platform... using builtin-java classes
>> where
>>> applicable
>>> 2012-01-06 17:15:51,566 INFO  crawl.LinkDb - LinkDb: starting at
>> 2012-01-06
>>> 17:15:51
>>> 2012-01-06 17:15:51,567 INFO  crawl.LinkDb - LinkDb: linkdb:
>>> /opt/nutch_1_4/data/crawl/linkdb
>>> 2012-01-06 17:15:51,567 INFO  crawl.LinkDb - LinkDb: URL normalize:
>>> true
>>> 2012-01-06 17:15:51,567 INFO  crawl.LinkDb - LinkDb: URL filter: true
>>> 2012-01-06 17:15:51,576 INFO  crawl.LinkDb - LinkDb: adding segment:
>>> file:/opt/nutch_1_4/data/crawl/segments/20120106171547
>>> 2012-01-06 17:15:51,721 ERROR crawl.LinkDb - LinkDb:
>>> org.apache.hadoop.mapred.InvalidInputException: Input path does not
>>> exist:
>>> file:/opt/nutch_1_4/data/crawl/segments/20120106171547/parse_data
>>>     at
>>>
>> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:190)
>>
>>>     at
>>>
>> org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:44)
>>
>>>     at
>>>
>> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:201)
>>
>>>     at
>> org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810)
>>>     at
>>> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781)
>>>
>>>     at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
>>>     at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
>>>     at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:175)
>>>     at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:290)
>>>     at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>>>     at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:255)
>>>
>>> 2012-01-06 17:15:52,714 INFO  solr.SolrIndexer - SolrIndexer:
>>> starting at
>>> 2012-01-06 17:15:52
>>> 2012-01-06 17:15:52,782 INFO  indexer.IndexerMapReduce -
>>> IndexerMapReduce:
>>> crawldb: /opt/nutch_1_4/data/crawl/crawldb
>>> 2012-01-06 17:15:52,782 INFO  indexer.IndexerMapReduce -
>>> IndexerMapReduce: