I just did and confirmed index-basic has no relevance to the crawl db. Here's
a piece of log output for injector and crawl db reader. There are only two
registered plugins, protocol-http and lib-http. After injection the crawldb
has 1 entry which is the same URL as in my seed list.
2011-10-14 15:30:03,683 INFO crawl.Injector - Injector: starting at
2011-10-14 15:30:03
2011-10-14 15:30:03,684 INFO crawl.Injector - Injector: crawlDb:
crawl/crawldb
2011-10-14 15:30:03,684 INFO crawl.Injector - Injector: urlDir: urls
2011-10-14 15:30:03,684 INFO crawl.Injector - Injector: Converting injected
urls to crawl db entries.
2011-10-14 15:30:04,041 INFO plugin.PluginRepository - Plugins: looking in:
/home/markus/projects/apache/nutch/trunk/runtime/local/plugins
2011-10-14 15:30:04,131 INFO plugin.PluginRepository - Plugin Auto-activation
mode: [true]
2011-10-14 15:30:04,131 INFO plugin.PluginRepository - Registered Plugins:
2011-10-14 15:30:04,131 INFO plugin.PluginRepository - the nutch core
extension points (nutch-extensionpoints)
2011-10-14 15:30:04,131 INFO plugin.PluginRepository - HTTP Framework
(lib-http)
2011-10-14 15:30:04,131 INFO plugin.PluginRepository - Http Protocol
Plug-in (protocol-http)
2011-10-14 15:30:04,131 INFO plugin.PluginRepository - Registered Extension-
Points:
2011-10-14 15:30:04,131 INFO plugin.PluginRepository - Nutch URL
Normalizer (org.apache.nutch.net.URLNormalizer)
2011-10-14 15:30:04,131 INFO plugin.PluginRepository - Nutch Protocol
(org.apache.nutch.protocol.Protocol)
2011-10-14 15:30:04,131 INFO plugin.PluginRepository - Nutch Segment
Merge Filter (org.apache.nutch.segment.SegmentMergeFilter)
2011-10-14 15:30:04,131 INFO plugin.PluginRepository - Nutch URL
Filter (org.apache.nutch.net.URLFilter)
2011-10-14 15:30:04,131 INFO plugin.PluginRepository - Nutch Indexing
Filter (org.apache.nutch.indexer.IndexingFilter)
2011-10-14 15:30:04,132 INFO plugin.PluginRepository - HTML Parse
Filter (org.apache.nutch.parse.HtmlParseFilter)
2011-10-14 15:30:04,132 INFO plugin.PluginRepository - Nutch Content
Parser (org.apache.nutch.parse.Parser)
2011-10-14 15:30:04,132 INFO plugin.PluginRepository - Nutch Scoring
(org.apache.nutch.scoring.ScoringFilter)
2011-10-14 15:30:04,946 INFO crawl.Injector - Injector: Merging injected urls
into crawl db.
2011-10-14 15:30:05,160 WARN util.NativeCodeLoader - Unable to load native-
hadoop library for your platform... using builtin-java classes where
applicable
2011-10-14 15:30:06,104 INFO crawl.Injector - Injector: finished at
2011-10-14 15:30:06, elapsed: 00:00:02
2011-10-14 15:30:08,727 INFO crawl.CrawlDbReader - CrawlDb statistics start:
crawl/crawldb/
2011-10-14 15:30:08,836 WARN mapred.JobClient - Use GenericOptionsParser for
parsing the arguments. Applications should implement Tool for the same.
2011-10-14 15:30:10,052 INFO crawl.CrawlDbReader - Statistics for CrawlDb:
crawl/crawldb/
2011-10-14 15:30:10,052 INFO crawl.CrawlDbReader - TOTAL urls: 1
2011-10-14 15:30:10,052 INFO crawl.CrawlDbReader - retry 0: 1
2011-10-14 15:30:10,052 INFO crawl.CrawlDbReader - min score: 1.0
2011-10-14 15:30:10,052 INFO crawl.CrawlDbReader - avg score: 1.0
2011-10-14 15:30:10,052 INFO crawl.CrawlDbReader - max score: 1.0
2011-10-14 15:30:10,052 INFO crawl.CrawlDbReader - status 1 (db_unfetched):
1
2011-10-14 15:30:10,053 INFO crawl.CrawlDbReader - CrawlDb statistics: done
On Friday 14 October 2011 15:23:00 Radim Kolar wrote:
> try it yourself. in 1.4 remove index-basic from list of included
> plugins, then run nutch inject in hadoop mode and you will get 0 rows on
> first map output.
--
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17050-8536620 / 06-50258350