-Re: Nutch as crawler for text analysis: setup ? version ?
Markus Jelsma 2012-03-09, 15:33
Behemoth  eats Nutch 1.x segments and can push them a.o. to GATE. Nutch
comes with its own Tika parser.
On Friday 09 March 2012 16:19:03 Piet van Remortel wrote:
> Hi all,
> Pretty new to nutch. Trying to create a setup where nutch repeatedly
> crawls a selected set of webpages, to feed the content into a pipeline for
> text analysis etc. (e.g. Nutch, Tika, GATE, ...)
> We are unclear about what setup/version/approach to use for this. To be
> honest, the plethora of snippets of (outdated?) docs don't help in getting
> a clear view on things.
> The major hurdle seems to be the flexible access to the crawled content.
> Both from a search (mentions of certain words) as from a systematic (e.g.
> database queries to process pages in batch) point of view.
> Next to solr queries, the only way seems dumping the segments with the
> SegmentReader, and processing those.
> But access to the segments seems cumbersome and not very flexible to
> integrate into a larger setup. And slow.
> I was happy to see the GORA access to e.g. MySQL in Nutch 2.0, but now that
> seems to all have been side-tracked. I got crawled pages in MySQL in 15
> minutes, which is great ! I don't see what the alternative for a setup
> like that is in Nutch 1.4 ?
> Alternatives to write to MySQL from Nutch 1.4 seem less straightforward as
> mentioned (extending nutch where the NutchPage gets written to SOLR and
> diverting to MySQL .. ? There must be a better way.)
> Could somebody with some experience in these kinds of setups advise in what
> direction we should consider going ?
> I would like a flexible setup, where nutch can run continuously, being fed
> with new seed URLs through time, and flexible and efficient access to the
> crawled results to integrate this in a larger setup.
> thanks !
Markus Jelsma - CTO - Openindex