Home | About | Sematext search-lucene.com search-hadoop.com
 Search Lucene and all its subprojects:

Switch to Threaded View
Nutch, mail # user - Nutch as crawler for text analysis: setup ? version ?


Copy link to this message
-
Re: Nutch as crawler for text analysis: setup ? version ?
Markus Jelsma 2012-03-09, 15:33
Behemoth [1] eats Nutch 1.x segments and can push them a.o. to GATE. Nutch
comes with its own Tika parser.

[1]: https://github.com/jnioche/behemoth

cheers

On Friday 09 March 2012 16:19:03 Piet van Remortel wrote:
> Hi all,
>
> Pretty new to nutch.  Trying to create a setup where nutch repeatedly
> crawls a selected set of webpages, to feed the content into a pipeline for
> text analysis etc. (e.g. Nutch, Tika, GATE, ...)
>
> We are unclear about what setup/version/approach to use for this.   To be
> honest, the plethora of snippets of (outdated?) docs don't help in getting
> a clear view on things.
>
> The major hurdle seems to be the flexible access to the crawled content.
>  Both from a search (mentions of certain words) as from a systematic (e.g.
> database queries to process pages in batch) point of view.
> Next to solr queries, the only way seems dumping the segments with the
> SegmentReader, and processing those.
> But access to the segments seems cumbersome and not very flexible to
> integrate into a larger setup.  And slow.
>
> I was happy to see the GORA access to e.g. MySQL in Nutch 2.0, but now that
> seems to all have been side-tracked.  I got crawled pages in MySQL in 15
> minutes, which is great !  I don't see what the alternative for a setup
> like that is in Nutch 1.4 ?
>
> Alternatives to write to MySQL from Nutch 1.4 seem less straightforward as
> mentioned (extending nutch where the NutchPage gets written to SOLR and
> diverting to MySQL .. ?  There must be a better way.)
>
> Could somebody with some experience in these kinds of setups advise in what
> direction we should consider going ?
>
> I would like a flexible setup, where nutch can run continuously, being fed
> with new seed URLs through time, and flexible and efficient access to the
> crawled results to integrate this in a larger setup.
>
> thanks !
>
> pvremort

--
Markus Jelsma - CTO - Openindex