Home | About | Sematext search-lucene.com search-hadoop.com
 Search Lucene and all its subprojects:

Switch to Threaded View
Nutch, mail # dev - get rid of outlink code for Tika


Copy link to this message
-
Re: get rid of outlink code for Tika
Mattmann, Chris A 2011-12-21, 15:42
+1 from me -- those 3 Tika content handlers should take care of it...

Cheers,
Chris

On Dec 21, 2011, at 6:51 AM, Markus Jelsma wrote:

> Hi,
>
> For using Boilerpipe we need LinkCH, BoilerpipeCH and TeeCH in Tika. LinkCH
> returns all URL's with some meta data such as title etc. Fixes for old parsers
> such as Neko are then obsolete.
>
> I propose to rely on Tika for all outlinks. Right now this means not all types
> are returned such as area, form and whatelse. Is this a big problem? Rel is
> also not returned but i patched Tika to do that so we can still do something
> with nofollow which is important.
>
> Thanks
>
> --
> Markus Jelsma - CTO - Openindex
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: [EMAIL PROTECTED]
WWW:   http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++