Home | About | Sematext search-lucene.com search-hadoop.com
 Search Lucene and all its subprojects:

Switch to Threaded View
Nutch, mail # user - publishedDate and feed plugin


Copy link to this message
-
Re: publishedDate and feed plugin
Shameema Umer 2012-06-08, 17:32
Hi Lewis, the things are clear, I am upset that I cannot find a means to
find the age of a web page by nutch. I thought publishedDate from the feed
plugin would help. If I change the field name from publishedDate to *pubDate
* . Will this help?

Thanks
Shameema
On Fri, Jun 8, 2012 at 6:48 PM, Lewis John Mcgibbney <
[EMAIL PROTECTED]> wrote:

> Hi,
>
> No This should not be necessary. The feed parser and accompanying
> indexingfilter should extract and send (to be indexed) the following
> metadata items
> Author, Tags, Pub;lished date, Updated date and feed,
>
> There is a problem though...
>
> With many feeds, including the bbci one you provided in another
> thread, many of these fields are absent, the parser and indexing
> filter cannot operate on our behalf and subsequently leaves these
> fields out.
>
> It is also important to note that in parse-plugins.xml we first try to
> parse the application/rss+xml mimetype with parse-tika before feed...
> I can only assume this is because parse-tika produces slightly better
> results for this mimetype. Let me explain
>
> With language identifier included and parse-plugins overridden to
> parse rss+xml solely with feed plugin I get
>
> lewismc@lewismc-HP-Mini-110-3100:~/ASF/trunk/runtime/local$ bin/nutch
> indexchecker http://feeds.feedburner.com/gov/GCC?format=xml
> fetching: http://feeds.feedburner.com/gov/GCC?format=xml
> parsing: http://feeds.feedburner.com/gov/GCC?format=xml
> contentType: application/rss+xml
> content :
> host :  feeds.feedburner.com
> tstamp :        Fri Jun 08 14:04:04 BST 2012
> lang :  unknown
> url :   http://feeds.feedburner.com/gov/GCC?format=xml
>
> however with parse-tika initiated and the same fetch I get
>
> lewismc@lewismc-HP-Mini-110-3100:~/ASF/trunk/runtime/local$ bin/nutch
> indexchecker http://feeds.feedburner.com/gov/GCC?format=xml
> fetching: http://feeds.feedburner.com/gov/GCC?format=xml
> parsing: http://feeds.feedburner.com/gov/GCC?format=xml
> contentType: application/rss+xml
> content :       Glasgow City Council - News Feed Glasgow City Council -
> News
> Feed Keep up to date with all the news
> title : Glasgow City Council - News Feed
> host :  feeds.feedburner.com
> tstamp :        Fri Jun 08 14:04:25 BST 2012
> lang :  en
> url :   http://feeds.feedburner.com/gov/GCC?format=xml
>
> Please note that this feed does not include info like publishedDate,
> updatedDate etc instead offering other means of expressing (some) of
> this information. In the above case, as the parse data is not present
> for the required feed fields, or for arguments sake parse-tika, these
> fields are not included in our subsequent index fields.
>
> I hope this clears things up a bit.
>
> On a sidenote, also some things to pick up from the above excepts from
> some tests;
> 1) Feed plugin fails to recognize content, title and lang fields where
> parse-tika does this sucessfully.
> 2) Even though parse-tika DOES utilise the language-identifier to
> recognize the lang field and provide a value, it fails to include the
> full value which should be lang="en-GB" as oppose to lang="en"
>
> Can anyone chime in on what the current state of affairs is with
> delegation of language detection to parse-tika, or whether this as
> already the case but needs patched to accommodate the scenario I
> provide above?
>
> Thanks
>
> Lewis
>
> On Fri, Jun 8, 2012 at 5:07 AM, Shameema Umer <[EMAIL PROTECTED]> wrote:
> > Hi Lewis,
> >
> > My solrindex-mapping contains
> > <mapping>
> >        <!-- Simple mapping of fields created by Nutch IndexingFilters
> >             to fields defined (and expected) in Solr schema.xml.
> >
> >             Any fields in NutchDocument that match a name defined
> >             in field/@source will be renamed to the corresponding
> >             field/@dest.
> >             Additionally, if a field name (before mapping) matches
> >             a copyField/@source then its values will be copied to
> >             the corresponding copyField/@dest.