|
Shameema Umer
2012-06-07, 10:41
Lewis John Mcgibbney
2012-06-07, 11:02
Shameema Umer
2012-06-08, 04:07
Lewis John Mcgibbney
2012-06-08, 13:18
Shameema Umer
2012-06-08, 17:32
Lewis John Mcgibbney
2012-06-09, 08:04
Shameema Umer
2012-06-09, 10:43
Shameema Umer
2012-06-13, 12:52
Shameema Umer
2012-06-13, 12:58
Shameema Umer
2012-06-14, 06:04
Lewis John Mcgibbney
2012-06-14, 12:41
Shameema Umer
2012-06-16, 10:11
|
-
publishedDate and feed pluginShameema Umer 2012-06-07, 10:41
In my schema there are certain fields used for feed plugin.
<!-- fields for feed plugin (tag is also used by microformats-reltag)--> <field name="author" type="string" stored="true" indexed="true"/> <field name="tag" type="string" stored="true" indexed="true" multiValued="true"/> <field name="feed" type="string" stored="true" indexed="true"/> <field name="publishedDate" type="date" stored="true" indexed="true"/> <field name="updatedDate" type="date" stored="true" indexed="true"/> I have included the feed plugin in nutch site xml. The feed file is fetched and parsed , also the links in it are working properly. But I cannot get the publishedDate working. I cannot retrieve the publishedDate or sort by it. Please help.
-
Re: publishedDate and feed pluginLewis John Mcgibbney 2012-06-07, 11:02
Best way to test this is by doing ad-hoc parsechecker fetches. Also
try including this value in your solr-mapping file. On Thu, Jun 7, 2012 at 11:41 AM, Shameema Umer <[EMAIL PROTECTED]> wrote: > In my schema there are certain fields used for feed plugin. > > <!-- fields for feed plugin (tag is also used by > microformats-reltag)--> > <field name="author" type="string" stored="true" indexed="true"/> > <field name="tag" type="string" stored="true" indexed="true" > multiValued="true"/> > <field name="feed" type="string" stored="true" indexed="true"/> > <field name="publishedDate" type="date" stored="true" > indexed="true"/> > <field name="updatedDate" type="date" stored="true" > indexed="true"/> > > I have included the feed plugin in nutch site xml. The feed file is fetched > and parsed , also the links in it are working properly. But I cannot get > the publishedDate working. > I cannot retrieve the publishedDate or sort by it. > > Please help. -- Lewis
-
Re: publishedDate and feed pluginShameema Umer 2012-06-08, 04:07
Hi Lewis,
My solrindex-mapping contains <mapping> <!-- Simple mapping of fields created by Nutch IndexingFilters to fields defined (and expected) in Solr schema.xml. Any fields in NutchDocument that match a name defined in field/@source will be renamed to the corresponding field/@dest. Additionally, if a field name (before mapping) matches a copyField/@source then its values will be copied to the corresponding copyField/@dest. uniqueKey has the same meaning as in Solr schema.xml and defaults to "id" if not defined. --> <fields> <field dest="content" source="content"/> <field dest="site" source="site"/> <field dest="title" source="title"/> <field dest="host" source="host"/> <field dest="segment" source="segment"/> <field dest="boost" source="boost"/> <field dest="digest" source="digest"/> <field dest="tstamp" source="tstamp"/> <field dest="publishedDate" source="publishedDate"/> <field dest="id" source="url"/> <copyField source="url" dest="url"/> </fields> <uniqueKey>id</uniqueKey> </mapping> Do I need to edit any source code of feed plugin to make available this publishedDate. Thanks Shameema On Thu, Jun 7, 2012 at 4:32 PM, Lewis John Mcgibbney <[EMAIL PROTECTED]> wrote: > Best way to test this is by doing ad-hoc parsechecker fetches. Also > try including this value in your solr-mapping file. > > On Thu, Jun 7, 2012 at 11:41 AM, Shameema Umer <[EMAIL PROTECTED]> wrote: >> In my schema there are certain fields used for feed plugin. >> >> <!-- fields for feed plugin (tag is also used by >> microformats-reltag)--> >> <field name="author" type="string" stored="true" indexed="true"/> >> <field name="tag" type="string" stored="true" indexed="true" >> multiValued="true"/> >> <field name="feed" type="string" stored="true" indexed="true"/> >> <field name="publishedDate" type="date" stored="true" >> indexed="true"/> >> <field name="updatedDate" type="date" stored="true" >> indexed="true"/> >> >> I have included the feed plugin in nutch site xml. The feed file is fetched >> and parsed , also the links in it are working properly. But I cannot get >> the publishedDate working. >> I cannot retrieve the publishedDate or sort by it. >> >> Please help. > > > > -- > Lewis
-
Re: publishedDate and feed pluginLewis John Mcgibbney 2012-06-08, 13:18
Hi,
No This should not be necessary. The feed parser and accompanying indexingfilter should extract and send (to be indexed) the following metadata items Author, Tags, Pub;lished date, Updated date and feed, There is a problem though... With many feeds, including the bbci one you provided in another thread, many of these fields are absent, the parser and indexing filter cannot operate on our behalf and subsequently leaves these fields out. It is also important to note that in parse-plugins.xml we first try to parse the application/rss+xml mimetype with parse-tika before feed... I can only assume this is because parse-tika produces slightly better results for this mimetype. Let me explain With language identifier included and parse-plugins overridden to parse rss+xml solely with feed plugin I get lewismc@lewismc-HP-Mini-110-3100:~/ASF/trunk/runtime/local$ bin/nutch indexchecker http://feeds.feedburner.com/gov/GCC?format=xml fetching: http://feeds.feedburner.com/gov/GCC?format=xml parsing: http://feeds.feedburner.com/gov/GCC?format=xml contentType: application/rss+xml content : host : feeds.feedburner.com tstamp : Fri Jun 08 14:04:04 BST 2012 lang : unknown url : http://feeds.feedburner.com/gov/GCC?format=xml however with parse-tika initiated and the same fetch I get lewismc@lewismc-HP-Mini-110-3100:~/ASF/trunk/runtime/local$ bin/nutch indexchecker http://feeds.feedburner.com/gov/GCC?format=xml fetching: http://feeds.feedburner.com/gov/GCC?format=xml parsing: http://feeds.feedburner.com/gov/GCC?format=xml contentType: application/rss+xml content : Glasgow City Council - News Feed Glasgow City Council - News Feed Keep up to date with all the news title : Glasgow City Council - News Feed host : feeds.feedburner.com tstamp : Fri Jun 08 14:04:25 BST 2012 lang : en url : http://feeds.feedburner.com/gov/GCC?format=xml Please note that this feed does not include info like publishedDate, updatedDate etc instead offering other means of expressing (some) of this information. In the above case, as the parse data is not present for the required feed fields, or for arguments sake parse-tika, these fields are not included in our subsequent index fields. I hope this clears things up a bit. On a sidenote, also some things to pick up from the above excepts from some tests; 1) Feed plugin fails to recognize content, title and lang fields where parse-tika does this sucessfully. 2) Even though parse-tika DOES utilise the language-identifier to recognize the lang field and provide a value, it fails to include the full value which should be lang="en-GB" as oppose to lang="en" Can anyone chime in on what the current state of affairs is with delegation of language detection to parse-tika, or whether this as already the case but needs patched to accommodate the scenario I provide above? Thanks Lewis On Fri, Jun 8, 2012 at 5:07 AM, Shameema Umer <[EMAIL PROTECTED]> wrote: > Hi Lewis, > > My solrindex-mapping contains > <mapping> > <!-- Simple mapping of fields created by Nutch IndexingFilters > to fields defined (and expected) in Solr schema.xml. > > Any fields in NutchDocument that match a name defined > in field/@source will be renamed to the corresponding > field/@dest. > Additionally, if a field name (before mapping) matches > a copyField/@source then its values will be copied to > the corresponding copyField/@dest. > > uniqueKey has the same meaning as in Solr schema.xml > and defaults to "id" if not defined. > --> > <fields> > <field dest="content" source="content"/> > <field dest="site" source="site"/> > <field dest="title" source="title"/> > <field dest="host" source="host"/> > <field dest="segment" source="segment"/> > <field dest="boost" source="boost"/> > <field dest="digest" source="digest"/> > <field dest="tstamp" source="tstamp"/> Lewis
-
Re: publishedDate and feed pluginShameema Umer 2012-06-08, 17:32
Hi Lewis, the things are clear, I am upset that I cannot find a means to
find the age of a web page by nutch. I thought publishedDate from the feed plugin would help. If I change the field name from publishedDate to *pubDate * . Will this help? Thanks Shameema On Fri, Jun 8, 2012 at 6:48 PM, Lewis John Mcgibbney < [EMAIL PROTECTED]> wrote: > Hi, > > No This should not be necessary. The feed parser and accompanying > indexingfilter should extract and send (to be indexed) the following > metadata items > Author, Tags, Pub;lished date, Updated date and feed, > > There is a problem though... > > With many feeds, including the bbci one you provided in another > thread, many of these fields are absent, the parser and indexing > filter cannot operate on our behalf and subsequently leaves these > fields out. > > It is also important to note that in parse-plugins.xml we first try to > parse the application/rss+xml mimetype with parse-tika before feed... > I can only assume this is because parse-tika produces slightly better > results for this mimetype. Let me explain > > With language identifier included and parse-plugins overridden to > parse rss+xml solely with feed plugin I get > > lewismc@lewismc-HP-Mini-110-3100:~/ASF/trunk/runtime/local$ bin/nutch > indexchecker http://feeds.feedburner.com/gov/GCC?format=xml > fetching: http://feeds.feedburner.com/gov/GCC?format=xml > parsing: http://feeds.feedburner.com/gov/GCC?format=xml > contentType: application/rss+xml > content : > host : feeds.feedburner.com > tstamp : Fri Jun 08 14:04:04 BST 2012 > lang : unknown > url : http://feeds.feedburner.com/gov/GCC?format=xml > > however with parse-tika initiated and the same fetch I get > > lewismc@lewismc-HP-Mini-110-3100:~/ASF/trunk/runtime/local$ bin/nutch > indexchecker http://feeds.feedburner.com/gov/GCC?format=xml > fetching: http://feeds.feedburner.com/gov/GCC?format=xml > parsing: http://feeds.feedburner.com/gov/GCC?format=xml > contentType: application/rss+xml > content : Glasgow City Council - News Feed Glasgow City Council - > News > Feed Keep up to date with all the news > title : Glasgow City Council - News Feed > host : feeds.feedburner.com > tstamp : Fri Jun 08 14:04:25 BST 2012 > lang : en > url : http://feeds.feedburner.com/gov/GCC?format=xml > > Please note that this feed does not include info like publishedDate, > updatedDate etc instead offering other means of expressing (some) of > this information. In the above case, as the parse data is not present > for the required feed fields, or for arguments sake parse-tika, these > fields are not included in our subsequent index fields. > > I hope this clears things up a bit. > > On a sidenote, also some things to pick up from the above excepts from > some tests; > 1) Feed plugin fails to recognize content, title and lang fields where > parse-tika does this sucessfully. > 2) Even though parse-tika DOES utilise the language-identifier to > recognize the lang field and provide a value, it fails to include the > full value which should be lang="en-GB" as oppose to lang="en" > > Can anyone chime in on what the current state of affairs is with > delegation of language detection to parse-tika, or whether this as > already the case but needs patched to accommodate the scenario I > provide above? > > Thanks > > Lewis > > On Fri, Jun 8, 2012 at 5:07 AM, Shameema Umer <[EMAIL PROTECTED]> wrote: > > Hi Lewis, > > > > My solrindex-mapping contains > > <mapping> > > <!-- Simple mapping of fields created by Nutch IndexingFilters > > to fields defined (and expected) in Solr schema.xml. > > > > Any fields in NutchDocument that match a name defined > > in field/@source will be renamed to the corresponding > > field/@dest. > > Additionally, if a field name (before mapping) matches > > a copyField/@source then its values will be copied to > > the corresponding copyField/@dest.
-
Re: publishedDate and feed pluginLewis John Mcgibbney 2012-06-09, 08:04
Hi Shameena,
I think this depends directly on what tags/elements are within the feed(s). From the feeds I looked at yesterday the relevant tags appeared to be missing. I was surprised that Tika didn't pick up more so I think I'll head over and see exactly what the Tika 1.1 source looks like for the rss+xml parser. In the meantime the feed plugin packaged with Nutch WILL parse and index these additional fields if they are present, but will not if they are absent. Lewis On Fri, Jun 8, 2012 at 6:32 PM, Shameema Umer <[EMAIL PROTECTED]> wrote: > Hi Lewis, the things are clear, I am upset that I cannot find a means to > find the age of a web page by nutch. I thought publishedDate from the feed > plugin would help. If I change the field name from publishedDate to *pubDate > * . Will this help? > > Thanks > Shameema > > > On Fri, Jun 8, 2012 at 6:48 PM, Lewis John Mcgibbney < > [EMAIL PROTECTED]> wrote: > >> Hi, >> >> No This should not be necessary. The feed parser and accompanying >> indexingfilter should extract and send (to be indexed) the following >> metadata items >> Author, Tags, Pub;lished date, Updated date and feed, >> >> There is a problem though... >> >> With many feeds, including the bbci one you provided in another >> thread, many of these fields are absent, the parser and indexing >> filter cannot operate on our behalf and subsequently leaves these >> fields out. >> >> It is also important to note that in parse-plugins.xml we first try to >> parse the application/rss+xml mimetype with parse-tika before feed... >> I can only assume this is because parse-tika produces slightly better >> results for this mimetype. Let me explain >> >> With language identifier included and parse-plugins overridden to >> parse rss+xml solely with feed plugin I get >> >> lewismc@lewismc-HP-Mini-110-3100:~/ASF/trunk/runtime/local$ bin/nutch >> indexchecker http://feeds.feedburner.com/gov/GCC?format=xml >> fetching: http://feeds.feedburner.com/gov/GCC?format=xml >> parsing: http://feeds.feedburner.com/gov/GCC?format=xml >> contentType: application/rss+xml >> content : >> host : feeds.feedburner.com >> tstamp : Fri Jun 08 14:04:04 BST 2012 >> lang : unknown >> url : http://feeds.feedburner.com/gov/GCC?format=xml >> >> however with parse-tika initiated and the same fetch I get >> >> lewismc@lewismc-HP-Mini-110-3100:~/ASF/trunk/runtime/local$ bin/nutch >> indexchecker http://feeds.feedburner.com/gov/GCC?format=xml >> fetching: http://feeds.feedburner.com/gov/GCC?format=xml >> parsing: http://feeds.feedburner.com/gov/GCC?format=xml >> contentType: application/rss+xml >> content : Glasgow City Council - News Feed Glasgow City Council - >> News >> Feed Keep up to date with all the news >> title : Glasgow City Council - News Feed >> host : feeds.feedburner.com >> tstamp : Fri Jun 08 14:04:25 BST 2012 >> lang : en >> url : http://feeds.feedburner.com/gov/GCC?format=xml >> >> Please note that this feed does not include info like publishedDate, >> updatedDate etc instead offering other means of expressing (some) of >> this information. In the above case, as the parse data is not present >> for the required feed fields, or for arguments sake parse-tika, these >> fields are not included in our subsequent index fields. >> >> I hope this clears things up a bit. >> >> On a sidenote, also some things to pick up from the above excepts from >> some tests; >> 1) Feed plugin fails to recognize content, title and lang fields where >> parse-tika does this sucessfully. >> 2) Even though parse-tika DOES utilise the language-identifier to >> recognize the lang field and provide a value, it fails to include the >> full value which should be lang="en-GB" as oppose to lang="en" >> >> Can anyone chime in on what the current state of affairs is with >> delegation of language detection to parse-tika, or whether this as >> already the case but needs patched to accommodate the scenario I >> provide above? >> >> Thanks >> >> Lewis >> >> On Fri, Jun 8, 2012 at 5:07 AM, Shameema Umer <[EMAIL PROTECTED]> wrote: Lewis
-
Re: publishedDate and feed pluginShameema Umer 2012-06-09, 10:43
Thanks Lewis.
On Sat, Jun 9, 2012 at 1:34 PM, Lewis John Mcgibbney < [EMAIL PROTECTED]> wrote: > Hi Shameena, > > I think this depends directly on what tags/elements are within the > feed(s). From the feeds I looked at yesterday the relevant tags > appeared to be missing. I was surprised that Tika didn't pick up more > so I think I'll head over and see exactly what the Tika 1.1 source > looks like for the rss+xml parser. > > In the meantime the feed plugin packaged with Nutch WILL parse and > index these additional fields if they are present, but will not if > they are absent. > > Lewis > > On Fri, Jun 8, 2012 at 6:32 PM, Shameema Umer <[EMAIL PROTECTED]> wrote: > > Hi Lewis, the things are clear, I am upset that I cannot find a means to > > find the age of a web page by nutch. I thought publishedDate from the > feed > > plugin would help. If I change the field name from publishedDate to > *pubDate > > * . Will this help? > > > > Thanks > > Shameema > > > > > > On Fri, Jun 8, 2012 at 6:48 PM, Lewis John Mcgibbney < > > [EMAIL PROTECTED]> wrote: > > > >> Hi, > >> > >> No This should not be necessary. The feed parser and accompanying > >> indexingfilter should extract and send (to be indexed) the following > >> metadata items > >> Author, Tags, Pub;lished date, Updated date and feed, > >> > >> There is a problem though... > >> > >> With many feeds, including the bbci one you provided in another > >> thread, many of these fields are absent, the parser and indexing > >> filter cannot operate on our behalf and subsequently leaves these > >> fields out. > >> > >> It is also important to note that in parse-plugins.xml we first try to > >> parse the application/rss+xml mimetype with parse-tika before feed... > >> I can only assume this is because parse-tika produces slightly better > >> results for this mimetype. Let me explain > >> > >> With language identifier included and parse-plugins overridden to > >> parse rss+xml solely with feed plugin I get > >> > >> lewismc@lewismc-HP-Mini-110-3100:~/ASF/trunk/runtime/local$ bin/nutch > >> indexchecker http://feeds.feedburner.com/gov/GCC?format=xml > >> fetching: http://feeds.feedburner.com/gov/GCC?format=xml > >> parsing: http://feeds.feedburner.com/gov/GCC?format=xml > >> contentType: application/rss+xml > >> content : > >> host : feeds.feedburner.com > >> tstamp : Fri Jun 08 14:04:04 BST 2012 > >> lang : unknown > >> url : http://feeds.feedburner.com/gov/GCC?format=xml > >> > >> however with parse-tika initiated and the same fetch I get > >> > >> lewismc@lewismc-HP-Mini-110-3100:~/ASF/trunk/runtime/local$ bin/nutch > >> indexchecker http://feeds.feedburner.com/gov/GCC?format=xml > >> fetching: http://feeds.feedburner.com/gov/GCC?format=xml > >> parsing: http://feeds.feedburner.com/gov/GCC?format=xml > >> contentType: application/rss+xml > >> content : Glasgow City Council - News Feed Glasgow City Council - > >> News > >> Feed Keep up to date with all the news > >> title : Glasgow City Council - News Feed > >> host : feeds.feedburner.com > >> tstamp : Fri Jun 08 14:04:25 BST 2012 > >> lang : en > >> url : http://feeds.feedburner.com/gov/GCC?format=xml > >> > >> Please note that this feed does not include info like publishedDate, > >> updatedDate etc instead offering other means of expressing (some) of > >> this information. In the above case, as the parse data is not present > >> for the required feed fields, or for arguments sake parse-tika, these > >> fields are not included in our subsequent index fields. > >> > >> I hope this clears things up a bit. > >> > >> On a sidenote, also some things to pick up from the above excepts from > >> some tests; > >> 1) Feed plugin fails to recognize content, title and lang fields where > >> parse-tika does this sucessfully. > >> 2) Even though parse-tika DOES utilise the language-identifier to > >> recognize the lang field and provide a value, it fails to include the > >> full value which should be lang="en-GB" as oppose to lang="en"
-
Re: publishedDate and feed pluginShameema Umer 2012-06-13, 12:52
Hi,
I am trying for days to get a solution to retrive the <pubDate> value of a feed. Even the value is there on a feed, nutch is not parsing it and sending along with the outlinks. the feed plugin is included, but it is not populating value in the field publishedDate. Somebody please give me hints where I went wrong. Or please let me know if it is not possible. Thanks Shameema On Sat, Jun 9, 2012 at 4:13 PM, Shameema Umer <[EMAIL PROTECTED]> wrote: > Thanks Lewis. > > > On Sat, Jun 9, 2012 at 1:34 PM, Lewis John Mcgibbney < > [EMAIL PROTECTED]> wrote: > >> Hi Shameena, >> >> I think this depends directly on what tags/elements are within the >> feed(s). From the feeds I looked at yesterday the relevant tags >> appeared to be missing. I was surprised that Tika didn't pick up more >> so I think I'll head over and see exactly what the Tika 1.1 source >> looks like for the rss+xml parser. >> >> In the meantime the feed plugin packaged with Nutch WILL parse and >> index these additional fields if they are present, but will not if >> they are absent. >> >> Lewis >> >> On Fri, Jun 8, 2012 at 6:32 PM, Shameema Umer <[EMAIL PROTECTED]> wrote: >> > Hi Lewis, the things are clear, I am upset that I cannot find a means to >> > find the age of a web page by nutch. I thought publishedDate from the >> feed >> > plugin would help. If I change the field name from publishedDate to >> *pubDate >> > * . Will this help? >> > >> > Thanks >> > Shameema >> > >> > >> > On Fri, Jun 8, 2012 at 6:48 PM, Lewis John Mcgibbney < >> > [EMAIL PROTECTED]> wrote: >> > >> >> Hi, >> >> >> >> No This should not be necessary. The feed parser and accompanying >> >> indexingfilter should extract and send (to be indexed) the following >> >> metadata items >> >> Author, Tags, Pub;lished date, Updated date and feed, >> >> >> >> There is a problem though... >> >> >> >> With many feeds, including the bbci one you provided in another >> >> thread, many of these fields are absent, the parser and indexing >> >> filter cannot operate on our behalf and subsequently leaves these >> >> fields out. >> >> >> >> It is also important to note that in parse-plugins.xml we first try to >> >> parse the application/rss+xml mimetype with parse-tika before feed... >> >> I can only assume this is because parse-tika produces slightly better >> >> results for this mimetype. Let me explain >> >> >> >> With language identifier included and parse-plugins overridden to >> >> parse rss+xml solely with feed plugin I get >> >> >> >> lewismc@lewismc-HP-Mini-110-3100:~/ASF/trunk/runtime/local$ bin/nutch >> >> indexchecker http://feeds.feedburner.com/gov/GCC?format=xml >> >> fetching: http://feeds.feedburner.com/gov/GCC?format=xml >> >> parsing: http://feeds.feedburner.com/gov/GCC?format=xml >> >> contentType: application/rss+xml >> >> content : >> >> host : feeds.feedburner.com >> >> tstamp : Fri Jun 08 14:04:04 BST 2012 >> >> lang : unknown >> >> url : http://feeds.feedburner.com/gov/GCC?format=xml >> >> >> >> however with parse-tika initiated and the same fetch I get >> >> >> >> lewismc@lewismc-HP-Mini-110-3100:~/ASF/trunk/runtime/local$ bin/nutch >> >> indexchecker http://feeds.feedburner.com/gov/GCC?format=xml >> >> fetching: http://feeds.feedburner.com/gov/GCC?format=xml >> >> parsing: http://feeds.feedburner.com/gov/GCC?format=xml >> >> contentType: application/rss+xml >> >> content : Glasgow City Council - News Feed Glasgow City Council - >> >> News >> >> Feed Keep up to date with all the news >> >> title : Glasgow City Council - News Feed >> >> host : feeds.feedburner.com >> >> tstamp : Fri Jun 08 14:04:25 BST 2012 >> >> lang : en >> >> url : http://feeds.feedburner.com/gov/GCC?format=xml >> >> >> >> Please note that this feed does not include info like publishedDate, >> >> updatedDate etc instead offering other means of expressing (some) of >> >> this information. In the above case, as the parse data is not present >> >> for the required feed fields, or for arguments sake parse-tika, these
-
Re: publishedDate and feed pluginShameema Umer 2012-06-13, 12:58
I tried parsechecker to ensure that no value is retrieved to publishedDate.
On Wed, Jun 13, 2012 at 6:22 PM, Shameema Umer <[EMAIL PROTECTED]> wrote: > Hi, > > I am trying for days to get a solution to retrive the <pubDate> value of a > feed. Even the value is there on a feed, nutch is not parsing it and > sending along with the outlinks. > > the feed plugin is included, but it is not populating value in the field > publishedDate. Somebody please give me hints where I went wrong. > > Or please let me know if it is not possible. > > Thanks > Shameema > > > On Sat, Jun 9, 2012 at 4:13 PM, Shameema Umer <[EMAIL PROTECTED]> wrote: > >> Thanks Lewis. >> >> >> On Sat, Jun 9, 2012 at 1:34 PM, Lewis John Mcgibbney < >> [EMAIL PROTECTED]> wrote: >> >>> Hi Shameena, >>> >>> I think this depends directly on what tags/elements are within the >>> feed(s). From the feeds I looked at yesterday the relevant tags >>> appeared to be missing. I was surprised that Tika didn't pick up more >>> so I think I'll head over and see exactly what the Tika 1.1 source >>> looks like for the rss+xml parser. >>> >>> In the meantime the feed plugin packaged with Nutch WILL parse and >>> index these additional fields if they are present, but will not if >>> they are absent. >>> >>> Lewis >>> >>> On Fri, Jun 8, 2012 at 6:32 PM, Shameema Umer <[EMAIL PROTECTED]> wrote: >>> > Hi Lewis, the things are clear, I am upset that I cannot find a means >>> to >>> > find the age of a web page by nutch. I thought publishedDate from the >>> feed >>> > plugin would help. If I change the field name from publishedDate to >>> *pubDate >>> > * . Will this help? >>> > >>> > Thanks >>> > Shameema >>> > >>> > >>> > On Fri, Jun 8, 2012 at 6:48 PM, Lewis John Mcgibbney < >>> > [EMAIL PROTECTED]> wrote: >>> > >>> >> Hi, >>> >> >>> >> No This should not be necessary. The feed parser and accompanying >>> >> indexingfilter should extract and send (to be indexed) the following >>> >> metadata items >>> >> Author, Tags, Pub;lished date, Updated date and feed, >>> >> >>> >> There is a problem though... >>> >> >>> >> With many feeds, including the bbci one you provided in another >>> >> thread, many of these fields are absent, the parser and indexing >>> >> filter cannot operate on our behalf and subsequently leaves these >>> >> fields out. >>> >> >>> >> It is also important to note that in parse-plugins.xml we first try to >>> >> parse the application/rss+xml mimetype with parse-tika before feed... >>> >> I can only assume this is because parse-tika produces slightly better >>> >> results for this mimetype. Let me explain >>> >> >>> >> With language identifier included and parse-plugins overridden to >>> >> parse rss+xml solely with feed plugin I get >>> >> >>> >> lewismc@lewismc-HP-Mini-110-3100:~/ASF/trunk/runtime/local$ bin/nutch >>> >> indexchecker http://feeds.feedburner.com/gov/GCC?format=xml >>> >> fetching: http://feeds.feedburner.com/gov/GCC?format=xml >>> >> parsing: http://feeds.feedburner.com/gov/GCC?format=xml >>> >> contentType: application/rss+xml >>> >> content : >>> >> host : feeds.feedburner.com >>> >> tstamp : Fri Jun 08 14:04:04 BST 2012 >>> >> lang : unknown >>> >> url : http://feeds.feedburner.com/gov/GCC?format=xml >>> >> >>> >> however with parse-tika initiated and the same fetch I get >>> >> >>> >> lewismc@lewismc-HP-Mini-110-3100:~/ASF/trunk/runtime/local$ bin/nutch >>> >> indexchecker http://feeds.feedburner.com/gov/GCC?format=xml >>> >> fetching: http://feeds.feedburner.com/gov/GCC?format=xml >>> >> parsing: http://feeds.feedburner.com/gov/GCC?format=xml >>> >> contentType: application/rss+xml >>> >> content : Glasgow City Council - News Feed Glasgow City Council >>> - >>> >> News >>> >> Feed Keep up to date with all the news >>> >> title : Glasgow City Council - News Feed >>> >> host : feeds.feedburner.com >>> >> tstamp : Fri Jun 08 14:04:25 BST 2012 >>> >> lang : en >>> >> url : http://feeds.feedburner.com/gov/GCC?format=xml
-
Re: publishedDate and feed pluginShameema Umer 2012-06-14, 06:04
Hi Lewis,
The feed you provided http://feeds.feedburner.com/gov/GCC?format=xml has the pubDate tag. Then why is it not parsed. Please explain. What i need is the value of the pubDate pulled to any of our date fields. Thanks Shameema On Wed, Jun 13, 2012 at 6:28 PM, Shameema Umer <[EMAIL PROTECTED]> wrote: > I tried parsechecker to ensure that no value is retrieved to publishedDate. > > > On Wed, Jun 13, 2012 at 6:22 PM, Shameema Umer <[EMAIL PROTECTED]> wrote: > >> Hi, >> >> I am trying for days to get a solution to retrive the <pubDate> value of >> a feed. Even the value is there on a feed, nutch is not parsing it and >> sending along with the outlinks. >> >> the feed plugin is included, but it is not populating value in the field >> publishedDate. Somebody please give me hints where I went wrong. >> >> Or please let me know if it is not possible. >> >> Thanks >> Shameema >> >> >> On Sat, Jun 9, 2012 at 4:13 PM, Shameema Umer <[EMAIL PROTECTED]> wrote: >> >>> Thanks Lewis. >>> >>> >>> On Sat, Jun 9, 2012 at 1:34 PM, Lewis John Mcgibbney < >>> [EMAIL PROTECTED]> wrote: >>> >>>> Hi Shameena, >>>> >>>> I think this depends directly on what tags/elements are within the >>>> feed(s). From the feeds I looked at yesterday the relevant tags >>>> appeared to be missing. I was surprised that Tika didn't pick up more >>>> so I think I'll head over and see exactly what the Tika 1.1 source >>>> looks like for the rss+xml parser. >>>> >>>> In the meantime the feed plugin packaged with Nutch WILL parse and >>>> index these additional fields if they are present, but will not if >>>> they are absent. >>>> >>>> Lewis >>>> >>>> On Fri, Jun 8, 2012 at 6:32 PM, Shameema Umer <[EMAIL PROTECTED]> >>>> wrote: >>>> > Hi Lewis, the things are clear, I am upset that I cannot find a means >>>> to >>>> > find the age of a web page by nutch. I thought publishedDate from the >>>> feed >>>> > plugin would help. If I change the field name from publishedDate to >>>> *pubDate >>>> > * . Will this help? >>>> > >>>> > Thanks >>>> > Shameema >>>> > >>>> > >>>> > On Fri, Jun 8, 2012 at 6:48 PM, Lewis John Mcgibbney < >>>> > [EMAIL PROTECTED]> wrote: >>>> > >>>> >> Hi, >>>> >> >>>> >> No This should not be necessary. The feed parser and accompanying >>>> >> indexingfilter should extract and send (to be indexed) the following >>>> >> metadata items >>>> >> Author, Tags, Pub;lished date, Updated date and feed, >>>> >> >>>> >> There is a problem though... >>>> >> >>>> >> With many feeds, including the bbci one you provided in another >>>> >> thread, many of these fields are absent, the parser and indexing >>>> >> filter cannot operate on our behalf and subsequently leaves these >>>> >> fields out. >>>> >> >>>> >> It is also important to note that in parse-plugins.xml we first try >>>> to >>>> >> parse the application/rss+xml mimetype with parse-tika before feed... >>>> >> I can only assume this is because parse-tika produces slightly better >>>> >> results for this mimetype. Let me explain >>>> >> >>>> >> With language identifier included and parse-plugins overridden to >>>> >> parse rss+xml solely with feed plugin I get >>>> >> >>>> >> lewismc@lewismc-HP-Mini-110-3100:~/ASF/trunk/runtime/local$ >>>> bin/nutch >>>> >> indexchecker http://feeds.feedburner.com/gov/GCC?format=xml >>>> >> fetching: http://feeds.feedburner.com/gov/GCC?format=xml >>>> >> parsing: http://feeds.feedburner.com/gov/GCC?format=xml >>>> >> contentType: application/rss+xml >>>> >> content : >>>> >> host : feeds.feedburner.com >>>> >> tstamp : Fri Jun 08 14:04:04 BST 2012 >>>> >> lang : unknown >>>> >> url : http://feeds.feedburner.com/gov/GCC?format=xml >>>> >> >>>> >> however with parse-tika initiated and the same fetch I get >>>> >> >>>> >> lewismc@lewismc-HP-Mini-110-3100:~/ASF/trunk/runtime/local$ >>>> bin/nutch >>>> >> indexchecker http://feeds.feedburner.com/gov/GCC?format=xml >>>> >> fetching: http://feeds.feedburner.com/gov/GCC?format=xml >>>> >> parsing: http://feeds.feedburner.com/gov/GCC?format=xml
-
Re: publishedDate and feed pluginLewis John Mcgibbney 2012-06-14, 12:41
Depending on what the tag looks like it will be interpreted
accordingly by the feed parser. My instincts are that there is a different between pubDate and publishedDate being parsed and identified by the parser, however then the question arises as to how/why the field is not identified as a tag. I will try to do more digging.. it might be worth looking at the feed source as well. Best Lewis On Thu, Jun 14, 2012 at 7:04 AM, Shameema Umer <[EMAIL PROTECTED]> wrote: > Hi Lewis, > > The feed you provided http://feeds.feedburner.com/gov/GCC?format=xml has > the pubDate tag. > Then why is it not parsed. Please explain. > > What i need is the value of the pubDate > pulled to any of our date fields. > > Thanks > Shameema > > > > On Wed, Jun 13, 2012 at 6:28 PM, Shameema Umer <[EMAIL PROTECTED]> wrote: > >> I tried parsechecker to ensure that no value is retrieved to publishedDate. >> >> >> On Wed, Jun 13, 2012 at 6:22 PM, Shameema Umer <[EMAIL PROTECTED]> wrote: >> >>> Hi, >>> >>> I am trying for days to get a solution to retrive the <pubDate> value of >>> a feed. Even the value is there on a feed, nutch is not parsing it and >>> sending along with the outlinks. >>> >>> the feed plugin is included, but it is not populating value in the field >>> publishedDate. Somebody please give me hints where I went wrong. >>> >>> Or please let me know if it is not possible. >>> >>> Thanks >>> Shameema >>> >>> >>> On Sat, Jun 9, 2012 at 4:13 PM, Shameema Umer <[EMAIL PROTECTED]> wrote: >>> >>>> Thanks Lewis. >>>> >>>> >>>> On Sat, Jun 9, 2012 at 1:34 PM, Lewis John Mcgibbney < >>>> [EMAIL PROTECTED]> wrote: >>>> >>>>> Hi Shameena, >>>>> >>>>> I think this depends directly on what tags/elements are within the >>>>> feed(s). From the feeds I looked at yesterday the relevant tags >>>>> appeared to be missing. I was surprised that Tika didn't pick up more >>>>> so I think I'll head over and see exactly what the Tika 1.1 source >>>>> looks like for the rss+xml parser. >>>>> >>>>> In the meantime the feed plugin packaged with Nutch WILL parse and >>>>> index these additional fields if they are present, but will not if >>>>> they are absent. >>>>> >>>>> Lewis >>>>> >>>>> On Fri, Jun 8, 2012 at 6:32 PM, Shameema Umer <[EMAIL PROTECTED]> >>>>> wrote: >>>>> > Hi Lewis, the things are clear, I am upset that I cannot find a means >>>>> to >>>>> > find the age of a web page by nutch. I thought publishedDate from the >>>>> feed >>>>> > plugin would help. If I change the field name from publishedDate to >>>>> *pubDate >>>>> > * . Will this help? >>>>> > >>>>> > Thanks >>>>> > Shameema >>>>> > >>>>> > >>>>> > On Fri, Jun 8, 2012 at 6:48 PM, Lewis John Mcgibbney < >>>>> > [EMAIL PROTECTED]> wrote: >>>>> > >>>>> >> Hi, >>>>> >> >>>>> >> No This should not be necessary. The feed parser and accompanying >>>>> >> indexingfilter should extract and send (to be indexed) the following >>>>> >> metadata items >>>>> >> Author, Tags, Pub;lished date, Updated date and feed, >>>>> >> >>>>> >> There is a problem though... >>>>> >> >>>>> >> With many feeds, including the bbci one you provided in another >>>>> >> thread, many of these fields are absent, the parser and indexing >>>>> >> filter cannot operate on our behalf and subsequently leaves these >>>>> >> fields out. >>>>> >> >>>>> >> It is also important to note that in parse-plugins.xml we first try >>>>> to >>>>> >> parse the application/rss+xml mimetype with parse-tika before feed... >>>>> >> I can only assume this is because parse-tika produces slightly better >>>>> >> results for this mimetype. Let me explain >>>>> >> >>>>> >> With language identifier included and parse-plugins overridden to >>>>> >> parse rss+xml solely with feed plugin I get >>>>> >> >>>>> >> lewismc@lewismc-HP-Mini-110-3100:~/ASF/trunk/runtime/local$ >>>>> bin/nutch >>>>> >> indexchecker http://feeds.feedburner.com/gov/GCC?format=xml >>>>> >> fetching: http://feeds.feedburner.com/gov/GCC?format=xml >>>>> >> parsing: http://feeds.feedburner.com/gov/GCC?format=xml Lewis
-
Re: publishedDate and feed pluginShameema Umer 2012-06-16, 10:11
S please. please explore why the tag pubDate is not parsed and indexed?
Thanks Shameema On Thu, Jun 14, 2012 at 6:11 PM, Lewis John Mcgibbney < [EMAIL PROTECTED]> wrote: > Depending on what the tag looks like it will be interpreted > accordingly by the feed parser. > My instincts are that there is a different between pubDate and > publishedDate being parsed and identified by the parser, however then > the question arises as to how/why the field is not identified as a > tag. > > I will try to do more digging.. it might be worth looking at the feed > source as well. > > Best > Lewis > > On Thu, Jun 14, 2012 at 7:04 AM, Shameema Umer <[EMAIL PROTECTED]> wrote: > > Hi Lewis, > > > > The feed you provided http://feeds.feedburner.com/gov/GCC?format=xml has > > the pubDate tag. > > Then why is it not parsed. Please explain. > > > > What i need is the value of the pubDate > > pulled to any of our date fields. > > > > Thanks > > Shameema > > > > > > > > On Wed, Jun 13, 2012 at 6:28 PM, Shameema Umer <[EMAIL PROTECTED]> > wrote: > > > >> I tried parsechecker to ensure that no value is retrieved to > publishedDate. > >> > >> > >> On Wed, Jun 13, 2012 at 6:22 PM, Shameema Umer <[EMAIL PROTECTED]> > wrote: > >> > >>> Hi, > >>> > >>> I am trying for days to get a solution to retrive the <pubDate> value > of > >>> a feed. Even the value is there on a feed, nutch is not parsing it and > >>> sending along with the outlinks. > >>> > >>> the feed plugin is included, but it is not populating value in the > field > >>> publishedDate. Somebody please give me hints where I went wrong. > >>> > >>> Or please let me know if it is not possible. > >>> > >>> Thanks > >>> Shameema > >>> > >>> > >>> On Sat, Jun 9, 2012 at 4:13 PM, Shameema Umer <[EMAIL PROTECTED]> > wrote: > >>> > >>>> Thanks Lewis. > >>>> > >>>> > >>>> On Sat, Jun 9, 2012 at 1:34 PM, Lewis John Mcgibbney < > >>>> [EMAIL PROTECTED]> wrote: > >>>> > >>>>> Hi Shameena, > >>>>> > >>>>> I think this depends directly on what tags/elements are within the > >>>>> feed(s). From the feeds I looked at yesterday the relevant tags > >>>>> appeared to be missing. I was surprised that Tika didn't pick up more > >>>>> so I think I'll head over and see exactly what the Tika 1.1 source > >>>>> looks like for the rss+xml parser. > >>>>> > >>>>> In the meantime the feed plugin packaged with Nutch WILL parse and > >>>>> index these additional fields if they are present, but will not if > >>>>> they are absent. > >>>>> > >>>>> Lewis > >>>>> > >>>>> On Fri, Jun 8, 2012 at 6:32 PM, Shameema Umer <[EMAIL PROTECTED]> > >>>>> wrote: > >>>>> > Hi Lewis, the things are clear, I am upset that I cannot find a > means > >>>>> to > >>>>> > find the age of a web page by nutch. I thought publishedDate from > the > >>>>> feed > >>>>> > plugin would help. If I change the field name from publishedDate to > >>>>> *pubDate > >>>>> > * . Will this help? > >>>>> > > >>>>> > Thanks > >>>>> > Shameema > >>>>> > > >>>>> > > >>>>> > On Fri, Jun 8, 2012 at 6:48 PM, Lewis John Mcgibbney < > >>>>> > [EMAIL PROTECTED]> wrote: > >>>>> > > >>>>> >> Hi, > >>>>> >> > >>>>> >> No This should not be necessary. The feed parser and accompanying > >>>>> >> indexingfilter should extract and send (to be indexed) the > following > >>>>> >> metadata items > >>>>> >> Author, Tags, Pub;lished date, Updated date and feed, > >>>>> >> > >>>>> >> There is a problem though... > >>>>> >> > >>>>> >> With many feeds, including the bbci one you provided in another > >>>>> >> thread, many of these fields are absent, the parser and indexing > >>>>> >> filter cannot operate on our behalf and subsequently leaves these > >>>>> >> fields out. > >>>>> >> > >>>>> >> It is also important to note that in parse-plugins.xml we first > try > >>>>> to > >>>>> >> parse the application/rss+xml mimetype with parse-tika before > feed... > >>>>> >> I can only assume this is because parse-tika produces slightly > better > >>>>> >> results for this mimetype. Let me explain |