>
>
> Well in o.a.n.metadata.Nutch some brief Javadoc's for the caching
> fields mention the following
>
> static String CACHING_FORBIDDEN_ALL
> Don't show either original forbidden content or summaries.
> static String CACHING_FORBIDDEN_CONTENT
> Don't show original forbidden content, but show summaries.
> static String CACHING_FORBIDDEN_KEY
> Sites may request that search engines don't provide access
> to cached documents.
> static org.apache.avro.util.Utf8 CACHING_FORBIDDEN_KEY_UTF8
>
> static String CACHING_FORBIDDEN_NONE
> Show both original forbidden content and summaries (default).
>
> I understand that caching data is held within and concerns metadata
> (in trunk it is parse.getData().getMeta())
it does not concern metadata, we store as metadata the policies regarding
caching that are specified in the html pages (
http://www.i18nguy.com/markup/metatags.html) then store the policy in the
cache field
* // add cached content/summary display policy, if available*
* String caching = parse.getData().getMeta(Nutch.CACHING_FORBIDDEN_KEY);*
* if (caching != null && !caching.equals(Nutch.CACHING_FORBIDDEN_NONE)) {
*
* doc.add("cache", caching);*
* }*
* *
I expect that this was then used by our search web app to determine whether
we could display the cached content or not.
> but I still have no idea the
> characteristics of the cache data, why this would be valuable for an
> index. I personally have never queried for it before in my index.
>
we do not store the cached content as a field, just the policy. caching can
be useful for an index e.g. when the target server is down and you want to
have a peek at the content of the page
indexing the policy instead of the actual cache content is probably not so
relevant now that we've delegated the indexing + search to SOLR & ES. We
could of course add a binary field with the content so that web apps
querying the search backends could provide the cache if needed. We'd need
to enforce the caching policy at the indexing level + put some restrictions
on length etc...
Makes sense?
Julien
--
*
*Open Source Solutions for Text Engineering
http://digitalpebble.blogspot.com/http://www.digitalpebble.comhttp://twitter.com/digitalpebble