Home | About | Sematext search-lucene.com search-hadoop.com
 Search Lucene and all its subprojects:

Switch to Threaded View
Tika, mail # dev - Extracting dublin core metadata in HtmlParser?


Copy link to this message
-
Re: Extracting dublin core metadata in HtmlParser?
Ken Krugler 2010-01-19, 14:01
Hi Nick,

On Jan 19, 2010, at 5:41am, Nick Burch wrote:

> Hi All
>
> I've been taking a look at the HtmlParser, and I can't spot anything  
> in there that extracts any of the dublin core metadata that could be  
> there. It seems that it's only things like location and encoding  
> that get set onto the metadata object. Nothing like description,  
> author etc seems to get set.

Only location & encoding are explicitly looked for, but all meta tag  
values get put into the metadata map.

See HtmlHandler.startElement(), where it has:

         if (bodyLevel == 0 && discardLevel == 0) {
             if ("META".equals(name) && atts.getValue("content") !=  
null) {
                 if (atts.getValue("http-equiv") != null) {
                     metadata.set(
                             atts.getValue("http-equiv"),
                             atts.getValue("content"));
                 }
                 if (atts.getValue("name") != null) {
                     metadata.set(
                             atts.getValue("name"),
                             atts.getValue("content"));
                 }
Though the names defined in Tika's DublinCore enum seem to be missing  
the "dc." prefix.

-- Ken

--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g