|
|
-
Re: Extracting dublin core metadata in HtmlParser?Ken Krugler 2010-01-19, 14:01
Hi Nick,
On Jan 19, 2010, at 5:41am, Nick Burch wrote: > Hi All > > I've been taking a look at the HtmlParser, and I can't spot anything > in there that extracts any of the dublin core metadata that could be > there. It seems that it's only things like location and encoding > that get set onto the metadata object. Nothing like description, > author etc seems to get set. Only location & encoding are explicitly looked for, but all meta tag values get put into the metadata map. See HtmlHandler.startElement(), where it has: if (bodyLevel == 0 && discardLevel == 0) { if ("META".equals(name) && atts.getValue("content") != null) { if (atts.getValue("http-equiv") != null) { metadata.set( atts.getValue("http-equiv"), atts.getValue("content")); } if (atts.getValue("name") != null) { metadata.set( atts.getValue("name"), atts.getValue("content")); } Though the names defined in Tika's DublinCore enum seem to be missing the "dc." prefix. -- Ken -------------------------------------------- Ken Krugler +1 530-210-6378 http://bixolabs.com e l a s t i c w e b m i n i n g |