Hello again,
I want to extract specific meta tag from HTML pages, like:
<meta name="uniks-fb" value="fb16" />
But it seems that they aren't extracted by the parser. I dumped the
segment of a page (Since the readseg doesn't work for me :-/ ) and
inspected the values for this example page:
http://cms.uni-kassel.de/unicms/index.php?id=29216&L=0This page contains these metatags:
<meta name="uniks-fb" content="default" />
<meta name="keywords"
content="Universitt,Kassel,Forschung,Lehre,Wissenschaft" />
<meta name="robots" content="index" />
<meta name="DC.Description" content="Der Internetauftritt der
Universitᅵt Kassel" />
<meta name="DC.Subject"
content="Universitt,Kassel,Forschung,Lehre,Wissenschaft" />
<meta name="generator" content="TYPO3 4.2 CMS" />
But these tags don't appear in the segment as shown above. I thought
I'll find them in "Parse Metadata" but there are only this two values:
"CharEncodingForConversion=utf-8" "OriginalCharEncoding=utf-8"
I use the value parse-(html|tika) in my plugin.includes as well as urlmeta.
Any suggestions what I am doing wrong?
THANK YOU!
Snippet from segment dump:
Recno:: 97
URL::
http://cms.uni-kassel.de/unicms/index.php?id=29216&L=0CrawlDatum::
Version: 7
Status: 67 (linked)
Fetch time: Mon Dec 19 12:44:49 CET 2011
Modified time: Thu Jan 01 01:00:00 CET 1970
Retries since fetch: 0
Retry interval: 603450 seconds (6 days)
Score: 0.0
Signature: null
Metadata:
CrawlDatum::
Version: 7
Status: 67 (linked)
Fetch time: Mon Dec 19 12:42:04 CET 2011
Modified time: Thu Jan 01 01:00:00 CET 1970
Retries since fetch: 0
Retry interval: 603450 seconds (6 days)
Score: 0.0
Signature: null
Metadata:
CrawlDatum::
Version: 7
Status: 67 (linked)
Fetch time: Mon Dec 19 12:42:04 CET 2011
Modified time: Thu Jan 01 01:00:00 CET 1970
Retries since fetch: 0
Retry interval: 603450 seconds (6 days)
Score: 0.0
Signature: null
Metadata:
CrawlDatum::
Version: 7
Status: 67 (linked)
Fetch time: Mon Dec 19 12:42:04 CET 2011
Modified time: Thu Jan 01 01:00:00 CET 1970
Retries since fetch: 0
Retry interval: 603450 seconds (6 days)
Score: 0.0
Signature: null
Metadata:
CrawlDatum::
Version: 7
Status: 65 (signature)
Fetch time: Mon Dec 19 12:42:04 CET 2011
Modified time: Thu Jan 01 01:00:00 CET 1970
Retries since fetch: 0
Retry interval: 0 seconds (0 days)
Score: 0.0
Signature: 7260839eaf4927f64b03dd86dcd0918a
Metadata:
CrawlDatum::
Version: 7
Status: 67 (linked)
Fetch time: Mon Dec 19 12:42:04 CET 2011
Modified time: Thu Jan 01 01:00:00 CET 1970
Retries since fetch: 0
Retry interval: 603450 seconds (6 days)
Score: 0.0
Signature: null
Metadata:
CrawlDatum::
Version: 7
Status: 67 (linked)
Fetch time: Mon Dec 19 12:42:51 CET 2011
Modified time: Thu Jan 01 01:00:00 CET 1970
Retries since fetch: 0
Retry interval: 603450 seconds (6 days)
Score: 0.0
Signature: null
Metadata:
CrawlDatum::
Version: 7
Status: 1 (db_unfetched)
Fetch time: Sat Dec 17 14:45:49 CET 2011
Modified time: Thu Jan 01 01:00:00 CET 1970
Retries since fetch: 1
Retry interval: 603450 seconds (6 days)
Score: 0.0
Signature: null
Metadata: _ngt_: 1324289811219_pst_: exception(16), lastModified=0:
java.net.SocketTimeoutException: Read timed out
CrawlDatum::
Version: 7
Status: 33 (fetch_success)
Fetch time: Mon Dec 19 12:25:59 CET 2011
Modified time: Thu Jan 01 01:00:00 CET 1970
Retries since fetch: 1
Retry interval: 603450 seconds (6 days)
Score: 0.0
Signature: null
Metadata: _ngt_: 1324289811219_pst_: success(1), lastModified=0
Content::
Version: -1
url:
http://cms.uni-kassel.de/unicms/index.php?id=29216&L=0base:
http://cms.uni-kassel.de/unicms/index.php?id=29216&L=0contentType: application/xhtml+xml
metadata: Date=Mon, 19 Dec 2011 10:24:09 GMT Vary=Accept-Encoding
Content-Length=3886 Content-Encoding=gzip Via=1.0 cms.uni-kassel.de
_fst_=33 Set-Cookie=fe_typo_user=cb42ebddb40df7e8a04b0183f79c41cb;
path=/unicms/ nutch.segment.name=20111219111925
Content-Type=text/html;charset=utf-8 Connection=close
Server=Apache/2.2.3 (Debian) mod_ssl/2.2.3 OpenSSL/0.9.8c
X-Powered-By=PHP/5.2.0-8+etch16
Content:
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE html
PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"
http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"><html xmlns="
http://www.w3.org/1999/xhtml" xml:lang="en" lang="de">
(...)
</html>
ParseData::
Version: 5
Status: success(1,0)
Title: 2004 - Universitᅵt Kassel
Outlinks: 35
outlink: toUrl:
http://cms.uni-kassel.de/unicms/index.php?id=29216&L=0#navigation anchor: Zur Hauptnavigation (Nutzergruppen-Navigation)
(...)
Content Metadata: Content-Length=3886 _fst_=33
Set-Cookie=fe_typo_user=cb42ebddb40df7e8a04b0183f79c41cb; path=/unicms/
nutch.segment.name=20111219111925 Connection=close Server=Apache/2.2.3
(Debian) mod_ssl/2.2.3 OpenSSL/0.9.8c X-Powered-By=PHP/5.2.0-8+etch16
nutch.content.digest=7260839eaf4927f64b03dd86dcd0918a Date=Mon, 19 Dec
2011 10:24:09 GMT Vary=Accept-Encoding Content-Encoding=gzip Via=1.0
cms.uni-kassel.de Content-Type=text/html;charset=utf-8
Parse Metadata: CharEncodingForConversion=utf-8 OriginalCharEncoding=utf-8
ParseText::
2004 - Universitᅵt Kassel Zur Hauptnavigation (Nutzergruppen-Navigation)
. Zur Unternavigation . Zum Inhalt . Zu verwandten Links und
Informationen . Infos fᅵr: Universitᅵt Studium Forschung Fachbereiche
Einrichtungen International students and scholars Sie befinden sich
hier: HFK > Ehemalige Mitarbeiter > Frᅵchting > Liste der
Verᅵffentlichungen > 2004 Verᅵffentlichungen im Fachgebiet
Hochfrequenztechnik/Kommunikationssysteme 2004: [113] Semmelrodt, S.;
Kattenbach, R.; Frᅵchting, H.: Toolbox for Spectral Analysis and Linear
Prediction of Stationary and Non-Stationary Signals, COST 273 TD(04)019,
Athen, Greece, January 26-28, 2004 [114] Semmelrodt, S.: Maximum
Likelihood Based Parameter Estimation of Stationary and Non-Stationary
Multi-Component Signals, FREQUENZ 58 (2004) 1-2, S. 20-24. [115]
Semmelrodt, S.: Methoden zur prᅵdiktiven Kan