-RE: Exclude certain mime-types
Markus Jelsma 2012-05-18, 13:19
> From:Matthias Paul <[EMAIL PROTECTED]>
> Sent: Fri 18-May-2012 14:57
> To: [EMAIL PROTECTED]
> Subject: Exclude certain mime-types
> How can I exlude certain mime-types from crawling, for example Word-documents?
> If I have parse-tika in plugin.includes it will parse them. Do I have
> to change parse-plugins.xml?
You have to get rid of the wildcard MIME-type that is mapped to Tika and manually map the desired MIMEs to the appropriate parser, which is usually Tika.
Keep in mind that in here you have to map both text/html and application/xhtml+xml if you need to parse HTML.
> I can't exclude them in regex-urlfilter as the .doc extension is not
> present in the urls.