Home | About | Sematext search-lucene.com search-hadoop.com
 Search Lucene and all its subprojects:

Switch to Threaded View
Tika, mail # dev - Getting started


Copy link to this message
-
Re: Getting started
Ken Krugler 2010-06-17, 16:25
Hi Arturo,

> Some of you already know that I'm working on a new parser (https://issues.apache.org/jira/browse/TIKA-443
> ). After all day trying to set up a workspace for Eclipse, I  
> implemented the typical "hello world" class, in the Tika Parser  
> version. My problem now, is how to configure Tika in order to call  
> my new parser when a file with especific extension (p.e. *.shp) is  
> found. I read something about a configuration file (tika-config.xml)  
> but I couldn't find it in the source code.

You first need to modify tika-core/src/main/resources/tika-
mimetypes.xml.

E.g. something like this was done for mailbox files.

   <mime-type type="application/mbox">
     <sub-class-of type="text/plain"/>
     <glob pattern="*.mbox"/>
   </mime-type>

That maps the suffix to the mime-type.

Then you define the SUPPORTED_TYPES static class field in your parser  
class that defines what mime-types it supports.

E.g. for MboxParser:

public class MboxParser implements Parser {

     private static final Set<MediaType> SUPPORTED_TYPES          Collections.singleton(MediaType.application("mbox"));
-- Ken

--------------------------------------------
<http://ken-blog.krugler.org>
+1 530-265-2225
--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g