-Re: Tika command line performance
Ken Krugler 2010-01-15, 19:37
On Jan 15, 2010, at 11:27am, Doug Carter wrote:
> On Fri, Jan 15, 2010 at 11:19:31AM -0800, Ken Krugler wrote:
>> On Jan 15, 2010, at 11:07am, Doug Carter wrote:
>>> Hi all,
>>> This may be off-topic for this list, but I need to start somewhere.
>>> I need a command line utility to do document format conversion, in a
>>> batch mode environment. The batch process is a combination of steps,
>>> of which is the actual format conversion which is currently being
>>> by a collection of Linux binary converters like wvWare, pdftohtml,
>>> I've put a shell script wrapper around the tika jar:
>>> java -jar tika-app.jar [infile] > [outfile]
>>> This works OK, but as you would imagine, it is much slower
>>> compared to
>>> a Linux binary.
>>> Does anyone know of a way to improve the performance in a setup like
>>> this? I know it goes against the whole philosophy of Java, but is
>>> a way to compile the Tika jar byte code into a native Linux binary?
>>> taken a look at gcj, but it doesn't look like a simple re-compile.
>>> Any ideas would be greatly appreciated.
>> If you have a set of documents, easiest would be to pass in a
>> directory to tika-app (extend it a bit) so that one invocation of the
>> JVM processes many documents.
> Hi Ken,
> I've considered something like this (for the exact reason you stated)
> but I don't have that flexibility with my current setup. Each document
> needs to go through a series of processing steps, one of which is the
> format conversion.
In that case, another cheesy solution is to have the Java process
watch a specific directory. Whenever a new file (with the appropriate
name format) appears, it gets processed. This Java process then
continues to run indefinitely as a kind of processing daemon.
You can avoid hand-off problems by using a name pattern, and renaming
the file when it's really ready for processing.
There are lots of cleaner, more sophisticated systems involving
notification systems, queues, RESTful services, etc. which might be
more appropriate, depending on your needs.
e l a s t i c w e b m i n i n g