Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Lucene and all its subprojects:

Switch to Threaded View
Tika >> mail # dev >> Parser stability and ForkParser


Copy link to this message
-
Re: Parser stability and ForkParser
On Thu, Dec 22, 2011 at 5:18 PM, Jerome Lacoste
<[EMAIL PROTECTED]> wrote:
> Hei,
>
> I opened a couple of issues to note some parser instability:
>
> https://issues.apache.org/jira/browse/TIKA-815
> https://issues.apache.org/bugzilla/show_bug.cgi?id=52372
> https://issues.apache.org/bugzilla/show_bug.cgi?id=52373
> https://issues.apache.org/jira/browse/COMPRESS-169
>
> TIKA-815 is the overall one that points to the fact that tika could
> have a few more tests to ensure that the underlying parsers are more
> robusts. The fact that Tika has a general interface allows those
> stress testing to be applied on all parsers, which may be a good idea.
> The code is simple and available on github. Feedback appreciated.
>
>
>
>
> Now a question that pertains more to the user list. In TIKA-815, Nick
> pointed that one could use ForkedParser to improve stability. I didn't
> manage to get it to work.
>
> When I use the command line tika app, e.g. with
>
> java -jar /tmp/tika-app-1.0.jar -v -t -f  brokenFile.doc
>
> then tika reports nothing.
>
> But if I try to reproduce something similar programatically I run into
> strange errors:

[...]

> This solves the exception but causes tika to not report any error when
> parsing. It just doesn't parse anything and returns gracefully.

A bit more information (and another bug):

if I change my javaCommand to allow debugging

        parser.setJavaCommand("java -Xmx32m -Xdebug
-Xrunjdwp:transport=dt_socket,address=54321,server=y,suspend=y");

Then the forked process will write

    Listening for transport dt_socket at address: 54321

to its output. This confuses the client when it tries to ping the
forked process:

 public synchronized boolean ping() {
        try {
            output.writeByte(ForkServer.PING);
            output.flush();
            while (true) {
                consumeErrorStream();
                int type = input.read();
                if (type == ForkServer.PING) {
                    consumeErrorStream();
                    return true;
                } else {
                    return false;
                }
            }
        } catch (IOException e) {
            return false;
        }
    }

The input contains  "Listening for..." message and ping returns false,
and the client closes the communication.

The ping method assumes nothing is to be read, which is wrong. (this
is reproduceable given the aforementionned context/parser fix).
Now to got back to my problem, the reported OOM never gets read by the client.
This is caused by ForClient#waitForResponse. This method has a switch
with 2 identical values (because type == -1 and type =ForkServer.ERROR) are identical.

So replace
            } else if (type == ForkServer.ERROR) {
with
            } else if ((byte) type == ForkServer.ERROR) {
and the error is properly reported.

To summarize. I think tika has 3 issues:
* Tika#parseString() sends the wrong parser in the context when we fork
* waitForResponse doesn't properly handle ERRORS because of broken switch
* forking doesn't work. This one I have no fix for right now.

If needed, I will update by tika-hardener test to provide a full
working test and proper patches to the program.

Jerome
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB