What great news!  Thank you, Sergey!!!

-----Original Message-----
From: Sergey Beryozkin [mailto:[EMAIL PROTECTED]]
Sent: Monday, September 11, 2017 9:18 AM
Subject: Re: Integrating Tika with Apache Beam

Hi Tim, All

It took it some time, but finally Beam TikaIO component is in its 2.2.0-SNAPSHOT master,


I've created a basic project which can help with running it quickly:


One can just build it and run as suggested in Readme.md, simply have some PDF files for example, and point to one or all of them.

By default, Beam will output the data to /tmp/tika.

main() can be updated with supporting more options, they can be collected from the command line either with TikaOptions:


(all options but the "--input" are optional)

or directly from the code, some variations are shown in the tests:


By default TikaReader will use an internal queue to make the SAX events available to the Beam pipeline, this is why you can see the options like "queuePollTime", etc. If it's known that a given parser can really read the whole text in the single op only then the process can be optimized with 'parseSynchronously'...

One can also try to update main() in the example to do more interesting things then just print the data :-).

Give it a try please if you get a chance, help make TikeIO the major part of Beam :-) with PRs, etc

Thanks, Sergey

On 25/05/17 17:47, Sergey Beryozkin wrote:
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB