Home | About | Sematext search-lucene.com search-hadoop.com
 Search Lucene and all its subprojects:

Switch to Threaded View
Solr, mail # user - missing a directory, can not process pdf files


Copy link to this message
-
Re: missing a directory, can not process pdf files
Chris Hostetter 2012-09-19, 18:47

: user:~/solr/example/exampledocs$ java -jar post.jar test.pdf doesnt work

1) you can use post.jar to send PDFs, but you have to use the option to
tell solr you are sending a PDF file - because by default it assumes you
are posting XML.  you can see the problem by looking at the output from
post.jar and the solr logs...

hossman@frisbee:~/tmp/solr-4.0-BETA/bin-zip/apache-solr-4.0.0-BETA/example/exampledocs$ java -jar post.jar /tmp/test.pdf
SimplePostTool version 1.5
Posting files to base url http://localhost:8983/solr/update using content-type application/xml..
...

And in the Solr logs...

...
SEVERE: org.apache.solr.common.SolrException: Invalid UTF-8 middle byte
0xe3 (at char #10, byte #-1)
at org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:159)
...

...if you specify the type things should work fine on the clinet side.

As for the Server side...

2) by default Solr's "/update" handler supports Solr Documents in XML,
JSON, CSV, and JavaBin.  If you wnat to use the "ExtractingRequestHandler"
to parse rich documents you just have to change the URL exactly as noted
in the wiki you mentioned ("-Durl=http://localhost:8983/solr/update/extract")
-Hoss