|
xxxx xxxx
2012-09-19, 07:46
Erick Erickson
2012-09-19, 12:33
xxxx xxxx
2012-09-19, 14:40
Erik Hatcher
2012-09-19, 14:51
Gora Mohanty
2012-09-19, 14:51
xxxx xxxx
2012-09-19, 17:53
Ahmet Arslan
2012-09-19, 18:23
xxxx xxxx
2012-09-19, 18:39
Chris Hostetter
2012-09-19, 18:47
xxxx xxxx
2012-09-19, 19:00
|
-
missing a directory, can not process pdf filesxxxx xxxx 2012-09-19, 07:46
seems the /update directory is missing? I use solr 4.0.0 beta
can not process pdf files because of it also when will the final version be released? thought it it 30 days after beta? how can we get the files which contain the searched queries / content?
-
Re: missing a directory, can not process pdf filesErick Erickson 2012-09-19, 12:33
Please review:
http://wiki.apache.org/solr/UsingMailingLists There's nothing in your problem statement that's diagnosable. What did you try? What were the results? Details matter. 4.0 is in process of being prepped for release. 30 days was a straw-man proposal. Best Erick On Wed, Sep 19, 2012 at 3:46 AM, xxxx xxxx <[EMAIL PROTECTED]> wrote: > seems the /update directory is missing? I use solr 4.0.0 beta > can not process pdf files because of it > > also when will the final version be released? thought it it 30 days after beta? > > how can we get the files which contain the searched queries / content? > >
-
Re: missing a directory, can not process pdf filesxxxx xxxx 2012-09-19, 14:40
I want to process a pdf file see "Indexing Data" from http://lucene.apache.org/solr/api-4_0_0-BETA/doc-files/tutorial.html
the directory "update" doesnt even exist: SimplePostTool: POSTing files to http://localhost:8983/solr/update.. fails because the /update directory is not there and also has no contents (and is missing in the repos on github and so on) how can we retrieve the files when we do a query which contain the searched query? -------- Original-Nachricht -------- > Datum: Wed, 19 Sep 2012 08:33:57 -0400 > Von: Erick Erickson <[EMAIL PROTECTED]> > An: [EMAIL PROTECTED] > Betreff: Re: missing a directory, can not process pdf files > Please review: > > http://wiki.apache.org/solr/UsingMailingLists > > There's nothing in your problem statement that's diagnosable. What did > you try? What > were the results? Details matter. > > 4.0 is in process of being prepped for release. 30 days was a > straw-man proposal. > > Best > Erick > > On Wed, Sep 19, 2012 at 3:46 AM, xxxx xxxx <[EMAIL PROTECTED]> wrote: > > seems the /update directory is missing? I use solr 4.0.0 beta > > can not process pdf files because of it > > > > also when will the final version be released? thought it it 30 days > after beta? > > > > how can we get the files which contain the searched queries / content? > > > >
-
Re: missing a directory, can not process pdf filesErik Hatcher 2012-09-19, 14:51
There's nothing in that tutorial that mentions an update "directory". /update is a URL endpoint that requires Solr be up and running.
Please post the entire set of steps that you're trying and the exact (copy/pasted) error messages you're receiving. And once you index a PDF file, you don't retrieve the file back from Solr, you retrieve search results. The original file is where it was indexed from, not inside Solr. What you'll get back is the file name (if you stored it, that is). Erik On Sep 19, 2012, at 10:40 , xxxx xxxx wrote: > I want to process a pdf file see "Indexing Data" from http://lucene.apache.org/solr/api-4_0_0-BETA/doc-files/tutorial.html > > the directory "update" doesnt even exist: > SimplePostTool: POSTing files to http://localhost:8983/solr/update.. > > fails because the /update directory is not there and also has no contents (and is missing in the repos on github and so on) > > how can we retrieve the files when we do a query which contain the searched query? > -------- Original-Nachricht -------- >> Datum: Wed, 19 Sep 2012 08:33:57 -0400 >> Von: Erick Erickson <[EMAIL PROTECTED]> >> An: [EMAIL PROTECTED] >> Betreff: Re: missing a directory, can not process pdf files > >> Please review: >> >> http://wiki.apache.org/solr/UsingMailingLists >> >> There's nothing in your problem statement that's diagnosable. What did >> you try? What >> were the results? Details matter. >> >> 4.0 is in process of being prepped for release. 30 days was a >> straw-man proposal. >> >> Best >> Erick >> >> On Wed, Sep 19, 2012 at 3:46 AM, xxxx xxxx <[EMAIL PROTECTED]> wrote: >>> seems the /update directory is missing? I use solr 4.0.0 beta >>> can not process pdf files because of it >>> >>> also when will the final version be released? thought it it 30 days >> after beta? >>> >>> how can we get the files which contain the searched queries / content? >>> >>>
-
Re: missing a directory, can not process pdf filesGora Mohanty 2012-09-19, 14:51
On 19 September 2012 20:10, xxxx xxxx <[EMAIL PROTECTED]> wrote:
> I want to process a pdf file see "Indexing Data" from > http://lucene.apache.org/solr/api-4_0_0-BETA/doc-files/tutorial.html > > the directory "update" doesnt even exist: > SimplePostTool: POSTing files to http://localhost:8983/solr/update.. > Sorry, what directory are you referring to? The update above is a URL component, and there is a handler that responds to that. Are you by any chance looking for a PHP-style file layout? That is not how things work here. Otherwise, please expand further on how exactly you are trying to index the PDF files, and what errors you see in the logs. Regards, Gora
-
Re: missing a directory, can not process pdf filesxxxx xxxx 2012-09-19, 17:53
user:~/solr/example/exampledocs$ java -jar post.jar test.pdf doesnt work
Index binary documents such as Word and PDF with Solr Cell (ExtractingRequestHandler). how do i do his? http://lucene.apache.org/solr/api-4_0_0-BETA/doc-files/tutorial.html http://wiki.apache.org/solr/ExtractingRequestHandler it says solr 1.4? curl is not installed normally so how do we do this like with post.jar? also the docs dir is not existing, seems very outdated? "using "curl" or other command line tools to post documents to Solr is nice for testing, but not the recommended update method for best performance." what then? far below there: java -Durl=http://localhost:8983/solr/update/extract -Dparams=literal.id=doc5 -Dtype=text/html -jar post.jar tutorial.html is this the right? java -Dauto -jar post.jar tutorial.html java -Dauto -Drecursive -jar post.jar . "NOTE: The post.jar utility is not meant for production use" so how do we normally do this or should do this? -------- Original-Nachricht -------- > Datum: Wed, 19 Sep 2012 10:51:29 -0400 > Von: Erik Hatcher <[EMAIL PROTECTED]> > An: [EMAIL PROTECTED] > Betreff: Re: missing a directory, can not process pdf files > There's nothing in that tutorial that mentions an update "directory". > /update is a URL endpoint that requires Solr be up and running. > > Please post the entire set of steps that you're trying and the exact > (copy/pasted) error messages you're receiving. > > And once you index a PDF file, you don't retrieve the file back from Solr, > you retrieve search results. The original file is where it was indexed > from, not inside Solr. What you'll get back is the file name (if you stored > it, that is). > > Erik > > On Sep 19, 2012, at 10:40 , xxxx xxxx wrote: > > > I want to process a pdf file see "Indexing Data" from > http://lucene.apache.org/solr/api-4_0_0-BETA/doc-files/tutorial.html > > > > the directory "update" doesnt even exist: > > SimplePostTool: POSTing files to http://localhost:8983/solr/update.. > > > > fails because the /update directory is not there and also has no > contents (and is missing in the repos on github and so on) > > > > how can we retrieve the files when we do a query which contain the > searched query? > > -------- Original-Nachricht -------- > >> Datum: Wed, 19 Sep 2012 08:33:57 -0400 > >> Von: Erick Erickson <[EMAIL PROTECTED]> > >> An: [EMAIL PROTECTED] > >> Betreff: Re: missing a directory, can not process pdf files > > > >> Please review: > >> > >> http://wiki.apache.org/solr/UsingMailingLists > >> > >> There's nothing in your problem statement that's diagnosable. What did > >> you try? What > >> were the results? Details matter. > >> > >> 4.0 is in process of being prepped for release. 30 days was a > >> straw-man proposal. > >> > >> Best > >> Erick > >> > >> On Wed, Sep 19, 2012 at 3:46 AM, xxxx xxxx <[EMAIL PROTECTED]> wrote: > >>> seems the /update directory is missing? I use solr 4.0.0 beta > >>> can not process pdf files because of it > >>> > >>> also when will the final version be released? thought it it 30 days > >> after beta? > >>> > >>> how can we get the files which contain the searched queries / content? > >>> > >>> >
-
Re: missing a directory, can not process pdf filesAhmet Arslan 2012-09-19, 18:23
> user:~/solr/example/exampledocs$ java
> -jar post.jar test.pdf doesnt work > > Index binary documents such as Word and PDF with Solr Cell > (ExtractingRequestHandler). > how do i do his? > > http://lucene.apache.org/solr/api-4_0_0-BETA/doc-files/tutorial.html > > > http://wiki.apache.org/solr/ExtractingRequestHandler > > it says solr 1.4? > > curl is not installed normally so how do we do this like > with post.jar? > also the docs dir is not existing, seems very outdated? > > "using "curl" or other command line tools to post documents > to Solr is nice for testing, but not the recommended update > method for best performance." > > what then? > > > far below there: > > java -Durl=http://localhost:8983/solr/update/extract > -Dparams=literal.id=doc5 -Dtype=text/html -jar post.jar > tutorial.html > > > is this the right? > > java -Dauto -jar post.jar tutorial.html > java -Dauto -Drecursive -jar post.jar . > > "NOTE: The post.jar utility is not meant for production > use" > so how do we normally do this or should do this? I haven't used post.jar to index rich documents. This is new feature of solr 4.0. To index rich documents you can use one of these : http://wiki.apache.org/solr/ContentStreamUpdateRequestExample http://wiki.apache.org/solr/TikaEntityProcessor http://searchhub.org/dev/2012/02/14/indexing-with-solrj/
-
Re: missing a directory, can not process pdf filesxxxx xxxx 2012-09-19, 18:39
So I have to create a java file and compile it just for this purpose? like http://wiki.apache.org/solr/ContentStreamUpdateRequestExample?
No way to do this via post.jar (and without curl? or an other already existing implementation via command line ...) also there is no way mentioned how it can be done without curl even they say we should not use curl? -------- Original-Nachricht -------- > Datum: Wed, 19 Sep 2012 11:23:25 -0700 (PDT) > Von: Ahmet Arslan <[EMAIL PROTECTED]> > An: [EMAIL PROTECTED] > Betreff: Re: missing a directory, can not process pdf files > > user:~/solr/example/exampledocs$ java > > -jar post.jar test.pdf doesnt work > > > > Index binary documents such as Word and PDF with Solr Cell > > (ExtractingRequestHandler). > > > how do i do his? > > > > http://lucene.apache.org/solr/api-4_0_0-BETA/doc-files/tutorial.html > > > > > > http://wiki.apache.org/solr/ExtractingRequestHandler > > > > it says solr 1.4? > > > > curl is not installed normally so how do we do this like > > with post.jar? > > also the docs dir is not existing, seems very outdated? > > > > "using "curl" or other command line tools to post documents > > to Solr is nice for testing, but not the recommended update > > method for best performance." > > > > what then? > > > > > > far below there: > > > > java -Durl=http://localhost:8983/solr/update/extract > > -Dparams=literal.id=doc5 -Dtype=text/html -jar post.jar > > tutorial.html > > > > > > is this the right? > > > > java -Dauto -jar post.jar tutorial.html > > java -Dauto -Drecursive -jar post.jar . > > > > "NOTE: The post.jar utility is not meant for production > > use" > > so how do we normally do this or should do this? > > I haven't used post.jar to index rich documents. This is new feature of > solr 4.0. To index rich documents you can use one of these : > > http://wiki.apache.org/solr/ContentStreamUpdateRequestExample > http://wiki.apache.org/solr/TikaEntityProcessor > http://searchhub.org/dev/2012/02/14/indexing-with-solrj/ > >
-
Re: missing a directory, can not process pdf filesChris Hostetter 2012-09-19, 18:47
: user:~/solr/example/exampledocs$ java -jar post.jar test.pdf doesnt work 1) you can use post.jar to send PDFs, but you have to use the option to tell solr you are sending a PDF file - because by default it assumes you are posting XML. you can see the problem by looking at the output from post.jar and the solr logs... hossman@frisbee:~/tmp/solr-4.0-BETA/bin-zip/apache-solr-4.0.0-BETA/example/exampledocs$ java -jar post.jar /tmp/test.pdf SimplePostTool version 1.5 Posting files to base url http://localhost:8983/solr/update using content-type application/xml.. ... And in the Solr logs... ... SEVERE: org.apache.solr.common.SolrException: Invalid UTF-8 middle byte 0xe3 (at char #10, byte #-1) at org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:159) ... ...if you specify the type things should work fine on the clinet side. As for the Server side... 2) by default Solr's "/update" handler supports Solr Documents in XML, JSON, CSV, and JavaBin. If you wnat to use the "ExtractingRequestHandler" to parse rich documents you just have to change the URL exactly as noted in the wiki you mentioned ("-Durl=http://localhost:8983/solr/update/extract") -Hoss
-
Re: missing a directory, can not process pdf filesxxxx xxxx 2012-09-19, 19:00
1) so how does this look like for example?
2) without curl? how does this look like? i am very confused because they use curl in the example but say at the same time that we should not use curl. also i have not installed curl -------- Original-Nachricht -------- > Datum: Wed, 19 Sep 2012 11:47:54 -0700 (PDT) > Von: Chris Hostetter <[EMAIL PROTECTED]> > An: [EMAIL PROTECTED] > Betreff: Re: missing a directory, can not process pdf files > > : user:~/solr/example/exampledocs$ java -jar post.jar test.pdf doesnt work > > 1) you can use post.jar to send PDFs, but you have to use the option to > tell solr you are sending a PDF file - because by default it assumes you > are posting XML. you can see the problem by looking at the output from > post.jar and the solr logs... > > hossman@frisbee:~/tmp/solr-4.0-BETA/bin-zip/apache-solr-4.0.0-BETA/example/exampledocs$ > java -jar post.jar /tmp/test.pdf > SimplePostTool version 1.5 > Posting files to base url http://localhost:8983/solr/update using > content-type application/xml.. > ... > > And in the Solr logs... > > ... > SEVERE: org.apache.solr.common.SolrException: Invalid UTF-8 middle byte > 0xe3 (at char #10, byte #-1) > at org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:159) > ... > > ...if you specify the type things should work fine on the clinet side. > > As for the Server side... > > 2) by default Solr's "/update" handler supports Solr Documents in XML, > JSON, CSV, and JavaBin. If you wnat to use the "ExtractingRequestHandler" > to parse rich documents you just have to change the URL exactly as noted > in the wiki you mentioned > ("-Durl=http://localhost:8983/solr/update/extract") > > > -Hoss |