Home | About | Sematext search-lucene.com search-hadoop.com
 Search Lucene and all its subprojects:

Switch to Threaded View
Solr >> mail # user >> How to index pdf's content with SolrJ?


Copy link to this message
-
Re: How to index pdf's content with SolrJ?
This might help:
http://www.lucidimagination.com/blog/2012/02/14/indexing-with-solrj/

The bit here is you have to have Tika parse your file
and then extract the content to send to Solr...

Best
Erick

On Fri, Apr 20, 2012 at 7:36 PM, vasuj <[EMAIL PROTECTED]> wrote:
>
> 0
> down vote
> favorite
> share [g+]
> share [fb]
> share [tw]
> I'm trying to index a few pdf documents using SolrJ as described at
> http://wiki.apache.org/solr/ContentStreamUpdateRequestExample, below there's
> the code:
>
> import static
> org.apache.solr.handler.extraction.ExtractingParams.LITERALS_PREFIX;
> import static
> org.apache.solr.handler.extraction.ExtractingParams.MAP_PREFIX;
> import static
> org.apache.solr.handler.extraction.ExtractingParams.UNKNOWN_FIELD_PREFIX;
>
> import org.apache.solr.client.solrj.SolrServer;
> import org.apache.solr.client.solrj.SolrServerException;
> import org.apache.solr.client.solrj.impl.CommonsHttpSolrServer;
> import org.apache.solr.client.solrj.request.AbstractUpdateRequest;
> import org.apache.solr.client.solrj.request.ContentStreamUpdateRequest;
> import org.apache.solr.common.util.NamedList;
> ...
> public static void indexFilesSolrCell(String fileName) throws IOException,
> SolrServerException {
>
>  String urlString = "http://localhost:8080/solr";
>  SolrServer server = new CommonsHttpSolrServer(urlString);
>
>  ContentStreamUpdateRequest up = new
> ContentStreamUpdateRequest("/update/extract");
>  up.addFile(new File(fileName));
>  String id = fileName.substring(fileName.lastIndexOf('/')+1);
>  System.out.println(id);
>
>  up.setParam(LITERALS_PREFIX + "id", id);
>  up.setParam(LITERALS_PREFIX + "location", fileName); // this field doesn't
> exists in schema.xml, it'll be created as attr_location
>  up.setParam(UNKNOWN_FIELD_PREFIX, "attr_");
>  up.setParam(MAP_PREFIX + "content", "attr_content");
>  up.setAction(AbstractUpdateRequest.ACTION.COMMIT, true, true);
>
>  NamedList request = server.request(up);
>  for(Entry<String, Object> entry : request){
>    System.out.println(entry.getKey());
>    System.out.println(entry.getValue());
>  }
> }
> Unfortunately when querying for *:* I get the list of indexed documents but
> the content field is empty. How can I change the code above to extract also
> the document's content?
>
> Below there's the xml frament that describes this document:
>
> <doc>
>  <arr name="attr_content">
>    <str>            </str>
>  </arr>
>  <arr name="attr_location">
>    <str>/home/alex/Documents/lsp.pdf</str>
>  </arr>
>  <arr name="attr_meta">
>    <str>stream_size</str>
>    <str>31203</str>
>    <str>Content-Type</str>
>    <str>application/pdf</str>
>  </arr>
>  <arr name="attr_stream_size">
>    <str>31203</str>
>  </arr>
>  <arr name="content_type">
>    <str>application/pdf</str>
>  </arr>
>  <str name="id">lsp.pdf</str>
> </doc>
> I don't think that this problem is related to an incorrect installation of
> Apache Tika, because previously I had a few ServerException but now I've
> installed the required jars in the correct path. Moreover I've tried to
> index a txt file using the same class but the attr_content field is always
> empty.
>
> Also tried In the schema.xml file, "stored= true" in the content field,
>
> <field name="text" type="textgen" indexed="true" stored="true"
> required="false" multiValued="true"/>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/How-to-index-pdf-s-content-with-SolrJ-tp3927284p3927284.html
> Sent from the Solr - User mailing list archive at Nabble.com.