Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Lucene and all its subprojects:

Switch to Threaded View
Solr >> mail # user >> How to index pdf's content with SolrJ?


Copy link to this message
-
Re: How to index pdf's content with SolrJ?
This might help:
http://www.lucidimagination.com/blog/2012/02/14/indexing-with-solrj/

The bit here is you have to have Tika parse your file
and then extract the content to send to Solr...

Best
Erick

On Fri, Apr 20, 2012 at 7:36 PM, vasuj <[EMAIL PROTECTED]> wrote:
>
> 0
> down vote
> favorite
> share [g+]
> share [fb]
> share [tw]
> I'm trying to index a few pdf documents using SolrJ as described at
> http://wiki.apache.org/solr/ContentStreamUpdateRequestExample, below there's
> the code:
>
> import static
> org.apache.solr.handler.extraction.ExtractingParams.LITERALS_PREFIX;
> import static
> org.apache.solr.handler.extraction.ExtractingParams.MAP_PREFIX;
> import static
> org.apache.solr.handler.extraction.ExtractingParams.UNKNOWN_FIELD_PREFIX;
>
> import org.apache.solr.client.solrj.SolrServer;
> import org.apache.solr.client.solrj.SolrServerException;
> import org.apache.solr.client.solrj.impl.CommonsHttpSolrServer;
> import org.apache.solr.client.solrj.request.AbstractUpdateRequest;
> import org.apache.solr.client.solrj.request.ContentStreamUpdateRequest;
> import org.apache.solr.common.util.NamedList;
> ...
> public static void indexFilesSolrCell(String fileName) throws IOException,
> SolrServerException {
>
>  String urlString = "http://localhost:8080/solr";
>  SolrServer server = new CommonsHttpSolrServer(urlString);
>
>  ContentStreamUpdateRequest up = new
> ContentStreamUpdateRequest("/update/extract");
>  up.addFile(new File(fileName));
>  String id = fileName.substring(fileName.lastIndexOf('/')+1);
>  System.out.println(id);
>
>  up.setParam(LITERALS_PREFIX + "id", id);
>  up.setParam(LITERALS_PREFIX + "location", fileName); // this field doesn't
> exists in schema.xml, it'll be created as attr_location
>  up.setParam(UNKNOWN_FIELD_PREFIX, "attr_");
>  up.setParam(MAP_PREFIX + "content", "attr_content");
>  up.setAction(AbstractUpdateRequest.ACTION.COMMIT, true, true);
>
>  NamedList request = server.request(up);
>  for(Entry<String, Object> entry : request){
>    System.out.println(entry.getKey());
>    System.out.println(entry.getValue());
>  }
> }
> Unfortunately when querying for *:* I get the list of indexed documents but
> the content field is empty. How can I change the code above to extract also
> the document's content?
>
> Below there's the xml frament that describes this document:
>
> <doc>
>  <arr name="attr_content">
>    <str>            </str>
>  </arr>
>  <arr name="attr_location">
>    <str>/home/alex/Documents/lsp.pdf</str>
>  </arr>
>  <arr name="attr_meta">
>    <str>stream_size</str>
>    <str>31203</str>
>    <str>Content-Type</str>
>    <str>application/pdf</str>
>  </arr>
>  <arr name="attr_stream_size">
>    <str>31203</str>
>  </arr>
>  <arr name="content_type">
>    <str>application/pdf</str>
>  </arr>
>  <str name="id">lsp.pdf</str>
> </doc>
> I don't think that this problem is related to an incorrect installation of
> Apache Tika, because previously I had a few ServerException but now I've
> installed the required jars in the correct path. Moreover I've tried to
> index a txt file using the same class but the attr_content field is always
> empty.
>
> Also tried In the schema.xml file, "stored= true" in the content field,
>
> <field name="text" type="textgen" indexed="true" stored="true"
> required="false" multiValued="true"/>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/How-to-index-pdf-s-content-with-SolrJ-tp3927284p3927284.html
> Sent from the Solr - User mailing list archive at Nabble.com.
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB