Home | About | Sematext search-lucene.com search-hadoop.com
 Search Lucene and all its subprojects:

Switch to Threaded View
Tika, mail # user - Return raw text from document


Copy link to this message
-
Re: Return raw text from document
Dave Meikle 2012-08-18, 07:56
Hi Alex,

On 17 Aug 2012, at 08:37, Alexander Cougarman <[EMAIL PROTECTED]> wrote:

> I'm using this C# code to call the parser directly via its URL; it returns JSON:
>
> var url = @"http://localhost:8983/solr/update/extract";
>
> var client = new WebClient();
> client.QueryString.Add("extractOnly","true");
> client.QueryString.Add("wt","json");
> var data = client.UploadFile(url, "input.txt");
> var json = ASCIIEncoding.ASCII.GetString(data);
>
> Sincerely,
> Alex

There is parameter called extractFormat that you can use in extractOnly mode.  This will give you the serialised content back as plain text within a <str> element within the full XML response.

Not sure your full use-case, and if you are using the SOLR instance for other features later,  but you could also use the JSR-311 Tika Server to do this extraction for you http://wiki.apache.org/tika/TikaJAXRS

Cheers,
Dave