|
nirnaydewan
2011-08-19, 11:49
Michael McCandless
2011-08-19, 15:21
nirnaydewan
2011-08-19, 19:32
Michael McCandless
2011-08-19, 23:44
nirnaydewan
2011-08-20, 05:07
Michael McCandless
2011-08-20, 10:40
Michael McCandless
2011-08-20, 12:35
Uwe Schindler
2011-08-20, 12:39
Michael McCandless
2011-08-20, 13:25
Uwe Schindler
2011-08-20, 14:16
Uwe Schindler
2011-08-20, 14:19
Michael McCandless
2011-08-20, 15:32
Uwe Schindler
2011-08-20, 16:11
Michael McCandless
2011-08-20, 16:25
|
-
Issue in text extraction in Solr / Tikanirnaydewan 2011-08-19, 11:49
I am using Solr 3.3.0 using the attached jetty server. When i upload ms word
documents or pdf files, the text is not formatted properly. 1. There is no line breaks between sentences. The text is extracted in a single line or string. 2. Wherever there are boxes in word documents , some weird characters come in place. How do i keep the formatting of the text just like in the document. For e.g if there are 3 line breaks , how do i maintain this? Also ? characters come in text while uploading word documents. Where is the issue? Thanks -- View this message in context: http://lucene.472066.n3.nabble.com/Issue-in-text-extraction-in-Solr-Tika-tp3267810p3267810.html Sent from the Apache Tika - Development mailing list archive at Nabble.com.
-
Re: Issue in text extraction in Solr / TikaMichael McCandless 2011-08-19, 15:21
Can you post some example docs that don't extract correctly?
Or, better, open a Jira issue(s) and attach the documents there? Thanks, Mike McCandless http://blog.mikemccandless.com On Fri, Aug 19, 2011 at 7:49 AM, nirnaydewan <[EMAIL PROTECTED]> wrote: > I am using Solr 3.3.0 using the attached jetty server. When i upload ms word > documents or pdf files, the text is not formatted properly. > > 1. There is no line breaks between sentences. The text is extracted in a > single line or string. > > 2. Wherever there are boxes in word documents , some weird characters come > in place. > > How do i keep the formatting of the text just like in the document. For e.g > if there are 3 line breaks , how do i maintain this? > > Also ? characters come in text while uploading word documents. Where is the > issue? > > Thanks > > -- > View this message in context: http://lucene.472066.n3.nabble.com/Issue-in-text-extraction-in-Solr-Tika-tp3267810p3267810.html > Sent from the Apache Tika - Development mailing list archive at Nabble.com. >
-
Re: Issue in text extraction in Solr / Tikanirnaydewan 2011-08-19, 19:32
Thanks for your suggestion Mike. Attached is the ms word file.
What happens is that, i get a single line of text but i want it be formatted as it is so that i can display it in highlighting. Thanks http://lucene.472066.n3.nabble.com/file/n3269071/2011-01-23-7-22-09_sample.doc 2011-01-23-7-22-09_sample.doc -- View this message in context: http://lucene.472066.n3.nabble.com/Issue-in-text-extraction-in-Solr-Tika-tp3267810p3269071.html Sent from the Apache Tika - Development mailing list archive at Nabble.com.
-
Re: Issue in text extraction in Solr / TikaMichael McCandless 2011-08-19, 23:44
I ran Tika to get the text:
> java -jar ./tika-app/target/tika-app-1.0-SNAPSHOT.jar -T 2011-01-23-7-22-09_sample.doc And it produces this output for me: +9245114107060 (M) E-Mail: [EMAIL PROTECTED] To enhance the organizational development by self development and motivation from the organizational atmosphere. Hence to involve myself as an effective personnel in this field with my skill, potential, talents with dedication. Working Experience DESIGNATION : Relationship Manager Computer Awareness Office Packages : MS-OFFICE. ACADEMIC CREDENTIALS Completed MBA in the year 2010 in MAKETING & RETAIL as major under RAI BUSINESS SCHOOL (67% Marks, overall). Completed GRADUATION in BSc with 50% marks under WBCHSE Extra Curriculum Activities In my Graduation level I was leading my College Cricket team PERSONAL DETAILS Declaration : It looks like it's missing some text? The Word doc starts with NAMITGOP SAHAD but it's not in the above text (strangely if I get the XHTML output instead, I do see that text); various other text seems to be missing too. Do you see that? On the formatting, it seems to have retained some of the formatting... (I don't get only a single line), but, how are you trying to highlight? Are you displaying the Tika output filtered text to the user? Can you try the XHTML output? Mike McCandless http://blog.mikemccandless.com On Fri, Aug 19, 2011 at 3:32 PM, nirnaydewan <[EMAIL PROTECTED]> wrote: > Thanks for your suggestion Mike. Attached is the ms word file. > > What happens is that, i get a single line of text but i want it be formatted > as it is so that i can display it in highlighting. > > > Thanks > > http://lucene.472066.n3.nabble.com/file/n3269071/2011-01-23-7-22-09_sample.doc > 2011-01-23-7-22-09_sample.doc > > -- > View this message in context: http://lucene.472066.n3.nabble.com/Issue-in-text-extraction-in-Solr-Tika-tp3267810p3269071.html > Sent from the Apache Tika - Development mailing list archive at Nabble.com. >
-
Re: Issue in text extraction in Solr / Tikanirnaydewan 2011-08-20, 05:07
First of all thanks again Mike for helping me out.
Yes, i have seen that, some text do get stripped out sometimes. Any idea as to why this could be happening? I am using the bundled Solr 3.3.0 which comes with Tika 0.8. Should i move to 0.9? if so how? Also i am storing this text only which i am trying to display. If the xhtml produces the correct text, how do i store it instead? Thanks -- View this message in context: http://lucene.472066.n3.nabble.com/Issue-in-text-extraction-in-Solr-Tika-tp3267810p3269982.html Sent from the Apache Tika - Development mailing list archive at Nabble.com.
-
Re: Issue in text extraction in Solr / TikaMichael McCandless 2011-08-20, 10:40
OK one correction: I ran the TikaCLI tool with the -T option, which
extracts "main content only"; when I re-ran with the -t (lowercase) option, which outputs all plain text, then it looks like all text appears correctly (phew!). On moving to 0.9, that's your call -- I'm not sure what's changed since then, but presumably it is better than 0.8! Displaying the equivalent of "-t" from the TikaCLI tool seems like a good approach? Especially because the XHTML output incorrectly breaks up the SAHAD from your document. Mike McCandless http://blog.mikemccandless.com On Sat, Aug 20, 2011 at 1:07 AM, nirnaydewan <[EMAIL PROTECTED]> wrote: > First of all thanks again Mike for helping me out. > > Yes, i have seen that, some text do get stripped out sometimes. Any idea as > to why this could be happening? > > I am using the bundled Solr 3.3.0 which comes with Tika 0.8. Should i move > to 0.9? if so how? > > Also i am storing this text only which i am trying to display. If the xhtml > produces the correct text, how do i store it instead? > > > Thanks > > > -- > View this message in context: http://lucene.472066.n3.nabble.com/Issue-in-text-extraction-in-Solr-Tika-tp3267810p3269982.html > Sent from the Apache Tika - Development mailing list archive at Nabble.com. >
-
Re: Issue in text extraction in Solr / TikaMichael McCandless 2011-08-20, 12:35
One thing I still don't like is with the XML (-x) or XHTML (-h)
output, the result filtered output incorrectly splits up a word. The doc has: NAMITGOP SAHAD But in the XML/XHTML it looks like this: <p> <b>NAMITGOP</b> <b> SAHA</b> <b>D</b> </p> Ie SAHAD became SAHA and D, separated. I think this is a bug and I think I know why it's happening... I'll open an issue. Mike McCandless http://blog.mikemccandless.com On Sat, Aug 20, 2011 at 6:40 AM, Michael McCandless <[EMAIL PROTECTED]> wrote: > OK one correction: I ran the TikaCLI tool with the -T option, which > extracts "main content only"; when I re-ran with the -t (lowercase) > option, which outputs all plain text, then it looks like all text > appears correctly (phew!). > > On moving to 0.9, that's your call -- I'm not sure what's changed > since then, but presumably it is better than 0.8! > > Displaying the equivalent of "-t" from the TikaCLI tool seems like a > good approach? Especially because the XHTML output incorrectly breaks > up the SAHAD from your document. > > Mike McCandless > > http://blog.mikemccandless.com > > On Sat, Aug 20, 2011 at 1:07 AM, nirnaydewan <[EMAIL PROTECTED]> wrote: >> First of all thanks again Mike for helping me out. >> >> Yes, i have seen that, some text do get stripped out sometimes. Any idea as >> to why this could be happening? >> >> I am using the bundled Solr 3.3.0 which comes with Tika 0.8. Should i move >> to 0.9? if so how? >> >> Also i am storing this text only which i am trying to display. If the xhtml >> produces the correct text, how do i store it instead? >> >> >> Thanks >> >> >> -- >> View this message in context: http://lucene.472066.n3.nabble.com/Issue-in-text-extraction-in-Solr-Tika-tp3267810p3269982.html >> Sent from the Apache Tika - Development mailing list archive at Nabble.com. >> >
-
Re: Issue in text extraction in Solr / TikaUwe Schindler 2011-08-20, 12:39
>From the xml point of view, its not separated. It's just in two elements, but no whitespace in-between, according to parsing standards (see xml whitespace rules).
Uwe -- Uwe Schindler H.-H.-Meier-Allee 63, 28213 Bremen http://www.thetaphi.de Michael McCandless <[EMAIL PROTECTED]> schrieb: One thing I still don't like is with the XML (-x) or XHTML (-h) output, the result filtered output incorrectly splits up a word. The doc has: NAMITGOP SAHAD But in the XML/XHTML it looks like this: <p> <b>NAMITGOP</b> <b> SAHA</b> <b>D</b> </p> Ie SAHAD became SAHA and D, separated. I think this is a bug and I think I know why it's happening... I'll open an issue. Mike McCandless http://blog.mikemccandless.com On Sat, Aug 20, 2011 at 6:40 AM, Michael McCandless <[EMAIL PROTECTED]> wrote: > OK one correction: I ran the TikaCLI tool with the -T option, which > extracts "main content only"; when I re-ran with the -t (lowercase) > option, which outputs all plain text, then it looks like all text > appears correctly (phew!). > > On moving to 0.9, that's your call -- I'm not sure what's changed > since then, but presumably it is better than 0.8! > > Displaying the equivalent of "-t" from the TikaCLI tool seems like a > good approach? Especially because the XHTML output incorrectly breaks > up the SAHAD from your document. > > Mike McCandless > > http://blog.mikemccandless.com > > On Sat, Aug 20, 2011 at 1:07 AM, nirnaydewan <[EMAIL PROTECTED]> wrote: >> First of all thanks again Mike for helping me out. >> >> Yes, i have seen that, some text do get stripped out sometimes. Any idea as >> to why this could be happening? >> >> I am using the bundled Solr 3.3.0 which comes with Tika 0.8. Should i move >> to 0.9? if so how? >> >> Also i am storing this text only which i am trying to display. If the xhtml >> produces the correct text, how do i store it instead? >> >> >> Thanks >> >> >> -- >> View this message in context: http://lucene.472066.n3.nabble.com/Issue-in-text-extraction-in-Solr-Tika-tp3267810p3269982.html >> Sent from the Apache Tika - Development mailing list archive at Nabble.com. >> >
-
Re: Issue in text extraction in Solr / TikaMichael McCandless 2011-08-20, 13:25
Ahhh.... what threw me off was the browser rendering, which turns that
newline into space so I see "SAHA D". Hmm, actually: the <p> element allows text, in addition to child elements? So shouldn't any whitespace within the <p>...</p> be treated as significant (part of the content)? I need to go learn XML's whitespace rules :) Mike McCandless http://blog.mikemccandless.com On Sat, Aug 20, 2011 at 8:39 AM, Uwe Schindler <[EMAIL PROTECTED]> wrote: > From the xml point of view, its not separated. It's just in two elements, but no whitespace in-between, according to parsing standards (see xml whitespace rules). > > Uwe > -- > Uwe Schindler > H.-H.-Meier-Allee 63, 28213 Bremen > http://www.thetaphi.de > > > > Michael McCandless <[EMAIL PROTECTED]> schrieb: > > One thing I still don't like is with the XML (-x) or XHTML (-h) > output, the result filtered output incorrectly splits up a word. The > doc has: > > NAMITGOP SAHAD > > But in the XML/XHTML it looks like this: > > <p> > <b>NAMITGOP</b> > <b> SAHA</b> > <b>D</b> > </p> > > Ie SAHAD became SAHA and D, separated. > > I think this is a bug and I think I know why it's happening... I'll > open an issue. > > Mike McCandless > > http://blog.mikemccandless.com > > On Sat, Aug 20, 2011 at 6:40 AM, Michael McCandless > <[EMAIL PROTECTED]> wrote: >> OK one correction: I ran the TikaCLI tool with the -T option, which >> extracts "main content only"; when I re-ran with the -t (lowercase) >> option, which outputs all plain text, then it looks like all text >> appears correctly (phew!). >> >> On moving to 0.9, that's your call -- I'm not sure what's changed >> since then, but presumably it is better than 0.8! >> >> Displaying the equivalent of "-t" from the TikaCLI tool seems like a >> good approach? Especially because the XHTML output incorrectly breaks >> up the SAHAD from your document. >> >> Mike McCandless >> >> http://blog.mikemccandless.com >> >> On Sat, Aug 20, 2011 at 1:07 AM, nirnaydewan <[EMAIL PROTECTED]> wrote: >>> First of all thanks again Mike for helping me out. >>> >>> Yes, i have seen that, some text do get stripped out sometimes. Any idea as >>> to why this could be happening? >>> >>> I am using the bundled Solr 3.3.0 which comes with Tika 0.8. Should i move >>> to 0.9? if so how? >>> >>> Also i am storing this text only which i am trying to display. If the xhtml >>> produces the correct text, how do i store it instead? >>> >>> >>> Thanks >>> >>> >>> -- >>> View this message in context: http://lucene.472066.n3.nabble.com/Issue-in-text-extraction-in-Solr-Tika-tp3267810p3269982.html >>> Sent from the Apache Tika - Development mailing list archive at Nabble.com. >>> >> > >
-
RE: Issue in text extraction in Solr / TikaUwe Schindler 2011-08-20, 14:16
Yes,
the text-only output handler exactly uses those whitespace processing guidelines and also inserts newlines at correct places according to block elements like <p/>. The code was partially written by me, especially the block element parts :-) So if the text-only output is formatted correctly then the HTML whould be fine too. Of course those useless splitting of formatting is mostly caused by the orginal word document (happens mostly by the word editor, e.g. when you click on "bold" then think, "oh I missed a character" and then make the rest also bold. Depending on the order of actions, these sections of bold text are not merged together. There is nothing TIKA is doing wrong it just translates the formatting of the word/pdf document to XHTML. Uwe ----- Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: [EMAIL PROTECTED] > -----Original Message----- > From: Michael McCandless [mailto:[EMAIL PROTECTED]] > Sent: Saturday, August 20, 2011 3:25 PM > To: [EMAIL PROTECTED] > Subject: Re: Issue in text extraction in Solr / Tika > > Ahhh.... what threw me off was the browser rendering, which turns that > newline into space so I see "SAHA D". > > Hmm, actually: the <p> element allows text, in addition to child elements? So > shouldn't any whitespace within the <p>...</p> be treated as significant (part of > the content)? > > I need to go learn XML's whitespace rules :) > > Mike McCandless > > http://blog.mikemccandless.com > > On Sat, Aug 20, 2011 at 8:39 AM, Uwe Schindler <[EMAIL PROTECTED]> wrote: > > From the xml point of view, its not separated. It's just in two elements, but no > whitespace in-between, according to parsing standards (see xml whitespace > rules). > > > > Uwe > > -- > > Uwe Schindler > > H.-H.-Meier-Allee 63, 28213 Bremen > > http://www.thetaphi.de > > > > > > > > Michael McCandless <[EMAIL PROTECTED]> schrieb: > > > > One thing I still don't like is with the XML (-x) or XHTML (-h) > > output, the result filtered output incorrectly splits up a word. The > > doc has: > > > > NAMITGOP SAHAD > > > > But in the XML/XHTML it looks like this: > > > > <p> > > <b>NAMITGOP</b> > > <b> SAHA</b> > > <b>D</b> > > </p> > > > > Ie SAHAD became SAHA and D, separated. > > > > I think this is a bug and I think I know why it's happening... I'll > > open an issue. > > > > Mike McCandless > > > > http://blog.mikemccandless.com > > > > On Sat, Aug 20, 2011 at 6:40 AM, Michael McCandless > > <[EMAIL PROTECTED]> wrote: > >> OK one correction: I ran the TikaCLI tool with the -T option, which > >> extracts "main content only"; when I re-ran with the -t (lowercase) > >> option, which outputs all plain text, then it looks like all text > >> appears correctly (phew!). > >> > >> On moving to 0.9, that's your call -- I'm not sure what's changed > >> since then, but presumably it is better than 0.8! > >> > >> Displaying the equivalent of "-t" from the TikaCLI tool seems like a > >> good approach? Especially because the XHTML output incorrectly > >> breaks up the SAHAD from your document. > >> > >> Mike McCandless > >> > >> http://blog.mikemccandless.com > >> > >> On Sat, Aug 20, 2011 at 1:07 AM, nirnaydewan <[EMAIL PROTECTED]> > wrote: > >>> First of all thanks again Mike for helping me out. > >>> > >>> Yes, i have seen that, some text do get stripped out sometimes. Any > >>> idea as to why this could be happening? > >>> > >>> I am using the bundled Solr 3.3.0 which comes with Tika 0.8. Should > >>> i move to 0.9? if so how? > >>> > >>> Also i am storing this text only which i am trying to display. If > >>> the xhtml produces the correct text, how do i store it instead? > >>> > >>> > >>> Thanks > >>> > >>> > >>> -- > >>> View this message in context: > >>> http://lucene.472066.n3.nabble.com/Issue-in-text-extraction-in-Solr- > >>> Tika-tp3267810p3269982.html Sent from the Apache Tika - Development > >>> mailing list archive at Nabble.com. > >>> > >> > > > >
-
RE: Issue in text extraction in Solr / TikaUwe Schindler 2011-08-20, 14:19
> Hmm, actually: the <p> element allows text, in addition to child elements?
So > shouldn't any whitespace within the <p>...</p> be treated as significant (part of > the content)? This is very indeed very complicated. For mixed content elements, the whitespace inside is preserved, but not next to child elements - very stupid rules. If you once coded HTML you know this :-) Uwe
-
Re: Issue in text extraction in Solr / TikaMichael McCandless 2011-08-20, 15:32
On Sat, Aug 20, 2011 at 10:19 AM, Uwe Schindler <[EMAIL PROTECTED]> wrote:
>> Hmm, actually: the <p> element allows text, in addition to child elements? > So >> shouldn't any whitespace within the <p>...</p> be treated as significant > (part of >> the content)? > > This is very indeed very complicated. For mixed content elements, the > whitespace inside is preserved, but not next to child elements - very stupid > rules. If you once coded HTML you know this :-) Hmm... are you sure? :) Because, I've tried Firefox and Chrome and Safari, on the xml file, and all insert a space in rendering. Also, I tried Tika itself (feeding back the .xml it had created, to produce text) and it also inserts a space. I also tried JTidy and it inserts the space though it thinks it's parsing HTML so that may be an invalid test. Anyway... even if the strict XML white space rules state that this newline should not be counted as whitespace in the content, because so many tools seem not to do it correctly.... I think it's worth trying to fix Tika to not add this newline. Mike McCandless http://blog.mikemccandless.com
-
RE: Issue in text extraction in Solr / TikaUwe Schindler 2011-08-20, 16:11
Does it really add this newline, because this is strange? If you look at
XHTMLContentHandler it does not. So the newline must come from somewhere else. ----- Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: [EMAIL PROTECTED] > -----Original Message----- > From: Michael McCandless [mailto:[EMAIL PROTECTED]] > Sent: Saturday, August 20, 2011 5:33 PM > To: [EMAIL PROTECTED] > Subject: Re: Issue in text extraction in Solr / Tika > > On Sat, Aug 20, 2011 at 10:19 AM, Uwe Schindler <[EMAIL PROTECTED]> wrote: > >> Hmm, actually: the <p> element allows text, in addition to child elements? > > So > >> shouldn't any whitespace within the <p>...</p> be treated as > >> significant > > (part of > >> the content)? > > > > This is very indeed very complicated. For mixed content elements, the > > whitespace inside is preserved, but not next to child elements - very > > stupid rules. If you once coded HTML you know this :-) > > Hmm... are you sure? :) > > Because, I've tried Firefox and Chrome and Safari, on the xml file, and all insert > a space in rendering. > > Also, I tried Tika itself (feeding back the .xml it had created, to produce text) > and it also inserts a space. > > I also tried JTidy and it inserts the space though it thinks it's parsing HTML so > that may be an invalid test. > > Anyway... even if the strict XML white space rules state that this newline should > not be counted as whitespace in the content, because so many tools seem not > to do it correctly.... I think it's worth trying to fix Tika to not add this newline. > > Mike McCandless > > http://blog.mikemccandless.com
-
Re: Issue in text extraction in Solr / TikaMichael McCandless 2011-08-20, 16:25
I found the source of the newline, and opened this issue:
https://issues.apache.org/jira/browse/TIKA-692 Let's continue talking over there... Mike McCandless http://blog.mikemccandless.com On Sat, Aug 20, 2011 at 12:11 PM, Uwe Schindler <[EMAIL PROTECTED]> wrote: > Does it really add this newline, because this is strange? If you look at > XHTMLContentHandler it does not. So the newline must come from somewhere > else. > > ----- > Uwe Schindler > H.-H.-Meier-Allee 63, D-28213 Bremen > http://www.thetaphi.de > eMail: [EMAIL PROTECTED] > > >> -----Original Message----- >> From: Michael McCandless [mailto:[EMAIL PROTECTED]] >> Sent: Saturday, August 20, 2011 5:33 PM >> To: [EMAIL PROTECTED] >> Subject: Re: Issue in text extraction in Solr / Tika >> >> On Sat, Aug 20, 2011 at 10:19 AM, Uwe Schindler <[EMAIL PROTECTED]> wrote: >> >> Hmm, actually: the <p> element allows text, in addition to child > elements? >> > So >> >> shouldn't any whitespace within the <p>...</p> be treated as >> >> significant >> > (part of >> >> the content)? >> > >> > This is very indeed very complicated. For mixed content elements, the >> > whitespace inside is preserved, but not next to child elements - very >> > stupid rules. If you once coded HTML you know this :-) >> >> Hmm... are you sure? :) >> >> Because, I've tried Firefox and Chrome and Safari, on the xml file, and > all insert >> a space in rendering. >> >> Also, I tried Tika itself (feeding back the .xml it had created, to > produce text) >> and it also inserts a space. >> >> I also tried JTidy and it inserts the space though it thinks it's parsing > HTML so >> that may be an invalid test. >> >> Anyway... even if the strict XML white space rules state that this newline > should >> not be counted as whitespace in the content, because so many tools seem > not >> to do it correctly.... I think it's worth trying to fix Tika to not add > this newline. >> >> Mike McCandless >> >> http://blog.mikemccandless.com > > |