|
|
-
Problems Indexing/Parsing Tibetan Text
Denis Brodeur 2012-03-30, 16:46
Hello, I'm currently working out some problems when searching for Tibetan Characters. More specifically: /u0f10-/u0f19. We are using the StandardAnalyzer (3.4) and I've narrowed the problem down to StandardTokenizerImpl throwing away these characters i.e. in getNextToken(), falls through case1: /* Not numeric, word, ideographic, hiragana, or SE Asian -- ignore it */. So, the question is: is this the expected behaviour and if it is what would be the best way to go about supporting code points that are not recognized by the StandardAnalyzer in a general way?
+
Denis Brodeur 2012-03-30, 16:46
-
Re: Problems Indexing/Parsing Tibetan Text
Robert Muir 2012-03-30, 16:57
On Fri, Mar 30, 2012 at 12:46 PM, Denis Brodeur <[EMAIL PROTECTED]> wrote: > Hello, I'm currently working out some problems when searching for Tibetan > Characters. More specifically: /u0f10-/u0f19. We are using the
unicode doesn't consider most of these characters part of a word: most are punctuation and symbols (except 0f18 and 0f19 which are combining characters that combine with digits).
for example 0f14 is a text delimiter.
in general standardtokenizer discards punctuation and is geared at word boundaries, just like you would have trouble searching on characters like '(', etc in english. So i think its totally expected.
-- lucidimagination.com
---------------------------------------------------------------------
+
Robert Muir 2012-03-30, 16:57
-
Re: Problems Indexing/Parsing Tibetan Text
Denis Brodeur 2012-03-30, 17:03
Thanks Robert. That makes sense. Do you have a link handy where I can find this information? i.e. word boundary/punctuation for any unicode character set?
On Fri, Mar 30, 2012 at 12:57 PM, Robert Muir <[EMAIL PROTECTED]> wrote:
> On Fri, Mar 30, 2012 at 12:46 PM, Denis Brodeur <[EMAIL PROTECTED]> > wrote: > > Hello, I'm currently working out some problems when searching for Tibetan > > Characters. More specifically: /u0f10-/u0f19. We are using the > > unicode doesn't consider most of these characters part of a word: most > are punctuation and symbols > (except 0f18 and 0f19 which are combining characters that combine with > digits). > > for example 0f14 is a text delimiter. > > in general standardtokenizer discards punctuation and is geared at > word boundaries, just like > you would have trouble searching on characters like '(', etc in > english. So i think its totally expected. > > -- > lucidimagination.com > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > >
+
Denis Brodeur 2012-03-30, 17:03
-
Re: Problems Indexing/Parsing Tibetan Text
Benson Margulies 2012-03-30, 17:07
fileformat.info
On Mar 30, 2012, at 1:04 PM, Denis Brodeur <[EMAIL PROTECTED]> wrote:
> Thanks Robert. That makes sense. Do you have a link handy where I can > find this information? i.e. word boundary/punctuation for any unicode > character set? > > On Fri, Mar 30, 2012 at 12:57 PM, Robert Muir <[EMAIL PROTECTED]> wrote: > >> On Fri, Mar 30, 2012 at 12:46 PM, Denis Brodeur <[EMAIL PROTECTED]> >> wrote: >>> Hello, I'm currently working out some problems when searching for Tibetan >>> Characters. More specifically: /u0f10-/u0f19. We are using the >> >> unicode doesn't consider most of these characters part of a word: most >> are punctuation and symbols >> (except 0f18 and 0f19 which are combining characters that combine with >> digits). >> >> for example 0f14 is a text delimiter. >> >> in general standardtokenizer discards punctuation and is geared at >> word boundaries, just like >> you would have trouble searching on characters like '(', etc in >> english. So i think its totally expected. >> >> -- >> lucidimagination.com >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: [EMAIL PROTECTED] >> For additional commands, e-mail: [EMAIL PROTECTED] >> >>
---------------------------------------------------------------------
+
Benson Margulies 2012-03-30, 17:07
-
Re: Problems Indexing/Parsing Tibetan Text
Robert Muir 2012-03-30, 17:09
On Fri, Mar 30, 2012 at 1:03 PM, Denis Brodeur <[EMAIL PROTECTED]> wrote: > Thanks Robert. That makes sense. Do you have a link handy where I can > find this information? i.e. word boundary/punctuation for any unicode > character set? > yeah, usually i use http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[u0f10-u0f19]&gyou can then click on a character and see all of its properties easily. (site seems to have some issues today) -- lucidimagination.com ---------------------------------------------------------------------
+
Robert Muir 2012-03-30, 17:09
-
Re: Problems Indexing/Parsing Tibetan Text
Brandon Mintern 2012-03-30, 18:11
Another good reference is this one: http://unicode.org/reports/tr29/Since the latest Lucene uses this for the basis of its text segmentation, it's worth getting familiar with it. On Fri, Mar 30, 2012 at 10:09 AM, Robert Muir <[EMAIL PROTECTED]> wrote: > On Fri, Mar 30, 2012 at 1:03 PM, Denis Brodeur <[EMAIL PROTECTED]> wrote: >> Thanks Robert. That makes sense. Do you have a link handy where I can >> find this information? i.e. word boundary/punctuation for any unicode >> character set? >> > > yeah, usually i use > http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[u0f10-u0f19]&g>> you can then click on a character and see all of its properties easily. > > (site seems to have some issues today) > > -- > lucidimagination.com > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > ---------------------------------------------------------------------
+
Brandon Mintern 2012-03-30, 18:11
|
|