Home | About | Sematext search-lucene.com search-hadoop.com
 Search Lucene and all its subprojects:

Switch to Plain View
Lucene, mail # user - Problems Indexing/Parsing Tibetan Text


+
Denis Brodeur 2012-03-30, 16:46
+
Robert Muir 2012-03-30, 16:57
+
Denis Brodeur 2012-03-30, 17:03
Copy link to this message
-
Re: Problems Indexing/Parsing Tibetan Text
Benson Margulies 2012-03-30, 17:07
fileformat.info

On Mar 30, 2012, at 1:04 PM, Denis Brodeur <[EMAIL PROTECTED]> wrote:

> Thanks Robert.  That makes sense.  Do you have a link handy where I can
> find this information? i.e. word boundary/punctuation for any unicode
> character set?
>
> On Fri, Mar 30, 2012 at 12:57 PM, Robert Muir <[EMAIL PROTECTED]> wrote:
>
>> On Fri, Mar 30, 2012 at 12:46 PM, Denis Brodeur <[EMAIL PROTECTED]>
>> wrote:
>>> Hello, I'm currently working out some problems when searching for Tibetan
>>> Characters.  More specifically: /u0f10-/u0f19.  We are using the
>>
>> unicode doesn't consider most of these characters part of a word: most
>> are punctuation and symbols
>> (except 0f18 and 0f19 which are combining characters that combine with
>> digits).
>>
>> for example 0f14 is a text delimiter.
>>
>> in general standardtokenizer discards punctuation and is geared at
>> word boundaries, just like
>> you would have trouble searching on characters like '(', etc in
>> english. So i think its totally expected.
>>
>> --
>> lucidimagination.com
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>> For additional commands, e-mail: [EMAIL PROTECTED]
>>
>>

---------------------------------------------------------------------
+
Robert Muir 2012-03-30, 17:09
+
Brandon Mintern 2012-03-30, 18:11