|
|
David King 2008-03-19, 19:06
This has probably been asked before, but I'm having trouble finding it. Basically, we want to be able to search for content across several languages, given that we know what language a datum and a query are in. Is there an obvious way to do this?
Here's the longer version: I am trying to index content that occurs in multiple languages, including Asian languages. I'm in the process of moving from PyLucene to Solr. In PyLucene, I would have a list of analysers:
analyzers = dict(en = pyluc.SnowballAnalyzer("English"), cs = pyluc.CzechAnalyzer(), pt = pyluc.SnowballAnalyzer("Portuguese"), ...
Then when I want to index something, I do
writer = pyluc.IndexWriter(store, analyzer, create) writer.addDocument(d.doc)
That is, I tell Lucene the language of every datum, and the analyser to use when writing out the field. Then when I want to search against it, I do
analyzer = LanguageAnalyzer.getanal(lang) q = pyluc.QueryParser(field, analyzer).parse(value)
And use that QueryParser to parse the query in the given language before sending it off to PyLucene. (off-topic: getanal() is perhaps my favourite function-name ever). So the language of a given datum is attached to the datum itself. In Solr, however, this appears to be attached to the field, not to the individual data in it:
<fieldType name="text_greek" class="solr.TextField"> <analyzer class="org.apache.lucene.analysis.el.GreekAnalyzer"/> </fieldType>
Does this mean there there's no way to have a single "contents" field that has content in multiple languages, and still have the queries be parsed and stemmed correctly? How are other people handling this? Does it makes sense to write a tokeniser factory and a query factory that look at, say, the 'lang' field and return the correct tokenisers? Does this already exist?
The other alternative is to have a text_zh field, a text_en field, etc, and to modify the query to search on that field depending on the language of the query, but that seems kind of hacky to me, especially if a query may be against more than one language. Is this the accepted way to go about it? Is there a benefit to this method over writing a detecting tokeniser factory?
nicolas.dessaigne@... 2008-03-20, 10:05
You may be interested in a recent discussion that took place on a similar subject: http://www.mail-archive.com/[EMAIL PROTECTED]/msg09332.html Nicolas -----Message d'origine----- De : David King [mailto:[EMAIL PROTECTED]] Envoyé : mercredi 19 mars 2008 20:07 À : [EMAIL PROTECTED] Objet : Language support This has probably been asked before, but I'm having trouble finding it. Basically, we want to be able to search for content across several languages, given that we know what language a datum and a query are in. Is there an obvious way to do this? Here's the longer version: I am trying to index content that occurs in multiple languages, including Asian languages. I'm in the process of moving from PyLucene to Solr. In PyLucene, I would have a list of analysers: analyzers = dict(en = pyluc.SnowballAnalyzer("English"), cs = pyluc.CzechAnalyzer(), pt = pyluc.SnowballAnalyzer("Portuguese"), ... Then when I want to index something, I do writer = pyluc.IndexWriter(store, analyzer, create) writer.addDocument(d.doc) That is, I tell Lucene the language of every datum, and the analyser to use when writing out the field. Then when I want to search against it, I do analyzer = LanguageAnalyzer.getanal(lang) q = pyluc.QueryParser(field, analyzer).parse(value) And use that QueryParser to parse the query in the given language before sending it off to PyLucene. (off-topic: getanal() is perhaps my favourite function-name ever). So the language of a given datum is attached to the datum itself. In Solr, however, this appears to be attached to the field, not to the individual data in it: <fieldType name="text_greek" class="solr.TextField"> <analyzer class="org.apache.lucene.analysis.el.GreekAnalyzer"/> </fieldType> Does this mean there there's no way to have a single "contents" field that has content in multiple languages, and still have the queries be parsed and stemmed correctly? How are other people handling this? Does it makes sense to write a tokeniser factory and a query factory that look at, say, the 'lang' field and return the correct tokenisers? Does this already exist? The other alternative is to have a text_zh field, a text_en field, etc, and to modify the query to search on that field depending on the language of the query, but that seems kind of hacky to me, especially if a query may be against more than one language. Is this the accepted way to go about it? Is there a benefit to this method over writing a detecting tokeniser factory?
David King 2008-03-20, 16:17
> You may be interested in a recent discussion that took place on a > similar > subject: > http://www.mail-archive.com/[EMAIL PROTECTED]/msg09332.html Interesting, yes. But since it doesn't actually exist, it's not much help. I guess what I'm asking is, if my approach seems convoluted, I'm probably doing it wrong, so how *a*re people solving the problem of searching over multiple languages? What is the canonical way to do this? > > > Nicolas > > -----Message d'origine----- > De : David King [mailto:[EMAIL PROTECTED]] > Envoyé : mercredi 19 mars 2008 20:07 > À : [EMAIL PROTECTED] > Objet : Language support > > This has probably been asked before, but I'm having trouble finding > it. Basically, we want to be able to search for content across several > languages, given that we know what language a datum and a query are > in. Is there an obvious way to do this? > > Here's the longer version: I am trying to index content that occurs in > multiple languages, including Asian languages. I'm in the process of > moving from PyLucene to Solr. In PyLucene, I would have a list of > analysers: > > analyzers = dict(en = pyluc.SnowballAnalyzer("English"), > cs = pyluc.CzechAnalyzer(), > pt = pyluc.SnowballAnalyzer("Portuguese"), > ... > > Then when I want to index something, I do > > writer = pyluc.IndexWriter(store, analyzer, create) > writer.addDocument(d.doc) > > That is, I tell Lucene the language of every datum, and the analyser > to use when writing out the field. Then when I want to search against > it, I do > > analyzer = LanguageAnalyzer.getanal(lang) > q = pyluc.QueryParser(field, analyzer).parse(value) > > And use that QueryParser to parse the query in the given language > before sending it off to PyLucene. (off-topic: getanal() is perhaps my > favourite function-name ever). So the language of a given datum is > attached to the datum itself. In Solr, however, this appears to be > attached to the field, not to the individual data in it: > > <fieldType name="text_greek" class="solr.TextField"> > <analyzer class="org.apache.lucene.analysis.el.GreekAnalyzer"/> > </fieldType> > > Does this mean there there's no way to have a single "contents" field > that has content in multiple languages, and still have the queries be > parsed and stemmed correctly? How are other people handling this? Does > it makes sense to write a tokeniser factory and a query factory that > look at, say, the 'lang' field and return the correct tokenisers? Does > this already exist? > > The other alternative is to have a text_zh field, a text_en field, > etc, and to modify the query to search on that field depending on the > language of the query, but that seems kind of hacky to me, especially > if a query may be against more than one language. Is this the accepted > way to go about it? Is there a benefit to this method over writing a > detecting tokeniser factory?
Benson Margulies 2008-03-20, 16:20
Unless you can come up with language-neutral tokenization and stemming, you need to: a) know the language of each document. b) run a different analyzer depending on the language. c) force the user to tell you the language of the query. d) run the query through the same analyzer. On Thu, Mar 20, 2008 at 12:17 PM, David King <[EMAIL PROTECTED]> wrote: > > You may be interested in a recent discussion that took place on a > > similar > > subject: > > http://www.mail-archive.com/[EMAIL PROTECTED]/msg09332.html > > Interesting, yes. But since it doesn't actually exist, it's not much > help. > > I guess what I'm asking is, if my approach seems convoluted, I'm > probably doing it wrong, so how *a*re people solving the problem of > searching over multiple languages? What is the canonical way to do this? > > > > > > > > Nicolas > > > > -----Message d'origine----- > > De : David King [mailto:[EMAIL PROTECTED]] > > Envoyé : mercredi 19 mars 2008 20:07 > > À : [EMAIL PROTECTED] > > Objet : Language support > > > > This has probably been asked before, but I'm having trouble finding > > it. Basically, we want to be able to search for content across several > > languages, given that we know what language a datum and a query are > > in. Is there an obvious way to do this? > > > > Here's the longer version: I am trying to index content that occurs in > > multiple languages, including Asian languages. I'm in the process of > > moving from PyLucene to Solr. In PyLucene, I would have a list of > > analysers: > > > > analyzers = dict(en = pyluc.SnowballAnalyzer("English"), > > cs = pyluc.CzechAnalyzer(), > > pt = pyluc.SnowballAnalyzer("Portuguese"), > > ... > > > > Then when I want to index something, I do > > > > writer = pyluc.IndexWriter(store, analyzer, create) > > writer.addDocument(d.doc) > > > > That is, I tell Lucene the language of every datum, and the analyser > > to use when writing out the field. Then when I want to search against > > it, I do > > > > analyzer = LanguageAnalyzer.getanal(lang) > > q = pyluc.QueryParser(field, analyzer).parse(value) > > > > And use that QueryParser to parse the query in the given language > > before sending it off to PyLucene. (off-topic: getanal() is perhaps my > > favourite function-name ever). So the language of a given datum is > > attached to the datum itself. In Solr, however, this appears to be > > attached to the field, not to the individual data in it: > > > > <fieldType name="text_greek" class="solr.TextField"> > > <analyzer class="org.apache.lucene.analysis.el.GreekAnalyzer"/> > > </fieldType> > > > > Does this mean there there's no way to have a single "contents" field > > that has content in multiple languages, and still have the queries be > > parsed and stemmed correctly? How are other people handling this? Does > > it makes sense to write a tokeniser factory and a query factory that > > look at, say, the 'lang' field and return the correct tokenisers? Does > > this already exist? > > > > The other alternative is to have a text_zh field, a text_en field, > > etc, and to modify the query to search on that field depending on the > > language of the query, but that seems kind of hacky to me, especially > > if a query may be against more than one language. Is this the accepted > > way to go about it? Is there a benefit to this method over writing a > > detecting tokeniser factory? > >
David King 2008-03-20, 16:39
> Unless you can come up with language-neutral tokenization and > stemming, you > need to: > a) know the language of each document. > b) run a different analyzer depending on the language. > c) force the user to tell you the language of the query. > d) run the query through the same analyzer. I can do all of those. This implies storing all of the different languages in different fields, right? Then changing the default search- field to the language of the query for every query? > > > > > On Thu, Mar 20, 2008 at 12:17 PM, David King <[EMAIL PROTECTED]> > wrote: > >>> You may be interested in a recent discussion that took place on a >>> similar >>> subject: >>> http://www.mail-archive.com/[EMAIL PROTECTED]/ >>> msg09332.html >> >> Interesting, yes. But since it doesn't actually exist, it's not much >> help. >> >> I guess what I'm asking is, if my approach seems convoluted, I'm >> probably doing it wrong, so how *a*re people solving the problem of >> searching over multiple languages? What is the canonical way to do >> this? >> >> >>> >>> >>> Nicolas >>> >>> -----Message d'origine----- >>> De : David King [mailto:[EMAIL PROTECTED]] >>> Envoyé : mercredi 19 mars 2008 20:07 >>> À : [EMAIL PROTECTED] >>> Objet : Language support >>> >>> This has probably been asked before, but I'm having trouble finding >>> it. Basically, we want to be able to search for content across >>> several >>> languages, given that we know what language a datum and a query are >>> in. Is there an obvious way to do this? >>> >>> Here's the longer version: I am trying to index content that >>> occurs in >>> multiple languages, including Asian languages. I'm in the process of >>> moving from PyLucene to Solr. In PyLucene, I would have a list of >>> analysers: >>> >>> analyzers = dict(en = pyluc.SnowballAnalyzer("English"), >>> cs = pyluc.CzechAnalyzer(), >>> pt = pyluc.SnowballAnalyzer("Portuguese"), >>> ... >>> >>> Then when I want to index something, I do >>> >>> writer = pyluc.IndexWriter(store, analyzer, create) >>> writer.addDocument(d.doc) >>> >>> That is, I tell Lucene the language of every datum, and the analyser >>> to use when writing out the field. Then when I want to search >>> against >>> it, I do >>> >>> analyzer = LanguageAnalyzer.getanal(lang) >>> q = pyluc.QueryParser(field, analyzer).parse(value) >>> >>> And use that QueryParser to parse the query in the given language >>> before sending it off to PyLucene. (off-topic: getanal() is >>> perhaps my >>> favourite function-name ever). So the language of a given datum is >>> attached to the datum itself. In Solr, however, this appears to be >>> attached to the field, not to the individual data in it: >>> >>> <fieldType name="text_greek" class="solr.TextField"> >>> <analyzer class="org.apache.lucene.analysis.el.GreekAnalyzer"/> >>> </fieldType> >>> >>> Does this mean there there's no way to have a single "contents" >>> field >>> that has content in multiple languages, and still have the queries >>> be >>> parsed and stemmed correctly? How are other people handling this? >>> Does >>> it makes sense to write a tokeniser factory and a query factory that >>> look at, say, the 'lang' field and return the correct tokenisers? >>> Does >>> this already exist? >>> >>> The other alternative is to have a text_zh field, a text_en field, >>> etc, and to modify the query to search on that field depending on >>> the >>> language of the query, but that seems kind of hacky to me, >>> especially >>> if a query may be against more than one language. Is this the >>> accepted >>> way to go about it? Is there a benefit to this method over writing a >>> detecting tokeniser factory? >> >>
Walter Underwood 2008-03-20, 16:42
Nice list.
You may still need to mark the language of each document. There are plenty of cross-language collisions: "die" and "boot" have different meanings in German and English. Proper nouns ("Laserjet") may be the same in all languages, a different problem if you are trying to get answers in one language.
At one point, I considered using Unicode language tagging on each token to keep it all straight. Effectively, index "de/Boot" or "en/Laserjet".
wunder
On 3/20/08 9:20 AM, "Benson Margulies" <[EMAIL PROTECTED]> wrote:
> Unless you can come up with language-neutral tokenization and stemming, > you need to: > > a) know the language of each document. > b) run a different > analyzer depending on the language. > c) force the user to tell you the language of the query. > d) run the query through the same analyzer.
Benson Margulies 2008-03-20, 16:43
You can store in one field if you manage to hide a language code with the text. XML is overkill but effective for this. At one point, we'd investigated how to allow a Lucene analyzer to see more than one field (the language code as well as the text) but I don't think we came up with anything. On Thu, Mar 20, 2008 at 12:39 PM, David King <[EMAIL PROTECTED]> wrote: > > Unless you can come up with language-neutral tokenization and > > stemming, you > > need to: > > a) know the language of each document. > > b) run a different analyzer depending on the language. > > c) force the user to tell you the language of the query. > > d) run the query through the same analyzer. > > I can do all of those. This implies storing all of the different > languages in different fields, right? Then changing the default search- > field to the language of the query for every query? > > > > > > > > > > > > On Thu, Mar 20, 2008 at 12:17 PM, David King <[EMAIL PROTECTED]> > > wrote: > > > >>> You may be interested in a recent discussion that took place on a > >>> similar > >>> subject: > >>> http://www.mail-archive.com/[EMAIL PROTECTED]/ > >>> msg09332.html > >> > >> Interesting, yes. But since it doesn't actually exist, it's not much > >> help. > >> > >> I guess what I'm asking is, if my approach seems convoluted, I'm > >> probably doing it wrong, so how *a*re people solving the problem of > >> searching over multiple languages? What is the canonical way to do > >> this? > >> > >> > >>> > >>> > >>> Nicolas > >>> > >>> -----Message d'origine----- > >>> De : David King [mailto:[EMAIL PROTECTED]] > >>> Envoyé : mercredi 19 mars 2008 20:07 > >>> À : [EMAIL PROTECTED] > >>> Objet : Language support > >>> > >>> This has probably been asked before, but I'm having trouble finding > >>> it. Basically, we want to be able to search for content across > >>> several > >>> languages, given that we know what language a datum and a query are > >>> in. Is there an obvious way to do this? > >>> > >>> Here's the longer version: I am trying to index content that > >>> occurs in > >>> multiple languages, including Asian languages. I'm in the process of > >>> moving from PyLucene to Solr. In PyLucene, I would have a list of > >>> analysers: > >>> > >>> analyzers = dict(en = pyluc.SnowballAnalyzer("English"), > >>> cs = pyluc.CzechAnalyzer(), > >>> pt = pyluc.SnowballAnalyzer("Portuguese"), > >>> ... > >>> > >>> Then when I want to index something, I do > >>> > >>> writer = pyluc.IndexWriter(store, analyzer, create) > >>> writer.addDocument(d.doc) > >>> > >>> That is, I tell Lucene the language of every datum, and the analyser > >>> to use when writing out the field. Then when I want to search > >>> against > >>> it, I do > >>> > >>> analyzer = LanguageAnalyzer.getanal(lang) > >>> q = pyluc.QueryParser(field, analyzer).parse(value) > >>> > >>> And use that QueryParser to parse the query in the given language > >>> before sending it off to PyLucene. (off-topic: getanal() is > >>> perhaps my > >>> favourite function-name ever). So the language of a given datum is > >>> attached to the datum itself. In Solr, however, this appears to be > >>> attached to the field, not to the individual data in it: > >>> > >>> <fieldType name="text_greek" class="solr.TextField"> > >>> <analyzer class="org.apache.lucene.analysis.el.GreekAnalyzer"/> > >>> </fieldType> > >>> > >>> Does this mean there there's no way to have a single "contents" > >>> field > >>> that has content in multiple languages, and still have the queries > >>> be > >>> parsed and stemmed correctly? How are other people handling this? > >>> Does > >>> it makes sense to write a tokeniser factory and a query factory that > >>> look at, say, the 'lang' field and return the correct tokenisers? > >>> Does > >>> this already exist? > >>> > >>> The other alternative is to have a text_zh field, a text_en field, > >>> etc, and to modify the query to search on that field depending on
Benson Margulies 2008-03-20, 16:45
Token/by/token seems a bit extreme. Are you concerned with macaronic documents?
On Thu, Mar 20, 2008 at 12:42 PM, Walter Underwood <[EMAIL PROTECTED]> wrote:
> Nice list. > > You may still need to mark the language of each document. There are > plenty of cross-language collisions: "die" and "boot" have different > meanings in German and English. Proper nouns ("Laserjet") may be the > same in all languages, a different problem if you are trying to get > answers in one language. > > At one point, I considered using Unicode language tagging on each > token to keep it all straight. Effectively, index "de/Boot" or > "en/Laserjet". > > wunder > > On 3/20/08 9:20 AM, "Benson Margulies" <[EMAIL PROTECTED]> wrote: > > > Unless you can come up with language-neutral tokenization and stemming, > > you > need to: > > > > a) know the language of each document. > > b) run a different > > analyzer depending on the language. > > c) force the user to tell you the language of the query. > > d) run the query through the same analyzer. > > >
Walter Underwood 2008-03-20, 17:00
Extreme, but guaranteed to work and it avoids bad IDF when there are inter-language collisions. In Ultraseek, we only stored the hash, so the size of the source token didn't matter.
Trademarks are a bad source of collisions and anomalous IDF. If you have LaserJet support docs in 20 languages, the term "LaserJet" will have a document frequency 20X higher than the terms in a single language and will score too low.
Ultraseek handles macaronic documents when the script makes it possible, for example, roman is sent to the English stemmer in a Japanese document, Hangul always goes to the Korean segmenter/stemmer.
A simpler approach is to tag each document with a language, like "lang:de", then use a filter query to restrict the documents to the query language.
Per-token tagging still strikes me as the "right" approach. It makes all sorts of things work, like keeping fuzzy matches within the same language. We didn't do it in Ultraseek because it would have been an incompatible index change and the benefit didn't justify that.
wunder =Walter Underwood Former Ultraseek Architect Current Entire Netflix Search Department
On 3/20/08 9:45 AM, "Benson Margulies" <[EMAIL PROTECTED]> wrote:
> Token/by/token seems a bit extreme. Are you concerned with macaronic > documents? > > On Thu, Mar 20, 2008 at 12:42 PM, Walter Underwood <[EMAIL PROTECTED]> > wrote: > >> Nice list. >> >> You may still need to mark the language of each document. There are >> plenty of cross-language collisions: "die" and "boot" have different >> meanings in German and English. Proper nouns ("Laserjet") may be the >> same in all languages, a different problem if you are trying to get >> answers in one language. >> >> At one point, I considered using Unicode language tagging on each >> token to keep it all straight. Effectively, index "de/Boot" or >> "en/Laserjet". >> >> wunder >> >> On 3/20/08 9:20 AM, "Benson Margulies" <[EMAIL PROTECTED]> wrote: >> >>> Unless you can come up with language-neutral tokenization and stemming, >>> you >> need to: >>> >>> a) know the language of each document. >>> b) run a different >>> analyzer depending on the language. >>> c) force the user to tell you the language of the query. >>> d) run the query through the same analyzer. >> >> >>
Benson Margulies 2008-03-20, 18:05
Oh, Walter! Hello! I thought that name was familiar. Greetings from Basis. All that makes sense.
On Thu, Mar 20, 2008 at 1:00 PM, Walter Underwood <[EMAIL PROTECTED]> wrote:
> Extreme, but guaranteed to work and it avoids bad IDF when there are > inter-language collisions. In Ultraseek, we only stored the hash, so > the size of the source token didn't matter. > > Trademarks are a bad source of collisions and anomalous IDF. If you have > LaserJet support docs in 20 languages, the term "LaserJet" will have > a document frequency 20X higher than the terms in a single language > and will score too low. > > Ultraseek handles macaronic documents when the script makes it possible, > for example, roman is sent to the English stemmer in a Japanese document, > Hangul always goes to the Korean segmenter/stemmer. > > A simpler approach is to tag each document with a language, like > "lang:de", > then use a filter query to restrict the documents to the query language. > > Per-token tagging still strikes me as the "right" approach. It makes > all sorts of things work, like keeping fuzzy matches within the same > language. We didn't do it in Ultraseek because it would have been an > incompatible index change and the benefit didn't justify that. > > wunder > => Walter Underwood > Former Ultraseek Architect > Current Entire Netflix Search Department > > On 3/20/08 9:45 AM, "Benson Margulies" <[EMAIL PROTECTED]> wrote: > > > Token/by/token seems a bit extreme. Are you concerned with macaronic > > documents? > > > > On Thu, Mar 20, 2008 at 12:42 PM, Walter Underwood < > [EMAIL PROTECTED]> > > wrote: > > > >> Nice list. > >> > >> You may still need to mark the language of each document. There are > >> plenty of cross-language collisions: "die" and "boot" have different > >> meanings in German and English. Proper nouns ("Laserjet") may be the > >> same in all languages, a different problem if you are trying to get > >> answers in one language. > >> > >> At one point, I considered using Unicode language tagging on each > >> token to keep it all straight. Effectively, index "de/Boot" or > >> "en/Laserjet". > >> > >> wunder > >> > >> On 3/20/08 9:20 AM, "Benson Margulies" <[EMAIL PROTECTED]> wrote: > >> > >>> Unless you can come up with language-neutral tokenization and > stemming, > >>> you > >> need to: > >>> > >>> a) know the language of each document. > >>> b) run a different > >>> analyzer depending on the language. > >>> c) force the user to tell you the language of the query. > >>> d) run the query through the same analyzer. > >> > >> > >> > >
|
|