|
Sergey Bartunov
2010-10-22, 16:07
Steven A Rowe
2010-10-22, 16:36
Sergey Bartunov
2010-10-22, 19:18
Steven A Rowe
2010-10-22, 22:43
Sergey Bartunov
2010-10-23, 12:56
Ahmet Arslan
2010-10-23, 13:45
Sergey Bartunov
2010-10-23, 14:45
Ahmet Arslan
2010-10-23, 14:53
Sergey Bartunov
2010-10-23, 15:01
Yonik Seeley
2010-10-23, 14:55
Sergey Bartunov
2010-10-23, 15:00
Ahmet Arslan
2010-10-23, 21:29
Sergey Bartunov
2010-10-24, 14:47
Yonik Seeley
2010-10-24, 15:19
Sergey Bartunov
2010-10-24, 15:29
Yonik Seeley
2010-10-24, 16:02
|
-
How to index long words with StandardTokenizerFactory?Sergey Bartunov 2010-10-22, 16:07
I'm trying to force solr to index words which length is more than 255
symbols (this constant is DEFAULT_MAX_TOKEN_LENGTH in lucene StandardAnalyzer.java) using StandardTokenizerFactory as 'filter' tag in schema configuration XML. Specifying the maxTokenLength attribute won't work. I'd tried to make the dirty hack: I downloaded lucene-core-2.9.3 src and changed the DEFAULT_MAX_TOKEN_LENGTH to 1000000, built it to jar and replaced original lucene-core jar in solr /lib. But seems like that it had bring no effect. +
Sergey Bartunov 2010-10-22, 16:07
-
RE: How to index long words with StandardTokenizerFactory?Steven A Rowe 2010-10-22, 16:36
Hi Sergey,
I've opened an issue to add a maxTokenLength param to the StandardTokenizerFactory configuration: https://issues.apache.org/jira/browse/SOLR-2188 I'll work on it this weekend. Are you using Solr 1.4.1? I ask because of your mention of Lucene 2.9.3. I'm not sure there will ever be a Solr 1.4.2 release. I plan on targeting Solr 3.1 and 4.0 for the SOLR-2188 fix. I'm not sure why you didn't get the results you wanted with your Lucene hack - is it possible you have other Lucene jars in your Solr classpath? Steve > -----Original Message----- > From: Sergey Bartunov [mailto:[EMAIL PROTECTED]] > Sent: Friday, October 22, 2010 12:08 PM > To: [EMAIL PROTECTED] > Subject: How to index long words with StandardTokenizerFactory? > > I'm trying to force solr to index words which length is more than 255 > symbols (this constant is DEFAULT_MAX_TOKEN_LENGTH in lucene > StandardAnalyzer.java) using StandardTokenizerFactory as 'filter' tag > in schema configuration XML. Specifying the maxTokenLength attribute > won't work. > > I'd tried to make the dirty hack: I downloaded lucene-core-2.9.3 src > and changed the DEFAULT_MAX_TOKEN_LENGTH to 1000000, built it to jar > and replaced original lucene-core jar in solr /lib. But seems like > that it had bring no effect. +
Steven A Rowe 2010-10-22, 16:36
-
Re: How to index long words with StandardTokenizerFactory?Sergey Bartunov 2010-10-22, 19:18
I'm using Solr 1.4.1. Now I'm successed with replacing lucene-core jar
but maxTokenValue seems to be used in very strange way. Currenty for me it's set to 1024*1024, but I couldn't index a field with just size of ~34kb. I understand that it's a little weird to index such a big data, but I just want to know it doesn't work On 22 October 2010 20:36, Steven A Rowe <[EMAIL PROTECTED]> wrote: > Hi Sergey, > > I've opened an issue to add a maxTokenLength param to the StandardTokenizerFactory configuration: > > https://issues.apache.org/jira/browse/SOLR-2188 > > I'll work on it this weekend. > > Are you using Solr 1.4.1? I ask because of your mention of Lucene 2.9.3. I'm not sure there will ever be a Solr 1.4.2 release. I plan on targeting Solr 3.1 and 4.0 for the SOLR-2188 fix. > > I'm not sure why you didn't get the results you wanted with your Lucene hack - is it possible you have other Lucene jars in your Solr classpath? > > Steve > >> -----Original Message----- >> From: Sergey Bartunov [mailto:[EMAIL PROTECTED]] >> Sent: Friday, October 22, 2010 12:08 PM >> To: [EMAIL PROTECTED] >> Subject: How to index long words with StandardTokenizerFactory? >> >> I'm trying to force solr to index words which length is more than 255 >> symbols (this constant is DEFAULT_MAX_TOKEN_LENGTH in lucene >> StandardAnalyzer.java) using StandardTokenizerFactory as 'filter' tag >> in schema configuration XML. Specifying the maxTokenLength attribute >> won't work. >> >> I'd tried to make the dirty hack: I downloaded lucene-core-2.9.3 src >> and changed the DEFAULT_MAX_TOKEN_LENGTH to 1000000, built it to jar >> and replaced original lucene-core jar in solr /lib. But seems like >> that it had bring no effect. > +
Sergey Bartunov 2010-10-22, 19:18
-
RE: How to index long words with StandardTokenizerFactory?Steven A Rowe 2010-10-22, 22:43
Hi Sergey,
What does your ~34kb field value look like? Does StandardTokenizer think it's just one token? What doesn't work? What happens? Steve > -----Original Message----- > From: Sergey Bartunov [mailto:[EMAIL PROTECTED]] > Sent: Friday, October 22, 2010 3:18 PM > To: [EMAIL PROTECTED] > Subject: Re: How to index long words with StandardTokenizerFactory? > > I'm using Solr 1.4.1. Now I'm successed with replacing lucene-core jar > but maxTokenValue seems to be used in very strange way. Currenty for > me it's set to 1024*1024, but I couldn't index a field with just size > of ~34kb. I understand that it's a little weird to index such a big > data, but I just want to know it doesn't work > > On 22 October 2010 20:36, Steven A Rowe <[EMAIL PROTECTED]> wrote: > > Hi Sergey, > > > > I've opened an issue to add a maxTokenLength param to the > StandardTokenizerFactory configuration: > > > > https://issues.apache.org/jira/browse/SOLR-2188 > > > > I'll work on it this weekend. > > > > Are you using Solr 1.4.1? I ask because of your mention of Lucene > 2.9.3. I'm not sure there will ever be a Solr 1.4.2 release. I plan on > targeting Solr 3.1 and 4.0 for the SOLR-2188 fix. > > > > I'm not sure why you didn't get the results you wanted with your Lucene > hack - is it possible you have other Lucene jars in your Solr classpath? > > > > Steve > > > >> -----Original Message----- > >> From: Sergey Bartunov [mailto:[EMAIL PROTECTED]] > >> Sent: Friday, October 22, 2010 12:08 PM > >> To: [EMAIL PROTECTED] > >> Subject: How to index long words with StandardTokenizerFactory? > >> > >> I'm trying to force solr to index words which length is more than 255 > >> symbols (this constant is DEFAULT_MAX_TOKEN_LENGTH in lucene > >> StandardAnalyzer.java) using StandardTokenizerFactory as 'filter' tag > >> in schema configuration XML. Specifying the maxTokenLength attribute > >> won't work. > >> > >> I'd tried to make the dirty hack: I downloaded lucene-core-2.9.3 src > >> and changed the DEFAULT_MAX_TOKEN_LENGTH to 1000000, built it to jar > >> and replaced original lucene-core jar in solr /lib. But seems like > >> that it had bring no effect. > > +
Steven A Rowe 2010-10-22, 22:43
-
Re: How to index long words with StandardTokenizerFactory?Sergey Bartunov 2010-10-23, 12:56
Here are all the files: http://rghost.net/3016862
1) StandardAnalyzer.java, StandardTokenizer.java - patched files from lucene-2.9.3 2) I patch these files and build lucene by typing "ant" 3) I replace lucene-core-2.9.3.jar in solr/lib/ by my lucene-core-2.9.3-dev.jar that I'd just compiled 4) than I do "ant compile" and "ant dist" in solr folder 5) after that I recompile solr/example/webapps/solr.war with my new solr and lucene-core jars 6) I put my schema.xml in solr/example/solr/conf/ 7) then I do "java -jar start.jar" in solr/example 8) index big_post.xml 9) trying to find this document by "curl http://localhost:8983/solr/select?q=body:big*" (big_post.xml contains a long word bigaaaaa...aaaa) 10) solr returns nothing On 23 October 2010 02:43, Steven A Rowe <[EMAIL PROTECTED]> wrote: > Hi Sergey, > > What does your ~34kb field value look like? Does StandardTokenizer think it's just one token? > > What doesn't work? What happens? > > Steve > >> -----Original Message----- >> From: Sergey Bartunov [mailto:[EMAIL PROTECTED]] >> Sent: Friday, October 22, 2010 3:18 PM >> To: [EMAIL PROTECTED] >> Subject: Re: How to index long words with StandardTokenizerFactory? >> >> I'm using Solr 1.4.1. Now I'm successed with replacing lucene-core jar >> but maxTokenValue seems to be used in very strange way. Currenty for >> me it's set to 1024*1024, but I couldn't index a field with just size >> of ~34kb. I understand that it's a little weird to index such a big >> data, but I just want to know it doesn't work >> >> On 22 October 2010 20:36, Steven A Rowe <[EMAIL PROTECTED]> wrote: >> > Hi Sergey, >> > >> > I've opened an issue to add a maxTokenLength param to the >> StandardTokenizerFactory configuration: >> > >> > https://issues.apache.org/jira/browse/SOLR-2188 >> > >> > I'll work on it this weekend. >> > >> > Are you using Solr 1.4.1? I ask because of your mention of Lucene >> 2.9.3. I'm not sure there will ever be a Solr 1.4.2 release. I plan on >> targeting Solr 3.1 and 4.0 for the SOLR-2188 fix. >> > >> > I'm not sure why you didn't get the results you wanted with your Lucene >> hack - is it possible you have other Lucene jars in your Solr classpath? >> > >> > Steve >> > >> >> -----Original Message----- >> >> From: Sergey Bartunov [mailto:[EMAIL PROTECTED]] >> >> Sent: Friday, October 22, 2010 12:08 PM >> >> To: [EMAIL PROTECTED] >> >> Subject: How to index long words with StandardTokenizerFactory? >> >> >> >> I'm trying to force solr to index words which length is more than 255 >> >> symbols (this constant is DEFAULT_MAX_TOKEN_LENGTH in lucene >> >> StandardAnalyzer.java) using StandardTokenizerFactory as 'filter' tag >> >> in schema configuration XML. Specifying the maxTokenLength attribute >> >> won't work. >> >> >> >> I'd tried to make the dirty hack: I downloaded lucene-core-2.9.3 src >> >> and changed the DEFAULT_MAX_TOKEN_LENGTH to 1000000, built it to jar >> >> and replaced original lucene-core jar in solr /lib. But seems like >> >> that it had bring no effect. +
Sergey Bartunov 2010-10-23, 12:56
-
Re: How to index long words with StandardTokenizerFactory?Ahmet Arslan 2010-10-23, 13:45
Did you delete the folder Jetty_0_0_0_0_8983_solr.war_** under apache-solr-1.4.1\example\work?
--- On Sat, 10/23/10, Sergey Bartunov <[EMAIL PROTECTED]> wrote: > From: Sergey Bartunov <[EMAIL PROTECTED]> > Subject: Re: How to index long words with StandardTokenizerFactory? > To: [EMAIL PROTECTED] > Date: Saturday, October 23, 2010, 3:56 PM > Here are all the files: http://rghost.net/3016862 > > 1) StandardAnalyzer.java, StandardTokenizer.java - patched > files from > lucene-2.9.3 > 2) I patch these files and build lucene by typing "ant" > 3) I replace lucene-core-2.9.3.jar in solr/lib/ by my > lucene-core-2.9.3-dev.jar that I'd just compiled > 4) than I do "ant compile" and "ant dist" in solr folder > 5) after that I recompile solr/example/webapps/solr.war > with my new > solr and lucene-core jars > 6) I put my schema.xml in solr/example/solr/conf/ > 7) then I do "java -jar start.jar" in solr/example > 8) index big_post.xml > 9) trying to find this document by "curl > http://localhost:8983/solr/select?q=body:big*" > (big_post.xml contains > a long word bigaaaaa...aaaa) > 10) solr returns nothing > > On 23 October 2010 02:43, Steven A Rowe <[EMAIL PROTECTED]> > wrote: > > Hi Sergey, > > > > What does your ~34kb field value look like? Does > StandardTokenizer think it's just one token? > > > > What doesn't work? What happens? > > > > Steve > > > >> -----Original Message----- > >> From: Sergey Bartunov [mailto:[EMAIL PROTECTED]] > >> Sent: Friday, October 22, 2010 3:18 PM > >> To: [EMAIL PROTECTED] > >> Subject: Re: How to index long words with > StandardTokenizerFactory? > >> > >> I'm using Solr 1.4.1. Now I'm successed with > replacing lucene-core jar > >> but maxTokenValue seems to be used in very strange > way. Currenty for > >> me it's set to 1024*1024, but I couldn't index a > field with just size > >> of ~34kb. I understand that it's a little weird to > index such a big > >> data, but I just want to know it doesn't work > >> > >> On 22 October 2010 20:36, Steven A Rowe <[EMAIL PROTECTED]> > wrote: > >> > Hi Sergey, > >> > > >> > I've opened an issue to add a maxTokenLength > param to the > >> StandardTokenizerFactory configuration: > >> > > >> > https://issues.apache.org/jira/browse/SOLR-2188 > >> > > >> > I'll work on it this weekend. > >> > > >> > Are you using Solr 1.4.1? I ask because of > your mention of Lucene > >> 2.9.3. I'm not sure there will ever be a Solr > 1.4.2 release. I plan on > >> targeting Solr 3.1 and 4.0 for the SOLR-2188 fix. > >> > > >> > I'm not sure why you didn't get the results > you wanted with your Lucene > >> hack - is it possible you have other Lucene jars > in your Solr classpath? > >> > > >> > Steve > >> > > >> >> -----Original Message----- > >> >> From: Sergey Bartunov [mailto:[EMAIL PROTECTED]] > >> >> Sent: Friday, October 22, 2010 12:08 PM > >> >> To: [EMAIL PROTECTED] > >> >> Subject: How to index long words with > StandardTokenizerFactory? > >> >> > >> >> I'm trying to force solr to index words > which length is more than 255 > >> >> symbols (this constant is > DEFAULT_MAX_TOKEN_LENGTH in lucene > >> >> StandardAnalyzer.java) using > StandardTokenizerFactory as 'filter' tag > >> >> in schema configuration XML. Specifying > the maxTokenLength attribute > >> >> won't work. > >> >> > >> >> I'd tried to make the dirty hack: I > downloaded lucene-core-2.9.3 src > >> >> and changed the DEFAULT_MAX_TOKEN_LENGTH > to 1000000, built it to jar > >> >> and replaced original lucene-core jar in > solr /lib. But seems like > >> >> that it had bring no effect. > +
Ahmet Arslan 2010-10-23, 13:45
-
Re: How to index long words with StandardTokenizerFactory?Sergey Bartunov 2010-10-23, 14:45
Yes. I did. Won't help.
On 23 October 2010 17:45, Ahmet Arslan <[EMAIL PROTECTED]> wrote: > Did you delete the folder Jetty_0_0_0_0_8983_solr.war_** under apache-solr-1.4.1\example\work? > > --- On Sat, 10/23/10, Sergey Bartunov <[EMAIL PROTECTED]> wrote: > >> From: Sergey Bartunov <[EMAIL PROTECTED]> >> Subject: Re: How to index long words with StandardTokenizerFactory? >> To: [EMAIL PROTECTED] >> Date: Saturday, October 23, 2010, 3:56 PM >> Here are all the files: http://rghost.net/3016862 >> >> 1) StandardAnalyzer.java, StandardTokenizer.java - patched >> files from >> lucene-2.9.3 >> 2) I patch these files and build lucene by typing "ant" >> 3) I replace lucene-core-2.9.3.jar in solr/lib/ by my >> lucene-core-2.9.3-dev.jar that I'd just compiled >> 4) than I do "ant compile" and "ant dist" in solr folder >> 5) after that I recompile solr/example/webapps/solr.war >> with my new >> solr and lucene-core jars >> 6) I put my schema.xml in solr/example/solr/conf/ >> 7) then I do "java -jar start.jar" in solr/example >> 8) index big_post.xml >> 9) trying to find this document by "curl >> http://localhost:8983/solr/select?q=body:big*" >> (big_post.xml contains >> a long word bigaaaaa...aaaa) >> 10) solr returns nothing >> >> On 23 October 2010 02:43, Steven A Rowe <[EMAIL PROTECTED]> >> wrote: >> > Hi Sergey, >> > >> > What does your ~34kb field value look like? Does >> StandardTokenizer think it's just one token? >> > >> > What doesn't work? What happens? >> > >> > Steve >> > >> >> -----Original Message----- >> >> From: Sergey Bartunov [mailto:[EMAIL PROTECTED]] >> >> Sent: Friday, October 22, 2010 3:18 PM >> >> To: [EMAIL PROTECTED] >> >> Subject: Re: How to index long words with >> StandardTokenizerFactory? >> >> >> >> I'm using Solr 1.4.1. Now I'm successed with >> replacing lucene-core jar >> >> but maxTokenValue seems to be used in very strange >> way. Currenty for >> >> me it's set to 1024*1024, but I couldn't index a >> field with just size >> >> of ~34kb. I understand that it's a little weird to >> index such a big >> >> data, but I just want to know it doesn't work >> >> >> >> On 22 October 2010 20:36, Steven A Rowe <[EMAIL PROTECTED]> >> wrote: >> >> > Hi Sergey, >> >> > >> >> > I've opened an issue to add a maxTokenLength >> param to the >> >> StandardTokenizerFactory configuration: >> >> > >> >> > https://issues.apache.org/jira/browse/SOLR-2188 >> >> > >> >> > I'll work on it this weekend. >> >> > >> >> > Are you using Solr 1.4.1? I ask because of >> your mention of Lucene >> >> 2.9.3. I'm not sure there will ever be a Solr >> 1.4.2 release. I plan on >> >> targeting Solr 3.1 and 4.0 for the SOLR-2188 fix. >> >> > >> >> > I'm not sure why you didn't get the results >> you wanted with your Lucene >> >> hack - is it possible you have other Lucene jars >> in your Solr classpath? >> >> > >> >> > Steve >> >> > >> >> >> -----Original Message----- >> >> >> From: Sergey Bartunov [mailto:[EMAIL PROTECTED]] >> >> >> Sent: Friday, October 22, 2010 12:08 PM >> >> >> To: [EMAIL PROTECTED] >> >> >> Subject: How to index long words with >> StandardTokenizerFactory? >> >> >> >> >> >> I'm trying to force solr to index words >> which length is more than 255 >> >> >> symbols (this constant is >> DEFAULT_MAX_TOKEN_LENGTH in lucene >> >> >> StandardAnalyzer.java) using >> StandardTokenizerFactory as 'filter' tag >> >> >> in schema configuration XML. Specifying >> the maxTokenLength attribute >> >> >> won't work. >> >> >> >> >> >> I'd tried to make the dirty hack: I >> downloaded lucene-core-2.9.3 src >> >> >> and changed the DEFAULT_MAX_TOKEN_LENGTH >> to 1000000, built it to jar >> >> >> and replaced original lucene-core jar in >> solr /lib. But seems like >> >> >> that it had bring no effect. >> > > > > +
Sergey Bartunov 2010-10-23, 14:45
-
Re: How to index long words with StandardTokenizerFactory?Ahmet Arslan 2010-10-23, 14:53
I think you should replace your new lucene-core-2.9.3-dev.jar in \apache-solr-1.4.1\lib and then create a new solr.war under \apache-solr-1.4.1\dist. And copy this new solr.war to solr/example/webapps/solr.war
--- On Sat, 10/23/10, Sergey Bartunov <[EMAIL PROTECTED]> wrote: > From: Sergey Bartunov <[EMAIL PROTECTED]> > Subject: Re: How to index long words with StandardTokenizerFactory? > To: [EMAIL PROTECTED] > Date: Saturday, October 23, 2010, 5:45 PM > Yes. I did. Won't help. > > On 23 October 2010 17:45, Ahmet Arslan <[EMAIL PROTECTED]> > wrote: > > Did you delete the folder > Jetty_0_0_0_0_8983_solr.war_** under > apache-solr-1.4.1\example\work? > > > > --- On Sat, 10/23/10, Sergey Bartunov <[EMAIL PROTECTED]> > wrote: > > > >> From: Sergey Bartunov <[EMAIL PROTECTED]> > >> Subject: Re: How to index long words with > StandardTokenizerFactory? > >> To: [EMAIL PROTECTED] > >> Date: Saturday, October 23, 2010, 3:56 PM > >> Here are all the files: http://rghost.net/3016862 > >> > >> 1) StandardAnalyzer.java, StandardTokenizer.java - > patched > >> files from > >> lucene-2.9.3 > >> 2) I patch these files and build lucene by typing > "ant" > >> 3) I replace lucene-core-2.9.3.jar in solr/lib/ by > my > >> lucene-core-2.9.3-dev.jar that I'd just compiled > >> 4) than I do "ant compile" and "ant dist" in solr > folder > >> 5) after that I recompile > solr/example/webapps/solr.war > >> with my new > >> solr and lucene-core jars > >> 6) I put my schema.xml in solr/example/solr/conf/ > >> 7) then I do "java -jar start.jar" in > solr/example > >> 8) index big_post.xml > >> 9) trying to find this document by "curl > >> http://localhost:8983/solr/select?q=body:big*" > >> (big_post.xml contains > >> a long word bigaaaaa...aaaa) > >> 10) solr returns nothing > >> > >> On 23 October 2010 02:43, Steven A Rowe <[EMAIL PROTECTED]> > >> wrote: > >> > Hi Sergey, > >> > > >> > What does your ~34kb field value look like? > Does > >> StandardTokenizer think it's just one token? > >> > > >> > What doesn't work? What happens? > >> > > >> > Steve > >> > > >> >> -----Original Message----- > >> >> From: Sergey Bartunov [mailto:[EMAIL PROTECTED]] > >> >> Sent: Friday, October 22, 2010 3:18 PM > >> >> To: [EMAIL PROTECTED] > >> >> Subject: Re: How to index long words > with > >> StandardTokenizerFactory? > >> >> > >> >> I'm using Solr 1.4.1. Now I'm successed > with > >> replacing lucene-core jar > >> >> but maxTokenValue seems to be used in > very strange > >> way. Currenty for > >> >> me it's set to 1024*1024, but I couldn't > index a > >> field with just size > >> >> of ~34kb. I understand that it's a little > weird to > >> index such a big > >> >> data, but I just want to know it doesn't > work > >> >> > >> >> On 22 October 2010 20:36, Steven A Rowe > <[EMAIL PROTECTED]> > >> wrote: > >> >> > Hi Sergey, > >> >> > > >> >> > I've opened an issue to add a > maxTokenLength > >> param to the > >> >> StandardTokenizerFactory configuration: > >> >> > > >> >> > https://issues.apache.org/jira/browse/SOLR-2188 > >> >> > > >> >> > I'll work on it this weekend. > >> >> > > >> >> > Are you using Solr 1.4.1? I ask > because of > >> your mention of Lucene > >> >> 2.9.3. I'm not sure there will ever be > a Solr > >> 1.4.2 release. I plan on > >> >> targeting Solr 3.1 and 4.0 for the > SOLR-2188 fix. > >> >> > > >> >> > I'm not sure why you didn't get the > results > >> you wanted with your Lucene > >> >> hack - is it possible you have other > Lucene jars > >> in your Solr classpath? > >> >> > > >> >> > Steve > >> >> > > >> >> >> -----Original Message----- > >> >> >> From: Sergey Bartunov [mailto:[EMAIL PROTECTED]] > >> >> >> Sent: Friday, October 22, 2010 > 12:08 PM > >> >> >> To: [EMAIL PROTECTED] > >> >> >> Subject: How to index long words > with > >> StandardTokenizerFactory? > >> >> >> > >> >> >> I'm trying to force solr to > index words > >> which length is more than 255 +
Ahmet Arslan 2010-10-23, 14:53
-
Re: How to index long words with StandardTokenizerFactory?Sergey Bartunov 2010-10-23, 15:01
This is exactly what I did. Look:
>> >> 3) I replace lucene-core-2.9.3.jar in solr/lib/ by >> my >> >> lucene-core-2.9.3-dev.jar that I'd just compiled >> >> 4) than I do "ant compile" and "ant dist" in solr >> folder >> >> 5) after that I recompile >> solr/example/webapps/solr.war On 23 October 2010 18:53, Ahmet Arslan <[EMAIL PROTECTED]> wrote: > I think you should replace your new lucene-core-2.9.3-dev.jar in \apache-solr-1.4.1\lib and then create a new solr.war under \apache-solr-1.4.1\dist. And copy this new solr.war to solr/example/webapps/solr.war > > --- On Sat, 10/23/10, Sergey Bartunov <[EMAIL PROTECTED]> wrote: > >> From: Sergey Bartunov <[EMAIL PROTECTED]> >> Subject: Re: How to index long words with StandardTokenizerFactory? >> To: [EMAIL PROTECTED] >> Date: Saturday, October 23, 2010, 5:45 PM >> Yes. I did. Won't help. >> >> On 23 October 2010 17:45, Ahmet Arslan <[EMAIL PROTECTED]> >> wrote: >> > Did you delete the folder >> Jetty_0_0_0_0_8983_solr.war_** under >> apache-solr-1.4.1\example\work? >> > >> > --- On Sat, 10/23/10, Sergey Bartunov <[EMAIL PROTECTED]> >> wrote: >> > >> >> From: Sergey Bartunov <[EMAIL PROTECTED]> >> >> Subject: Re: How to index long words with >> StandardTokenizerFactory? >> >> To: [EMAIL PROTECTED] >> >> Date: Saturday, October 23, 2010, 3:56 PM >> >> Here are all the files: http://rghost.net/3016862 >> >> >> >> 1) StandardAnalyzer.java, StandardTokenizer.java - >> patched >> >> files from >> >> lucene-2.9.3 >> >> 2) I patch these files and build lucene by typing >> "ant" >> >> 3) I replace lucene-core-2.9.3.jar in solr/lib/ by >> my >> >> lucene-core-2.9.3-dev.jar that I'd just compiled >> >> 4) than I do "ant compile" and "ant dist" in solr >> folder >> >> 5) after that I recompile >> solr/example/webapps/solr.war >> >> with my new >> >> solr and lucene-core jars >> >> 6) I put my schema.xml in solr/example/solr/conf/ >> >> 7) then I do "java -jar start.jar" in >> solr/example >> >> 8) index big_post.xml >> >> 9) trying to find this document by "curl >> >> http://localhost:8983/solr/select?q=body:big*" >> >> (big_post.xml contains >> >> a long word bigaaaaa...aaaa) >> >> 10) solr returns nothing >> >> >> >> On 23 October 2010 02:43, Steven A Rowe <[EMAIL PROTECTED]> >> >> wrote: >> >> > Hi Sergey, >> >> > >> >> > What does your ~34kb field value look like? >> Does >> >> StandardTokenizer think it's just one token? >> >> > >> >> > What doesn't work? What happens? >> >> > >> >> > Steve >> >> > >> >> >> -----Original Message----- >> >> >> From: Sergey Bartunov [mailto:[EMAIL PROTECTED]] >> >> >> Sent: Friday, October 22, 2010 3:18 PM >> >> >> To: [EMAIL PROTECTED] >> >> >> Subject: Re: How to index long words >> with >> >> StandardTokenizerFactory? >> >> >> >> >> >> I'm using Solr 1.4.1. Now I'm successed >> with >> >> replacing lucene-core jar >> >> >> but maxTokenValue seems to be used in >> very strange >> >> way. Currenty for >> >> >> me it's set to 1024*1024, but I couldn't >> index a >> >> field with just size >> >> >> of ~34kb. I understand that it's a little >> weird to >> >> index such a big >> >> >> data, but I just want to know it doesn't >> work >> >> >> >> >> >> On 22 October 2010 20:36, Steven A Rowe >> <[EMAIL PROTECTED]> >> >> wrote: >> >> >> > Hi Sergey, >> >> >> > >> >> >> > I've opened an issue to add a >> maxTokenLength >> >> param to the >> >> >> StandardTokenizerFactory configuration: >> >> >> > >> >> >> > https://issues.apache.org/jira/browse/SOLR-2188 >> >> >> > >> >> >> > I'll work on it this weekend. >> >> >> > >> >> >> > Are you using Solr 1.4.1? I ask >> because of >> >> your mention of Lucene >> >> >> 2.9.3. I'm not sure there will ever be >> a Solr >> >> 1.4.2 release. I plan on >> >> >> targeting Solr 3.1 and 4.0 for the >> SOLR-2188 fix. >> >> >> > >> >> >> > I'm not sure why you didn't get the >> results >> >> you wanted with your Lucene >> >> >> hack - is it possible you have other +
Sergey Bartunov 2010-10-23, 15:01
-
Re: How to index long words with StandardTokenizerFactory?Yonik Seeley 2010-10-23, 14:55
On Fri, Oct 22, 2010 at 12:07 PM, Sergey Bartunov <[EMAIL PROTECTED]> wrote:
> I'm trying to force solr to index words which length is more than 255 If the field is not a text field, the Solr's default analyzer is used, which currently limits the token to 256 bytes. Out of curiosity, what's your usecase that you really need a single 34KB token? -Yonik http://www.lucidimagination.com +
Yonik Seeley 2010-10-23, 14:55
-
Re: How to index long words with StandardTokenizerFactory?Sergey Bartunov 2010-10-23, 15:00
Look at the scheme.xml that I provided. I use my own "text_block" type
which is derived from "TextField". And I force using StandardTokenizerFactory using tokenizer tag. If I use StrField type there are no problems with big data indexing. The problem is in the tokenizer. On 23 October 2010 18:55, Yonik Seeley <[EMAIL PROTECTED]> wrote: > On Fri, Oct 22, 2010 at 12:07 PM, Sergey Bartunov <[EMAIL PROTECTED]> wrote: >> I'm trying to force solr to index words which length is more than 255 > > If the field is not a text field, the Solr's default analyzer is used, > which currently limits the token to 256 bytes. > Out of curiosity, what's your usecase that you really need a single 34KB token? > > -Yonik > http://www.lucidimagination.com > +
Sergey Bartunov 2010-10-23, 15:00
-
Re: How to index long words with StandardTokenizerFactory?Ahmet Arslan 2010-10-23, 21:29
Ops I am sorry, I thought that solr/lib refers to solrhome/lib.
I just tested this and it seems that you have successfully increased the max token length. You can verify this by analysis.jsp page. Although analysis.jsp's output, it seems that some other mechanism is preventing this huge token to be indexed. Response of http://localhost:8983/solr/terms?terms.fl=body does not have that huge token. If you are interested in only prefix queries, as a workaround, you can use <filter class="solr.EdgeNGramFilterFactory" minGramSize="1" maxGramSize="25" /> at index time. So the query (without star) solr/select?q=body:big will return that document. By the way for this particular task you don't need to edit lucene/solr disto. You can use this class for this with standard pre-compiled solr.war. By putting jar into SolrHome/lib directory. package foo.solr.analysis; import org.apache.lucene.analysis.standard.StandardTokenizer; import org.apache.solr.analysis.BaseTokenizerFactory; import java.io.Reader; public class CustomStandardTokenizerFactory extends BaseTokenizerFactory { public StandardTokenizer create(Reader input) { final StandardTokenizer tokenizer = new StandardTokenizer(input); tokenizer.setMaxTokenLength(Integer.MAX_VALUE); return tokenizer; } } <fieldType name="text_block" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="foo.solr.analysis.CustomStandardTokenizerFactory" /> </analyzer> </fieldType> --- On Sat, 10/23/10, Sergey Bartunov <[EMAIL PROTECTED]> wrote: > From: Sergey Bartunov <[EMAIL PROTECTED]> > Subject: Re: How to index long words with StandardTokenizerFactory? > To: [EMAIL PROTECTED] > Date: Saturday, October 23, 2010, 6:01 PM > This is exactly what I did. Look: > > >> >> 3) I replace lucene-core-2.9.3.jar in > solr/lib/ by > >> my > >> >> lucene-core-2.9.3-dev.jar that I'd just > compiled > >> >> 4) than I do "ant compile" and "ant dist" > in solr > >> folder > >> >> 5) after that I recompile > >> solr/example/webapps/solr.war > > On 23 October 2010 18:53, Ahmet Arslan <[EMAIL PROTECTED]> > wrote: > > I think you should replace your new > lucene-core-2.9.3-dev.jar in \apache-solr-1.4.1\lib and then > create a new solr.war under \apache-solr-1.4.1\dist. And > copy this new solr.war to solr/example/webapps/solr.war > > > > --- On Sat, 10/23/10, Sergey Bartunov <[EMAIL PROTECTED]> > wrote: > > > >> From: Sergey Bartunov <[EMAIL PROTECTED]> > >> Subject: Re: How to index long words with > StandardTokenizerFactory? > >> To: [EMAIL PROTECTED] > >> Date: Saturday, October 23, 2010, 5:45 PM > >> Yes. I did. Won't help. > >> > >> On 23 October 2010 17:45, Ahmet Arslan <[EMAIL PROTECTED]> > >> wrote: > >> > Did you delete the folder > >> Jetty_0_0_0_0_8983_solr.war_** under > >> apache-solr-1.4.1\example\work? > >> > > >> > --- On Sat, 10/23/10, Sergey Bartunov <[EMAIL PROTECTED]> > >> wrote: > >> > > >> >> From: Sergey Bartunov <[EMAIL PROTECTED]> > >> >> Subject: Re: How to index long words > with > >> StandardTokenizerFactory? > >> >> To: [EMAIL PROTECTED] > >> >> Date: Saturday, October 23, 2010, 3:56 > PM > >> >> Here are all the files: http://rghost.net/3016862 > >> >> > >> >> 1) StandardAnalyzer.java, > StandardTokenizer.java - > >> patched > >> >> files from > >> >> lucene-2.9.3 > >> >> 2) I patch these files and build lucene > by typing > >> "ant" > >> >> 3) I replace lucene-core-2.9.3.jar in > solr/lib/ by > >> my > >> >> lucene-core-2.9.3-dev.jar that I'd just > compiled > >> >> 4) than I do "ant compile" and "ant dist" > in solr > >> folder > >> >> 5) after that I recompile > >> solr/example/webapps/solr.war > >> >> with my new > >> >> solr and lucene-core jars > >> >> 6) I put my schema.xml in > solr/example/solr/conf/ > >> >> 7) then I do "java -jar start.jar" in > >> solr/example > >> >> 8) index big_post.xml +
Ahmet Arslan 2010-10-23, 21:29
-
Re: How to index long words with StandardTokenizerFactory?Sergey Bartunov 2010-10-24, 14:47
I did it just as you recommended. Solr indexes files around 15kb, but
no more. The same effect was with patched constants On 24 October 2010 01:29, Ahmet Arslan <[EMAIL PROTECTED]> wrote: > Ops I am sorry, I thought that solr/lib refers to solrhome/lib. > > I just tested this and it seems that you have successfully increased the max token length. You can verify this by analysis.jsp page. > > Although analysis.jsp's output, it seems that some other mechanism is preventing this huge token to be indexed. Response of http://localhost:8983/solr/terms?terms.fl=body > does not have that huge token. > > If you are interested in only prefix queries, as a workaround, you can use <filter class="solr.EdgeNGramFilterFactory" minGramSize="1" maxGramSize="25" /> at index time. So the query (without star) > solr/select?q=body:big will return that document. > > By the way for this particular task you don't need to edit lucene/solr disto. You can use this class for this with standard pre-compiled solr.war. > By putting jar into SolrHome/lib directory. > > package foo.solr.analysis; > > import org.apache.lucene.analysis.standard.StandardTokenizer; > import org.apache.solr.analysis.BaseTokenizerFactory; > import java.io.Reader; > > > public class CustomStandardTokenizerFactory extends BaseTokenizerFactory { > public StandardTokenizer create(Reader input) { > final StandardTokenizer tokenizer = new StandardTokenizer(input); > tokenizer.setMaxTokenLength(Integer.MAX_VALUE); > return tokenizer; > } > } > > <fieldType name="text_block" class="solr.TextField" positionIncrementGap="100"> > <analyzer> > <tokenizer class="foo.solr.analysis.CustomStandardTokenizerFactory" /> > </analyzer> > </fieldType> > > --- On Sat, 10/23/10, Sergey Bartunov <[EMAIL PROTECTED]> wrote: > >> From: Sergey Bartunov <[EMAIL PROTECTED]> >> Subject: Re: How to index long words with StandardTokenizerFactory? >> To: [EMAIL PROTECTED] >> Date: Saturday, October 23, 2010, 6:01 PM >> This is exactly what I did. Look: >> >> >> >> 3) I replace lucene-core-2.9.3.jar in >> solr/lib/ by >> >> my >> >> >> lucene-core-2.9.3-dev.jar that I'd just >> compiled >> >> >> 4) than I do "ant compile" and "ant dist" >> in solr >> >> folder >> >> >> 5) after that I recompile >> >> solr/example/webapps/solr.war >> >> On 23 October 2010 18:53, Ahmet Arslan <[EMAIL PROTECTED]> >> wrote: >> > I think you should replace your new >> lucene-core-2.9.3-dev.jar in \apache-solr-1.4.1\lib and then >> create a new solr.war under \apache-solr-1.4.1\dist. And >> copy this new solr.war to solr/example/webapps/solr.war >> > >> > --- On Sat, 10/23/10, Sergey Bartunov <[EMAIL PROTECTED]> >> wrote: >> > >> >> From: Sergey Bartunov <[EMAIL PROTECTED]> >> >> Subject: Re: How to index long words with >> StandardTokenizerFactory? >> >> To: [EMAIL PROTECTED] >> >> Date: Saturday, October 23, 2010, 5:45 PM >> >> Yes. I did. Won't help. >> >> >> >> On 23 October 2010 17:45, Ahmet Arslan <[EMAIL PROTECTED]> >> >> wrote: >> >> > Did you delete the folder >> >> Jetty_0_0_0_0_8983_solr.war_** under >> >> apache-solr-1.4.1\example\work? >> >> > >> >> > --- On Sat, 10/23/10, Sergey Bartunov <[EMAIL PROTECTED]> >> >> wrote: >> >> > >> >> >> From: Sergey Bartunov <[EMAIL PROTECTED]> >> >> >> Subject: Re: How to index long words >> with >> >> StandardTokenizerFactory? >> >> >> To: [EMAIL PROTECTED] >> >> >> Date: Saturday, October 23, 2010, 3:56 >> PM >> >> >> Here are all the files: http://rghost.net/3016862 >> >> >> >> >> >> 1) StandardAnalyzer.java, >> StandardTokenizer.java - >> >> patched >> >> >> files from >> >> >> lucene-2.9.3 >> >> >> 2) I patch these files and build lucene >> by typing >> >> "ant" >> >> >> 3) I replace lucene-core-2.9.3.jar in >> solr/lib/ by >> >> my >> >> >> lucene-core-2.9.3-dev.jar that I'd just >> compiled >> >> >> 4) than I do "ant compile" and "ant dist" +
Sergey Bartunov 2010-10-24, 14:47
-
Re: How to index long words with StandardTokenizerFactory?Yonik Seeley 2010-10-24, 15:19
On Sun, Oct 24, 2010 at 10:47 AM, Sergey Bartunov <[EMAIL PROTECTED]> wrote:
> I did it just as you recommended. Solr indexes files around 15kb, but > no more. The same effect was with patched constants Lucene also has max token sizes it can index. IIRC, lengths used to be stored inline with the char data, and a single char was used for the length. The bigger question: Is this a problem for you (do you actually have a use case)? -Yonik http://www.lucidimagination.com +
Yonik Seeley 2010-10-24, 15:19
-
Re: How to index long words with StandardTokenizerFactory?Sergey Bartunov 2010-10-24, 15:29
It's a kind of research. There is no particular practical use case as
far as I know. Do you know how to set all these max token lengths? On 24 October 2010 19:19, Yonik Seeley <[EMAIL PROTECTED]> wrote: > On Sun, Oct 24, 2010 at 10:47 AM, Sergey Bartunov <[EMAIL PROTECTED]> wrote: >> I did it just as you recommended. Solr indexes files around 15kb, but >> no more. The same effect was with patched constants > > Lucene also has max token sizes it can index. > IIRC, lengths used to be stored inline with the char data, and a > single char was used for the length. > > The bigger question: Is this a problem for you (do you actually have a > use case)? > > -Yonik > http://www.lucidimagination.com > +
Sergey Bartunov 2010-10-24, 15:29
-
Re: How to index long words with StandardTokenizerFactory?Yonik Seeley 2010-10-24, 16:02
On Sun, Oct 24, 2010 at 11:29 AM, Sergey Bartunov <[EMAIL PROTECTED]> wrote:
> It's a kind of research. There is no particular practical use case as > far as I know. > Do you know how to set all these max token lengths? It's a practical limit given how things are coded, not an arbitrary one. Given the lack of use cases, It would be a mistake to complicate the code or make it less performant trying to support a larger limit. -Yonik http://www.lucidimagination.com +
Yonik Seeley 2010-10-24, 16:02
|