|
Yonik Seeley
2009-08-10, 16:01
Uwe Schindler
2009-08-10, 16:44
Yonik Seeley
2009-08-10, 16:55
Yonik Seeley
2009-08-10, 16:57
Uwe Schindler
2009-08-10, 17:23
Yonik Seeley
2009-08-10, 17:42
Michael Busch
2009-08-10, 17:51
Uwe Schindler
2009-08-10, 17:50
Earwin Burrfoot
2009-08-10, 18:00
Grant Ingersoll
2009-08-10, 18:50
Mark Miller
2009-08-10, 19:00
Michael Busch
2009-08-10, 19:06
Mark Miller
2009-08-10, 19:29
Earwin Burrfoot
2009-08-10, 19:47
Uwe Schindler
2009-08-10, 19:52
Michael Busch
2009-08-10, 19:57
Uwe Schindler
2009-08-10, 20:02
Michael Busch
2009-08-10, 20:08
Uwe Schindler
2009-08-10, 20:13
Grant Ingersoll
2009-08-10, 20:39
Uwe Schindler
2009-08-10, 20:54
Earwin Burrfoot
2009-08-10, 21:07
Shai Erera
2009-08-10, 21:12
Grant Ingersoll
2009-08-10, 22:19
Michael Busch
2009-08-10, 22:25
Mark Miller
2009-08-10, 22:28
Grant Ingersoll
2009-08-11, 01:36
Mark Miller
2009-08-11, 01:49
Michael Busch
2009-08-11, 08:28
Robert Muir
2009-08-11, 10:50
Yonik Seeley
2009-08-11, 11:09
Earwin Burrfoot
2009-08-11, 11:21
Mark Miller
2009-08-11, 11:25
Earwin Burrfoot
2009-08-11, 11:28
Mark Miller
2009-08-11, 11:31
Michael McCandless
2009-08-11, 12:22
Michael Busch
2009-08-12, 07:14
Grant Ingersoll
2009-08-11, 11:13
Michael Busch
2009-08-11, 19:21
Grant Ingersoll
2009-08-11, 20:56
Earwin Burrfoot
2009-08-10, 22:43
Mark Miller
2009-08-10, 22:57
Uwe Schindler
2009-08-10, 22:49
DM Smith
2009-08-11, 15:53
Uwe Schindler
2009-08-11, 22:14
Uwe Schindler
2009-08-10, 16:50
|
-
who clears attributes?Yonik Seeley 2009-08-10, 16:01
CharTokenizer.incrementToken() clears *all* attributes in the entire
tokenizer chain. StandardTokenizer.incrementToken() clears only the term attribute. So... which is right? Seems like the tokenizer should be responsible? On a performance related note, CharTokenizer.clearAttribtes() could be more efficient - 2 new objects (the unmodifiable map and the iterator object) are created for every incrementToken. -Yonik http://www.lucidimagination.com --------------------------------------------------------------------- +
Yonik Seeley 2009-08-10, 16:01
-
RE: who clears attributes?Uwe Schindler 2009-08-10, 16:44
I already removed the unmodifiable iterator, so one new instance is removed
(see the JIRA issue). But you are right, the CharTokenizer should only clear the TermAttribute, as it is only using this attribute. ----- Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: [EMAIL PROTECTED] > -----Original Message----- > From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]] On Behalf Of Yonik > Seeley > Sent: Monday, August 10, 2009 6:01 PM > To: [EMAIL PROTECTED] > Subject: who clears attributes? > > CharTokenizer.incrementToken() clears *all* attributes in the entire > tokenizer chain. > StandardTokenizer.incrementToken() clears only the term attribute. > > So... which is right? Seems like the tokenizer should be responsible? > > On a performance related note, CharTokenizer.clearAttribtes() could be > more efficient - 2 new objects (the unmodifiable map and the iterator > object) are created for every incrementToken. > > -Yonik > http://www.lucidimagination.com > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] --------------------------------------------------------------------- +
Uwe Schindler 2009-08-10, 16:44
-
Re: who clears attributes?Yonik Seeley 2009-08-10, 16:55
On Mon, Aug 10, 2009 at 12:44 PM, Uwe Schindler<[EMAIL PROTECTED]> wrote:
>the CharTokenizer should only clear the TermAttribute, as it is only using this attribute. It's certainly not clear to me - is there an established convention? Either Tokenizer clears all attributes, or each tokenizer clears those attributes it cares about. But in the latter case, wouldn't that potentially cause multiple TokenFilters to clear the same attribute? -Yonik http://www.lucidimagination.com --------------------------------------------------------------------- +
Yonik Seeley 2009-08-10, 16:55
-
Re: who clears attributes?Yonik Seeley 2009-08-10, 16:57
> , or each tokenizer
should read "or each Tokenizer or TokenFilter" On Mon, Aug 10, 2009 at 12:55 PM, Yonik Seeley<[EMAIL PROTECTED]> wrote: > On Mon, Aug 10, 2009 at 12:44 PM, Uwe Schindler<[EMAIL PROTECTED]> wrote: >>the CharTokenizer should only clear the TermAttribute, as it is only using this attribute. > > It's certainly not clear to me - is there an established convention? > Either Tokenizer clears all attributes, or each tokenizer clears those > attributes it cares about. But in the latter case, wouldn't that > potentially cause multiple TokenFilters to clear the same attribute? > > -Yonik > http://www.lucidimagination.com > --------------------------------------------------------------------- +
Yonik Seeley 2009-08-10, 16:57
-
RE: who clears attributes?Uwe Schindler 2009-08-10, 17:23
> On Mon, Aug 10, 2009 at 12:44 PM, Uwe Schindler<[EMAIL PROTECTED]> wrote:
> >the CharTokenizer should only clear the TermAttribute, as it is only > using this attribute. I changed this in the latest patch for https://issues.apache.org/jira/browse/LUCENE-1796 > It's certainly not clear to me - is there an established convention? > Either Tokenizer clears all attributes, or each tokenizer clears those > attributes it cares about. But in the latter case, wouldn't that > potentially cause multiple TokenFilters to clear the same attribute? Clearing attributes in TokenFilters is not the best. The problem is, that calling clear() on an AttributeImpl may not only clear the directly referenced values, the multi-attribute implementations like Token/TokenWrapper currently used, always clear all 6 standard attributes. Because of this, I would only clear attributes in TokenStream/Tokenizer, but then per default for all Tokenizers. Maybe we should implement this. The problem with that is still the iterator creation, but I have no better solution as Maps only work with iterators for enumerating values... :( Uwe --------------------------------------------------------------------- +
Uwe Schindler 2009-08-10, 17:23
-
Re: who clears attributes?Yonik Seeley 2009-08-10, 17:42
Thinking through this a little more, I don't see an alternative to the
tokenizer clearing all attributes at the start of incrementToken(). Consider a DefaultPayloadTokenFilter that only sets a payload if one isn't already set - it's clear that this filter can't clear the payload attribute, so it must be cleared by the head of the chain - the tokenizer. Right? -Yonik http://www.lucidimagination.com --------------------------------------------------------------------- +
Yonik Seeley 2009-08-10, 17:42
-
Re: who clears attributes?Michael Busch 2009-08-10, 17:51
Clearing the attributes should be required in those places where we
cleared (or reinit'ed) Token previously, right? Michael On 8/10/09 10:42 AM, Yonik Seeley wrote: > Thinking through this a little more, I don't see an alternative to the > tokenizer clearing all attributes at the start of incrementToken(). > > Consider a DefaultPayloadTokenFilter that only sets a payload if one > isn't already set - it's clear that this filter can't clear the > payload attribute, so it must be cleared by the head of the chain - > the tokenizer. Right? > > -Yonik > http://www.lucidimagination.com > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > > --------------------------------------------------------------------- +
Michael Busch 2009-08-10, 17:51
-
RE: who clears attributes?Uwe Schindler 2009-08-10, 17:50
Yes. Is there a way to enforce this for all Tokenizers automatically? As
incrementToken() will be abstract in 3.0, there cannot be a default impl. So all Tokenizers should call clearAttributes() as first call in incrementToken(). Then we have still the problem of the slow iterator creation (which was speed up a little bit by removing the unmodifiable wrapper). This can be solved by using an additional ArrayList in AttributeSource that gets all AttributeImpl instances, but this would bring an additional initialization cost() on creating the Tokenizer chain. ----- Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: [EMAIL PROTECTED] > -----Original Message----- > From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]] On Behalf Of Yonik > Seeley > Sent: Monday, August 10, 2009 7:42 PM > To: [EMAIL PROTECTED] > Subject: Re: who clears attributes? > > Thinking through this a little more, I don't see an alternative to the > tokenizer clearing all attributes at the start of incrementToken(). > > Consider a DefaultPayloadTokenFilter that only sets a payload if one > isn't already set - it's clear that this filter can't clear the > payload attribute, so it must be cleared by the head of the chain - > the tokenizer. Right? > > -Yonik > http://www.lucidimagination.com > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] --------------------------------------------------------------------- +
Uwe Schindler 2009-08-10, 17:50
-
Re: who clears attributes?Earwin Burrfoot 2009-08-10, 18:00
I'll deviate from the topic somewhat.
What are exact benefits that new tokenstream API yields? Are we sure we want it released with 2.9? By now I only see various elaborate problems, but haven't seen a single piece of code becoming simpler. On Mon, Aug 10, 2009 at 21:50, Uwe Schindler<[EMAIL PROTECTED]> wrote: > Yes. Is there a way to enforce this for all Tokenizers automatically? As > incrementToken() will be abstract in 3.0, there cannot be a default impl. So > all Tokenizers should call clearAttributes() as first call in > incrementToken(). > > Then we have still the problem of the slow iterator creation (which was > speed up a little bit by removing the unmodifiable wrapper). This can be > solved by using an additional ArrayList in AttributeSource that gets all > AttributeImpl instances, but this would bring an additional initialization > cost() on creating the Tokenizer chain. > > ----- > Uwe Schindler > H.-H.-Meier-Allee 63, D-28213 Bremen > http://www.thetaphi.de > eMail: [EMAIL PROTECTED] > > >> -----Original Message----- >> From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]] On Behalf Of Yonik >> Seeley >> Sent: Monday, August 10, 2009 7:42 PM >> To: [EMAIL PROTECTED] >> Subject: Re: who clears attributes? >> >> Thinking through this a little more, I don't see an alternative to the >> tokenizer clearing all attributes at the start of incrementToken(). >> >> Consider a DefaultPayloadTokenFilter that only sets a payload if one >> isn't already set - it's clear that this filter can't clear the >> payload attribute, so it must be cleared by the head of the chain - >> the tokenizer. Right? >> >> -Yonik >> http://www.lucidimagination.com >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: [EMAIL PROTECTED] >> For additional commands, e-mail: [EMAIL PROTECTED] > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > -- Kirill Zakharenko/Кирилл Захаренко ([EMAIL PROTECTED]) Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423 ICQ: 104465785 --------------------------------------------------------------------- +
Earwin Burrfoot 2009-08-10, 18:00
-
Re: who clears attributes?Grant Ingersoll 2009-08-10, 18:50
On Aug 10, 2009, at 2:00 PM, Earwin Burrfoot wrote: > I'll deviate from the topic somewhat. > What are exact benefits that new tokenstream API yields? Are we sure > we want it released with 2.9? > By now I only see various elaborate problems, but haven't seen a > single piece of code becoming simpler. In theory, it sets up for more indexing/searching possibilities in 3.0, but in the meantime, it is proving to be quite problematic due to back compatibility restrictions. I have serious doubts about releasing this new API until these performance issues are resolved and better proven out from a usability standpoint. It simply is too much to swallow for most users, as Analyzers/ TokenStreams/etc. are easily the most common place for people to inject their own capabilities and there is no way we should be taking a 30% hit in performance for some theoretical speed up and new search capability 1 year from now. I'm almost thinking we should have a 2.5 release instead of 2.9. I know, that stinks, because we all want to get onto 3.0, but the fact is, 2.9 was _SUPPOSED_ to be a deprecation release, when in reality it probably has as many changes as 2.3 did and it has a lot of back compatibility breakages. Going to 2.5 would give this token stuff a chance to marinate, as well as all the per segment changes and the NRT stuff. Just a thought. -Grant --------------------------------------------------------------------- +
Grant Ingersoll 2009-08-10, 18:50
-
Re: who clears attributes?Mark Miller 2009-08-10, 19:00
Grant Ingersoll wrote:
> > On Aug 10, 2009, at 2:00 PM, Earwin Burrfoot wrote: > > 2.9 was _SUPPOSED_ to be a deprecation release, Whats a deprecation release? We deprecate stuff in every release ... does it make sense to do a release just to deprecate anything we might not have yet? And if you add deprecations, wouldn't you add features to move to? I'm not a fan of 3.0 just being 2.9 with deprecations removed either. Why not add new features as well? Sure, we should be *way* more careful about breaking back compat there, but who cares if a few features are introduced? Doing a release is a lot of project steam - why waste it ? -- - Mark http://www.lucidimagination.com --------------------------------------------------------------------- +
Mark Miller 2009-08-10, 19:00
-
Re: who clears attributes?Michael Busch 2009-08-10, 19:06
I think we should change the backwards-compatibility policy as proposed
in LUCENE-1698 and remove some deprecated things (inlcuding the old TokenStream API, maybe query parser) in 3.1, not 3.0. I don't think we should have a 2.5 release - this clearly shows the disadvantages of our current bw-policy. Michael On 8/10/09 11:50 AM, Grant Ingersoll wrote: > > On Aug 10, 2009, at 2:00 PM, Earwin Burrfoot wrote: > >> I'll deviate from the topic somewhat. >> What are exact benefits that new tokenstream API yields? Are we sure >> we want it released with 2.9? >> By now I only see various elaborate problems, but haven't seen a >> single piece of code becoming simpler. > > In theory, it sets up for more indexing/searching possibilities in > 3.0, but in the meantime, it is proving to be quite problematic due to > back compatibility restrictions. > > I have serious doubts about releasing this new API until these > performance issues are resolved and better proven out from a usability > standpoint. > It simply is too much to swallow for most users, as > Analyzers/TokenStreams/etc. are easily the most common place for > people to inject their own capabilities and there is no way we should be > taking a 30% hit in performance for some theoretical speed up and new > search capability 1 year from now. > > I'm almost thinking we should have a 2.5 release instead of 2.9. I > know, that stinks, because we all want to get onto 3.0, but the fact > is, 2.9 was _SUPPOSED_ to be a deprecation release, > when in reality it probably has as many changes as 2.3 did and it has > a lot of back compatibility breakages. Going to 2.5 would give this > token stuff a chance to marinate, as well as > all the per segment changes and the NRT stuff. Just a thought. > > -Grant > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > --------------------------------------------------------------------- +
Michael Busch 2009-08-10, 19:06
-
Re: who clears attributes?Mark Miller 2009-08-10, 19:29
Michael Busch wrote:
> I think we should change the backwards-compatibility policy as > proposed in LUCENE-1698 and remove some deprecated things (inlcuding > the old TokenStream API, maybe query parser) in 3.1, not 3.0. > I don't think we should have a 2.5 release - this clearly shows the > disadvantages of our current bw-policy. > > Michael > I think the only advantage to that policy is to save major number space (it will take us longer to get to Lucene 10) - and the disadvantages are laid out in the comments. If we find we have a lot we need to remove after 3.0, jumping to Lucene 4 makes the most sense to me. I still like the idea of at least *attempting* back compat between major versions - its much more intuitive than the every other minor stuff. -- - Mark http://www.lucidimagination.com --------------------------------------------------------------------- +
Mark Miller 2009-08-10, 19:29
-
Re: who clears attributes?Earwin Burrfoot 2009-08-10, 19:47
On Mon, Aug 10, 2009 at 22:50, Grant Ingersoll<[EMAIL PROTECTED]> wrote:
> > On Aug 10, 2009, at 2:00 PM, Earwin Burrfoot wrote: > >> I'll deviate from the topic somewhat. >> What are exact benefits that new tokenstream API yields? Are we sure >> we want it released with 2.9? >> By now I only see various elaborate problems, but haven't seen a >> single piece of code becoming simpler. > > In theory, it sets up for more indexing/searching possibilities in 3.0, but > in the meantime, it is proving to be quite problematic due to back > compatibility restrictions. I'm not quite sure which exact indexing/searching possibilities does the new API open for us. Some new ways of handling text? Okay, I'd like each token to have one more number in addition to posIncr, so I can have my 'true multiword synonyms'. Maybe, just maybe, there will be a pair of other extensions. Usecases here are really scarce. Plus, if they're successful/useful, they will most probably be included out of the box, so we don't need much flexibility here. Something other than text? Numbers, with good rangequeries. Dates. Spatial data. Your-type-here. For these, flexible text-processing stream-oriented API is totally useless. > I have serious doubts about releasing this new API until these performance > issues are resolved and better proven out from a usability standpoint. > It simply is too much to swallow for most users, as > Analyzers/TokenStreams/etc. are easily the most common place for people to > inject their own capabilities and there is no way we should be > taking a 30% hit in performance for some theoretical speed up and new search > capability 1 year from now. I have a feeling that best idea, before more damage is done, is to rollback this new API, store the patch, and try rolling it out once again, when we have usecases/more code to justify it. -- Kirill Zakharenko/Кирилл Захаренко ([EMAIL PROTECTED]) Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423 ICQ: 104465785 --------------------------------------------------------------------- +
Earwin Burrfoot 2009-08-10, 19:47
-
RE: who clears attributes?Uwe Schindler 2009-08-10, 19:52
Hi Grant,
> I have serious doubts about releasing this new API until these > performance issues are resolved and better proven out from a usability > standpoint. I think LUCENE-1796 has fixed the performance problems, which was caused by a missing reflection-cache needed for bw compatibility. I hope to commit soon! 2.9 may be a little bit slower when you mix old and new API and do not reuse Tokenizers (but Robert is already adding reusableTokenStream to all contrib analyzers). When the backwards layer is removed completely or setOnlyUseNewAPI is enabled, there is no speed impact at all. Michael: The TokenWrapper added cost was there in 2.9 before the TokenStream overhaul, too, as the TokenWrapper-like code was there implemented similarily inside DocInverter. Uwe --------------------------------------------------------------------- +
Uwe Schindler 2009-08-10, 19:52
-
Re: who clears attributes?Michael Busch 2009-08-10, 19:57
On 8/10/09 12:52 PM, Uwe Schindler wrote:
> Michael: The TokenWrapper added cost was there in 2.9 before the TokenStream > overhaul, too, as the TokenWrapper-like code was there implemented > similarily inside DocInverter. > > You're right. It will only be more costly in case you mix multiple old and new TokenStreams in a chain. Then the delegation is done more than once. Michael --------------------------------------------------------------------- +
Michael Busch 2009-08-10, 19:57
-
RE: who clears attributes?Uwe Schindler 2009-08-10, 20:02
But TokenWrapper is used there every time, it is not used for delegating,
only for exchanging the inner Token instance. The delegation cost are there because a Filter implementing the old-API in front of a new-API-Tokenizer would need to be wrapped 2 times: DocInverter -> oldAPIFilter.incrementToken() [bw layer] -> oldAPIFilter.next(Token) [native old-style impl] -> newAPIFilter.next(Token) [bw-layer] -> newAPIFilter.incrementToken() [native new-style impl] If both filters would only implement new API there would be direct calls from the filter to the input TokenStream. If all streams/filters would implement only the old API, the bw-delegation would only be used for the incrementToken() calls from DocInverter. ----- Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: [EMAIL PROTECTED] > -----Original Message----- > From: Michael Busch [mailto:[EMAIL PROTECTED]] > Sent: Monday, August 10, 2009 9:58 PM > To: [EMAIL PROTECTED] > Subject: Re: who clears attributes? > > On 8/10/09 12:52 PM, Uwe Schindler wrote: > > Michael: The TokenWrapper added cost was there in 2.9 before the > TokenStream > > overhaul, too, as the TokenWrapper-like code was there implemented > > similarily inside DocInverter. > > > > > > You're right. It will only be more costly in case you mix multiple old > and new TokenStreams in a chain. Then the delegation is done more than > once. > > Michael > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] --------------------------------------------------------------------- +
Uwe Schindler 2009-08-10, 20:02
-
Re: who clears attributes?Michael Busch 2009-08-10, 20:08
On 8/10/09 1:02 PM, Uwe Schindler wrote:
> If both filters would only implement new API there would be direct calls > from the filter to the input TokenStream. If all streams/filters would > implement only the old API, the bw-delegation would only be used for the > incrementToken() calls from DocInverter. > True. It also seems like the delegation costs are not very expensive. Michael --------------------------------------------------------------------- +
Michael Busch 2009-08-10, 20:08
-
RE: who clears attributes?Uwe Schindler 2009-08-10, 20:13
I think, they are optimized away by the JRE...
The figure from Mark does not have TokenWrapper hot spots in it, only TokenWrapper.termLength() is mentioned, but this is because Token.termLength() is often called and takes the same time (so the TokenWrapper time is equal to the inner Token call). A lot of code in next()/next(Toke)/incrementToken() default impls uses final variables, so the delegation can simply be removed by the compiler. :-) ----- Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: [EMAIL PROTECTED] > -----Original Message----- > From: Michael Busch [mailto:[EMAIL PROTECTED]] > Sent: Monday, August 10, 2009 10:09 PM > To: [EMAIL PROTECTED] > Subject: Re: who clears attributes? > > On 8/10/09 1:02 PM, Uwe Schindler wrote: > > If both filters would only implement new API there would be direct calls > > from the filter to the input TokenStream. If all streams/filters would > > implement only the old API, the bw-delegation would only be used for the > > incrementToken() calls from DocInverter. > > > True. It also seems like the delegation costs are not very expensive. > > Michael > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] --------------------------------------------------------------------- +
Uwe Schindler 2009-08-10, 20:13
-
Re: who clears attributes?Grant Ingersoll 2009-08-10, 20:39
On Aug 10, 2009, at 3:52 PM, Uwe Schindler wrote: > Hi Grant, > >> I have serious doubts about releasing this new API until these >> performance issues are resolved and better proven out from a >> usability >> standpoint. > > I think LUCENE-1796 has fixed the performance problems, which was > caused by > a missing reflection-cache needed for bw compatibility. I hope to > commit > soon! > > 2.9 may be a little bit slower when you mix old and new API and do > not reuse > Tokenizers (but Robert is already adding reusableTokenStream to all > contrib > analyzers). When the backwards layer is removed completely or > setOnlyUseNewAPI is enabled, there is no speed impact at all. > The Analysis features of Lucene are the single most common place where people enhance Lucene. Very few add queries, or muck with field caches, but they do write their own Analyzers and TokenStreams, etc. Within that, mixing old and new is likely the most common case for everyone who has made their own customizations, so a "little bit slower" is something I'd rather not live with just for the sake of some supposed goodness in a year or two. -Grant --------------------------------------------------------------------- +
Grant Ingersoll 2009-08-10, 20:39
-
RE: who clears attributes?Uwe Schindler 2009-08-10, 20:54
> >> I have serious doubts about releasing this new API until these
> >> performance issues are resolved and better proven out from a > >> usability > >> standpoint. > > > > I think LUCENE-1796 has fixed the performance problems, which was > > caused by > > a missing reflection-cache needed for bw compatibility. I hope to > > commit > > soon! > > > > 2.9 may be a little bit slower when you mix old and new API and do > > not reuse > > Tokenizers (but Robert is already adding reusableTokenStream to all > > contrib > > analyzers). When the backwards layer is removed completely or > > setOnlyUseNewAPI is enabled, there is no speed impact at all. > > > > > The Analysis features of Lucene are the single most common place where > people enhance Lucene. Very few add queries, or muck with field > caches, but they do write their own Analyzers and TokenStreams, > etc. Within that, mixing old and new is likely the most common case > for everyone who has made their own customizations, so a "little bit > slower" is something I'd rather not live with just for the sake of > some supposed goodness in a year or two. But because of this flexibility, we added the backwards layer. The old style with setUseNewAPI was not flexible at all, and nobody would move his Tokenizers to the new API without that flexibility (maybe he uses external analyzer packages not yet updated). With "a little bit" I mean the cost of wrapping the old and new API is really minimal, it is just an if statement and a method call, hopefully optimized away by the JVM. In my tests the standard deviation between different test runs was much higher than the difference between mixing old/new API (on Win32), so it is not really sure, that the cost comes from the delegation. The only case that is really slower is (now minimized cost of creation in TokenStream.<init>, if you not reuse TokenStreams: Two LinkedHashMaps have to be created and setup). But this is not caused by the backwards layer. Uwe --------------------------------------------------------------------- +
Uwe Schindler 2009-08-10, 20:54
-
Re: who clears attributes?Earwin Burrfoot 2009-08-10, 21:07
On Tue, Aug 11, 2009 at 00:54, Uwe Schindler<[EMAIL PROTECTED]> wrote:
>> >> I have serious doubts about releasing this new API until these >> >> performance issues are resolved and better proven out from a >> >> usability >> >> standpoint. >> > >> > I think LUCENE-1796 has fixed the performance problems, which was >> > caused by >> > a missing reflection-cache needed for bw compatibility. I hope to >> > commit >> > soon! >> > >> > 2.9 may be a little bit slower when you mix old and new API and do >> > not reuse >> > Tokenizers (but Robert is already adding reusableTokenStream to all >> > contrib >> > analyzers). When the backwards layer is removed completely or >> > setOnlyUseNewAPI is enabled, there is no speed impact at all. >> > >> >> >> The Analysis features of Lucene are the single most common place where >> people enhance Lucene. Very few add queries, or muck with field >> caches, but they do write their own Analyzers and TokenStreams, >> etc. Within that, mixing old and new is likely the most common case >> for everyone who has made their own customizations, so a "little bit >> slower" is something I'd rather not live with just for the sake of >> some supposed goodness in a year or two. > > But because of this flexibility, we added the backwards layer. The old style > with setUseNewAPI was not flexible at all, and nobody would move his > Tokenizers to the new API without that flexibility (maybe he uses external > analyzer packages not yet updated). > > With "a little bit" I mean the cost of wrapping the old and new API is > really minimal, it is just an if statement and a method call, hopefully > optimized away by the JVM. In my tests the standard deviation between > different test runs was much higher than the difference between mixing > old/new API (on Win32), so it is not really sure, that the cost comes from > the delegation. > > The only case that is really slower is (now minimized cost of creation in > TokenStream.<init>, if you not reuse TokenStreams: Two LinkedHashMaps have > to be created and setup). But this is not caused by the backwards layer. > > Uwe > Uwe, the problems I raised are still here - what is the benefit of moving to this API right now? I see none. What is the future benefit of moving to this API? It is very vague. Someone said this API is generic, but there are different kinds of genericity. Are we sure we abstracted the right thing? How will it be used? Where are examples? Right now it is an excercise in programming, which forces us to do new and new excercises. Very exciting, very rewarding, but as of now - pointless. -- Kirill Zakharenko/Кирилл Захаренко ([EMAIL PROTECTED]) Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423 ICQ: 104465785 --------------------------------------------------------------------- +
Earwin Burrfoot 2009-08-10, 21:07
-
Re: who clears attributes?Shai Erera 2009-08-10, 21:12
It sounds like the 'old' API should stay a bit longer than 3.0. We'd like to
give more people a chance to experiment w/ the new API before we claim it is the new Analysis API in Lucene. And that means that more users will have to live w/ the "bit of slowness" more than what is believed in this thread. I personally worry much about needing to throw away the current API. I'll have a lot of code to port over and I haven't read anything so far that convinces me the new API is better. I don't have any problems w/ the current API today. I feel I have all the flexibility I need w/ indexing fields. I use payloads, Field.Index constants, write Analyzers, TokenStreams ... actually I have 0 complaints. Maybe we should follow what I seem to read from Earwin and Grant - come up w/ real use cases, try to implement them w/ the current API, then if it's impossible, discuss how we can make the current API more adaptive. If at the end of this we'll get back to the new API, then we'll at least feel better about it, and more convinced it is the way to go. Hack .. maybe we'll be convinced to base the Luceue analysis on UIMA? :) Shai On Mon, Aug 10, 2009 at 11:54 PM, Uwe Schindler <[EMAIL PROTECTED]> wrote: > > >> I have serious doubts about releasing this new API until these > > >> performance issues are resolved and better proven out from a > > >> usability > > >> standpoint. > > > > > > I think LUCENE-1796 has fixed the performance problems, which was > > > caused by > > > a missing reflection-cache needed for bw compatibility. I hope to > > > commit > > > soon! > > > > > > 2.9 may be a little bit slower when you mix old and new API and do > > > not reuse > > > Tokenizers (but Robert is already adding reusableTokenStream to all > > > contrib > > > analyzers). When the backwards layer is removed completely or > > > setOnlyUseNewAPI is enabled, there is no speed impact at all. > > > > > > > > > The Analysis features of Lucene are the single most common place where > > people enhance Lucene. Very few add queries, or muck with field > > caches, but they do write their own Analyzers and TokenStreams, > > etc. Within that, mixing old and new is likely the most common case > > for everyone who has made their own customizations, so a "little bit > > slower" is something I'd rather not live with just for the sake of > > some supposed goodness in a year or two. > > But because of this flexibility, we added the backwards layer. The old > style > with setUseNewAPI was not flexible at all, and nobody would move his > Tokenizers to the new API without that flexibility (maybe he uses external > analyzer packages not yet updated). > > With "a little bit" I mean the cost of wrapping the old and new API is > really minimal, it is just an if statement and a method call, hopefully > optimized away by the JVM. In my tests the standard deviation between > different test runs was much higher than the difference between mixing > old/new API (on Win32), so it is not really sure, that the cost comes from > the delegation. > > The only case that is really slower is (now minimized cost of creation in > TokenStream.<init>, if you not reuse TokenStreams: Two LinkedHashMaps have > to be created and setup). But this is not caused by the backwards layer. > > Uwe > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > +
Shai Erera 2009-08-10, 21:12
-
Re: who clears attributes?Grant Ingersoll 2009-08-10, 22:19
On Aug 10, 2009, at 5:12 PM, Shai Erera wrote: > > Maybe we should follow what I seem to read from Earwin and Grant - > come up w/ real use cases, try to implement them w/ the current API, > then if it's impossible, discuss how we can make the current API > more adaptive. If at the end of this we'll get back to the new API, > then we'll at least feel better about it, and more convinced it is > the way to go. Well, I have real use cases for it, but all of it is still missing the biggest piece: search side support. It's the 900 lb. elephant in the room. The 500 lb. elephant is the fact that all these attributes, AIUI, require you to hook in your own indexing chain, etc. in order to even be indexed, which is all package private stuff. It's not even clear to me what happens right now if you were to, say have a Token Stream that, say, had only one Attribute on it and none of the existing attributes (term buffer, length, position, etc.) Please correct me if I am wrong, I still don't have a deep understanding of it all. Honestly, though, it really gives you very little over the current, well functioning payloads capability other than stronger typing, the ability to pick only those attributes that you want indexed (in theory) and a byte (or so) of savings per any token that has a payload, and we _HAVE_ right now, search support for payloads. Oh, and now it seems the new QP is dependent on it all. +
Grant Ingersoll 2009-08-10, 22:19
-
Re: who clears attributes?Michael Busch 2009-08-10, 22:25
On 8/10/09 3:19 PM, Grant Ingersoll wrote:
> Oh, and now it seems the new QP is dependent on it all. The new QP uses Attributes for config settings, but doesn't require the TokenStream to be an AttributeSource. --------------------------------------------------------------------- +
Michael Busch 2009-08-10, 22:25
-
Re: who clears attributes?Mark Miller 2009-08-10, 22:28
Grant Ingersoll wrote:
> > On Aug 10, 2009, at 5:12 PM, Shai Erera wrote: >> >> Maybe we should follow what I seem to read from Earwin and Grant - >> come up w/ real use cases, try to implement them w/ the current API, >> then if it's impossible, discuss how we can make the current API more >> adaptive. If at the end of this we'll get back to the new API, then >> we'll at least feel better about it, and more convinced it is the way >> to go. > > Well, I have real use cases for it, but all of it is still missing the > biggest piece: search side support. It's the 900 lb. elephant in the > room. The 500 lb. elephant is the fact that all these attributes, > AIUI, require you to hook in your own indexing chain, etc. in order to > even be indexed, which is all package private stuff. It's not even > clear to me what happens right now if you were to, say have a Token > Stream that, say, had only one Attribute on it and none of the > existing attributes (term buffer, length, position, etc.) Please > correct me if I am wrong, I still don't have a deep understanding of > it all. Michael has always been up front that this new API is in preparation for flexible indexing. It doesn't give us the goodness - he has laid out the reasons for moving before the goodness comes more than once I think. From my understanding, Michael looked at what Mike was doing in one of his flexible indexing patches, wondered how some of the TokenStream stuff was going to work well with it, and came up with this new API as a solution. Yes - it gets us nothing now. But its a big move, and there is no need to do everything at once - in fact it would probably be harder to do it all at once - the rest has always been on the table. 3.0 has always been convenient to push it before, as deprecations can than be removed. Nothing forcing us to make that decision now though. > > Honestly, though, it really gives you very little over the current, > well functioning payloads capability other than stronger typing, the > ability to pick only those attributes that you want indexed (in > theory) and a byte (or so) of savings per any token that has a > payload, and we _HAVE_ right now, search support for payloads. Payloads gives us nothing as developers - you can't use that functionality without taking it from the users - payloads are for users. Flexible indexing will lead to all kinds of little cool things - the likes of which have been discussed a lot in older emails. It will likely lead to things we cannot predict as well. Everything will be more flexible. It also could play a part in CSF, and work on allowing custom files to plug into merging. Plus everything else thats been mentioned (pfor, etc) I've been sold on the long term benefits. I don't think you need these API for them, but its my understanding it helps solve part of the equation. A bunch of issues have come up. To my knowledge, they have been addressed with vigor every time. If someone is unhappy with how something has been addressed, and it needs to be addressed further, please speak up. Otherwise, I don't think the sky is falling - I think the new API is being shaken out. > > Oh, and now it seems the new QP is dependent on it all. Dependent how? -- - Mark http://www.lucidimagination.com --------------------------------------------------------------------- +
Mark Miller 2009-08-10, 22:28
-
Re: who clears attributes?Grant Ingersoll 2009-08-11, 01:36
On Aug 10, 2009, at 6:28 PM, Mark Miller wrote: > Grant Ingersoll wrote: >> >> On Aug 10, 2009, at 5:12 PM, Shai Erera wrote: >>> >>> Maybe we should follow what I seem to read from Earwin and Grant - >>> come up w/ real use cases, try to implement them w/ the current >>> API, then if it's impossible, discuss how we can make the current >>> API more adaptive. If at the end of this we'll get back to the new >>> API, then we'll at least feel better about it, and more convinced >>> it is the way to go. >> >> Well, I have real use cases for it, but all of it is still missing >> the biggest piece: search side support. It's the 900 lb. elephant >> in the room. The 500 lb. elephant is the fact that all these >> attributes, AIUI, require you to hook in your own indexing chain, >> etc. in order to even be indexed, which is all package private >> stuff. It's not even clear to me what happens right now if you >> were to, say have a Token Stream that, say, had only one Attribute >> on it and none of the existing attributes (term buffer, length, >> position, etc.) Please correct me if I am wrong, I still don't >> have a deep understanding of it all. > Michael has always been up front that this new API is in preparation > for flexible indexing. It doesn't give us the goodness - he has laid > out the reasons for moving before the goodness comes more than once > I think. From my understanding, Michael looked at what Mike was > doing in one of his flexible indexing patches, wondered how some of > the TokenStream stuff was going to work well with it, and came up > with this new API as a solution. Yes - it gets us nothing now. But > its a big move, and there is no need to do everything at once - in > fact it would probably be harder to do it all at once - the rest has > always been on the table. 3.0 has always been convenient to push it > before, as deprecations can than be removed. Nothing forcing us to > make that decision now though. >> >> Honestly, though, it really gives you very little over the current, >> well functioning payloads capability other than stronger typing, >> the ability to pick only those attributes that you want indexed (in >> theory) and a byte (or so) of savings per any token that has a >> payload, and we _HAVE_ right now, search support for payloads. > Payloads gives us nothing as developers - you can't use that > functionality without taking it from the users - payloads are for > users. > > Flexible indexing will lead to all kinds of little cool things - the > likes of which have been discussed a lot in older emails. It will > likely lead to things we cannot predict as well. Everything will be > more flexible. It also could play a part in CSF, and work on > allowing custom files to plug into merging. Plus everything else > thats been mentioned (pfor, etc) I've been sold on the long term > benefits. I don't think you need these API for them, but its my > understanding it helps solve part of the equation. > > A bunch of issues have come up. To my knowledge, they have been > addressed with vigor every time. If someone is unhappy with how > something has been addressed, and it needs to be addressed further, > please speak up. Um, that's what I've been doing. Vigor is good. I very much appreciate everyone's work. From what I can tell, most devs here are unsure at best what to do with their existing Analyzer capabilities. I've actually implemented a couple of new TokenFilter's using the new APIs. I like that aspect of it. I'm just not sure on the back compat hoops (and yes, I asked for them). But I'm also operating under the assumption that our BC approach isn't going to change anytime soon, such that it is very important that these new capabilities are worked out (and I don't just mean little performance nicks here and there, I mean in terms of usability and performance). Let's put it this way: We expect to release 2.9 within the month (which is very short in Lucene time). That will give us a sum total of, what, 2.5 weeks of review by devs for some very major changes? I want 3.0 as much as anyone (I've been pushing for 1.5 support for at least 2 years now), but I don't want us to be in a hole going into it because we felt rushed right when the "finish" was so close. I agree its not falling. It never is. This is in fact how the process works. People are doing the right thing here by discussing it and working on it. Attribute and a whole slew of AttributeImpls. +
Grant Ingersoll 2009-08-11, 01:36
-
Re: who clears attributes?Mark Miller 2009-08-11, 01:49
Grant Ingersoll wrote:
> > On Aug 10, 2009, at 6:28 PM, Mark Miller wrote: > >> Grant Ingersoll wrote: >>> >>> On Aug 10, 2009, at 5:12 PM, Shai Erera wrote: >>>> >>>> Maybe we should follow what I seem to read from Earwin and Grant - >>>> come up w/ real use cases, try to implement them w/ the current >>>> API, then if it's impossible, discuss how we can make the current >>>> API more adaptive. If at the end of this we'll get back to the new >>>> API, then we'll at least feel better about it, and more convinced >>>> it is the way to go. >>> >>> Well, I have real use cases for it, but all of it is still missing >>> the biggest piece: search side support. It's the 900 lb. elephant >>> in the room. The 500 lb. elephant is the fact that all these >>> attributes, AIUI, require you to hook in your own indexing chain, >>> etc. in order to even be indexed, which is all package private >>> stuff. It's not even clear to me what happens right now if you >>> were to, say have a Token Stream that, say, had only one Attribute >>> on it and none of the existing attributes (term buffer, length, >>> position, etc.) Please correct me if I am wrong, I still don't have >>> a deep understanding of it all. >> Michael has always been up front that this new API is in preparation >> for flexible indexing. It doesn't give us the goodness - he has laid >> out the reasons for moving before the goodness comes more than once I >> think. From my understanding, Michael looked at what Mike was doing >> in one of his flexible indexing patches, wondered how some of the >> TokenStream stuff was going to work well with it, and came up with >> this new API as a solution. Yes - it gets us nothing now. But its a >> big move, and there is no need to do everything at once - in fact it >> would probably be harder to do it all at once - the rest has always >> been on the table. 3.0 has always been convenient to push it before, >> as deprecations can than be removed. Nothing forcing us to make that >> decision now though. >>> >>> Honestly, though, it really gives you very little over the current, >>> well functioning payloads capability other than stronger typing, the >>> ability to pick only those attributes that you want indexed (in >>> theory) and a byte (or so) of savings per any token that has a >>> payload, and we _HAVE_ right now, search support for payloads. >> Payloads gives us nothing as developers - you can't use that >> functionality without taking it from the users - payloads are for users. >> >> Flexible indexing will lead to all kinds of little cool things - the >> likes of which have been discussed a lot in older emails. It will >> likely lead to things we cannot predict as well. Everything will be >> more flexible. It also could play a part in CSF, and work on allowing >> custom files to plug into merging. Plus everything else thats been >> mentioned (pfor, etc) I've been sold on the long term benefits. I >> don't think you need these API for them, but its my understanding it >> helps solve part of the equation. >> >> A bunch of issues have come up. To my knowledge, they have been >> addressed with vigor every time. If someone is unhappy with how >> something has been addressed, and it needs to be addressed further, >> please speak up. > > Um, that's what I've been doing. Vigor is good. I very much > appreciate everyone's work. From what I can tell, most devs here are > unsure at best what to do with their existing Analyzer capabilities. > I've actually implemented a couple of new TokenFilter's using the new > APIs. I like that aspect of it. I'm just not sure on the back compat > hoops (and yes, I asked for them). But I'm also operating under the > assumption that our BC approach isn't going to change anytime soon, > such that it is very important that these new capabilities are worked > out (and I don't just mean little performance nicks here and there, I > mean in terms of usability and performance). I'm not just responding to just you there, but more to the growing pack of those speaking against the new API. I don't see specific issues being brought up - the only issues I have seen brought up have been addressed in JIRA issues that have received no comments indicating the fix was not good enough. So we are seeing a lot of general complaints, but specific complaints have been addressed as far as I can tell. As far as back compat - is it really still considered an issue? We have broken back compat in this release wherever it was convenient to do so. I suspect that will continue. I just wish our policy reflected how things actually work (and I think they work as they should, based on the circumstances that lead to each decision). Thats kind of in response to the ground swell that appeared to be building to roll back or hold off on the new API. To me, we would do that if the sky was falling. As long as specific issues are being addressed (and the number issues has not been that high), I just don't see a reason to hold off on the current plan. Oh, because it uses the Attributes. I think the new QueryParser is its own kettle of fish. It really shouldn't have a back compat promise while it lives in contrib. It needs to be shaked out before it could possibly replace the current parser. - Mark http://www.lucidimagination.com +
Mark Miller 2009-08-11, 01:49
-
Re: who clears attributes?Michael Busch 2009-08-11, 08:28
> I'm not just responding to just you there, but more to the growing > pack of those speaking against the new API. I don't see specific > issues being brought up - the only issues I have seen brought up have > been addressed in JIRA issues that have received no comments > indicating the fix was not good enough. So we are seeing a lot of > general complaints, but specific complaints have been addressed as far > as I can tell. > Thanks Mark. Yeah, I'm really not sure what actually the problem here is now. There was a performance test in Solr that apparently ran much slower after upgrading to the new Lucene jar. This test is testing a rather uncommon scenario: very very short documents. Within one day - thanks to Uwe - we committed a patch that basically brings back the performance to where it was before. That is a pretty good turnaround time. And according to Robert's and Mark's performance tests Lucene trunk is now even a little bit faster than 2.4 was. This was not the first time we found and fixed a bug in Lucene and it won't be the last. > As far as back compat - is it really still considered an issue? We > have broken back compat in this release wherever it was convenient to > do so. I suspect that will continue. I just wish our policy reflected > how things actually work (and I think they work as they should, based > on the circumstances that lead to each decision). All backwards-compatibility problems we could think of were addressed and all possible uses cases were tested regarding backwards-compatibility. In LUCENE-1693 you can find the many iterations Uwe and I had about this. All current unit tests pass. All contrib tests pass. All backwards-compatibility tests from the 2.4 tag pass as well. This is probably one of the best-tested additions to Lucene in terms of backwards-compatibility we've had in a while. Michael --------------------------------------------------------------------- +
Michael Busch 2009-08-11, 08:28
-
Re: who clears attributes?Robert Muir 2009-08-11, 10:50
On Tue, Aug 11, 2009 at 4:28 AM, Michael Busch<[EMAIL PROTECTED]> wrote:
> There was a performance test in Solr that apparently ran much slower > after upgrading to the new Lucene jar. This test is testing a rather > uncommon scenario: very very short documents. Actually, its more uncommon than that: its very very short documents, without implementing reusableTokenStream() this makes it basically a benchmark of ctor cost... doesn't really benchmark the token api in my opinion. we should do some better benchmarks, but in most cases things appear to be the same to me. it is only this case where you have very very short documents but don't implement reuse things, that there is any difference, and now it is minor. -- Robert Muir [EMAIL PROTECTED] --------------------------------------------------------------------- +
Robert Muir 2009-08-11, 10:50
-
Re: who clears attributes?Yonik Seeley 2009-08-11, 11:09
On Tue, Aug 11, 2009 at 6:50 AM, Robert Muir<[EMAIL PROTECTED]> wrote:
> On Tue, Aug 11, 2009 at 4:28 AM, Michael Busch<[EMAIL PROTECTED]> wrote: >> There was a performance test in Solr that apparently ran much slower >> after upgrading to the new Lucene jar. This test is testing a rather >> uncommon scenario: very very short documents. > > Actually, its more uncommon than that: its very very short documents, > without implementing reusableTokenStream() > this makes it basically a benchmark of ctor cost... doesn't really > benchmark the token api in my opinion. You would be surprized... there are quite a few Solr users that have relatively short documents... or even if they are sizeable documents, they have up to hundreds of short metadata-type fields (generally a token or two). Reusing TokenStreams has become a must in Solr IMO since construction costs (hashmap lookups, etc) and GC costs (larger objects) have been growing. I'm focused on that now... Robert's taking a crack at fixing things up so users can actually create reusable analyzers out of our filters: https://issues.apache.org/jira/browse/LUCENE-1794 -Yonik http://www.lucidimagination.com --------------------------------------------------------------------- +
Yonik Seeley 2009-08-11, 11:09
-
Re: who clears attributes?Earwin Burrfoot 2009-08-11, 11:21
On Tue, Aug 11, 2009 at 15:09, Yonik Seeley<[EMAIL PROTECTED]> wrote:
> On Tue, Aug 11, 2009 at 6:50 AM, Robert Muir<[EMAIL PROTECTED]> wrote: >> On Tue, Aug 11, 2009 at 4:28 AM, Michael Busch<[EMAIL PROTECTED]> wrote: >>> There was a performance test in Solr that apparently ran much slower >>> after upgrading to the new Lucene jar. This test is testing a rather >>> uncommon scenario: very very short documents. >> >> Actually, its more uncommon than that: its very very short documents, >> without implementing reusableTokenStream() >> this makes it basically a benchmark of ctor cost... doesn't really >> benchmark the token api in my opinion. > > You would be surprized... there are quite a few Solr users that have > relatively short documents... or even if they are sizeable documents, > they have up to hundreds of short metadata-type fields (generally a > token or two). > > Reusing TokenStreams has become a must in Solr IMO since construction > costs (hashmap lookups, etc) and GC costs (larger objects) have been > growing. I'm focused on that now... > > Robert's taking a crack at fixing things up so users can actually > create reusable analyzers out of our filters: > https://issues.apache.org/jira/browse/LUCENE-1794 +1. We don't use Solr, but have quite a bunch of medium and short-sized documents. Plus heaps of metadata fields. I'm yet to read Uwe's example, but I feel I'm a bit misunderstood by some of you. My gripe with new API is not that it brings us troubles (which are solved one way or another), it is that the switch and associated migration costs bring zero benefits in immediate and remote future. The only person that tried to disprove this claim is Uwe. Others either say "the problems are solved, so it's okay to move to the new API", or "this will be usable when flexindexing arrives". Sorry, the last phrase doesn't hold its place, this API is orthogonal to flexindexing, or at least nobody has shown the opposite. So, what I'm arguing against is adding some code (and forcing users to migrate) just because we can, with no other reasons. -- Kirill Zakharenko/Кирилл Захаренко ([EMAIL PROTECTED]) Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423 ICQ: 104465785 --------------------------------------------------------------------- +
Earwin Burrfoot 2009-08-11, 11:21
-
Re: who clears attributes?Mark Miller 2009-08-11, 11:25
Earwin Burrfoot wrote:
> > The only person that tried to disprove this claim is Uwe. Others > either say "the problems are solved, so it's okay to move to the new > API", or "this will be usable when flexindexing arrives". Others (not me) have spent a lot of time going over this before (more than once I think) - they prob are just sick of retyping. Lots of searchable archives out there though. - Mark --------------------------------------------------------------------- +
Mark Miller 2009-08-11, 11:25
-
Re: who clears attributes?Earwin Burrfoot 2009-08-11, 11:28
>> The only person that tried to disprove this claim is Uwe. Others
>> either say "the problems are solved, so it's okay to move to the new >> API", or "this will be usable when flexindexing arrives". > > Others (not me) have spent a lot of time going over this before (more than > once I think) - they prob are just sick of retyping. Lots of searchable > archives out there though. Okay, I'll dig into them. Sorry for being a bother. -- Kirill Zakharenko/Кирилл Захаренко ([EMAIL PROTECTED]) Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423 ICQ: 104465785 --------------------------------------------------------------------- +
Earwin Burrfoot 2009-08-11, 11:28
-
Re: who clears attributes?Mark Miller 2009-08-11, 11:31
Earwin Burrfoot wrote:
>>> The only person that tried to disprove this claim is Uwe. Others >>> either say "the problems are solved, so it's okay to move to the new >>> API", or "this will be usable when flexindexing arrives". >>> >> Others (not me) have spent a lot of time going over this before (more than >> once I think) - they prob are just sick of retyping. Lots of searchable >> archives out there though. >> > > Okay, I'll dig into them. Sorry for being a bother. > > Your not being a bother - sorry if I came off that way. Didn't mean to. I just know a lot of the reasons for the API switch have been discussed before, and much if it has not come up again in this discussion. If you felt the tone of that email was anything but trying to throw out some info, I apologize. Not trying to squash this current debate at all. -- - Mark http://www.lucidimagination.com --------------------------------------------------------------------- +
Mark Miller 2009-08-11, 11:31
-
Re: who clears attributes?Michael McCandless 2009-08-11, 12:22
I think extensible analysis (the new TokenStream API) is a net
positive: it gives us strongly typed and high performance extensibility to a Token, so apps can now add whatever attrs they want. And, I see it as the first (of 3) big "legs" that we need to reach flexible indexing. We really have to do flexible indexing piece-meal since it's so big. The flexible indexing chain (still package private, but otherwise "done") is the 2nd leg, allowing you to pull whatever app-specific attrs you've created during analysis, and get them into the index in some manner. The final leg is LUCENE-1458, which has seen good progress (eg, I got it to the point where I had a pulsing codec working well, for inlining low-freq terms directly into the terms dict), but I need to get back to it, modernize it, iterate, etc. That API enables you to make your own codecs to write/read stuff in the index. Once we get that finished, I think we finally have the basic full infrastructure in place for flexible indexing. I think what's happening now is people are really starting to dig into the new stuff. I've been drilling into the new QueryParser, and besides a few small issues (mostly different defaults), it looks solid and very configurable/extensible. Solr & others have been digging into the extensible analysis API, and I think of all features in 2.9, the extensible analysis API has received the most hardening. Hoss and Mark have been drilling on the "long tail" of the impact of per-segment searching & collection, uncovering sneaky "explain" challenges and others. I think this is all healthy, to be expected, shakeout... I do still think a longish 2.9 beta is warranted, if we can succeed in getting users outside the dev group to kick the tires and uncover stuff. Mike On Tue, Aug 11, 2009 at 7:31 AM, Mark Miller<[EMAIL PROTECTED]> wrote: > Earwin Burrfoot wrote: >>>> >>>> The only person that tried to disprove this claim is Uwe. Others >>>> either say "the problems are solved, so it's okay to move to the new >>>> API", or "this will be usable when flexindexing arrives". >>>> >>> >>> Others (not me) have spent a lot of time going over this before (more >>> than >>> once I think) - they prob are just sick of retyping. Lots of searchable >>> archives out there though. >>> >> >> Okay, I'll dig into them. Sorry for being a bother. >> >> > > Your not being a bother - sorry if I came off that way. Didn't mean to. I > just know a lot of the reasons for the API switch have been discussed > before, and much if it has not come up again in this discussion. > > If you felt the tone of that email was anything but trying to throw out some > info, I apologize. Not trying to squash this current debate at all. > > -- > - Mark > > http://www.lucidimagination.com > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > --------------------------------------------------------------------- +
Michael McCandless 2009-08-11, 12:22
-
Re: who clears attributes?Michael Busch 2009-08-12, 07:14
> +1. We don't use Solr, but have quite a bunch of medium and > short-sized documents. Plus heaps of metadata fields. > > I'm yet to read Uwe's example, but I feel I'm a bit misunderstood by > Did you read it yet? What do you think about it? > some of you. My gripe with new API is not that it brings us troubles > (which are solved one way or another), it is that the switch and > associated migration costs bring zero benefits in immediate and remote > future. > The only person that tried to disprove this claim is Uwe. Others > either say "the problems are solved, so it's okay to move to the new > API", or "this will be usable when flexindexing arrives". Sorry, the > last phrase doesn't hold its place, this API is orthogonal to > flexindexing, or at least nobody has shown the opposite. > If the API is orthogonal to flexible indexing or not depends on how you define "flexible indexing". I admit the term is vague and probably nowhere clearly defined. I agree that if flexible indexing means to only change the encoding, i.e. *how* data is stored, e.g. PFOR vs. the current posting format, then yes, we don't need the new TokenStream API for it. But the goals we have with flexible indexing are more than that. We want to allow customizing *what* data is stored in the inverted index. The very first discussion about flexible indexing that happened several years ago you can find in the wiki: http://wiki.apache.org/lucene-java/FlexibleIndexing. Already in this very early proposal it was suggested to have the following posting formats as a start: a. <doc>+ b. <doc, boost>+ c. <doc, freq, <position>+ >+ d. <doc, freq, <position, boost>+ >+ For d. you need to change the TokenStream API. How else can we get the boost from the source to the indexer. Of course you can always serialize the additional data into the payload byte array, but if filters want to do something with it performance suffers. The new API solves this problem very nicely. When we open the posting format like this people will want to store different custom things in there. The new TokenStream API is prepared for that - the old one isn't. Michael > So, what I'm arguing against is adding some code (and forcing users to > migrate) just because we can, with no other reasons. > > --------------------------------------------------------------------- +
Michael Busch 2009-08-12, 07:14
-
Re: who clears attributes?Grant Ingersoll 2009-08-11, 11:13
On Aug 11, 2009, at 4:28 AM, Michael Busch wrote: > >> I'm not just responding to just you there, but more to the growing >> pack of those speaking against the new API. I don't see specific >> issues being brought up - the only issues I have seen brought up >> have been addressed in JIRA issues that have received no comments >> indicating the fix was not good enough. So we are seeing a lot of >> general complaints, but specific complaints have been addressed as >> far as I can tell. >> > Thanks Mark. Yeah, I'm really not sure what actually the problem > here is now. There was a performance test in Solr that apparently > ran much slower after upgrading to the new Lucene jar. This test is > testing a rather uncommon scenario: very very short documents. That is not an uncommon scenario. Solr has very, very short fields _ALL THE TIME_. > Within one day - thanks to Uwe - we committed a patch that basically > brings back the performance to where it was before. That is a pretty > good turnaround time. And according to Robert's and Mark's > performance tests Lucene trunk is now even a little bit faster than > 2.4 was. This was not the first time we found and fixed a bug in > Lucene and it won't be the last. Of course. >> As far as back compat - is it really still considered an issue? We >> have broken back compat in this release wherever it was convenient >> to do so. I suspect that will continue. I just wish our policy >> reflected how things actually work (and I think they work as they >> should, based on the circumstances that lead to each decision). > > All backwards-compatibility problems we could think of were > addressed and all possible uses cases were tested regarding > backwards-compatibility. In LUCENE-1693 you can find the many > iterations Uwe and I had about this. All current unit tests pass. > All contrib tests pass. All backwards-compatibility tests from the > 2.4 tag pass as well. This is probably one of the best-tested > additions to Lucene in terms of backwards-compatibility we've had in > a while. But AFAICT, it seems like the only choice one really has is to upgrade their code, which can be a lot of work. --------------------------------------------------------------------- +
Grant Ingersoll 2009-08-11, 11:13
-
Re: who clears attributes?Michael Busch 2009-08-11, 19:21
On 8/11/09 4:13 AM, Grant Ingersoll wrote:
> > On Aug 11, 2009, at 4:28 AM, Michael Busch wrote: > >> >>> I'm not just responding to just you there, but more to the growing >>> pack of those speaking against the new API. I don't see specific >>> issues being brought up - the only issues I have seen brought up >>> have been addressed in JIRA issues that have received no comments >>> indicating the fix was not good enough. So we are seeing a lot of >>> general complaints, but specific complaints have been addressed as >>> far as I can tell. >>> >> Thanks Mark. Yeah, I'm really not sure what actually the problem here >> is now. There was a performance test in Solr that apparently ran much >> slower after upgrading to the new Lucene jar. This test is testing a >> rather uncommon scenario: very very short documents. > > That is not an uncommon scenario. Solr has very, very short fields > _ALL THE TIME_. > I meant that having documents that only contain very short fields is not as common as having docs with a decent amount of text. Maybe I'm wrong - in either case I didn't try to say it's not an important use case. I think it is important to have good performance here too. The point I was trying to make was that we tested performance more thoroughly for the case we thought would be more common. According to the numbers posted on LUCENE-1796 it now seems like it's fixed - even for documents with only very short fields and no reusable TokenStreams. Michael --------------------------------------------------------------------- +
Michael Busch 2009-08-11, 19:21
-
Re: who clears attributes?Grant Ingersoll 2009-08-11, 20:56
On Aug 11, 2009, at 3:21 PM, Michael Busch wrote: > On 8/11/09 4:13 AM, Grant Ingersoll wrote: >> >> On Aug 11, 2009, at 4:28 AM, Michael Busch wrote: >> >>> >>>> I'm not just responding to just you there, but more to the >>>> growing pack of those speaking against the new API. I don't see >>>> specific issues being brought up - the only issues I have seen >>>> brought up have been addressed in JIRA issues that have received >>>> no comments indicating the fix was not good enough. So we are >>>> seeing a lot of general complaints, but specific complaints have >>>> been addressed as far as I can tell. >>>> >>> Thanks Mark. Yeah, I'm really not sure what actually the problem >>> here is now. There was a performance test in Solr that apparently >>> ran much slower after upgrading to the new Lucene jar. This test >>> is testing a rather uncommon scenario: very very short documents. >> >> That is not an uncommon scenario. Solr has very, very short fields >> _ALL THE TIME_. >> > > I meant that having documents that only contain very short fields is > not as common as having docs with a decent amount of text. Maybe > I'm wrong - in either case I didn't try to say it's not an important > use case. I think it is important to have good performance here > too. The point I was trying to make was that we tested performance > more thoroughly for the case we thought would be more common. FWIW, I think the most common scenario is: one or two large fields and several (usually in the range of 5-10, but have seen cases with many) small fields, at least that has been my experience. Some of the small fields require analysis, some don't. > > According to the numbers posted on LUCENE-1796 it now seems like > it's fixed - even for documents with only very short fields and no > reusable TokenStreams. > Very cool. --------------------------------------------------------------------- +
Grant Ingersoll 2009-08-11, 20:56
-
Re: who clears attributes?Earwin Burrfoot 2009-08-10, 22:43
> Well, I have real use cases for it, but all of it is still missing the
> biggest piece: search side support. It's the 900 lb. elephant in the room. > The 500 lb. elephant is the fact that all these attributes, AIUI, require > you to hook in your own indexing chain, etc. in order to even be indexed, > which is all package private stuff. It's not even clear to me what happens > right now if you were to, say have a Token Stream that, say, had only one > Attribute on it and none of the existing attributes (term buffer, length, > position, etc.) Please correct me if I am wrong, I still don't have a deep > understanding of it all. Even pseudocode would be good. "Custom indexing chain for abstract attributes" sounds like one of microsoft.com definitions - serious, determined, but vague. If you take current Token and start throwing away some of its fields, the resulting index contents are obvious for one combinations and absurd for others. You don't need this new API to handle obvious ones. > Oh, and now it seems the new QP is dependent on it all. That's why I said earlier "before more damage is done". > Michael has always been up front that this new API is in preparation for flexible indexing. It doesn't give us the goodness - he has laid out the reasons for moving before the goodness comes more than once I think. My problem is not waiting for 'goodness'. It is that I don't currently see what goodness will come from this API even in remote future. That's why I am asking! :) > Flexible indexing will lead to all kinds of little cool things - the likes of which have been discussed a lot in older emails. It will likely lead to things we cannot predict as well. > Everything will be more flexible. It also could play a part in CSF, and work on allowing custom files to plug into merging. Plus everything else thats been mentioned (pfor, etc) > I've been sold on the long term benefits. I don't think you need these API for them, but its my understanding it helps solve part of the equation. Yeah. I too, would like to see all these little cool things, and I don't think we need this API for them. Flexible indexing is going to handle various different datatypes besides text, so I can only reiterate - it cannot rely on generic stream-based text-handling API for consuming data. > A bunch of issues have come up. To my knowledge, they have been addressed with vigor every time. If someone is unhappy with how something has been addressed, and it > needs to be addressed further, please speak up. Otherwise, I don't think the sky is falling - I think the new API is being shaken out. API is born dead without usecases. If a year later we get closer to flexindexing it is supposed to support, and then we understand we missed some crucial thing - WHAM! our back-compat policy kicks in and makes our lives miserable once more. -- Kirill Zakharenko/Кирилл Захаренко ([EMAIL PROTECTED]) Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423 ICQ: 104465785 --------------------------------------------------------------------- +
Earwin Burrfoot 2009-08-10, 22:43
-
Re: who clears attributes?Mark Miller 2009-08-10, 22:57
Right - this API is not required, even for Flexible indexing how its
appeared it will emerge. I think its just there to help. Originally, I think the idea was to reduce how much casting was going to be needed. Also, a given chain will be more easily able to just deal with just the attributes that it wants - rather than one stream with methods for a ton of attributes that may or may not be there. Because Lucene is a full text engine, I feel it likely the ingestion process will continue to rely on text ? That text can be converted to anything by the index chain, and stored however in whatever file? I think I'm missing something in your argument there. - Mark Earwin Burrfoot wrote: >> Well, I have real use cases for it, but all of it is still missing the >> biggest piece: search side support. It's the 900 lb. elephant in the room. >> The 500 lb. elephant is the fact that all these attributes, AIUI, require >> you to hook in your own indexing chain, etc. in order to even be indexed, >> which is all package private stuff. It's not even clear to me what happens >> right now if you were to, say have a Token Stream that, say, had only one >> Attribute on it and none of the existing attributes (term buffer, length, >> position, etc.) Please correct me if I am wrong, I still don't have a deep >> understanding of it all. >> > Even pseudocode would be good. "Custom indexing chain for abstract > attributes" sounds like one of microsoft.com definitions - serious, > determined, but vague. > If you take current Token and start throwing away some of its fields, > the resulting index contents are obvious for one combinations and > absurd for others. You don't need this new API to handle obvious ones. > > >> Oh, and now it seems the new QP is dependent on it all. >> > That's why I said earlier "before more damage is done". > > >> Michael has always been up front that this new API is in preparation for flexible indexing. It doesn't give us the goodness - he has laid out the reasons for moving before the goodness comes more than once I think. >> > My problem is not waiting for 'goodness'. It is that I don't currently > see what goodness will come from this API even in remote future. > That's why I am asking! :) > > >> Flexible indexing will lead to all kinds of little cool things - the likes of which have been discussed a lot in older emails. It will likely lead to things we cannot predict as well. >> Everything will be more flexible. It also could play a part in CSF, and work on allowing custom files to plug into merging. Plus everything else thats been mentioned (pfor, etc) > I've been sold on the long term benefits. I don't think you need these API for them, but its my understanding it helps solve part of the equation. >> > Yeah. I too, would like to see all these little cool things, and I > don't think we need this API for them. > Flexible indexing is going to handle various different datatypes > besides text, so I can only reiterate - it cannot rely on generic > stream-based text-handling API for consuming data. > > >> A bunch of issues have come up. To my knowledge, they have been addressed with vigor every time. If someone is unhappy with how something has been addressed, and it >> needs to be addressed further, please speak up. Otherwise, I don't think the sky is falling - I think the new API is being shaken out. >> > API is born dead without usecases. If a year later we get closer to > flexindexing it is supposed to support, and then we understand we > missed some crucial thing - WHAM! our back-compat policy kicks in and > makes our lives miserable once more. > > -- - Mark http://www.lucidimagination.com --------------------------------------------------------------------- +
Mark Miller 2009-08-10, 22:57
-
RE: who clears attributes?Uwe Schindler 2009-08-10, 22:49
> UIMA....
The new API looks like UIMA, you have streams that are attributed with various attributes that can be exchanged between TokenStreams/TokenFilters. Just like the current FlagsAttribute or TypeAttribute, that can easily misused for such things. About a real use case for the new API: I talked some time ago with Grant in the podcast about NumericRange and the Publishing Network for Geoscientific Date called PANGAEA. At the end of the talk (available on the Lucid Imagination website), there were some explanations, how we index our XML documents that one could ask for contents of a specific XML element name (element name is field name) or a XPath-like path as field name. E.g. if you have an XML document like this: http://www.pangaea.de/PHP/getxml.php/51675 (please note: this is just a very simple XML schema we use for indexing our documents). When we index this document type into Lucene, we create a new field for each element name, e.g. "lastName", "firstName" and so on. One could easily search for any document where anywhere (not only in citation), a specific "lastName" appears. We also create fields for more general element names. So you could also look inside field name "citation", to search anywhere in the citation. You could also combine, to only find documents where the "lastName" of an "author" is "Xyz" by using the field name "author:lastName". In the past (before the new API, I wrote this analyzer very complicated and created StringBuffers for earch element name, where I appended the text and then analyzed it for each field name again. Now I pass the XML document in my special XMLTokenStream that uses STAX/DOM to retrieve the element names and contents. Each element creates a new TermAttribute (with the whole contents as one term) and a custom Attribute holding the reference to the current element name and all previous higher level element names (the Attribute contains a Stack of element names). This special Attribute is then in the Tokenizer chain and only updated by the root XMLTokenStream. The next filter in the chain is a WhitespaceFilter (that splits up the tokens at white space) and so on to further tokenize the element contents. The special element name stack attribute is untouched, but always contains the current element name for later filtering. The last step is using the new TeeSinkTokenFilter to index the stream into different fields. The TeeSinkTokenFilter gets Sinks for each field name/element name hierarchy (which are recorded before), each Sink filters the Tokens using the special element stack attribute for matching tokens the field is interested. By that I can simply analyze the whole XML document one time and distribute the contents to various field names using the additional attribute. Here an example (using the above schema), that shows all documents with a title of "Evidence from Fram Strait" in the publication where the dataset is attached to as supplement: http://www.pangaea.de/search?q=supplementTo%3Atitle%3A%22Evidence+from+Fram+ Strait%22 (which hits only the above example). The query parser is customized (not the Lucene one). The final code of this TokenStream is a little bit more complicated that described here, but it gives a possible usage of the new API: Annotate tokens with field identifiers to e.g. automatically put the title of a document in a title field and the authors in another one and so on. I hope somebody understood, what we are doing here :-) ----- Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: [EMAIL PROTECTED] _____ From: Shai Erera [mailto:[EMAIL PROTECTED]] Sent: Monday, August 10, 2009 11:13 PM To: [EMAIL PROTECTED] Subject: Re: who clears attributes? It sounds like the 'old' API should stay a bit longer than 3.0. We'd like to give more people a chance to experiment w/ the new API before we claim it is the new Analysis API in Lucene. And that means that more users will have to live w/ the "bit of slowness" more than what is believed in this thread. I personally worry much about needing to throw away the current API. I'll have a lot of code to port over and I haven't read anything so far that convinces me the new API is better. I don't have any problems w/ the current API today. I feel I have all the flexibility I need w/ indexing fields. I use payloads, Field.Index constants, write Analyzers, TokenStreams ... actually I have 0 complaints. Maybe we should follow what I seem to read from Earwin and Grant - come up w/ real use cases, try to implement them w/ the current API, then if it's impossible, discuss how we can make the current API more adaptive. If at the end of this we'll get back to the new API, then we'll at least feel better about it, and more convinced it is the way to go. Hack .. maybe we'll be convinced to base the Luceue analysis on UIMA? :) Shai On Mon, Aug 10, 2009 at 11:54 PM, Uwe Schindler <[EMAIL PROTECTED]> wrote: But because of this flexibility, we added the backwards layer. The old style with setUseNewAPI was not flexible at all, and nobody would move his Tokenizers to the new API without that flexibility (maybe he uses external analyzer packages not yet updated). With "a little bit" I mean the cost of wrapping the old and new API is really minimal, it is just an if statement and a method call, hopefully optimized away by the JVM. In my tests the standard deviation between different test runs was much higher than the difference between mixing old/new API (on Win32), so it is not really sure, that the cost comes from the delegation. The only case that is really slower is (now minimized cost of creation in TokenStream.<init>, if you not reuse TokenStreams: Two LinkedHashMaps have to be created and setup). But this is not caused by the backwards layer. Uwe +
Uwe Schindler 2009-08-10, 22:49
-
Re: who clears attributes?DM Smith 2009-08-11, 15:53
Uwe,
Is this example available? I think that an example like this would help the user community see the current value in the change. At least, I'd love to see the code for it. -- DM On 08/10/2009 06:49 PM, Uwe Schindler wrote: > > > UIMA.... > > The new API looks like UIMA, you have streams that are attributed with > various attributes that can be exchanged between > TokenStreams/TokenFilters. Just like the current FlagsAttribute or > TypeAttribute, that can easily misused for such things. > > About a real use case for the new API: > > I talked some time ago with Grant in the podcast about NumericRange > and the Publishing Network for Geoscientific Date called PANGAEA. At > the end of the talk (available on the Lucid Imagination website), > there were some explanations, how we index our XML documents that one > could ask for contents of a specific XML element name (element name is > field name) or a XPath-like path as field name. E.g. if you have an > XML document like this: http://www.pangaea.de/PHP/getxml.php/51675 > (please note: this is just a very simple XML schema we use for > indexing our documents). When we index this document type into Lucene, > we create a new field for each element name, e.g. "lastName", > "firstName" and so on. One could easily search for any document where > anywhere (not only in citation), a specific "lastName" appears. We > also create fields for more general element names. So you could also > look inside field name "citation", to search anywhere in the citation. > You could also combine, to only find documents where the "lastName" of > an "author" is "Xyz" by using the field name "author:lastName". In the > past (before the new API, I wrote this analyzer very complicated and > created StringBuffers for earch element name, where I appended the > text and then analyzed it for each field name again. > > Now I pass the XML document in my special XMLTokenStream that uses > STAX/DOM to retrieve the element names and contents. Each element > creates a new TermAttribute (with the whole contents as one term) and > a custom Attribute holding the reference to the current element name > and all previous higher level element names (the Attribute contains a > Stack of element names). This special Attribute is then in the > Tokenizer chain and only updated by the root XMLTokenStream. The next > filter in the chain is a WhitespaceFilter (that splits up the tokens > at white space) and so on to further tokenize the element contents. > The special element name stack attribute is untouched, but always > contains the current element name for later filtering. The last step > is using the new TeeSinkTokenFilter to index the stream into different > fields. The TeeSinkTokenFilter gets Sinks for each field name/element > name hierarchy (which are recorded before), each Sink filters the > Tokens using the special element stack attribute for matching tokens > the field is interested. By that I can simply analyze the whole XML > document one time and distribute the contents to various field names > using the additional attribute. > > Here an example (using the above schema), that shows all documents > with a title of "Evidence from Fram Strait" in the publication where > the dataset is attached to as supplement: > http://www.pangaea.de/search?q=supplementTo%3Atitle%3A%22Evidence+from+Fram+Strait%22 > (which hits only the above example). The query parser is customized > (not the Lucene one). > > The final code of this TokenStream is a little bit more complicated > that described here, but it gives a possible usage of the new API: > Annotate tokens with field identifiers to e.g. automatically put the > title of a document in a title field and the authors in another one > and so on. > > I hope somebody understood, what we are doing here J > > ----- > Uwe Schindler > H.-H.-Meier-Allee 63, D-28213 Bremen > http://www.thetaphi.de > eMail: [EMAIL PROTECTED] > > ----------------- +
DM Smith 2009-08-11, 15:53
-
RE: who clears attributes?Uwe Schindler 2009-08-11, 22:14
Hi DM,
It is not public at the moment and still in development. I can public the XML tokenizer when it is finished. In general it shows one possible use-case for custom attributes. Maybe we get something like this in future: Just tag all tokens with the field name (using a FieldNameAttribute) and the Document/Indexer can automatically create the fields? ----- Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: [EMAIL PROTECTED] _____ From: DM Smith [mailto:[EMAIL PROTECTED]] Sent: Tuesday, August 11, 2009 5:54 PM To: [EMAIL PROTECTED] Subject: Re: who clears attributes? Uwe, Is this example available? I think that an example like this would help the user community see the current value in the change. At least, I'd love to see the code for it. -- DM On 08/10/2009 06:49 PM, Uwe Schindler wrote: > UIMA.... The new API looks like UIMA, you have streams that are attributed with various attributes that can be exchanged between TokenStreams/TokenFilters. Just like the current FlagsAttribute or TypeAttribute, that can easily misused for such things. About a real use case for the new API: I talked some time ago with Grant in the podcast about NumericRange and the Publishing Network for Geoscientific Date called PANGAEA. At the end of the talk (available on the Lucid Imagination website), there were some explanations, how we index our XML documents that one could ask for contents of a specific XML element name (element name is field name) or a XPath-like path as field name. E.g. if you have an XML document like this: http://www.pangaea.de/PHP/getxml.php/51675 (please note: this is just a very simple XML schema we use for indexing our documents). When we index this document type into Lucene, we create a new field for each element name, e.g. "lastName", "firstName" and so on. One could easily search for any document where anywhere (not only in citation), a specific "lastName" appears. We also create fields for more general element names. So you could also look inside field name "citation", to search anywhere in the citation. You could also combine, to only find documents where the "lastName" of an "author" is "Xyz" by using the field name "author:lastName". In the past (before the new API, I wrote this analyzer very complicated and created StringBuffers for earch element name, where I appended the text and then analyzed it for each field name again. Now I pass the XML document in my special XMLTokenStream that uses STAX/DOM to retrieve the element names and contents. Each element creates a new TermAttribute (with the whole contents as one term) and a custom Attribute holding the reference to the current element name and all previous higher level element names (the Attribute contains a Stack of element names). This special Attribute is then in the Tokenizer chain and only updated by the root XMLTokenStream. The next filter in the chain is a WhitespaceFilter (that splits up the tokens at white space) and so on to further tokenize the element contents. The special element name stack attribute is untouched, but always contains the current element name for later filtering. The last step is using the new TeeSinkTokenFilter to index the stream into different fields. The TeeSinkTokenFilter gets Sinks for each field name/element name hierarchy (which are recorded before), each Sink filters the Tokens using the special element stack attribute for matching tokens the field is interested. By that I can simply analyze the whole XML document one time and distribute the contents to various field names using the additional attribute. Here an example (using the above schema), that shows all documents with a title of "Evidence from Fram Strait" in the publication where the dataset is attached to as supplement: http://www.pangaea.de/search?q=supplementTo%3Atitle%3A%22Evidence+from+Fram+ Strait%22 (which hits only the above example). The query parser is customized (not the Lucene one). The final code of this TokenStream is a little bit more complicated that described here, but it gives a possible usage of the new API: Annotate tokens with field identifiers to e.g. automatically put the title of a document in a title field and the authors in another one and so on. I hope somebody understood, what we are doing here :-) Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: [EMAIL PROTECTED] _____ From: Shai Erera [mailto:[EMAIL PROTECTED]] Sent: Monday, August 10, 2009 11:13 PM To: [EMAIL PROTECTED] Subject: Re: who clears attributes? It sounds like the 'old' API should stay a bit longer than 3.0. We'd like to give more people a chance to experiment w/ the new API before we claim it is the new Analysis API in Lucene. And that means that more users will have to live w/ the "bit of slowness" more than what is believed in this thread. I personally worry much about needing to throw away the current API. I'll have a lot of code to port over and I haven't read anything so far that convinces me the new API is better. I don't have any problems w/ the current API today. I feel I have all the flexibility I need w/ indexing fields. I use payloads, Field.Index constants, write Analyzers, TokenStreams ... actually I have 0 complaints. Maybe we should follow what I seem to read from Earwin and Grant - come up w/ real use cases, try to implement them w/ the current API, then if it's impossible, discuss how we can make the current API more adaptive. If at the end of this we'll get back to the new API, then we'll at least feel better about it, and more convinced it is the way to go. Hack .. maybe we'll be convinced to base the Luceue analysis on UIMA? :) Shai On Mon, Aug 10, 2009 at 11:54 PM, Uwe Schindler <[EMAIL PROTECTED]> wrote: But because of this flexibility, we added the backwards layer. The old style with setUseNewAPI was not flexible at all, and nobody would move his Tokenizers to the n +
Uwe Schindler 2009-08-11, 22:14
-
RE: who clears attributes?Uwe Schindler 2009-08-10, 16:50
In my opinion, it is completely unneeded to clear the attributes in
CharTokenizer. The TermAttribute and OffsetAttribute is always initialized correctly (at least set to termLength gets 0), when incrementToken() returns true. I would simply remove the call to clearAttributes() at all. ----- Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: [EMAIL PROTECTED] > -----Original Message----- > From: Uwe Schindler [mailto:[EMAIL PROTECTED]] > Sent: Monday, August 10, 2009 6:44 PM > To: [EMAIL PROTECTED]; [EMAIL PROTECTED] > Subject: RE: who clears attributes? > > I already removed the unmodifiable iterator, so one new instance is > removed > (see the JIRA issue). But you are right, the CharTokenizer should only > clear > the TermAttribute, as it is only using this attribute. > > ----- > Uwe Schindler > H.-H.-Meier-Allee 63, D-28213 Bremen > http://www.thetaphi.de > eMail: [EMAIL PROTECTED] > > > -----Original Message----- > > From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]] On Behalf Of Yonik > > Seeley > > Sent: Monday, August 10, 2009 6:01 PM > > To: [EMAIL PROTECTED] > > Subject: who clears attributes? > > > > CharTokenizer.incrementToken() clears *all* attributes in the entire > > tokenizer chain. > > StandardTokenizer.incrementToken() clears only the term attribute. > > > > So... which is right? Seems like the tokenizer should be responsible? > > > > On a performance related note, CharTokenizer.clearAttribtes() could be > > more efficient - 2 new objects (the unmodifiable map and the iterator > > object) are created for every incrementToken. > > > > -Yonik > > http://www.lucidimagination.com > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] --------------------------------------------------------------------- +
Uwe Schindler 2009-08-10, 16:50
|