|
|
-
PayloadNearQuery and AveragePayloadFunction
shyama 2012-02-02, 16:57
Hi List Apologies for such a long message. I have tried to include everything, that you might need to know to answer my question.
I am having difficulties understanding how or what AveragePayloadFunction is doing. Here is my example
Title:Human|9 pineal|5 luteinizing hormone receptors. Text:The presence of luteinizing hormone receptors in human|9 pineal|5 glands from five females and three males, ranging in age from 61-89 yr, was examined by in situ hybridization and immunocytochemistry. The results demonstrated the presence of these receptors at the mRNA|7 and protein levels in all the pineal|5 glands examined. Pineal|5 gland luteinizing hormone receptors could potentially be involved in the regulation of melatonin|7 synthesis.
3 is for class A 5 is for class B 7 is for class C 9 is for class D These are the payloads stored in the index. But when I search, I use these values for encoding term class, and then return 3 for selected class.
I am using WhiteSpaceTokenizer and LowerCaseFilter. In my PayloadSimilarity class, I manipulate payload in a way so that, if I am interested in class A, it will return payload value "x=3" only for terms in class A, I decide term class by checking its payload value.
Now, I query for "luteinizing hormone" using PayloadNearQuery with slop of 5. First I try with interest in class B and next with interest in class A.
*Result of Class A interest:*
Explain: 10.97332 = (MATCH) sum of: 2.5589073 = (MATCH) weight(payloadNear([AbstractText:luteinizing, AbstractText:hormone], 5, true) in 5362133), product of: 0.68000716 = queryWeight(payloadNear([AbstractText:luteinizing, AbstractText:hormone], 5, true)), product of: 14.045828 = idf(AbstractText: luteinizing=15481 hormone=164637) 0.048413463 = queryNorm 3.7630591 = (MATCH) fieldWeight(AbstractText:payloadNear([luteinizing, hormone], 5, true) in 5362133), product of: 2.4494898 = PayloadNearQuery, product of: 0.8164966 = tf(phraseFreq=0.6666667) *3.0 = AveragePayloadFunction(...)* 14.045828 = idf(AbstractText: luteinizing=15481 hormone=164637) 0.109375 = fieldNorm(field=AbstractText, doc=5362133) 8.4144125 = (MATCH) weight(payloadNear([ArticleTitle:luteinizing, ArticleTitle:hormone], 5, true) in 5362133), product of: 0.7332054 = queryWeight(payloadNear([ArticleTitle:luteinizing, ArticleTitle:hormone], 5, true)), product of: 15.144659 = idf(ArticleTitle: hormone=86980 luteinizing=9765) 0.048413463 = queryNorm 11.476201 = (MATCH) fieldWeight(ArticleTitle:payloadNear([luteinizing, hormone], 5, true) in 5362133), product of: 1.7320508 = PayloadNearQuery, product of: 0.57735026 = tf(phraseFreq=0.33333334) * 3.0 = AveragePayloadFunction(...)* 15.144659 = idf(ArticleTitle: hormone=86980 luteinizing=9765) 0.4375 = fieldNorm(field=ArticleTitle, doc=5362133) ---------------------------------------------------------------------
*Result of Class B Interest:*
Explain: 3.657773 = (MATCH) sum of: 0.85296905 = (MATCH) weight(payloadNear([AbstractText:luteinizing, AbstractText:hormone], 5, true) in 5362133), product of: 0.68000716 = queryWeight(payloadNear([AbstractText:luteinizing, AbstractText:hormone], 5, true)), product of: 14.045828 = idf(AbstractText: luteinizing=15481 hormone=164637) 0.048413463 = queryNorm 1.254353 = (MATCH) fieldWeight(AbstractText:payloadNear([luteinizing, hormone], 5, true) in 5362133), product of: 0.8164966 = PayloadNearQuery, product of: 0.8164966 = tf(phraseFreq=0.6666667) *1.0 = AveragePayloadFunction(...)* 14.045828 = idf(AbstractText: luteinizing=15481 hormone=164637) 0.109375 = fieldNorm(field=AbstractText, doc=5362133) 2.804804 = (MATCH) weight(payloadNear([ArticleTitle:luteinizing, ArticleTitle:hormone], 5, true) in 5362133), product of: 0.7332054 = queryWeight(payloadNear([ArticleTitle:luteinizing, ArticleTitle:hormone], 5, true)), product of: 15.144659 = idf(ArticleTitle: hormone=86980 luteinizing=9765) 0.048413463 = queryNorm 3.8254004 = (MATCH) fieldWeight(ArticleTitle:payloadNear([luteinizing, hormone], 5, true) in 5362133), product of: 0.57735026 = PayloadNearQuery, product of: 0.57735026 = tf(phraseFreq=0.33333334) * 1.0 = AveragePayloadFunction(...)* 15.144659 = idf(ArticleTitle: hormone=86980 luteinizing=9765) 0.4375 = fieldNorm(field=ArticleTitle, doc=5362133)
As I understand, when I am interested in class B, I should get 3 from AveragePayloadFunction, where as I should get 1 for class A, as there is no class A term in the text, hence everything will have payload 1. Whereas, if I am interested in Class B, there is one term in "Title" field, hence AveragePayloadFunction returned value will be 3.
I do not understand what is going on. May be I am not getting what AveragePayloadFunction is doing exactly.
My similarity class is as follows:
public class PayloadSearchSimilarity extends DefaultSimilarity {
private static final long serialVersionUID = 1L; public static String semantic; @Override public float scorePayload(int docId,String fieldName, int start, int end, byte[] bytes, int offset, int length) { //System.out.println("this is gett"); if(bytes!=null) { float payload=PayloadHelper.decodeFloat(bytes, offset); //System.out.println("this is getting called, load:"+payload); //i am now returning same payload for all semantic type so that we can compare the score. it was changed after we showed it to Dietrich. if(semantic.equals("A") && (payload==3)) { //System.out.println("Doc id:"+docId+"field :"+fieldName+" Semantic:"+ semantic+" Payload:"+payload); return 3; } else { if(semantic.equals("B") && (payload==5)) { //System.out.println("Doc id:"+docId+"field :"+fieldName+" Semantic:"+ semantic+" Payload:"+payload); return 3; } else { if(semantic.equals("C") && (payload==7))
+
shyama 2012-02-02, 16:57
-
Re: PayloadNearQuery and AveragePayloadFunction
Peter Keegan 2012-02-02, 21:39
I don't quite follow what you're doing, but is it possible that your payloads are not on the desired terms when you indexed them? The first explanation shows that the matching document contained "luteinizing hormone" in both fields 'AbstractText' and 'AbstractTitle'. The average payload value was '3.0', so either both terms had payloads that averaged 3.0 or only one had a payload of 3.0. In the 2nd query, the phrase was found in both fields again, but no payloads were found (thus the 1.0). According to your 'scorePayload' method, the first match would return 3 only if semantic=A. But the Similarity class is associated with an IndexReader, so the same 'semantic' would be used for all queries.
Peter On Thu, Feb 2, 2012 at 11:57 AM, shyama <[EMAIL PROTECTED]> wrote:
> Hi List > Apologies for such a long message. I have tried to include everything, that > you might need to know to answer my question. > > I am having difficulties understanding how or what AveragePayloadFunction > is > doing. Here is my example > > Title:Human|9 pineal|5 luteinizing hormone receptors. > Text:The presence of luteinizing hormone receptors in human|9 pineal|5 > glands from five females and three males, ranging in age from 61-89 yr, was > examined by in situ hybridization and immunocytochemistry. The results > demonstrated the presence of these receptors at the mRNA|7 and protein > levels in all the pineal|5 glands examined. Pineal|5 gland luteinizing > hormone receptors could potentially be involved in the regulation of > melatonin|7 synthesis. > > 3 is for class A > 5 is for class B > 7 is for class C > 9 is for class D > These are the payloads stored in the index. But when I search, I use these > values for encoding term class, and then return 3 for selected class. > > I am using WhiteSpaceTokenizer and LowerCaseFilter. In my PayloadSimilarity > class, I manipulate payload in a way so that, if I am interested in class > A, > it will return payload value "x=3" only for terms in class A, I decide term > class by checking its payload value. > > Now, I query for "luteinizing hormone" using PayloadNearQuery with slop of > 5. First I try with interest in class B and next with interest in class A. > > *Result of Class A interest:* > > Explain: 10.97332 = (MATCH) sum of: > 2.5589073 = (MATCH) weight(payloadNear([AbstractText:luteinizing, > AbstractText:hormone], 5, true) in 5362133), product of: > 0.68000716 = queryWeight(payloadNear([AbstractText:luteinizing, > AbstractText:hormone], 5, true)), product of: > 14.045828 = idf(AbstractText: luteinizing=15481 hormone=164637) > 0.048413463 = queryNorm > 3.7630591 = (MATCH) fieldWeight(AbstractText:payloadNear([luteinizing, > hormone], 5, true) in 5362133), product of: > 2.4494898 = PayloadNearQuery, product of: > 0.8164966 = tf(phraseFreq=0.6666667) > *3.0 = AveragePayloadFunction(...)* > 14.045828 = idf(AbstractText: luteinizing=15481 hormone=164637) > 0.109375 = fieldNorm(field=AbstractText, doc=5362133) > 8.4144125 = (MATCH) weight(payloadNear([ArticleTitle:luteinizing, > ArticleTitle:hormone], 5, true) in 5362133), product of: > 0.7332054 = queryWeight(payloadNear([ArticleTitle:luteinizing, > ArticleTitle:hormone], 5, true)), product of: > 15.144659 = idf(ArticleTitle: hormone=86980 luteinizing=9765) > 0.048413463 = queryNorm > 11.476201 = (MATCH) fieldWeight(ArticleTitle:payloadNear([luteinizing, > hormone], 5, true) in 5362133), product of: > 1.7320508 = PayloadNearQuery, product of: > 0.57735026 = tf(phraseFreq=0.33333334) > * 3.0 = AveragePayloadFunction(...)* > 15.144659 = idf(ArticleTitle: hormone=86980 luteinizing=9765) > 0.4375 = fieldNorm(field=ArticleTitle, doc=5362133) > --------------------------------------------------------------------- > > *Result of Class B Interest:* > > Explain: 3.657773 = (MATCH) sum of: > 0.85296905 = (MATCH) weight(payloadNear([AbstractText:luteinizing, > AbstractText:hormone], 5, true) in 5362133), product of:
+
Peter Keegan 2012-02-02, 21:39
-
Re: PayloadNearQuery and AveragePayloadFunction
shyama 2012-02-03, 09:13
Hi Peter I have checked payload associated with terms, and they are fine in the index. I was not clear enough I believe. When I say interested in class A, then scorePayload function returns 3 for only for class A terms. Again, When I say interested in class B, then my scorePayload function returns 3 for only Class B terms. These searches are done separately. I mean on the same index, but each time i search, I set the semantic in my Similarity class. I am actually trying to do semantic ranking of documents. Hence, lucene ranks those documents high, which contains query terms and also has more terms from that semantic class. I hope now I have make it clear, why I do not understand that score returned from AveragePayloadFunction. Hope to hear about some more explanation. Many Thanks Shyama -- View this message in context: http://lucene.472066.n3.nabble.com/PayloadNearQuery-and-AveragePayloadFunction-tp3710454p3712509.htmlSent from the Lucene - Java Users mailing list archive at Nabble.com. ---------------------------------------------------------------------
+
shyama 2012-02-03, 09:13
-
Re: PayloadNearQuery and AveragePayloadFunction
Peter Keegan 2012-02-03, 13:35
AveragPayloadFunction is just what it sounds like: return numPayloadsSeen > 0 ? (payloadScore / numPayloadsSeen) : 1; What values are you seeing returned from PayloadHelper.decodeFloat ? Peter On Fri, Feb 3, 2012 at 4:13 AM, shyama <[EMAIL PROTECTED]> wrote: > Hi Peter > I have checked payload associated with terms, and they are fine in the > index. I was not clear enough I believe. When I say interested in class A, > then scorePayload function returns 3 for only for class A terms. Again, > When > I say interested in class B, then my scorePayload function returns 3 for > only Class B terms. These searches are done separately. I mean on the same > index, but each time i search, I set the semantic in my Similarity class. > > I am actually trying to do semantic ranking of documents. Hence, lucene > ranks those documents high, which contains query terms and also has more > terms from that semantic class. > > I hope now I have make it clear, why I do not understand that score > returned > from AveragePayloadFunction. > > Hope to hear about some more explanation. > > Many Thanks > Shyama > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/PayloadNearQuery-and-AveragePayloadFunction-tp3710454p3712509.html> Sent from the Lucene - Java Users mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > >
+
Peter Keegan 2012-02-03, 13:35
-
Re: PayloadNearQuery and AveragePayloadFunction
shyama 2012-02-03, 16:50
Hi Peter Thanks for your reply. I guess I found the problem. scorePayload function is only called for query terms. Problem was, when I was retrieving payloads for each tokens in token stream, it was return misleading payloads due to the fact that I did not skip TermPositions that does not belongs to current document. I still wonder, whether AveragePayloadFunction will consider query terms for "payload seen so far", or all terms in current field of current document. I will check this out. In my previous testing I found, it only considers query terms. Thanks again. Shyama -- View this message in context: http://lucene.472066.n3.nabble.com/PayloadNearQuery-and-AveragePayloadFunction-tp3710454p3713653.htmlSent from the Lucene - Java Users mailing list archive at Nabble.com. ---------------------------------------------------------------------
+
shyama 2012-02-03, 16:50
-
Re: PayloadNearQuery and AveragePayloadFunction
Peter Keegan 2012-02-03, 17:28
All term queries, including payload queries, deal only with words from the query that exist in a document. They don't know what other terms are in a matching document, due to the inverted nature of the index. Peter On Fri, Feb 3, 2012 at 11:50 AM, shyama <[EMAIL PROTECTED]> wrote: > Hi Peter > Thanks for your reply. > I guess I found the problem. > > scorePayload function is only called for query terms. Problem was, when I > was retrieving payloads for each tokens in token stream, it was return > misleading payloads due to the fact that I did not skip TermPositions that > does not belongs to current document. > > I still wonder, whether AveragePayloadFunction will consider query terms > for > "payload seen so far", or all terms in current field of current document. I > will check this out. In my previous testing I found, it only considers > query > terms. > > Thanks again. > Shyama > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/PayloadNearQuery-and-AveragePayloadFunction-tp3710454p3713653.html> Sent from the Lucene - Java Users mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > >
+
Peter Keegan 2012-02-03, 17:28
|