|
Robert Muir
2009-11-07, 08:37
Marvin Humphrey
2009-11-10, 06:07
Robert Muir
2009-11-10, 06:33
Simon Willnauer
2009-11-10, 20:42
Robert Muir
2009-11-10, 20:48
Grant Ingersoll
2009-11-10, 22:25
Robert Muir
2009-11-10, 22:30
Simon Willnauer
2009-11-10, 22:42
Robert Muir
2009-11-10, 23:09
Marvin Humphrey
2009-11-10, 23:45
Andrzej Bialecki
2009-11-10, 23:46
Robert Muir
2009-11-11, 00:13
Marvin Humphrey
2009-11-11, 17:55
Robert Muir
2009-11-12, 11:34
Grant Ingersoll
2009-11-12, 11:59
Nicola Ferro
2009-11-12, 12:18
Robert Muir
2009-11-12, 12:39
Robert Muir
2009-11-12, 13:09
Nicola Ferro
2009-11-12, 13:51
Robert Muir
2009-11-12, 15:06
Robert Muir
2009-11-12, 15:43
Nicola Ferro
2009-11-13, 08:10
Nicola Ferro
2009-11-13, 08:13
Robert Muir
2009-11-13, 11:54
Nicola Ferro
2009-11-13, 12:31
Andrzej Bialecki
2009-11-23, 09:29
Robert Muir
2009-11-23, 12:03
Andrzej Bialecki
2009-11-23, 14:01
Robert Muir
2009-11-23, 15:24
Andrzej Bialecki
2009-11-23, 15:47
|
-
some links to downloadable test collectionsRobert Muir 2009-11-07, 08:37
fyi, I added a page to the wiki with some links to existing test collections
that can be downloaded along with queries and relevance judgements. some of these are smaller, maybe not perfect, but something to start with for playing around. if you know of others, please add! http://cwiki.apache.org/confluence/display/ORP/ExistingCollections also, if you have some ideas on how to perhaps create some ant tasks to make downloading/running these thru the lucene benchmark package easier, that would be great too. this is a bit frustrating because many collections claim to be "trec" format but they are all formatted slightly differently... -- Robert Muir [EMAIL PROTECTED]
-
Re: some links to downloadable test collectionsMarvin Humphrey 2009-11-10, 06:07
Robert Muir:
> fyi, I added a page to the wiki with some links to existing test collections > that can be downloaded along with queries and relevance judgements. So, what comparisons can we set up using these collections? I seem to recall you suggesting at ApacheCon that they would be handy when judging Analyzer mods. These collections are all binary assertions -- relevant/not-relevant for a given query -- right? Am I correct in presuming that such corpora can't help us to judge scoring and ranking algorithms, or Similarity implementations? > also, if you have some ideas on how to perhaps create some ant tasks to make > downloading/running these thru the lucene benchmark package easier, that > would be great too. Hmm, that approach is specific to Lucene Java. It's not handy for either of the projects I work on (Lucy, KinoSearch). At some point, I'd planned to write a loose port of the Lucene benchmarking suite so that Lucy (at least) could exploit it... The benchmarking code has gotten so elaborate and complex now, though -- I wonder how easy it will be to generalize... > this is a bit frustrating because many collections claim to be "trec" format > but they are all formatted slightly differently... Sounds like we need one module per corpus to explode it into a common format. Is ant the best approach here? Maybe we start off with a scripting language like Python? Marvin Humphrey
-
Re: some links to downloadable test collectionsRobert Muir 2009-11-10, 06:33
> So, what comparisons can we set up using these collections?
> I think we can be creative. for example I used one of these tonight to test LUCENE-1812, Andrzej's index pruning tool. Results showed that it works as he advertised at apachecon... also, we should be careful about the english ones i linked to (or preferably, find bigger ones), because they are smallish collections. > > I seem to recall you suggesting at ApacheCon that they would be handy when > judging Analyzer mods. > Yeah, definitely don't think any results should be gospel for analyzers or scoring or anything else, but then again I think we could detect if some change is completely broken or silly (bugs, etc). > > These collections are all binary assertions -- relevant/not-relevant for a > given query -- right? Am I correct in presuming that such corpora can't > help > us to judge scoring and ranking algorithms, or Similarity implementations? > I think most of them are binary... but I think I disagree with your second statement, these kinds of collections are used to compare scoring/ranking algorithms all the time! > > > also, if you have some ideas on how to perhaps create some ant tasks to > make > > downloading/running these thru the lucene benchmark package easier, that > > would be great too. > > Hmm, that approach is specific to Lucene Java. It's not handy for either > of > the projects I work on (Lucy, KinoSearch). > You raise a good point here. Really at the end of the day, you just want to produce a .txt file that you throw at the trec_eval commandline program or something similar. Doing it in a lucene-java specific way doesn't allow us to easily evaluate things even in solr, for example it has analysis components that affect relevance! I guess one approach could be to create scripts and stuff here that download and munge these collections into a consistent format, and then lucy, lucene-java, solr, whatever would have an an easier time running the evaluations? > > At some point, I'd planned to write a loose port of the Lucene benchmarking > suite so that Lucy (at least) could exploit it... The benchmarking code > has > gotten so elaborate and complex now, though -- I wonder how easy it will be > to > generalize... > > > this is a bit frustrating because many collections claim to be "trec" > format > > but they are all formatted slightly differently... > > Sounds like we need one module per corpus to explode it into a common > format. > > Is ant the best approach here? Maybe we start off with a scripting > language > like Python? > in reference to both your comments above, I don't modify the lucene benchmarking code really too much to run my tests, sometimes i change the analyzer or scoring but thats it. instead, i use sed and perl and what not to reformat things into the format the benchmark package wants... so I guess this is already what I am doing (scripting language) > > Marvin Humphrey > > -- Robert Muir [EMAIL PROTECTED]
-
Re: some links to downloadable test collectionsSimon Willnauer 2009-11-10, 20:42
On Tue, Nov 10, 2009 at 7:33 AM, Robert Muir <[EMAIL PROTECTED]> wrote:
>> So, what comparisons can we set up using these collections? >> > > I think we can be creative. for example I used one of these tonight to test > LUCENE-1812, Andrzej's index pruning tool. Results showed that it works as > he advertised at apachecon... > > also, we should be careful about the english ones i linked to (or > preferably, find bigger ones), because they are smallish collections. > > >> >> I seem to recall you suggesting at ApacheCon that they would be handy when >> judging Analyzer mods. >> > > Yeah, definitely don't think any results should be gospel for analyzers or > scoring or anything else, but then again I think we could detect if some > change is completely broken or silly (bugs, etc). This would bring a huge value to lucene and its derivatives. This sounds like a very good point to start from especially until we sorted out all the licensing issues, how to distribute collections or what we want to crawl. There is a huge +1 from my side to get started with the small collections - 100% more than we have today. > > >> >> These collections are all binary assertions -- relevant/not-relevant for a >> given query -- right? Am I correct in presuming that such corpora can't >> help >> us to judge scoring and ranking algorithms, or Similarity implementations? >> > > I think most of them are binary... but I think I disagree with your second > statement, these kinds of collections are used to compare scoring/ranking > algorithms all the time! Afaik, those collections yield pretty good results for all kinds of relevance judgements though. > > >> >> > also, if you have some ideas on how to perhaps create some ant tasks to >> make >> > downloading/running these thru the lucene benchmark package easier, that >> > would be great too. >> >> Hmm, that approach is specific to Lucene Java. It's not handy for either >> of >> the projects I work on (Lucy, KinoSearch). >> > > You raise a good point here. Really at the end of the day, you just want to > produce a .txt file that you throw at the trec_eval commandline program or > something similar. Doing it in a lucene-java specific way doesn't allow us > to easily evaluate things even in solr, for example it has analysis > components that affect relevance! This is maybe the most important issue for the first step. I would really like to see a standard format which can be parsed easily by whatever language you use. I personally prefer JSON for almost everything as it is soo easy to parse, read (human eyes) and write. Ant still sounds like a good plan as there are many many functions already implemented and it is easy to extend. +1 for a creating an issue for format and transformation. > > I guess one approach could be to create scripts and stuff here that download > and munge these collections into a consistent format, and then lucy, > lucene-java, solr, whatever would have an an easier time running the > evaluations? > see above > >> >> At some point, I'd planned to write a loose port of the Lucene benchmarking >> suite so that Lucy (at least) could exploit it... The benchmarking code >> has >> gotten so elaborate and complex now, though -- I wonder how easy it will be >> to >> generalize... >> > >> > this is a bit frustrating because many collections claim to be "trec" >> format >> > but they are all formatted slightly differently... >> >> Sounds like we need one module per corpus to explode it into a common >> format. >> >> Is ant the best approach here? Maybe we start off with a scripting >> language >> like Python? you wanna use you object model, right ? :) >> > > in reference to both your comments above, I don't modify the lucene > benchmarking code really too much to run my tests, sometimes i change the > analyzer or scoring but thats it. > > instead, i use sed and perl and what not to reformat things into the format > the benchmark package wants... so I guess this is already what I am doing > (scripting language)
-
Re: some links to downloadable test collectionsRobert Muir 2009-11-10, 20:48
Hi Simon, thanks for your comments.
I guess in my opinion, the fastest way to having something would be to create scripts that munge these various collections into a standard format, as mentioned earlier. And I think the easiest format would actually be to format queries, judgements, and text into what the Lucene-java benchmark expects already. This format is pretty simple and I don't think it would be a headache to use for other projects such as lucy or solr or maybe even comparisons against other software. This is of course biased by the fact that I am lazy and I don't want to mess with the lucene benchmark package :) I would like to create a JIRA issue to start working this task, as I am maintaining this various junk internally at the moment. Does anyone have specific preference to what programming language/build system/etc is desired? I don't have a preference, I just care about relevance. On Tue, Nov 10, 2009 at 3:42 PM, Simon Willnauer < [EMAIL PROTECTED]> wrote: > On Tue, Nov 10, 2009 at 7:33 AM, Robert Muir <[EMAIL PROTECTED]> wrote: > >> So, what comparisons can we set up using these collections? > >> > > > > I think we can be creative. for example I used one of these tonight to > test > > LUCENE-1812, Andrzej's index pruning tool. Results showed that it works > as > > he advertised at apachecon... > > > > also, we should be careful about the english ones i linked to (or > > preferably, find bigger ones), because they are smallish collections. > > > > > >> > >> I seem to recall you suggesting at ApacheCon that they would be handy > when > >> judging Analyzer mods. > >> > > > > Yeah, definitely don't think any results should be gospel for analyzers > or > > scoring or anything else, but then again I think we could detect if some > > change is completely broken or silly (bugs, etc). > > This would bring a huge value to lucene and its derivatives. This > sounds like a very good point to start from especially until we sorted > out all the licensing issues, how to distribute collections or what we > want to crawl. There is a huge +1 from my side to get started with the > small collections - 100% more than we have today. > > > > > >> > >> These collections are all binary assertions -- relevant/not-relevant for > a > >> given query -- right? Am I correct in presuming that such corpora can't > >> help > >> us to judge scoring and ranking algorithms, or Similarity > implementations? > >> > > > > I think most of them are binary... but I think I disagree with your > second > > statement, these kinds of collections are used to compare scoring/ranking > > algorithms all the time! > Afaik, those collections yield pretty good results for all kinds of > relevance judgements though. > > > > > >> > >> > also, if you have some ideas on how to perhaps create some ant tasks > to > >> make > >> > downloading/running these thru the lucene benchmark package easier, > that > >> > would be great too. > >> > >> Hmm, that approach is specific to Lucene Java. It's not handy for > either > >> of > >> the projects I work on (Lucy, KinoSearch). > >> > > > > You raise a good point here. Really at the end of the day, you just want > to > > produce a .txt file that you throw at the trec_eval commandline program > or > > something similar. Doing it in a lucene-java specific way doesn't allow > us > > to easily evaluate things even in solr, for example it has analysis > > components that affect relevance! > > This is maybe the most important issue for the first step. I would > really like to see a standard format which can be parsed easily by > whatever language you use. I personally prefer JSON for almost > everything as it is soo easy to parse, read (human eyes) and write. > Ant still sounds like a good plan as there are many many functions > already implemented and it is easy to extend. > +1 for a creating an issue for format and transformation. > > > > I guess one approach could be to create scripts and stuff here that > download Robert Muir [EMAIL PROTECTED]
-
Re: some links to downloadable test collectionsGrant Ingersoll 2009-11-10, 22:25
On Nov 10, 2009, at 3:48 PM, Robert Muir wrote: > Hi Simon, thanks for your comments. > > I guess in my opinion, the fastest way to having something would be to > create scripts that munge these various collections into a standard > format, > as mentioned earlier. > And I think the easiest format would actually be to format queries, > judgements, and text into what the Lucene-java benchmark expects > already. > This format is pretty simple and I don't think it would be a > headache to use > for other projects such as lucy or solr or maybe even comparisons > against > other software. > > This is of course biased by the fact that I am lazy and I don't want > to mess > with the lucene benchmark package :) > > I would like to create a JIRA issue to start working this task, as I > am > maintaining this various junk internally at the moment. > > Does anyone have specific preference to what programming language/ > build > system/etc is desired? I don't have a preference, I just care about > relevance. Since most of our projects are in Java, I would probably lean that way, but if it is just meant to be lightweight, then we could just use a scripting lang. > > On Tue, Nov 10, 2009 at 3:42 PM, Simon Willnauer < > [EMAIL PROTECTED]> wrote: > >> On Tue, Nov 10, 2009 at 7:33 AM, Robert Muir <[EMAIL PROTECTED]> >> wrote: >>>> So, what comparisons can we set up using these collections? >>>> >>> >>> I think we can be creative. for example I used one of these >>> tonight to >> test >>> LUCENE-1812, Andrzej's index pruning tool. Results showed that it >>> works >> as >>> he advertised at apachecon... >>> >>> also, we should be careful about the english ones i linked to (or >>> preferably, find bigger ones), because they are smallish >>> collections. >>> >>> >>>> >>>> I seem to recall you suggesting at ApacheCon that they would be >>>> handy >> when >>>> judging Analyzer mods. >>>> >>> >>> Yeah, definitely don't think any results should be gospel for >>> analyzers >> or >>> scoring or anything else, but then again I think we could detect >>> if some >>> change is completely broken or silly (bugs, etc). >> >> This would bring a huge value to lucene and its derivatives. This >> sounds like a very good point to start from especially until we >> sorted >> out all the licensing issues, how to distribute collections or what >> we >> want to crawl. There is a huge +1 from my side to get started with >> the >> small collections - 100% more than we have today. >>> >>> >>>> >>>> These collections are all binary assertions -- relevant/not- >>>> relevant for >> a >>>> given query -- right? Am I correct in presuming that such >>>> corpora can't >>>> help >>>> us to judge scoring and ranking algorithms, or Similarity >> implementations? >>>> >>> >>> I think most of them are binary... but I think I disagree with your >> second >>> statement, these kinds of collections are used to compare scoring/ >>> ranking >>> algorithms all the time! >> Afaik, those collections yield pretty good results for all kinds of >> relevance judgements though. >>> >>> >>>> >>>>> also, if you have some ideas on how to perhaps create some ant >>>>> tasks >> to >>>> make >>>>> downloading/running these thru the lucene benchmark package >>>>> easier, >> that >>>>> would be great too. >>>> >>>> Hmm, that approach is specific to Lucene Java. It's not handy for >> either >>>> of >>>> the projects I work on (Lucy, KinoSearch). >>>> >>> >>> You raise a good point here. Really at the end of the day, you >>> just want >> to >>> produce a .txt file that you throw at the trec_eval commandline >>> program >> or >>> something similar. Doing it in a lucene-java specific way doesn't >>> allow >> us >>> to easily evaluate things even in solr, for example it has analysis >>> components that affect relevance! >> >> This is maybe the most important issue for the first step. I would >> really like to see a standard format which can be parsed easily by
-
Re: some links to downloadable test collectionsRobert Muir 2009-11-10, 22:30
Grant, I am fine with java, really. Marvin brought up python, I am willing
to learn the language if thats what it takes (I only have minor experience with it so far) Really, I think that any code that munges these collections isn't something we should worry about being nice from a software devel principle. To correct the formats of this stuff, I always use sed, grep, or even vi/notepad. Its a throwaway type of thing in my opinion. If people feel strongly towards any particular language/build system, let me know. Otherwise I want to start working on a patch sooner than later. Someone smarter than me could always help improve it. On Tue, Nov 10, 2009 at 5:25 PM, Grant Ingersoll <[EMAIL PROTECTED]>wrote: > > On Nov 10, 2009, at 3:48 PM, Robert Muir wrote: > > Hi Simon, thanks for your comments. >> >> I guess in my opinion, the fastest way to having something would be to >> create scripts that munge these various collections into a standard >> format, >> as mentioned earlier. >> And I think the easiest format would actually be to format queries, >> judgements, and text into what the Lucene-java benchmark expects already. >> This format is pretty simple and I don't think it would be a headache to >> use >> for other projects such as lucy or solr or maybe even comparisons against >> other software. >> >> This is of course biased by the fact that I am lazy and I don't want to >> mess >> with the lucene benchmark package :) >> >> I would like to create a JIRA issue to start working this task, as I am >> maintaining this various junk internally at the moment. >> >> Does anyone have specific preference to what programming language/build >> system/etc is desired? I don't have a preference, I just care about >> relevance. >> > > Since most of our projects are in Java, I would probably lean that way, but > if it is just meant to be lightweight, then we could just use a scripting > lang. > > > > >> On Tue, Nov 10, 2009 at 3:42 PM, Simon Willnauer < >> [EMAIL PROTECTED]> wrote: >> >> On Tue, Nov 10, 2009 at 7:33 AM, Robert Muir <[EMAIL PROTECTED]> wrote: >>> >>>> So, what comparisons can we set up using these collections? >>>>> >>>>> >>>> I think we can be creative. for example I used one of these tonight to >>>> >>> test >>> >>>> LUCENE-1812, Andrzej's index pruning tool. Results showed that it works >>>> >>> as >>> >>>> he advertised at apachecon... >>>> >>>> also, we should be careful about the english ones i linked to (or >>>> preferably, find bigger ones), because they are smallish collections. >>>> >>>> >>>> >>>>> I seem to recall you suggesting at ApacheCon that they would be handy >>>>> >>>> when >>> >>>> judging Analyzer mods. >>>>> >>>>> >>>> Yeah, definitely don't think any results should be gospel for analyzers >>>> >>> or >>> >>>> scoring or anything else, but then again I think we could detect if some >>>> change is completely broken or silly (bugs, etc). >>>> >>> >>> This would bring a huge value to lucene and its derivatives. This >>> sounds like a very good point to start from especially until we sorted >>> out all the licensing issues, how to distribute collections or what we >>> want to crawl. There is a huge +1 from my side to get started with the >>> small collections - 100% more than we have today. >>> >>>> >>>> >>>> >>>>> These collections are all binary assertions -- relevant/not-relevant >>>>> for >>>>> >>>> a >>> >>>> given query -- right? Am I correct in presuming that such corpora can't >>>>> help >>>>> us to judge scoring and ranking algorithms, or Similarity >>>>> >>>> implementations? >>> >>>> >>>>> >>>> I think most of them are binary... but I think I disagree with your >>>> >>> second >>> >>>> statement, these kinds of collections are used to compare >>>> scoring/ranking >>>> algorithms all the time! >>>> >>> Afaik, those collections yield pretty good results for all kinds of >>> relevance judgements though. >>> >>>> >>>> >>>> >>>>> also, if you have some ideas on how to perhaps create some ant tasks Robert Muir [EMAIL PROTECTED]
-
Re: some links to downloadable test collectionsSimon Willnauer 2009-11-10, 22:42
IMO we should not waste too much time for a decision on a programming
language. Lets just go to for Java / ANT as we all know what we are doing. Thoughts? +1 for java / ANT On Tue, Nov 10, 2009 at 11:30 PM, Robert Muir <[EMAIL PROTECTED]> wrote: > Grant, I am fine with java, really. Marvin brought up python, I am willing > to learn the language if thats what it takes (I only have minor experience > with it so far) > > Really, I think that any code that munges these collections isn't something > we should worry about being nice from a software devel principle. > > To correct the formats of this stuff, I always use sed, grep, or even > vi/notepad. Its a throwaway type of thing in my opinion. > > If people feel strongly towards any particular language/build system, let me > know. Otherwise I want to start working on a patch sooner than later. > Someone smarter than me could always help improve it. > > On Tue, Nov 10, 2009 at 5:25 PM, Grant Ingersoll <[EMAIL PROTECTED]>wrote: > >> >> On Nov 10, 2009, at 3:48 PM, Robert Muir wrote: >> >> Hi Simon, thanks for your comments. >>> >>> I guess in my opinion, the fastest way to having something would be to >>> create scripts that munge these various collections into a standard >>> format, >>> as mentioned earlier. >>> And I think the easiest format would actually be to format queries, >>> judgements, and text into what the Lucene-java benchmark expects already. >>> This format is pretty simple and I don't think it would be a headache to >>> use >>> for other projects such as lucy or solr or maybe even comparisons against >>> other software. >>> >>> This is of course biased by the fact that I am lazy and I don't want to >>> mess >>> with the lucene benchmark package :) >>> >>> I would like to create a JIRA issue to start working this task, as I am >>> maintaining this various junk internally at the moment. >>> >>> Does anyone have specific preference to what programming language/build >>> system/etc is desired? I don't have a preference, I just care about >>> relevance. >>> >> >> Since most of our projects are in Java, I would probably lean that way, but >> if it is just meant to be lightweight, then we could just use a scripting >> lang. >> >> >> >> >>> On Tue, Nov 10, 2009 at 3:42 PM, Simon Willnauer < >>> [EMAIL PROTECTED]> wrote: >>> >>> On Tue, Nov 10, 2009 at 7:33 AM, Robert Muir <[EMAIL PROTECTED]> wrote: >>>> >>>>> So, what comparisons can we set up using these collections? >>>>>> >>>>>> >>>>> I think we can be creative. for example I used one of these tonight to >>>>> >>>> test >>>> >>>>> LUCENE-1812, Andrzej's index pruning tool. Results showed that it works >>>>> >>>> as >>>> >>>>> he advertised at apachecon... >>>>> >>>>> also, we should be careful about the english ones i linked to (or >>>>> preferably, find bigger ones), because they are smallish collections. >>>>> >>>>> >>>>> >>>>>> I seem to recall you suggesting at ApacheCon that they would be handy >>>>>> >>>>> when >>>> >>>>> judging Analyzer mods. >>>>>> >>>>>> >>>>> Yeah, definitely don't think any results should be gospel for analyzers >>>>> >>>> or >>>> >>>>> scoring or anything else, but then again I think we could detect if some >>>>> change is completely broken or silly (bugs, etc). >>>>> >>>> >>>> This would bring a huge value to lucene and its derivatives. This >>>> sounds like a very good point to start from especially until we sorted >>>> out all the licensing issues, how to distribute collections or what we >>>> want to crawl. There is a huge +1 from my side to get started with the >>>> small collections - 100% more than we have today. >>>> >>>>> >>>>> >>>>> >>>>>> These collections are all binary assertions -- relevant/not-relevant >>>>>> for >>>>>> >>>>> a >>>> >>>>> given query -- right? Am I correct in presuming that such corpora can't >>>>>> help >>>>>> us to judge scoring and ranking algorithms, or Similarity >>>>>> >>>>> implementations? >>>> >>>>> >>>>>> >>>>> I think most of them are binary... but I think I disagree with your
-
Re: some links to downloadable test collectionsRobert Muir 2009-11-10, 23:09
+1 (for agreeing on just something, lets get going on this!)
On Tue, Nov 10, 2009 at 5:42 PM, Simon Willnauer < [EMAIL PROTECTED]> wrote: > IMO we should not waste too much time for a decision on a programming > language. Lets just go to for Java / ANT as we all know what we are > doing. > > Thoughts? > > +1 for java / ANT > > On Tue, Nov 10, 2009 at 11:30 PM, Robert Muir <[EMAIL PROTECTED]> wrote: > > Grant, I am fine with java, really. Marvin brought up python, I am > willing > > to learn the language if thats what it takes (I only have minor > experience > > with it so far) > > > > Really, I think that any code that munges these collections isn't > something > > we should worry about being nice from a software devel principle. > > > > To correct the formats of this stuff, I always use sed, grep, or even > > vi/notepad. Its a throwaway type of thing in my opinion. > > > > If people feel strongly towards any particular language/build system, let > me > > know. Otherwise I want to start working on a patch sooner than later. > > Someone smarter than me could always help improve it. > > > > On Tue, Nov 10, 2009 at 5:25 PM, Grant Ingersoll <[EMAIL PROTECTED] > >wrote: > > > >> > >> On Nov 10, 2009, at 3:48 PM, Robert Muir wrote: > >> > >> Hi Simon, thanks for your comments. > >>> > >>> I guess in my opinion, the fastest way to having something would be to > >>> create scripts that munge these various collections into a standard > >>> format, > >>> as mentioned earlier. > >>> And I think the easiest format would actually be to format queries, > >>> judgements, and text into what the Lucene-java benchmark expects > already. > >>> This format is pretty simple and I don't think it would be a headache > to > >>> use > >>> for other projects such as lucy or solr or maybe even comparisons > against > >>> other software. > >>> > >>> This is of course biased by the fact that I am lazy and I don't want to > >>> mess > >>> with the lucene benchmark package :) > >>> > >>> I would like to create a JIRA issue to start working this task, as I am > >>> maintaining this various junk internally at the moment. > >>> > >>> Does anyone have specific preference to what programming language/build > >>> system/etc is desired? I don't have a preference, I just care about > >>> relevance. > >>> > >> > >> Since most of our projects are in Java, I would probably lean that way, > but > >> if it is just meant to be lightweight, then we could just use a > scripting > >> lang. > >> > >> > >> > >> > >>> On Tue, Nov 10, 2009 at 3:42 PM, Simon Willnauer < > >>> [EMAIL PROTECTED]> wrote: > >>> > >>> On Tue, Nov 10, 2009 at 7:33 AM, Robert Muir <[EMAIL PROTECTED]> > wrote: > >>>> > >>>>> So, what comparisons can we set up using these collections? > >>>>>> > >>>>>> > >>>>> I think we can be creative. for example I used one of these tonight > to > >>>>> > >>>> test > >>>> > >>>>> LUCENE-1812, Andrzej's index pruning tool. Results showed that it > works > >>>>> > >>>> as > >>>> > >>>>> he advertised at apachecon... > >>>>> > >>>>> also, we should be careful about the english ones i linked to (or > >>>>> preferably, find bigger ones), because they are smallish collections. > >>>>> > >>>>> > >>>>> > >>>>>> I seem to recall you suggesting at ApacheCon that they would be > handy > >>>>>> > >>>>> when > >>>> > >>>>> judging Analyzer mods. > >>>>>> > >>>>>> > >>>>> Yeah, definitely don't think any results should be gospel for > analyzers > >>>>> > >>>> or > >>>> > >>>>> scoring or anything else, but then again I think we could detect if > some > >>>>> change is completely broken or silly (bugs, etc). > >>>>> > >>>> > >>>> This would bring a huge value to lucene and its derivatives. This > >>>> sounds like a very good point to start from especially until we sorted > >>>> out all the licensing issues, how to distribute collections or what we > >>>> want to crawl. There is a huge +1 from my side to get started with the > >>>> small collections - 100% more than we have today. Robert Muir [EMAIL PROTECTED]
-
Re: some links to downloadable test collectionsMarvin Humphrey 2009-11-10, 23:45
Simon Willnauer replied to me:
> >> Sounds like we need one module per corpus to explode it into a common > >> format. > >> > >> Is ant the best approach here? �Maybe we start off with a scripting > >> language like Python? > > you wanna use you object model, right ? :) Haha, no. :) The Lucy object model is completely unrelated and wouldn't have been touched under what I was suggesting. In order to launch benchmarking apps for indexing libraries written in different languages -- Java Lucene, Perl Lucy, Python Lucy, etc -- our central library will need to launch external processes. That's the very definition of a scripting task. From <http://en.wikipedia.org/wiki/Scripting_language>: A scripting language, script language or extension language is a programming language that allows control of one or more software applications. "Scripts" are distinct from the core code of the application, which is usually written in a different language, and are often created or at least modified by the end-user. I suggested Python in particular because from a distance it looks like Python 3.x has pretty decent Unicode support. Robert Muir wrote: > > instead, i use sed and perl and what not to reformat things into the format > > the benchmark package wants... so I guess this is already what I am doing > > (scripting language) I don't recommend Perl for this application. It's great for hacking up fast file wrangling stuff, but its Unicode support is hard to use and very hard to debug unless you understand the underlying implementation, which for backwards-compatibility reasons is very very complicated. I know it backwards and forwards so I could get good results, but I don't think other people should have to make that investment. The Java/ant combo wouldn't be my preference for the opposite reason: the Unicode support is there, but it's a lot more verbose and unwieldy for scripting tasks and quick file hackups. This would all matter more if we end up generalizing some portion of the Lucene benchmarking suite under Open Relevance so that other projects could use it. (The fact that Mike McCandless, primary author of the benchmarking suite, is fluent in Python, also drove the suggestion.) That seems natural because we need some framework to run the relevance tests under; exporting to a common intermediate format is nice, but running actual benchmarks is nicer. And since search-time benchmarking capabilities are sorely needed for Lucy, I'd get involved. However, I'd pretty much resigned myself to porting a separate implementation of the benchmarking suite eventually. And given the way this thread has progressed since I started writing this reply, looks like that's what I'll be falling back to after all. Oh well... no gain, no loss. Marvin Humphrey
-
Re: some links to downloadable test collectionsAndrzej Bialecki 2009-11-10, 23:46
Robert Muir wrote:
> +1 (for agreeing on just something, lets get going on this!) > > On Tue, Nov 10, 2009 at 5:42 PM, Simon Willnauer < > [EMAIL PROTECTED]> wrote: > >> IMO we should not waste too much time for a decision on a programming >> language. Lets just go to for Java / ANT as we all know what we are >> doing. >> >> Thoughts? >> >> +1 for java / ANT In the spirit of a long-time Unix hacker I love to use one-liners as the next guy .. but they are hard to maintain. So +1 to Java/ant, with a comment that we should not get too religious about _not_ using *nix utilities where it makes sense - ant can drive a shell script that munges the format with grep/sed/awk if that's easier/faster to do than writing a Java class. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __________________________________ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
-
Re: some links to downloadable test collectionsRobert Muir 2009-11-11, 00:13
Marvin, I am a little concerned about your comments.
I think that there might be a little confusion: 1. the trec portion of the lucene benchmark suite is essentially standalone, it doesn't really interact with the other components. 2. creating some scripts/code to download and reformat collections into a standardized format, I think this is the way to go? Why not agree on the conventional TREC format that the lucene benchmark package expects? 3. we are talking about downloading and reformatting text files into text files, seriously I don't think you need to/should understand really anything about the lucene benchmark impl to make use of this. by the way, If there's anything I can do to make this concept more amenable to you, please reply! actually I 100% agree that we should not limit anything to any specific lucene implementation. In fact, I don't want to try to imply some scope creep, but I'm completely for the idea that this kind of thing could be re-used to compare even non-apache projects (other search engines, etc). Surely we have some stuff to learn from each other. This is just about boilerplate code, build scripts, downloading, reformatting, very boring stuff :) However, I'd pretty much resigned myself to porting a separate > implementation > of the benchmarking suite eventually. And given the way this thread has > progressed since I started writing this reply, looks like that's what I'll > be > falling back to after all. Oh well... no gain, no loss. > > Marvin Humphrey > > -- Robert Muir [EMAIL PROTECTED]
-
Re: some links to downloadable test collectionsMarvin Humphrey 2009-11-11, 17:55
On Tue, Nov 10, 2009 at 07:13:58PM -0500, Robert Muir wrote:
> Why not agree on the conventional TREC format that the lucene benchmark > package expects? +0, seems logical, but I'm not well informed about either the format itself or possible alternatives. > 3. we are talking about downloading and reformatting text files into text > files, seriously I don't think you need to/should understand > really anything about the lucene benchmark impl to make use of this. OK. The next logical step is to actually do something with the files, and I figured you were going there. I didn't realize that simply converting the files was more-or-less sufficient for coaxing something useful out of the Lucene benchmark suite. Please carry on. > In fact, I don't want to try to imply some scope creep, but I'm completely > for the idea that this kind of thing could be re-used to compare even > non-apache projects (other search engines, etc). If it's not going to work with other search engines, the project should be called "Open Irrelevance". :P PS: I misremembered the authorship of the Lucene benchmarking suite earlier. McCandless has been modding it recently, but the original patch was a team effort from Grant Ingersoll and Doron Cohen with prior art contributions by Andrzej Bialecki and myself. Apologies. Marvin Humphrey
-
Re: some links to downloadable test collectionsRobert Muir 2009-11-12, 11:34
Marvin, I'm not really sure its the format that we want to stick with
either? For example, converting everything a least common denominator will work for now, but some collections might have special properties (i.e. fields with categorization values, other interesting things). just want to get something started and working, worst case: nobody likes the patch and we are back to where we are now! On Wed, Nov 11, 2009 at 12:55 PM, Marvin Humphrey <[EMAIL PROTECTED]>wrote: > On Tue, Nov 10, 2009 at 07:13:58PM -0500, Robert Muir wrote: > > > Why not agree on the conventional TREC format that the lucene benchmark > > package expects? > > +0, seems logical, but I'm not well informed about either the format itself > or > possible alternatives. > > > 3. we are talking about downloading and reformatting text files into text > > files, seriously I don't think you need to/should understand > > really anything about the lucene benchmark impl to make use of this. > > OK. The next logical step is to actually do something with the files, and > I > figured you were going there. I didn't realize that simply converting the > files was more-or-less sufficient for coaxing something useful out of the > Lucene benchmark suite. > > Please carry on. > > > In fact, I don't want to try to imply some scope creep, but I'm > completely > > for the idea that this kind of thing could be re-used to compare even > > non-apache projects (other search engines, etc). > > If it's not going to work with other search engines, the project should be > called "Open Irrelevance". :P > > PS: I misremembered the authorship of the Lucene benchmarking suite > earlier. > McCandless has been modding it recently, but the original patch was a team > effort from Grant Ingersoll and Doron Cohen with prior art contributions by > Andrzej Bialecki and myself. Apologies. > > Marvin Humphrey > > -- Robert Muir [EMAIL PROTECTED]
-
Re: some links to downloadable test collectionsGrant Ingersoll 2009-11-12, 11:59
On Nov 10, 2009, at 7:13 PM, Robert Muir wrote: > Marvin, I am a little concerned about your comments. > > I think that there might be a little confusion: > 1. the trec portion of the lucene benchmark suite is essentially > standalone, > it doesn't really interact with the other components. +1 > 2. creating some scripts/code to download and reformat collections > into a > standardized format, I think this is the way to go? > Why not agree on the conventional TREC format that the lucene > benchmark > package expects? +1. This makes the most sense. No sense in re-inventing the wheel. > 3. we are talking about downloading and reformatting text files into > text > files, seriously I don't think you need to/should understand > really anything about the lucene benchmark impl to make use of this. +1 > > by the way, If there's anything I can do to make this concept more > amenable > to you, please reply! > actually I 100% agree that we should not limit anything to any > specific > lucene implementation. > > In fact, I don't want to try to imply some scope creep, but I'm > completely > for the idea that this kind of thing could be > re-used to compare even non-apache projects (other search engines, > etc). > Surely we have some stuff to learn from each other. +1 > > This is just about boilerplate code, build scripts, downloading, > reformatting, very boring stuff :) Yawn. > > However, I'd pretty much resigned myself to porting a separate >> implementation >> of the benchmarking suite eventually. And given the way this >> thread has >> progressed since I started writing this reply, looks like that's >> what I'll >> be >> falling back to after all. Oh well... no gain, no loss. >> >> Marvin Humphrey >> >> > > > > -- > Robert Muir > [EMAIL PROTECTED] -------------------------- Grant Ingersoll http://www.lucidimagination.com/ Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using Solr/Lucene: http://www.lucidimagination.com/search
-
Re: some links to downloadable test collectionsNicola Ferro 2009-11-12, 12:18
Our experience in organizing and running CLEF for 10 years has been to
not go for a least common denominator but leave collections as they are. The rationale is that: 1) you loose the link/alignment with the original collection 2) you loose or discard information (tags) that might be useful in the future for unforeseen evaluation tasks / reuses of the collection 3) you might introduce errors, if you miss something in the semantics of the original collection or you have bugs in the software 4) it is almost impossible to develop a format that fits for all the domains (e.g. news, library collections, patent collections, juridical documents, ...) or mixed media collections (images+text, speech +text, ...) 5) errors / alternative transliterations (e.g. with accents, without accents) / documents with empty content/tags in the collection represent a real word situation which search engines should be able to cope with. What we only ask for the new collections (not the legacy ones) is to be in XML, UTF-8, and ensuring unique document identifier (possibly according to some meaningful/agreed format). All the best, Nicola --------------------------------------------------------------------------------- Nicola Ferro - Ph.D. in Computer Science Assistant Professor Department of Information Engineering (DEI) University of Padua Via Gradenigo, 6/A - 35131 Padova - Italy Tel +39 049 827 7939 Fax: +39 049 827 7799 skype: nicola.ferro e-mail: [EMAIL PROTECTED] home page: http://ims.dei.unipd.it/members/ferro/ ---------------------------------------------------------------------------------- Il giorno 12 Nov 2009, alle ore 12:34, Robert Muir ha scritto: > Marvin, I'm not really sure its the format that we want to stick with > either? > > For example, converting everything a least common denominator will > work for > now, but some collections might have special properties (i.e. fields > with > categorization values, other interesting things). > > just want to get something started and working, worst case: nobody > likes the > patch and we are back to where we are now! > > On Wed, Nov 11, 2009 at 12:55 PM, Marvin Humphrey <[EMAIL PROTECTED] > >wrote: > >> On Tue, Nov 10, 2009 at 07:13:58PM -0500, Robert Muir wrote: >> >>> Why not agree on the conventional TREC format that the lucene >>> benchmark >>> package expects? >> >> +0, seems logical, but I'm not well informed about either the >> format itself >> or >> possible alternatives. >> >>> 3. we are talking about downloading and reformatting text files >>> into text >>> files, seriously I don't think you need to/should understand >>> really anything about the lucene benchmark impl to make use of this. >> >> OK. The next logical step is to actually do something with the >> files, and >> I >> figured you were going there. I didn't realize that simply >> converting the >> files was more-or-less sufficient for coaxing something useful out >> of the >> Lucene benchmark suite. >> >> Please carry on. >> >>> In fact, I don't want to try to imply some scope creep, but I'm >> completely >>> for the idea that this kind of thing could be re-used to compare >>> even >>> non-apache projects (other search engines, etc). >> >> If it's not going to work with other search engines, the project >> should be >> called "Open Irrelevance". :P >> >> PS: I misremembered the authorship of the Lucene benchmarking suite >> earlier. >> McCandless has been modding it recently, but the original patch was >> a team >> effort from Grant Ingersoll and Doron Cohen with prior art >> contributions by >> Andrzej Bialecki and myself. Apologies. >> >> Marvin Humphrey >> >> > > > -- > Robert Muir > [EMAIL PROTECTED]
-
Re: some links to downloadable test collectionsRobert Muir 2009-11-12, 12:39
Nicola,
I agree with your assessment, however if someone wants the collection 'as it is', they can already do this without any openrelevance project (just download the collection, you have it). what I am proposing is some scripts to create a consistent format to make consumption easier, else every project that wants to run the tests must implement parsers/etc for each collection, due to these inconsistencies. Most of the formatting differences I speak of are things such as using various different tags to refer to the document id: Docname, DOCNAME, DOCID, ..., different formatting of queries and judgements files. I am not talking about changing any of the content (accents or errors), and I don't see how this really loses anything from the original collection... I'll look at including all tags, for lucene-java we can change TrecContentSource to ignore tags that don't matter for the time being. On Thu, Nov 12, 2009 at 7:18 AM, Nicola Ferro <[EMAIL PROTECTED]> wrote: > Our experience in organizing and running CLEF for 10 years has been to not > go for a least common denominator but leave collections as they are. > > The rationale is that: > 1) you loose the link/alignment with the original collection > 2) you loose or discard information (tags) that might be useful in the > future for unforeseen evaluation tasks / reuses of the collection > 3) you might introduce errors, if you miss something in the semantics of > the original collection or you have bugs in the software > 4) it is almost impossible to develop a format that fits for all the > domains (e.g. news, library collections, patent collections, juridical > documents, ...) or mixed media collections (images+text, speech+text, ...) > 5) errors / alternative transliterations (e.g. with accents, without > accents) / documents with empty content/tags in the collection represent a > real word situation which search engines should be able to cope with. > > What we only ask for the new collections (not the legacy ones) is to be in > XML, UTF-8, and ensuring unique document identifier (possibly according to > some meaningful/agreed format). > > All the best, > Nicola > > > > > --------------------------------------------------------------------------------- > Nicola Ferro - Ph.D. in Computer Science > Assistant Professor > > Department of Information Engineering (DEI) > University of Padua > Via Gradenigo, 6/A - 35131 Padova - Italy > Tel +39 049 827 7939 Fax: +39 049 827 7799 > > skype: nicola.ferro > e-mail: [EMAIL PROTECTED] > home page: http://ims.dei.unipd.it/members/ferro/ > > ---------------------------------------------------------------------------------- > > Il giorno 12 Nov 2009, alle ore 12:34, Robert Muir ha scritto: > > > Marvin, I'm not really sure its the format that we want to stick with >> either? >> >> For example, converting everything a least common denominator will work >> for >> now, but some collections might have special properties (i.e. fields with >> categorization values, other interesting things). >> >> just want to get something started and working, worst case: nobody likes >> the >> patch and we are back to where we are now! >> >> On Wed, Nov 11, 2009 at 12:55 PM, Marvin Humphrey <[EMAIL PROTECTED] >> >wrote: >> >> On Tue, Nov 10, 2009 at 07:13:58PM -0500, Robert Muir wrote: >>> >>> Why not agree on the conventional TREC format that the lucene benchmark >>>> package expects? >>>> >>> >>> +0, seems logical, but I'm not well informed about either the format >>> itself >>> or >>> possible alternatives. >>> >>> 3. we are talking about downloading and reformatting text files into >>>> text >>>> files, seriously I don't think you need to/should understand >>>> really anything about the lucene benchmark impl to make use of this. >>>> >>> >>> OK. The next logical step is to actually do something with the files, >>> and >>> I >>> figured you were going there. I didn't realize that simply converting Robert Muir [EMAIL PROTECTED]
-
Re: some links to downloadable test collectionsRobert Muir 2009-11-12, 13:09
here are some more examples:
some judgement files are tab delimited, some space delimited. some collections concatenate all documents into one big file, some have thousands of smaller files. some have these under subdirectories, some do not. so, yeah it sounds like i'm not proposing much value-add, but all of these little inconsistencies add up into annoyances and processing to make use of the collection. I'm definitely not proposing changing any of the actual content. For the openly-available collections I have worked with and listed on the wiki, once you fix these structural differences, its easy to make use of them. I am already doing it. I'm not concerned about any theoretical properties of collections that are not openly available (things with speech or whatever). On Thu, Nov 12, 2009 at 7:18 AM, Nicola Ferro <[EMAIL PROTECTED]> wrote: > Our experience in organizing and running CLEF for 10 years has been to not > go for a least common denominator but leave collections as they are. > > The rationale is that: > 1) you loose the link/alignment with the original collection > 2) you loose or discard information (tags) that might be useful in the > future for unforeseen evaluation tasks / reuses of the collection > 3) you might introduce errors, if you miss something in the semantics of > the original collection or you have bugs in the software > 4) it is almost impossible to develop a format that fits for all the > domains (e.g. news, library collections, patent collections, juridical > documents, ...) or mixed media collections (images+text, speech+text, ...) > 5) errors / alternative transliterations (e.g. with accents, without > accents) / documents with empty content/tags in the collection represent a > real word situation which search engines should be able to cope with. > > What we only ask for the new collections (not the legacy ones) is to be in > XML, UTF-8, and ensuring unique document identifier (possibly according to > some meaningful/agreed format). > > All the best, > Nicola > > > > > --------------------------------------------------------------------------------- > Nicola Ferro - Ph.D. in Computer Science > Assistant Professor > > Department of Information Engineering (DEI) > University of Padua > Via Gradenigo, 6/A - 35131 Padova - Italy > Tel +39 049 827 7939 Fax: +39 049 827 7799 > > skype: nicola.ferro > e-mail: [EMAIL PROTECTED] > home page: http://ims.dei.unipd.it/members/ferro/ > > ---------------------------------------------------------------------------------- > > Il giorno 12 Nov 2009, alle ore 12:34, Robert Muir ha scritto: > > > Marvin, I'm not really sure its the format that we want to stick with >> either? >> >> For example, converting everything a least common denominator will work >> for >> now, but some collections might have special properties (i.e. fields with >> categorization values, other interesting things). >> >> just want to get something started and working, worst case: nobody likes >> the >> patch and we are back to where we are now! >> >> On Wed, Nov 11, 2009 at 12:55 PM, Marvin Humphrey <[EMAIL PROTECTED] >> >wrote: >> >> On Tue, Nov 10, 2009 at 07:13:58PM -0500, Robert Muir wrote: >>> >>> Why not agree on the conventional TREC format that the lucene benchmark >>>> package expects? >>>> >>> >>> +0, seems logical, but I'm not well informed about either the format >>> itself >>> or >>> possible alternatives. >>> >>> 3. we are talking about downloading and reformatting text files into >>>> text >>>> files, seriously I don't think you need to/should understand >>>> really anything about the lucene benchmark impl to make use of this. >>>> >>> >>> OK. The next logical step is to actually do something with the files, >>> and >>> I >>> figured you were going there. I didn't realize that simply converting >>> the >>> files was more-or-less sufficient for coaxing something useful out of the >>> Lucene benchmark suite. >>> >>> Please carry on. Robert Muir [EMAIL PROTECTED]
-
Re: some links to downloadable test collectionsNicola Ferro 2009-11-12, 13:51
Dear Robert,
that's fine. Maybe I've been a little bit over concerned. But how do you plan to do with non-XML collections? Do you plan to XMLify them? And what about for those that are in SGML with a DTD? Do you plan to translate them to XML and provide also translations of their document type e.g. to XSchema? Do you in general think to add a document type for all the collections? Usually we do like that in CLEF because obviously gives the possibility of validate documents and it is a good documentation for the users of the collections who know what to expect. In general, I'm more than in favour to have standardised XML-based formats for topics, qrles, and runs instead of / in conjunction with the legacy TREC format - which is sometimes redundant and sometimes esoteric wrt unused fields. We have developed straightforward XML formats in CLEF for runs, qrels, and topics but we have publicly used only the one for topics because participants are used to trec_eval which does not work with runs and qrels in XML. You can have a look at the XML topics at: http://direct.dei.unipd.it/10.2452/100-AH If you are interested, we would be happy to share those formats. By the way we have also developed a Java wrapper for trec_eval 8.0 (via JNI) which allows use to use trec_eval as a plain Java object still using its original code for computations to ensure compliancy with the actual implementation used at TREC (same results, same bugs if any -> comparable performance figures). Maybe, this could be of your interest as well. All the best, Nicola Il giorno 12 Nov 2009, alle ore 13:39, Robert Muir ha scritto: > Nicola, > > I agree with your assessment, however if someone wants the > collection 'as it > is', they can already do this without any openrelevance project (just > download the collection, you have it). > > what I am proposing is some scripts to create a consistent format to > make > consumption easier, else every project that wants to run the tests > must > implement parsers/etc for each collection, due to these > inconsistencies. > > Most of the formatting differences I speak of are things such as using > various different tags to refer to the document id: Docname, > DOCNAME, DOCID, > ..., different formatting of queries and judgements files. > > I am not talking about changing any of the content (accents or > errors), and > I don't see how this really loses anything from the original > collection... > > I'll look at including all tags, for lucene-java we can change > TrecContentSource to ignore tags that don't matter for the time being. > > > On Thu, Nov 12, 2009 at 7:18 AM, Nicola Ferro <[EMAIL PROTECTED]> > wrote: > >> Our experience in organizing and running CLEF for 10 years has been >> to not >> go for a least common denominator but leave collections as they are. >> >> The rationale is that: >> 1) you loose the link/alignment with the original collection >> 2) you loose or discard information (tags) that might be useful in >> the >> future for unforeseen evaluation tasks / reuses of the collection >> 3) you might introduce errors, if you miss something in the >> semantics of >> the original collection or you have bugs in the software >> 4) it is almost impossible to develop a format that fits for all the >> domains (e.g. news, library collections, patent collections, >> juridical >> documents, ...) or mixed media collections (images+text, speech >> +text, ...) >> 5) errors / alternative transliterations (e.g. with accents, without >> accents) / documents with empty content/tags in the collection >> represent a >> real word situation which search engines should be able to cope with. >> >> What we only ask for the new collections (not the legacy ones) is >> to be in >> XML, UTF-8, and ensuring unique document identifier (possibly >> according to >> some meaningful/agreed format). >> >> All the best, >> Nicola >> >> >> >> >> ----------------------------
-
Re: some links to downloadable test collectionsRobert Muir 2009-11-12, 15:06
Nicola, actually for now I am thinking just to use the legacy trec format.
The only reason is that this way, I don't have to change the lucene-java benchmark package to make use of it. I think an xml format for everything might be better in the future (and looks like you have some experience with this kind of thing already), perhaps someone would be interested in doing this as a later improvement, and also contributing code to make use of it for lucene-java benchmark. as far as the actual collection content, right now this package only expects "body" and "docname" index fields to be present, so this is a very simple format for the time being. I don't really care about the format of the existing collection for the time being, I'm not going to do anything complicated involving xml schema or dtds. Im just going to parse out the "body" and "docname" and output in a consistent way, so that we can have some repeatable relevance tests in the near future. In the future, I think people can contribute improvements to all of this. I'm trying to start with the very barebones basics, expecting that others can contribute improvements, or even completely replace all of this! On Thu, Nov 12, 2009 at 8:51 AM, Nicola Ferro <[EMAIL PROTECTED]> wrote: > Dear Robert, > > that's fine. Maybe I've been a little bit over concerned. > > But how do you plan to do with non-XML collections? Do you plan to XMLify > them? And what about for those that are in SGML with a DTD? Do you plan to > translate them to XML and provide also translations of their document type > e.g. to XSchema? Do you in general think to add a document type for all the > collections? Usually we do like that in CLEF because obviously gives the > possibility of validate documents and it is a good documentation for the > users of the collections who know what to expect. > > In general, I'm more than in favour to have standardised XML-based formats > for topics, qrles, and runs instead of / in conjunction with the legacy > TREC format - which is sometimes redundant and sometimes esoteric wrt unused > fields. > > We have developed straightforward XML formats in CLEF for runs, qrels, and > topics but we have publicly used only the one for topics because > participants are used to trec_eval which does not work with runs and qrels > in XML. You can have a look at the XML topics at: > > http://direct.dei.unipd.it/10.2452/100-AH > > If you are interested, we would be happy to share those formats. > > By the way we have also developed a Java wrapper for trec_eval 8.0 (via > JNI) which allows use to use trec_eval as a plain Java object still using > its original code for computations to ensure compliancy with the actual > implementation used at TREC (same results, same bugs if any -> comparable > performance figures). Maybe, this could be of your interest as well. > > All the best, > Nicola > > > > Il giorno 12 Nov 2009, alle ore 13:39, Robert Muir ha scritto: > > > Nicola, >> >> I agree with your assessment, however if someone wants the collection 'as >> it >> is', they can already do this without any openrelevance project (just >> download the collection, you have it). >> >> what I am proposing is some scripts to create a consistent format to make >> consumption easier, else every project that wants to run the tests must >> implement parsers/etc for each collection, due to these inconsistencies. >> >> Most of the formatting differences I speak of are things such as using >> various different tags to refer to the document id: Docname, DOCNAME, >> DOCID, >> ..., different formatting of queries and judgements files. >> >> I am not talking about changing any of the content (accents or errors), >> and >> I don't see how this really loses anything from the original collection... >> >> I'll look at including all tags, for lucene-java we can change >> TrecContentSource to ignore tags that don't matter for the time being. >> >> >> On Thu, Nov 12, 2009 at 7:18 AM, Nicola Ferro <[EMAIL PROTECTED]> wrote: Robert Muir [EMAIL PROTECTED]
-
Re: some links to downloadable test collectionsRobert Muir 2009-11-12, 15:43
sorry, I wanted to respond to this also.
This sounds interesting for the future I think. For now, the lucene-java code simply outputs a submission.txt file, which can then be used from the commandline with trec_eval (invoked manually). I think we want to continue to do this for starters. In the future it might be neat to have something that uses a wrapper like this to create, say jira-formatted tables for us to use in benchmarks, but there might be other ways to do the same thing... I agree that for comparisons though, it would be best if everyone used the official trec_eval to make comparisons, and not the summary output from the lucene-java benchmark package. By the way we have also developed a Java wrapper for trec_eval 8.0 (via JNI) > which allows use to use trec_eval as a plain Java object still using its > original code for computations to ensure compliancy with the actual > implementation used at TREC (same results, same bugs if any -> comparable > performance figures). Maybe, this could be of your interest as well. >
-
Re: some links to downloadable test collectionsNicola Ferro 2009-11-13, 08:10
Ok, I see your point.
Nicola Il giorno 12 Nov 2009, alle ore 16:06, Robert Muir ha scritto: > Nicola, actually for now I am thinking just to use the legacy trec > format. > The only reason is that this way, I don't have to change the lucene- > java > benchmark package to make use of it. > > I think an xml format for everything might be better in the future > (and > looks like you have some experience with this kind of thing already), > perhaps someone would be interested in doing this as a later > improvement, > and also contributing code to make use of it for lucene-java > benchmark. > > as far as the actual collection content, right now this package only > expects > "body" and "docname" index fields to be present, so this is a very > simple > format for the time being. I don't really care about the format of the > existing collection for the time being, I'm not going to do anything > complicated involving xml schema or dtds. Im just going to parse out > the > "body" and "docname" and output in a consistent way, so that we can > have > some repeatable relevance tests in the near future. > > In the future, I think people can contribute improvements to all of > this. > I'm trying to start with the very barebones basics, expecting that > others > can contribute improvements, or even completely replace all of this! > > > On Thu, Nov 12, 2009 at 8:51 AM, Nicola Ferro <[EMAIL PROTECTED]> > wrote: > >> Dear Robert, >> >> that's fine. Maybe I've been a little bit over concerned. >> >> But how do you plan to do with non-XML collections? Do you plan to >> XMLify >> them? And what about for those that are in SGML with a DTD? Do you >> plan to >> translate them to XML and provide also translations of their >> document type >> e.g. to XSchema? Do you in general think to add a document type for >> all the >> collections? Usually we do like that in CLEF because obviously >> gives the >> possibility of validate documents and it is a good documentation >> for the >> users of the collections who know what to expect. >> >> In general, I'm more than in favour to have standardised XML-based >> formats >> for topics, qrles, and runs instead of / in conjunction with the >> legacy >> TREC format - which is sometimes redundant and sometimes esoteric >> wrt unused >> fields. >> >> We have developed straightforward XML formats in CLEF for runs, >> qrels, and >> topics but we have publicly used only the one for topics because >> participants are used to trec_eval which does not work with runs >> and qrels >> in XML. You can have a look at the XML topics at: >> >> http://direct.dei.unipd.it/10.2452/100-AH >> >> If you are interested, we would be happy to share those formats. >> >> By the way we have also developed a Java wrapper for trec_eval 8.0 >> (via >> JNI) which allows use to use trec_eval as a plain Java object still >> using >> its original code for computations to ensure compliancy with the >> actual >> implementation used at TREC (same results, same bugs if any -> >> comparable >> performance figures). Maybe, this could be of your interest as well. >> >> All the best, >> Nicola >> >> >> >> Il giorno 12 Nov 2009, alle ore 13:39, Robert Muir ha scritto: >> >> >> Nicola, >>> >>> I agree with your assessment, however if someone wants the >>> collection 'as >>> it >>> is', they can already do this without any openrelevance project >>> (just >>> download the collection, you have it). >>> >>> what I am proposing is some scripts to create a consistent format >>> to make >>> consumption easier, else every project that wants to run the tests >>> must >>> implement parsers/etc for each collection, due to these >>> inconsistencies. >>> >>> Most of the formatting differences I speak of are things such as >>> using >>> various different tags to refer to the document id: Docname, >>> DOCNAME, >>> DOCID, >>> ..., different formatting of queries and judgements files. >>> >>> I am not talking about changing any of the content (accents or
-
Re: some links to downloadable test collectionsNicola Ferro 2009-11-13, 08:13
This sounds reasonable.
If, in the future, you are interested in this kind of approach (wrapper to trec_eval via JNI) let us know: it would be a pity to duplicate existing work and our package is quite tested since we use it in CLEF since 2005. All the best, Nicola Il giorno 12 Nov 2009, alle ore 16:43, Robert Muir ha scritto: > sorry, I wanted to respond to this also. > This sounds interesting for the future I think. > > For now, the lucene-java code simply outputs a submission.txt file, > which > can then be used from the commandline with trec_eval (invoked > manually). > > I think we want to continue to do this for starters. In the future > it might > be neat to have something that uses a wrapper like this to create, say > jira-formatted tables for us to use in benchmarks, but there might > be other > ways to do the same thing... > > I agree that for comparisons though, it would be best if everyone > used the > official trec_eval to make comparisons, and not the summary output > from the > lucene-java benchmark package. > > By the way we have also developed a Java wrapper for trec_eval 8.0 > (via JNI) >> which allows use to use trec_eval as a plain Java object still >> using its >> original code for computations to ensure compliancy with the actual >> implementation used at TREC (same results, same bugs if any -> >> comparable >> performance figures). Maybe, this could be of your interest as well. >>
-
Re: some links to downloadable test collectionsRobert Muir 2009-11-13, 11:54
personally, I try to avoid writing JNI at all costs so if I ever want such a
thing I will certainly shoot you an email! I guess this what you use to produce the charts/graphs for CLEF results? On Fri, Nov 13, 2009 at 3:13 AM, Nicola Ferro <[EMAIL PROTECTED]> wrote: > This sounds reasonable. > > If, in the future, you are interested in this kind of approach (wrapper to > trec_eval via JNI) let us know: it would be a pity to duplicate existing > work and our package is quite tested since we use it in CLEF since 2005. > > All the best, > Nicola > > Il giorno 12 Nov 2009, alle ore 16:43, Robert Muir ha scritto: > > > sorry, I wanted to respond to this also. >> This sounds interesting for the future I think. >> >> For now, the lucene-java code simply outputs a submission.txt file, which >> can then be used from the commandline with trec_eval (invoked manually). >> >> I think we want to continue to do this for starters. In the future it >> might >> be neat to have something that uses a wrapper like this to create, say >> jira-formatted tables for us to use in benchmarks, but there might be >> other >> ways to do the same thing... >> >> I agree that for comparisons though, it would be best if everyone used the >> official trec_eval to make comparisons, and not the summary output from >> the >> lucene-java benchmark package. >> >> By the way we have also developed a Java wrapper for trec_eval 8.0 (via >> JNI) >> >>> which allows use to use trec_eval as a plain Java object still using its >>> original code for computations to ensure compliancy with the actual >>> implementation used at TREC (same results, same bugs if any -> comparable >>> performance figures). Maybe, this could be of your interest as well. >>> >>> > -- Robert Muir [EMAIL PROTECTED]
-
Re: some links to downloadable test collectionsNicola Ferro 2009-11-13, 12:31
I don't like JNI-based solutions too but copycat results with
trec_eval are mandatory for us; e.g. participants use trec_eval on their own and this ensure exactly the same computations. This is why I didn't write a pure Java version doing the code for the computations from scratch and it was anyway better than calling an external OS process. And yes, this is what we use in CLEF for computations inside the Java- based system we use for managing the campaign. Nicola Il giorno 13 Nov 2009, alle ore 12:54, Robert Muir ha scritto: > personally, I try to avoid writing JNI at all costs so if I ever > want such a > thing I will certainly shoot you an email! > > I guess this what you use to produce the charts/graphs for CLEF > results? > > On Fri, Nov 13, 2009 at 3:13 AM, Nicola Ferro <[EMAIL PROTECTED]> > wrote: > >> This sounds reasonable. >> >> If, in the future, you are interested in this kind of approach >> (wrapper to >> trec_eval via JNI) let us know: it would be a pity to duplicate >> existing >> work and our package is quite tested since we use it in CLEF since >> 2005. >> >> All the best, >> Nicola >> >> Il giorno 12 Nov 2009, alle ore 16:43, Robert Muir ha scritto: >> >> >> sorry, I wanted to respond to this also. >>> This sounds interesting for the future I think. >>> >>> For now, the lucene-java code simply outputs a submission.txt >>> file, which >>> can then be used from the commandline with trec_eval (invoked >>> manually). >>> >>> I think we want to continue to do this for starters. In the future >>> it >>> might >>> be neat to have something that uses a wrapper like this to create, >>> say >>> jira-formatted tables for us to use in benchmarks, but there might >>> be >>> other >>> ways to do the same thing... >>> >>> I agree that for comparisons though, it would be best if everyone >>> used the >>> official trec_eval to make comparisons, and not the summary output >>> from >>> the >>> lucene-java benchmark package. >>> >>> By the way we have also developed a Java wrapper for trec_eval 8.0 >>> (via >>> JNI) >>> >>>> which allows use to use trec_eval as a plain Java object still >>>> using its >>>> original code for computations to ensure compliancy with the actual >>>> implementation used at TREC (same results, same bugs if any -> >>>> comparable >>>> performance figures). Maybe, this could be of your interest as >>>> well. >>>> >>>> >> > > > -- > Robert Muir > [EMAIL PROTECTED]
-
Re: some links to downloadable test collectionsAndrzej Bialecki 2009-11-23, 09:29
Simon Willnauer wrote:
> IMO we should not waste too much time for a decision on a programming > language. Lets just go to for Java / ANT as we all know what we are > doing. > > Thoughts? As we start adding collections, IMHO it's important that we add a per-collection LICENSE.txt and README.txt - what good is a collection from some random URL without a record of its provenience and its suitability to be used in this project? -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __________________________________ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
-
Re: some links to downloadable test collectionsRobert Muir 2009-11-23, 12:03
you are right, lets open a JIRA issue
On Mon, Nov 23, 2009 at 4:29 AM, Andrzej Bialecki <[EMAIL PROTECTED]> wrote: > Simon Willnauer wrote: > >> IMO we should not waste too much time for a decision on a programming >> language. Lets just go to for Java / ANT as we all know what we are >> doing. >> >> Thoughts? >> > > As we start adding collections, IMHO it's important that we add a > per-collection LICENSE.txt and README.txt - what good is a collection from > some random URL without a record of its provenience and its suitability to > be used in this project? > > > > -- > Best regards, > Andrzej Bialecki <>< > ___. ___ ___ ___ _ _ __________________________________ > [__ || __|__/|__||\/| Information Retrieval, Semantic Web > ___|||__|| \| || | Embedded Unix, System Integration > http://www.sigram.com Contact: info at sigram dot com > > -- Robert Muir [EMAIL PROTECTED]
-
Re: some links to downloadable test collectionsAndrzej Bialecki 2009-11-23, 14:01
Robert Muir wrote:
> you are right, lets open a JIRA issue Done, ORP-3. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __________________________________ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
-
Re: some links to downloadable test collectionsRobert Muir 2009-11-23, 15:24
thanks Andrzej, I added a few thoughts of my own.
I might be completely off-base, but I think we should exercise a lot of caution to not give the impression these are apache works. On Mon, Nov 23, 2009 at 9:01 AM, Andrzej Bialecki <[EMAIL PROTECTED]> wrote: > Robert Muir wrote: > >> you are right, lets open a JIRA issue >> > > Done, ORP-3. > > > > -- > Best regards, > Andrzej Bialecki <>< > ___. ___ ___ ___ _ _ __________________________________ > [__ || __|__/|__||\/| Information Retrieval, Semantic Web > ___|||__|| \| || | Embedded Unix, System Integration > http://www.sigram.com Contact: info at sigram dot com > > -- Robert Muir [EMAIL PROTECTED]
-
Re: some links to downloadable test collectionsAndrzej Bialecki 2009-11-23, 15:47
Robert Muir wrote:
> thanks Andrzej, I added a few thoughts of my own. > > I might be completely off-base, but I think we should exercise a lot of > caution to not give the impression these are apache works. Good thinking, I agree with your comments - this is a tricky issue, and it's better to err on the side of caution. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __________________________________ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com |