Home | About | Sematext search-lucene.com search-hadoop.com
 Search Lucene and all its subprojects:

Switch to Threaded View
OpenRelevance, mail # dev - some links to downloadable test collections


Copy link to this message
-
Re: some links to downloadable test collections
Robert Muir 2009-11-10, 20:48
Hi Simon, thanks for your comments.

I guess in my opinion, the fastest way to having something would be to
create scripts that munge these various collections into a standard format,
as mentioned earlier.
And I think the easiest format would actually be to format queries,
judgements, and text into what the Lucene-java benchmark expects already.
This format is pretty simple and I don't think it would be a headache to use
for other projects such as lucy or solr or maybe even comparisons against
other software.

This is of course biased by the fact that I am lazy and I don't want to mess
with the lucene benchmark package :)

I would like to create a JIRA issue to start working this task, as I am
maintaining this various junk internally at the moment.

Does anyone have specific preference to what programming language/build
system/etc is desired? I don't have a preference, I just care about
relevance.

On Tue, Nov 10, 2009 at 3:42 PM, Simon Willnauer <
[EMAIL PROTECTED]> wrote:

> On Tue, Nov 10, 2009 at 7:33 AM, Robert Muir <[EMAIL PROTECTED]> wrote:
> >> So, what comparisons can we set up using these collections?
> >>
> >
> > I think we can be creative. for example I used one of these tonight to
> test
> > LUCENE-1812, Andrzej's index pruning tool. Results showed that it works
> as
> > he advertised at apachecon...
> >
> > also, we should be careful about the english ones i linked to (or
> > preferably, find bigger ones), because they are smallish collections.
> >
> >
> >>
> >> I seem to recall you suggesting at ApacheCon that they would be handy
> when
> >> judging Analyzer mods.
> >>
> >
> > Yeah, definitely don't think any results should be gospel for analyzers
> or
> > scoring or anything else, but then again I think we could detect if some
> > change is completely broken or silly (bugs, etc).
>
> This would bring a huge value to lucene and its derivatives. This
> sounds like a very good point to start from especially until we sorted
> out all the licensing issues, how to distribute collections or what we
> want to crawl. There is a huge +1 from my side to get started with the
> small collections - 100% more than we have today.
> >
> >
> >>
> >> These collections are all binary assertions -- relevant/not-relevant for
> a
> >> given query -- right?  Am I correct in presuming that such corpora can't
> >> help
> >> us to judge scoring and ranking algorithms, or Similarity
> implementations?
> >>
> >
> > I think most of them are binary... but I think I disagree with your
> second
> > statement, these kinds of collections are used to compare scoring/ranking
> > algorithms all the time!
> Afaik, those collections yield pretty good results for all kinds of
> relevance judgements though.
> >
> >
> >>
> >> > also, if you have some ideas on how to perhaps create some ant tasks
> to
> >> make
> >> > downloading/running these thru the lucene benchmark package easier,
> that
> >> > would be great too.
> >>
> >> Hmm, that approach is specific to Lucene Java.  It's not handy for
> either
> >> of
> >> the projects I work on (Lucy, KinoSearch).
> >>
> >
> > You raise a good point here. Really at the end of the day, you just want
> to
> > produce a .txt file that you throw at the trec_eval commandline program
> or
> > something similar. Doing it in a lucene-java specific way doesn't allow
> us
> > to easily evaluate things even in solr, for example it has analysis
> > components that affect relevance!
>
> This is maybe the most important issue for the first step. I would
> really like to see a standard format which can be parsed easily by
> whatever language you use. I personally prefer JSON for almost
> everything as it is soo easy to parse, read (human eyes) and write.
> Ant still sounds like a good plan as there are many many functions
> already implemented and it is easy to extend.
> +1 for a creating an issue for format and transformation.
> >
> > I guess one approach could be to create scripts and stuff here that
> download

Robert Muir
[EMAIL PROTECTED]