Home | About | Sematext search-lucene.com search-hadoop.com
 Search Lucene and all its subprojects:

Switch to Threaded View
OpenRelevance, mail # dev - some links to downloadable test collections


Copy link to this message
-
Re: some links to downloadable test collections
Grant Ingersoll 2009-11-10, 22:25

On Nov 10, 2009, at 3:48 PM, Robert Muir wrote:

> Hi Simon, thanks for your comments.
>
> I guess in my opinion, the fastest way to having something would be to
> create scripts that munge these various collections into a standard  
> format,
> as mentioned earlier.
> And I think the easiest format would actually be to format queries,
> judgements, and text into what the Lucene-java benchmark expects  
> already.
> This format is pretty simple and I don't think it would be a  
> headache to use
> for other projects such as lucy or solr or maybe even comparisons  
> against
> other software.
>
> This is of course biased by the fact that I am lazy and I don't want  
> to mess
> with the lucene benchmark package :)
>
> I would like to create a JIRA issue to start working this task, as I  
> am
> maintaining this various junk internally at the moment.
>
> Does anyone have specific preference to what programming language/
> build
> system/etc is desired? I don't have a preference, I just care about
> relevance.

Since most of our projects are in Java, I would probably lean that  
way, but if it is just meant to be lightweight, then we could just use  
a scripting lang.
>
> On Tue, Nov 10, 2009 at 3:42 PM, Simon Willnauer <
> [EMAIL PROTECTED]> wrote:
>
>> On Tue, Nov 10, 2009 at 7:33 AM, Robert Muir <[EMAIL PROTECTED]>  
>> wrote:
>>>> So, what comparisons can we set up using these collections?
>>>>
>>>
>>> I think we can be creative. for example I used one of these  
>>> tonight to
>> test
>>> LUCENE-1812, Andrzej's index pruning tool. Results showed that it  
>>> works
>> as
>>> he advertised at apachecon...
>>>
>>> also, we should be careful about the english ones i linked to (or
>>> preferably, find bigger ones), because they are smallish  
>>> collections.
>>>
>>>
>>>>
>>>> I seem to recall you suggesting at ApacheCon that they would be  
>>>> handy
>> when
>>>> judging Analyzer mods.
>>>>
>>>
>>> Yeah, definitely don't think any results should be gospel for  
>>> analyzers
>> or
>>> scoring or anything else, but then again I think we could detect  
>>> if some
>>> change is completely broken or silly (bugs, etc).
>>
>> This would bring a huge value to lucene and its derivatives. This
>> sounds like a very good point to start from especially until we  
>> sorted
>> out all the licensing issues, how to distribute collections or what  
>> we
>> want to crawl. There is a huge +1 from my side to get started with  
>> the
>> small collections - 100% more than we have today.
>>>
>>>
>>>>
>>>> These collections are all binary assertions -- relevant/not-
>>>> relevant for
>> a
>>>> given query -- right?  Am I correct in presuming that such  
>>>> corpora can't
>>>> help
>>>> us to judge scoring and ranking algorithms, or Similarity
>> implementations?
>>>>
>>>
>>> I think most of them are binary... but I think I disagree with your
>> second
>>> statement, these kinds of collections are used to compare scoring/
>>> ranking
>>> algorithms all the time!
>> Afaik, those collections yield pretty good results for all kinds of
>> relevance judgements though.
>>>
>>>
>>>>
>>>>> also, if you have some ideas on how to perhaps create some ant  
>>>>> tasks
>> to
>>>> make
>>>>> downloading/running these thru the lucene benchmark package  
>>>>> easier,
>> that
>>>>> would be great too.
>>>>
>>>> Hmm, that approach is specific to Lucene Java.  It's not handy for
>> either
>>>> of
>>>> the projects I work on (Lucy, KinoSearch).
>>>>
>>>
>>> You raise a good point here. Really at the end of the day, you  
>>> just want
>> to
>>> produce a .txt file that you throw at the trec_eval commandline  
>>> program
>> or
>>> something similar. Doing it in a lucene-java specific way doesn't  
>>> allow
>> us
>>> to easily evaluate things even in solr, for example it has analysis
>>> components that affect relevance!
>>
>> This is maybe the most important issue for the first step. I would
>> really like to see a standard format which can be parsed easily by