Home | About | Sematext search-lucene.com search-hadoop.com
 Search Lucene and all its subprojects:

Switch to Plain View
OpenRelevance, mail # dev - some links to downloadable test collections


+
Robert Muir 2009-11-07, 08:37
+
Marvin Humphrey 2009-11-10, 06:07
+
Robert Muir 2009-11-10, 06:33
+
Simon Willnauer 2009-11-10, 20:42
+
Robert Muir 2009-11-10, 20:48
+
Grant Ingersoll 2009-11-10, 22:25
+
Robert Muir 2009-11-10, 22:30
+
Simon Willnauer 2009-11-10, 22:42
+
Andrzej Bialecki 2009-11-23, 09:29
+
Robert Muir 2009-11-23, 12:03
+
Andrzej Bialecki 2009-11-23, 14:01
+
Robert Muir 2009-11-23, 15:24
+
Andrzej Bialecki 2009-11-23, 15:47
+
Robert Muir 2009-11-10, 23:09
+
Andrzej Bialecki 2009-11-10, 23:46
+
Marvin Humphrey 2009-11-10, 23:45
+
Robert Muir 2009-11-11, 00:13
+
Grant Ingersoll 2009-11-12, 11:59
+
Marvin Humphrey 2009-11-11, 17:55
+
Robert Muir 2009-11-12, 11:34
+
Nicola Ferro 2009-11-12, 12:18
Copy link to this message
-
Re: some links to downloadable test collections
Robert Muir 2009-11-12, 12:39
Nicola,

I agree with your assessment, however if someone wants the collection 'as it
is', they can already do this without any openrelevance project (just
download the collection, you have it).

what I am proposing is some scripts to create a consistent format to make
consumption easier, else every project that wants to run the tests must
implement parsers/etc for each collection, due to these inconsistencies.

Most of the formatting differences I speak of are things such as using
various different tags to refer to the document id: Docname, DOCNAME, DOCID,
..., different formatting of queries and judgements files.

I am not talking about changing any of the content (accents or errors), and
I don't see how this really loses anything from the original collection...

I'll look at including all tags, for lucene-java we can change
TrecContentSource to ignore tags that don't matter for the time being.
On Thu, Nov 12, 2009 at 7:18 AM, Nicola Ferro <[EMAIL PROTECTED]> wrote:

> Our experience in organizing and running CLEF for 10 years has been to not
> go for a least common denominator but leave collections as they are.
>
> The rationale is that:
> 1) you loose the link/alignment with the original collection
> 2) you loose or discard information (tags) that might be useful in the
> future for unforeseen evaluation tasks / reuses of the collection
> 3) you might introduce errors, if you miss something in the semantics of
> the original collection or you have bugs in the software
> 4) it is almost impossible to develop a format that fits for all the
> domains (e.g. news, library collections, patent collections, juridical
> documents, ...) or mixed media collections (images+text, speech+text, ...)
> 5) errors / alternative transliterations (e.g. with accents, without
> accents) / documents with empty content/tags in the collection represent a
> real word situation which search engines should be able to cope with.
>
> What we only ask for the new collections (not the legacy ones) is to be in
> XML, UTF-8, and ensuring unique document identifier (possibly according to
> some meaningful/agreed format).
>
> All the best,
> Nicola
>
>
>
>
> ---------------------------------------------------------------------------------
>     Nicola Ferro   -   Ph.D. in Computer Science
>     Assistant Professor
>
>     Department of Information Engineering (DEI)
>     University of Padua
>     Via Gradenigo, 6/A  -  35131 Padova - Italy
>     Tel +39 049 827 7939  Fax: +39 049 827 7799
>
>     skype: nicola.ferro
>     e-mail: [EMAIL PROTECTED]
>     home page: http://ims.dei.unipd.it/members/ferro/
>
> ----------------------------------------------------------------------------------
>
> Il giorno 12 Nov 2009, alle ore 12:34, Robert Muir ha scritto:
>
>
>  Marvin, I'm not really sure its the format that we want to stick with
>> either?
>>
>> For example, converting everything a least common denominator will work
>> for
>> now, but some collections might have special properties (i.e. fields with
>> categorization values, other interesting things).
>>
>> just want to get something started and working, worst case: nobody likes
>> the
>> patch and we are back to where we are now!
>>
>> On Wed, Nov 11, 2009 at 12:55 PM, Marvin Humphrey <[EMAIL PROTECTED]
>> >wrote:
>>
>>  On Tue, Nov 10, 2009 at 07:13:58PM -0500, Robert Muir wrote:
>>>
>>>  Why not agree on the conventional TREC format that the lucene benchmark
>>>> package expects?
>>>>
>>>
>>> +0, seems logical, but I'm not well informed about either the format
>>> itself
>>> or
>>> possible alternatives.
>>>
>>>  3. we are talking about downloading and reformatting text files into
>>>> text
>>>> files, seriously I don't think you need to/should understand
>>>> really anything about the lucene benchmark impl to make use of this.
>>>>
>>>
>>> OK.  The next logical step is to actually do something with the files,
>>> and
>>> I
>>> figured you were going there.  I didn't realize that simply converting
Robert Muir
[EMAIL PROTECTED]
+
Nicola Ferro 2009-11-12, 13:51
+
Robert Muir 2009-11-12, 15:06
+
Nicola Ferro 2009-11-13, 08:10
+
Robert Muir 2009-11-12, 15:43
+
Nicola Ferro 2009-11-13, 08:13
+
Robert Muir 2009-11-13, 11:54
+
Nicola Ferro 2009-11-13, 12:31
+
Robert Muir 2009-11-12, 13:09