Home | About | Sematext search-lucene.com search-hadoop.com
 Search Lucene and all its subprojects:

Switch to Threaded View
OpenRelevance, mail # dev - Re: [jira] Commented: (ORP-1) Use existing collections for relevance testing


Copy link to this message
-
Re: [jira] Commented: (ORP-1) Use existing collections for relevance testing
Grant Ingersoll 2009-11-18, 21:40
Simon,

Any luck on this?

Do you want me to try the patch?

-Grant

On Nov 14, 2009, at 7:06 PM, Simon Willnauer (JIRA) wrote:

>
>    [ https://issues.apache.org/jira/browse/ORP-1?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12778023#action_12778023 ]
>
> Simon Willnauer commented on ORP-1:
> -----------------------------------
>
> Grant, I tried it on US and EU. I always get the same stupid error.
> I googled a bit and found some possible issues that maybe the URL in the authz file is slightly wrong (Upper / Lower case  issues). Are you able to check this?
>
> simon
>
>> Use existing collections for relevance testing
>> ----------------------------------------------
>>
>>                Key: ORP-1
>>                URL: https://issues.apache.org/jira/browse/ORP-1
>>            Project: Open Relevance Project
>>         Issue Type: New Feature
>>         Components: Collections, Judgments, Queries
>>           Reporter: Robert Muir
>>           Assignee: Simon Willnauer
>>        Attachments: ORP-1.patch
>>
>>
>> I created a list of existing collections with queries and judgements on the wiki here: http://cwiki.apache.org/ORP/existingcollections.html
>> These can be downloaded from the internet.
>> (please add more if you know)
>> I've created source code (ant and java) to download these collections, and reformat them to the trec format that the lucene benchmark expects.
>> each collection has its own ant script to download the collection, and java code to reformat, although I have some shared code at the top level.
>> The resulting output for each collection is a "corpus.gz" file, a queries file, and a judgements file,
>> The corpus.gz is a gzipped file that can be indexed with ant via the lucene benchmark package (using TrecContentSource)
>> Once the index is created, the command-line tool QueryDriver under the lucene benchmark quality/trec package can be used to run the evaluation.
>> It will print some summary output to stdout, but will also create a submission file that can be fed to trec_eval.
>> For starters, I will only have support for one collection in the patch, the Indonesian "Tempo" collection (around 23,000 docs)
>> We can simply add subdirectories for additional collections (it does a contrib-crawl like thing).
>> Once I finish wrapping up some documentation (such as description of the formats, some javadocs, and an example), I'll upload the patch.
>> These formats are actually documented in the lucene-java benchmark package already, but I think it would be nice to add this for non-java users.
>
> --
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
>