-Re: [jira] Commented: (ORP-1) Use existing collections for relevance testing
Grant Ingersoll 2009-11-18, 21:40
Any luck on this?
Do you want me to try the patch?
On Nov 14, 2009, at 7:06 PM, Simon Willnauer (JIRA) wrote:
> [ https://issues.apache.org/jira/browse/ORP-1?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12778023#action_12778023 ]
> Simon Willnauer commented on ORP-1:
> Grant, I tried it on US and EU. I always get the same stupid error.
> I googled a bit and found some possible issues that maybe the URL in the authz file is slightly wrong (Upper / Lower case issues). Are you able to check this?
>> Use existing collections for relevance testing
>> Key: ORP-1
>> URL: https://issues.apache.org/jira/browse/ORP-1
>> Project: Open Relevance Project
>> Issue Type: New Feature
>> Components: Collections, Judgments, Queries
>> Reporter: Robert Muir
>> Assignee: Simon Willnauer
>> Attachments: ORP-1.patch
>> I created a list of existing collections with queries and judgements on the wiki here: http://cwiki.apache.org/ORP/existingcollections.html
>> These can be downloaded from the internet.
>> (please add more if you know)
>> I've created source code (ant and java) to download these collections, and reformat them to the trec format that the lucene benchmark expects.
>> each collection has its own ant script to download the collection, and java code to reformat, although I have some shared code at the top level.
>> The resulting output for each collection is a "corpus.gz" file, a queries file, and a judgements file,
>> The corpus.gz is a gzipped file that can be indexed with ant via the lucene benchmark package (using TrecContentSource)
>> Once the index is created, the command-line tool QueryDriver under the lucene benchmark quality/trec package can be used to run the evaluation.
>> It will print some summary output to stdout, but will also create a submission file that can be fed to trec_eval.
>> For starters, I will only have support for one collection in the patch, the Indonesian "Tempo" collection (around 23,000 docs)
>> We can simply add subdirectories for additional collections (it does a contrib-crawl like thing).
>> Once I finish wrapping up some documentation (such as description of the formats, some javadocs, and an example), I'll upload the patch.
>> These formats are actually documented in the lucene-java benchmark package already, but I think it would be nice to add this for non-java users.
> This message is automatically generated by JIRA.
> You can reply to this email to add a comment to the issue online.