On Dec 23, 2009, at 2:37 PM, Mark Bennett wrote:
> Hello Simon and Robert,
> Robert, yes, I do have a private corpus and truth table. At this point I can't share it, though I'll ask my client at some point.
> I did find some code in JiRA, your patches, including the links here "for the record":
> Top languages for testing are English, Japanese, French and German.
> I'm exicited to have others to talk to! I have some general comments / questions:
> 1: Although the qrels data format was originally binary yes/no, apparently there were more flexible dialects used in later years, that allowed for some weighting. Was there a particular dialect that y'all were considering?
I think it would be nice to have both binary and something like: relevant, somewhat relevant, not relevant, embarrassing or a scale of 1-5 or 1-10 depending on how hard core you want to be.
> 2: CAN WE use the TREC qrels format(s)?
> I believe TREC has various restrictions on the use of test results, source content and evaluation code (annoying since TREC is supposed to foster research and NIST is paid for by US tax dollars, but that's another whole rant) But do we think the file format is "open" or "closed" ?
We should be able to use the format. I think the only thing closed about TREC is the need to pay a small sum for the collection, but that isn't NIST's fault.
> 3: I do favor an optional "analog" scale. Do you agree?
> Our assertions are on an A-F scale, I can elaborate if you're interested. A floating point scale is more precise perhaps, but we have human graders, and explain letter grades that approximate academic rankings was less confusing, plus we were already using numbers in two other aspects of the grading form.
> 4: Generally do you guys favor a simple file format (one line per record), or an XML format?
> TREC was born in the early 90's I guess, so is record oriented, and probably more efficient. We have our tests in an XML format, which though more verbose, affords a lot more flexibility including comments and optionally self-contained content. It also sidesteps encoding as XML is UTF-8. I've found that "text files" from other countries tend to be numerous encodings. And Excel, which is often used for CSV and other delimited files, sadly does NOT do UTF-8 in CSV files.
Pretty wide open at this point
> 5: How important do you value interoperability with Excel?
> It's VERY handy for non-techies, and the xlsX format is a set of zipped XML files, so perhaps acceptably "open". I would not propose .xlsx as the standard format, but it'd be nice to inter-operate with it. We'd need some type of template.
That would be great.
> 6: "quiescent" vs. "dynamic" strategies
> Content: During in house testing it's sometimes been hard to maintain a static set of content. You can have a separate system, but I suspect in some scenarios it won't be feasible to lock down the content. See item 10 below. My suggestion is to mix this into the thinking. Some researches wouldn't accept the variables it adds, but for others if it's a choice between imperfect checks and no checks at all, they'll take the imperfect.
> Grades / Evaluations: It's VERY hard to get folks to grade an MxN matrix. I had a matrix of just 57 x 25 (> 1,400 spots) and, trust me, it's hard to do in one sitting. It'd be nice to handle spotty valuations.
I think it's pretty important to be able to reproduce experiments across users/machines/etc. which means the content needs to be versioned. This is the one big issue I have w/ simply pointing at other data sets. Ultimately, we will need our own collection that we can version.
> 7: fuzzy evaluations vs. "unit testing"
> Given the variabilities (covered in other points), it'd be nice to come up with fuzzier assertions.
> * "Doc1 is more relevant to Search1 than Doc2"
> * "I'd like to see at least 3 of these docs in the top 10 matches for this search"
Nice to have, but likely further down the road. However, the door is wide open at this point, so scratch that itch!
Yep, this has been kicked around and would be quite nice.
It's an open source project at Apache, so anybody who is fine w/ the ASL can participate. Meaning both academics and commercial companies. Frankly, I don't care much about p@1000, but p@5 and p@10 are quite interesting, so I tend to be more real world focused, but a health cross fertilization will be great.
Indeed, not to preclude others, but I know I'm focused on how to use it for Lucene/Mahout etc. In other words, the latter two in the list. If other vendors want to participate, that is great too. All are welcome. Still, it's pretty hard to really do Engine A vs. Engine B tests in a fair way.
Maybe. Many clustering algorithms calculate distance much the same way that the engine scores, so it may just be a case of self-fulfilling prophesy.
Reproducibility is paramount. One of the biggest issues w/ these types of evaluations is the problem of managing the output of the tests and keeping track of them. I imagine we'll develop tools for that, too.
Sure, the more the merrier. Getting the word out is important, as is setting expectations on what they will find once they arrive.
See other response.
Search the Lucene ecosystem using Solr/Lucene: http://www.lucidimagination.com/search