-Re: Re. Lots to talk about, ORP
Grant Ingersoll 2009-12-24, 11:46
On Dec 23, 2009, at 7:27 PM, Robert Muir wrote:
> Hey Mark, just some quick replies below.
> On Wed, Dec 23, 2009 at 5:49 PM, Mark Bennett <[EMAIL PROTECTED]> wrote:
> Are you guys on board with this? There were comments like "First and foremost, this project is a way for Lucene to talk about relevance in a standard way..." and "I think for starters, our primary focus should be to support improvements of apache lucene-related projects. Then we can expand later... "
> I should reword this, as Grant said... scratch your own itch. If you want to help support another search engine, fantastic! I did some very minimal work so lucene-java could run relevance tests, so that was my itch. But please don't let this discourage you from supporting search engine XYZ.
+1. This is how open source works. The overall goal of the project is to be able to judge relevance for a search engine in an open way. I personally won't be building any tools for things other than Lucene/Solr/Mahout (yes, I think we can use these same corpora for machine learning too!), but there's no reason others can't. We'll just need to properly structure things in the SVN for the various code points.
> If we push that too hard, we'll scare away folks from other communities. I agree that people should each scratch their worst itch, I think it's in part a question of positioning. Solr and Nutch are very heavily associated with Lucene, which is understandable. But virtually every client we work with has multiple engines, so we have a bit of a different itch I guess.
> we welcome any patches to support these additional search engines... I mean, we can't even run tests against things like solr yet.... (which would also be cool)
I hope to have some time on that, but others should jump in too.
> 3: Multiple languages are good, even though some of the early content has been selected more because it was available. English might be a strategic language to get covered early. I'd really like to see a parallel set of test documents and searches in multiple languages; that's what my client is having to build.
> +1. It would be great to consider using parallel text to make it easier to support many languages, although it might require us to have some different search domains (but I think this is ok?).
We could do CLIR. I love it. Going back to my roots!
> We have limited resources here so I think this kinda of thing is interesting, english has been done time and time again and while I still think its important, what if we can build a multilingual relevance corpus with only small additional effort. Yes, I realize this kind of approach probably wouldn't be as accurate as building individual collections for each language, but its probably very close.
> Can we consider something like http://www.statmt.org/europarl/ ???
> More open parallel corpora available here: http://urd.let.rug.nl/tiedeman/OPUS/
> I mentioned europarl, even though it has less languages than say http://langtech.jrc.it/JRC-Acquis.html, because especially interesting is the note at the bottom: We are not aware of any copyright restrictions of the material.
> If there is no problem with this, I'd like to help. supporting more languages is my itch.
> Robert Muir
> [EMAIL PROTECTED]
Search the Lucene ecosystem using Solr/Lucene: http://www.lucidimagination.com/search