|
Grant Ingersoll
2009-07-31, 23:01
Andrzej Bialecki
2009-07-31, 23:23
Peter Skomoroch
2009-07-31, 23:26
Simon Willnauer
2009-08-01, 18:27
Grant Ingersoll
2009-08-05, 02:41
Andrzej Bialecki
2009-08-05, 08:40
Nicola Ferro
2009-08-12, 17:04
|
-
Getting StartedGrant Ingersoll 2009-07-31, 23:01
OK, so how do we get this started? Seems like there are a lot of
collections out there we could use. Also, we can crawl. Seems the tricky part is getting judgments. Thoughts? -Grant
-
Re: Getting StartedAndrzej Bialecki 2009-07-31, 23:23
Grant Ingersoll wrote:
> OK, so how do we get this started? Seems like there are a lot of > collections out there we could use. Also, we can crawl. Seems the > tricky part is getting judgments. I think we should establish first what kind of relevance judgments we want to collect: 1. given a corpus, and a query, define an ordered list of top-N documents that are relevant to the query. This is our baseline. Getting this sort of information is very time-consuming and subjective. 2. given a corpus, a query and a list of top-N results obtained from a real search, define what results are relevant and how they should be ordered. The reviewed list of top-N results becomes then the initial approximation of our baseline. Calculate a distance metric between real and reviewed result, and adjust ranking to maximize this metric. The second scenario could be handled by a webapp, which could present the following areas of functionality: * corpus selection and browsing * searching using selected search impl and its ranking parameters, and storing tuples of <corpus, impl, query, results> * review of the results (marking relevant / non-relevant, reordering), and saving of tuples <corpus, impl, query, reviewed results> * calculation of distance metrics. * adjustment of ranking parameters for a given search implementation. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __________________________________ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
-
Re: Getting StartedPeter Skomoroch 2009-07-31, 23:26
Mechanical Turk has built in tasks for evaluating search relevance
Seed queries could start with the AOL search logs or wikipedia traffic logs? Pete Sent from my iPhone On Jul 31, 2009, at 7:01 PM, Grant Ingersoll <[EMAIL PROTECTED]> wrote: > OK, so how do we get this started? Seems like there are a lot of > collections out there we could use. Also, we can crawl. Seems the > tricky part is getting judgments. > > Thoughts? > > -Grant
-
Re: Getting StartedSimon Willnauer 2009-08-01, 18:27
On Sat, Aug 1, 2009 at 1:23 AM, Andrzej Bialecki<[EMAIL PROTECTED]> wrote:
> Grant Ingersoll wrote: >> >> OK, so how do we get this started? Seems like there are a lot of >> collections out there we could use. Also, we can crawl. Seems the tricky >> part is getting judgments. > > I think we should establish first what kind of relevance judgments we want > to collect: This looks like two different things. One thing is deciding what we use to get "a" collection of documents - a corpus. It seems to be a very good idea to me to create a heterogeneous collection of documents such as wikipedia to kick off ORP. I guess we do not need a huge collection of documents to get started, right?! @Grant: I might have missed something but have we a list of available collections on some wiki page?! Would be great to have something like that. Once we got this project going we can start building various collections from all kinds of areas. I found it interesting that all collections I have seen are build from large documents but with the advent of mobile devices collections could also be build from "data-records" like SMS, Address-Records, image-metadata, audio-metadata where the text/document is relatively small. I found that searching on such "small" document puts different requirements on scoring parameters than websearch... Another thing is what we do with this collections. I kind of like the idea of having something like a webapp that is able to preform corpus selection, distance measurement etc. I wanna extend Andrzej's list and throwing out some random thoughts... - It would be nice to have something like a immediate representation of a corpus that can be plugged into a relevance measurement app / webapp. - such a relevance measurement should be able to work on top of custom search applications. There could be an API which give applications access to the corpus for indexing and can search on this corpus through the API. I can imagine lots of usecases where users want to judge their custom search engine against a corpus and compare the results. simon (in the middle of moving his apartment) > > 1. given a corpus, and a query, define an ordered list of top-N documents > that are relevant to the query. This is our baseline. Getting this sort of > information is very time-consuming and subjective. > > 2. given a corpus, a query and a list of top-N results obtained from a real > search, define what results are relevant and how they should be ordered. The > reviewed list of top-N results becomes then the initial approximation of our > baseline. Calculate a distance metric between real and reviewed result, and > adjust ranking to maximize this metric. > > The second scenario could be handled by a webapp, which could present the > following areas of functionality: > > * corpus selection and browsing > > * searching using selected search impl and its ranking parameters, and > storing tuples of <corpus, impl, query, results> > > * review of the results (marking relevant / non-relevant, reordering), and > saving of tuples <corpus, impl, query, reviewed results> > > * calculation of distance metrics. > > * adjustment of ranking parameters for a given search implementation. > > -- > Best regards, > Andrzej Bialecki <>< > ___. ___ ___ ___ _ _ __________________________________ > [__ || __|__/|__||\/| Information Retrieval, Semantic Web > ___|||__|| \| || | Embedded Unix, System Integration > http://www.sigram.com Contact: info at sigram dot com > >
-
Re: Getting StartedGrant Ingersoll 2009-08-05, 02:41
On Aug 1, 2009, at 2:27 PM, Simon Willnauer wrote: > On Sat, Aug 1, 2009 at 1:23 AM, Andrzej Bialecki<[EMAIL PROTECTED]> wrote: >> Grant Ingersoll wrote: >>> >>> OK, so how do we get this started? Seems like there are a lot of >>> collections out there we could use. Also, we can crawl. Seems >>> the tricky >>> part is getting judgments. >> >> I think we should establish first what kind of relevance judgments >> we want >> to collect: > This looks like two different things. > One thing is deciding what we use to get "a" collection of documents - > a corpus. It seems to be a very good idea to me to create a > heterogeneous collection of documents such as wikipedia to kick off > ORP. I guess we do not need a huge collection of documents to get > started, right?! > @Grant: I might have missed something but have we a list of available > collections on some wiki page?! Would be great to have something like > that. Not yet, we have some on Mahout
-
Re: Getting StartedAndrzej Bialecki 2009-08-05, 08:40
Grant Ingersoll wrote:
> > On Aug 1, 2009, at 2:27 PM, Simon Willnauer wrote: > >> On Sat, Aug 1, 2009 at 1:23 AM, Andrzej Bialecki<[EMAIL PROTECTED]> wrote: >>> Grant Ingersoll wrote: >>>> >>>> OK, so how do we get this started? Seems like there are a lot of >>>> collections out there we could use. Also, we can crawl. Seems the >>>> tricky >>>> part is getting judgments. >>> >>> I think we should establish first what kind of relevance judgments we >>> want >>> to collect: >> This looks like two different things. >> One thing is deciding what we use to get "a" collection of documents - >> a corpus. It seems to be a very good idea to me to create a >> heterogeneous collection of documents such as wikipedia to kick off >> ORP. I guess we do not need a huge collection of documents to get >> started, right?! >> @Grant: I might have missed something but have we a list of available >> collections on some wiki page?! Would be great to have something like >> that. > > Not yet, we have some on Mahout This link may be of interest for us: http://evaluatir.org . Lucene results are there, although they are disappointingly low. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __________________________________ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
-
Fw: Getting StartedNicola Ferro 2009-08-12, 17:04
Dear All,
as far as the production of a huge amount relevance assessment is concerned, you could have a look at the TREC Million Query Track ( http://ciir.cs.umass.edu/research/million/, http://trec.nist.gov/pubs/trec17/t17_proceedings.html). As far as the production of a test collection in an interactive way is concerned, you could look at: Corkmack et al., "Efficient construction of large test collections",SIGIR 1998, http://doi.acm.org/10.1145/290941.291009 Sanderson & Joho, "Forming test collections with no system pooling", SIGIR 2004, http://doi.acm.org/10.1145/1008992.1009001 Wrt the creation of pools (and sampling of collections) targeted towards a specific metric, you could have a look at: Aslam et al., "A statistical method for system evaluation using incomplete judgments", SIGIR 2006, http://doi.acm.org/10.1145/1148170.1148263 Finally, a system that can be of your interest is DIRECT (Distributed Information Retrieval Evaluation Campaign Tool), that we have built for managing the CLEF evaluation campaigns. Among other things, it allows for interactive topic creation by searching in document collections (by the way we use Lucene to do this) and interactive relevance assessments. You can find some information about DIRECT at: http://www.trebleclef.eu/getfile.php?id=75 All the best, Nicola Ferro ---------------------------------------------------------------------------------- Nicola Ferro - Ph.D. in Computer Science Assistant Professor Department of Information Engineering (DEI) University of Padua Via Gradenigo, 6/A - 35131 Padova - Italy Tel +39 049 827 7939 Fax: +39 049 827 7799 skype: nicola.ferro e-mail: [EMAIL PROTECTED] home page: http://ims.dei.unipd.it/members/ferro/ ---------------------------------------------------------------------------------- |