|
|
-
Relevance score - Classification
Faizan\ 2011-11-23, 11:21
Hello,
We are working on using Classification as a Search.
I want to compute the relevance score of the output which is generated by the Naive Bayes Classifier or some other classifier.
Please give any guideline/hint!
Thanks.
--
Best Regards
Faizan Shaikh Aroha Labs (Private) Limited
-
Re: Relevance score - Classification
Tanton Gibbs 2011-11-23, 17:23
Hi Faizan,
You can definitely do this, basically divide your training data into two sets (relevant and not relevant) and then build up a classifier to tell the difference. However, I strongly recommend you look at a learning to rank solution instead. Even if you don't want to ever show an irrelevant result, a ranking based solution that is calibrated to a 0-1 scale is often better than a classification approach.
Can you tell us a bit more about your application and I might have specific recommendations.
On Wed, Nov 23, 2011 at 3:21 AM, Faizan(Aroha) <[EMAIL PROTECTED]> wrote: > Hello, > > > > We are working on using Classification as a Search. > > > > I want to compute the relevance score of the output which is generated by > the Naive Bayes Classifier or some other classifier. > > > > Please give any guideline/hint! > > > > Thanks. > > > > -- > > Best Regards > > Faizan Shaikh > Aroha Labs (Private) Limited > > > > > >
-
Re: Relevance score - Classification
Isabel Drost 2011-11-23, 20:47
On 23.11.2011 Faizan(Aroha) wrote: > We are working on using Classification as a Search. > > I want to compute the relevance score of the output which is generated by > the Naive Bayes Classifier or some other classifier. > > Please give any guideline/hint!
Can you please provide some more background to your use case? Which documents do you want to search? How is relevance defined in your setting?
Isabel
-
RE: Relevance score - Classification
Faizan\ 2011-11-24, 06:42
We are trying to implement relevant search(using machine learning) at a website where we have 3 million visitors a week.. and 150k blog posts a single day.
We are currently in the planning phase, so we are trying several different approaches.
I will take the news group dataset example to explain my situation :
Let's say , we apply the classifier on a new document X that may belong to "rec.sport.baseball", we know that 397 documents in our collection that have been correctly identified by the classifier.
When we apply the classifier on X, the classifier should bring back a result with the list of documents that are sorted in a way that the top most document is most relevant document to the query (document X) and the last document is the most irrelevant one.
and in order to do the above stated, we need to devise a way where we can use these classifiers for information retrieval
The classifier should be used as a retrieval algorithm where it will first compute relevance scores for all the documents and produce a ranking. When that retrieval algorithm is applied to an individual query document , it will bring back a set of documents that are sorted in a way that the top most documents are the most relevant one to the blog post and the last document is the most irrelevant one.
This is a little background.
Thanks. -----Original Message----- From: Isabel Drost [mailto:[EMAIL PROTECTED]] Sent: Thursday, November 24, 2011 1:47 AM To: [EMAIL PROTECTED] Subject: Re: Relevance score - Classification
On 23.11.2011 Faizan(Aroha) wrote: > We are working on using Classification as a Search. > > I want to compute the relevance score of the output which is generated > by the Naive Bayes Classifier or some other classifier. > > Please give any guideline/hint!
Can you please provide some more background to your use case? Which documents do you want to search? How is relevance defined in your setting?
Isabel
-
Re: Relevance score - Classification
Ted Dunning 2011-11-24, 07:31
You have both a training problem and a scalability problem here.
The training problem is that to build a classifier, you either need to make fairly gross assumptions about which features matter and what weights they should have. For most text retrieval systems, this is done by taking the user's query, (possibly) adding a few extra terms and then assuming that this set of terms are the features that your classifier. Weights are generally either derived heuristically using something like IDF or simply ignored in favor of some other relevancy score like page-rank or other document quality measures. A middle road can be taken in which these different scores are combined.
In contrast, for most document classifiers, relevant terms are derived by examining a set of training documents that are labeled as positive and negative relative to the question of interest.
In your application, you don't have training data of the sort to build the second kind of classifier so you need to build the first kind. But this is just the same as saying you should use a normal text retrieval system.
The second issue is one of how the computations are arranged. With both kinds of systems, the computational problem is shaped just like approximate sparse matrix multiplication, but in the text retrieval system, considerable knowledge is used to avoid computations that cannot affect the final retrieval result. With a straightforward implementation using text classifiers, you need to evaluate the classifier for every document. This cannot scale as well the text retrieval simply because you have to read data for far more documents.
It is possible to combine these two approaches and only evaluate the classifier on documents that contain the terms that have non-zero features in the classifier, but the number of terms involved makes the inverted index much less effective at avoiding work.
So how did you plan to derive the features and weights for the classifiers you mention? On Wed, Nov 23, 2011 at 10:42 PM, Faizan(Aroha) <[EMAIL PROTECTED] > wrote:
> We are trying to implement relevant search(using machine learning) at a > website where we have 3 million visitors a week.. and 150k blog posts a > single day. > > We are currently in the planning phase, so we are trying several different > approaches. > > I will take the news group dataset example to explain my situation : > > Let's say , we apply the classifier on a new document X that may belong to > "rec.sport.baseball", we know that 397 documents in our collection that > have > been correctly identified by the classifier. > > When we apply the classifier on X, the classifier should bring back a > result > with the list of documents that are sorted in a way that the top most > document is most relevant document to the query (document X) and the last > document is the most irrelevant one. > > and in order to do the above stated, we need to devise a way where we can > use these classifiers for information retrieval > > The classifier should be used as a retrieval algorithm where it will first > compute relevance scores for all the documents and produce a ranking. When > that retrieval algorithm is applied to an individual query document , it > will bring back a set of documents that are sorted in a way that the top > most documents are the most relevant one to the blog post and the last > document is the most irrelevant one. > > This is a little background. > > Thanks. > > > -----Original Message----- > From: Isabel Drost [mailto:[EMAIL PROTECTED]] > Sent: Thursday, November 24, 2011 1:47 AM > To: [EMAIL PROTECTED] > Subject: Re: Relevance score - Classification > > On 23.11.2011 Faizan(Aroha) wrote: > > We are working on using Classification as a Search. > > > > I want to compute the relevance score of the output which is generated > > by the Naive Bayes Classifier or some other classifier. > > > > Please give any guideline/hint! > > Can you please provide some more background to your use case? Which
-
Re: Relevance score - Classification
Tanton Gibbs 2011-11-24, 17:56
Hi Faizan,
It seems like you have an IR problem where the query is a document (and the documents are documents, too).
Have you looked a Lucene? Seems like that would be a good starting point. After you have done that, then I would come back to clustering (which it seems you are wanting to do here). You could add the generated cluster ids as unique terms in your index and then ensure you always match the cluster term, but then the IR features would help correctly rank the documents with that clustered term.
On Wednesday, November 23, 2011, Faizan(Aroha) <[EMAIL PROTECTED]> wrote: > We are trying to implement relevant search(using machine learning) at a > website where we have 3 million visitors a week.. and 150k blog posts a > single day. > > We are currently in the planning phase, so we are trying several different > approaches. > > I will take the news group dataset example to explain my situation : > > Let's say , we apply the classifier on a new document X that may belong to > "rec.sport.baseball", we know that 397 documents in our collection that have > been correctly identified by the classifier. > > When we apply the classifier on X, the classifier should bring back a result > with the list of documents that are sorted in a way that the top most > document is most relevant document to the query (document X) and the last > document is the most irrelevant one. > > and in order to do the above stated, we need to devise a way where we can > use these classifiers for information retrieval > > The classifier should be used as a retrieval algorithm where it will first > compute relevance scores for all the documents and produce a ranking. When > that retrieval algorithm is applied to an individual query document , it > will bring back a set of documents that are sorted in a way that the top > most documents are the most relevant one to the blog post and the last > document is the most irrelevant one. > > This is a little background. > > Thanks. > > > -----Original Message----- > From: Isabel Drost [mailto:[EMAIL PROTECTED]] > Sent: Thursday, November 24, 2011 1:47 AM > To: [EMAIL PROTECTED] > Subject: Re: Relevance score - Classification > > On 23.11.2011 Faizan(Aroha) wrote: >> We are working on using Classification as a Search. >> >> I want to compute the relevance score of the output which is generated >> by the Naive Bayes Classifier or some other classifier. >> >> Please give any guideline/hint! > > Can you please provide some more background to your use case? Which > documents do you want to search? How is relevance defined in your setting? > > Isabel > >
-
Re: Relevance score - Classification
Ted Dunning 2011-11-24, 18:50
+1 to Tanton's wise words.
On Thu, Nov 24, 2011 at 9:56 AM, Tanton Gibbs <[EMAIL PROTECTED]>wrote:
> Hi Faizan, > > It seems like you have an IR problem where the query is a document (and the > documents are documents, too). > > Have you looked a Lucene? Seems like that would be a good starting point. > After you have done that, then I would come back to clustering (which it > seems you are wanting to do here). You could add the generated cluster ids > as unique terms in your index and then ensure you always match the cluster > term, but then the IR features would help correctly rank the documents with > that clustered term. > > On Wednesday, November 23, 2011, Faizan(Aroha) < > [EMAIL PROTECTED]> > wrote: > > We are trying to implement relevant search(using machine learning) at a > > website where we have 3 million visitors a week.. and 150k blog posts a > > single day. > > > > We are currently in the planning phase, so we are trying several > different > > approaches. > > > > I will take the news group dataset example to explain my situation : > > > > Let's say , we apply the classifier on a new document X that may belong > to > > "rec.sport.baseball", we know that 397 documents in our collection that > have > > been correctly identified by the classifier. > > > > When we apply the classifier on X, the classifier should bring back a > result > > with the list of documents that are sorted in a way that the top most > > document is most relevant document to the query (document X) and the last > > document is the most irrelevant one. > > > > and in order to do the above stated, we need to devise a way where we can > > use these classifiers for information retrieval > > > > The classifier should be used as a retrieval algorithm where it will > first > > compute relevance scores for all the documents and produce a ranking. > When > > that retrieval algorithm is applied to an individual query document , it > > will bring back a set of documents that are sorted in a way that the top > > most documents are the most relevant one to the blog post and the last > > document is the most irrelevant one. > > > > This is a little background. > > > > Thanks. > > > > > > -----Original Message----- > > From: Isabel Drost [mailto:[EMAIL PROTECTED]] > > Sent: Thursday, November 24, 2011 1:47 AM > > To: [EMAIL PROTECTED] > > Subject: Re: Relevance score - Classification > > > > On 23.11.2011 Faizan(Aroha) wrote: > >> We are working on using Classification as a Search. > >> > >> I want to compute the relevance score of the output which is generated > >> by the Naive Bayes Classifier or some other classifier. > >> > >> Please give any guideline/hint! > > > > Can you please provide some more background to your use case? Which > > documents do you want to search? How is relevance defined in your > setting? > > > > Isabel > > > > >
-
Re: Relevance score - Classification
Lance Norskog 2011-11-27, 03:15
Solr is an application-level wrapper for Lucene. Carrot2 is a fine clustering system, and Solr has an integration for it. You can do a lot of research quickly using this combination of tools.
On Thu, Nov 24, 2011 at 10:50 AM, Ted Dunning <[EMAIL PROTECTED]> wrote:
> +1 to Tanton's wise words. > > On Thu, Nov 24, 2011 at 9:56 AM, Tanton Gibbs <[EMAIL PROTECTED] > >wrote: > > > Hi Faizan, > > > > It seems like you have an IR problem where the query is a document (and > the > > documents are documents, too). > > > > Have you looked a Lucene? Seems like that would be a good starting > point. > > After you have done that, then I would come back to clustering (which it > > seems you are wanting to do here). You could add the generated cluster > ids > > as unique terms in your index and then ensure you always match the > cluster > > term, but then the IR features would help correctly rank the documents > with > > that clustered term. > > > > On Wednesday, November 23, 2011, Faizan(Aroha) < > > [EMAIL PROTECTED]> > > wrote: > > > We are trying to implement relevant search(using machine learning) at a > > > website where we have 3 million visitors a week.. and 150k blog posts a > > > single day. > > > > > > We are currently in the planning phase, so we are trying several > > different > > > approaches. > > > > > > I will take the news group dataset example to explain my situation : > > > > > > Let's say , we apply the classifier on a new document X that may belong > > to > > > "rec.sport.baseball", we know that 397 documents in our collection that > > have > > > been correctly identified by the classifier. > > > > > > When we apply the classifier on X, the classifier should bring back a > > result > > > with the list of documents that are sorted in a way that the top most > > > document is most relevant document to the query (document X) and the > last > > > document is the most irrelevant one. > > > > > > and in order to do the above stated, we need to devise a way where we > can > > > use these classifiers for information retrieval > > > > > > The classifier should be used as a retrieval algorithm where it will > > first > > > compute relevance scores for all the documents and produce a ranking. > > When > > > that retrieval algorithm is applied to an individual query document , > it > > > will bring back a set of documents that are sorted in a way that the > top > > > most documents are the most relevant one to the blog post and the last > > > document is the most irrelevant one. > > > > > > This is a little background. > > > > > > Thanks. > > > > > > > > > -----Original Message----- > > > From: Isabel Drost [mailto:[EMAIL PROTECTED]] > > > Sent: Thursday, November 24, 2011 1:47 AM > > > To: [EMAIL PROTECTED] > > > Subject: Re: Relevance score - Classification > > > > > > On 23.11.2011 Faizan(Aroha) wrote: > > >> We are working on using Classification as a Search. > > >> > > >> I want to compute the relevance score of the output which is generated > > >> by the Naive Bayes Classifier or some other classifier. > > >> > > >> Please give any guideline/hint! > > > > > > Can you please provide some more background to your use case? Which > > > documents do you want to search? How is relevance defined in your > > setting? > > > > > > Isabel > > > > > > > > >
-- Lance Norskog [EMAIL PROTECTED]
-
RE: Relevance score - Classification
Faizan\ 2011-11-29, 10:42
We are still in the process of resolving this features and weights problem.
I think normally you convert documents based on features when you have a dictionary of features defined. In case of apple, we need to define the size, weight, color like in case of geography, you have country, city, zip code, region, latitude, longitude, population, area, etc. In our case, I think we won't be looking much into features
I am moving towards clustering as Tantons's mentioned.
Thanks.
Regards, Faizan
-----Original Message----- From: Lance Norskog [mailto:[EMAIL PROTECTED]] Sent: Sunday, November 27, 2011 8:15 AM To: [EMAIL PROTECTED] Subject: Re: Relevance score - Classification
Solr is an application-level wrapper for Lucene. Carrot2 is a fine clustering system, and Solr has an integration for it. You can do a lot of research quickly using this combination of tools.
On Thu, Nov 24, 2011 at 10:50 AM, Ted Dunning <[EMAIL PROTECTED]> wrote:
> +1 to Tanton's wise words. > > On Thu, Nov 24, 2011 at 9:56 AM, Tanton Gibbs <[EMAIL PROTECTED] > >wrote: > > > Hi Faizan, > > > > It seems like you have an IR problem where the query is a document > > (and > the > > documents are documents, too). > > > > Have you looked a Lucene? Seems like that would be a good starting > point. > > After you have done that, then I would come back to clustering > > (which it seems you are wanting to do here). You could add the > > generated cluster > ids > > as unique terms in your index and then ensure you always match the > cluster > > term, but then the IR features would help correctly rank the > > documents > with > > that clustered term. > > > > On Wednesday, November 23, 2011, Faizan(Aroha) < > > [EMAIL PROTECTED]> > > wrote: > > > We are trying to implement relevant search(using machine learning) > > > at a website where we have 3 million visitors a week.. and 150k > > > blog posts a single day. > > > > > > We are currently in the planning phase, so we are trying several > > different > > > approaches. > > > > > > I will take the news group dataset example to explain my situation : > > > > > > Let's say , we apply the classifier on a new document X that may > > > belong > > to > > > "rec.sport.baseball", we know that 397 documents in our collection > > > that > > have > > > been correctly identified by the classifier. > > > > > > When we apply the classifier on X, the classifier should bring > > > back a > > result > > > with the list of documents that are sorted in a way that the top > > > most document is most relevant document to the query (document X) > > > and the > last > > > document is the most irrelevant one. > > > > > > and in order to do the above stated, we need to devise a way where > > > we > can > > > use these classifiers for information retrieval > > > > > > The classifier should be used as a retrieval algorithm where it > > > will > > first > > > compute relevance scores for all the documents and produce a ranking. > > When > > > that retrieval algorithm is applied to an individual query > > > document , > it > > > will bring back a set of documents that are sorted in a way that > > > the > top > > > most documents are the most relevant one to the blog post and the > > > last document is the most irrelevant one. > > > > > > This is a little background. > > > > > > Thanks. > > > > > > > > > -----Original Message----- > > > From: Isabel Drost [mailto:[EMAIL PROTECTED]] > > > Sent: Thursday, November 24, 2011 1:47 AM > > > To: [EMAIL PROTECTED] > > > Subject: Re: Relevance score - Classification > > > > > > On 23.11.2011 Faizan(Aroha) wrote: > > >> We are working on using Classification as a Search. > > >> > > >> I want to compute the relevance score of the output which is > > >> generated by the Naive Bayes Classifier or some other classifier. > > >> > > >> Please give any guideline/hint! > > > > > > Can you please provide some more background to your use case? > > > Which documents do you want to search? How is relevance defined in
Lance Norskog [EMAIL PROTECTED]
-
Re: Relevance score - Classification
Isabel Drost 2011-11-30, 20:38
On 29.11.2011 Faizan(Aroha) wrote: > In our case, I think we won't be looking much into features > > I am moving towards clustering as Tantons's mentioned.
Hmm - what kind of similarity measure are you planning to use for that? What makes to items be similar in your use case?
Isabel
|
|