|
Mark Bennett
2009-08-11, 16:44
Mark Bennett
2009-08-11, 16:56
Grant Ingersoll
2009-08-11, 19:40
Mark Bennett
2009-08-11, 20:19
Stanislaw Osinski
2009-08-13, 16:39
Mark Bennett
2009-08-13, 17:29
Grant Ingersoll
2009-08-13, 20:24
Stanislaw Osinski
2009-08-15, 10:43
|
-
Solr 1.4 Clustering / mlt AS search?Mark Bennett 2009-08-11, 16:44
I'm going somewhere with this... be patient. :-) I had asked about this
briefly at the SF meetup, but there was a lot going on. 1: Suppose you had Solr 1.4 and all the Carrot^2 DOCUMENT clustering was all in, and you had built the cluster index for all your docs. 2: Then, if you had a particular cluster, and one of the docs in that cluster happened to be your search, then the other documents in the cluster could be considered the results. In effect, the cluster is like the search results. 3: Now imagine you can take an arbitrary doc and find the clusters that document is in. (some clustering engines let you do this). 4: And then imagine that, when somebody submits a search, you quickly turn it into a document, add it to the index, redo the clusters, find the clusters this new temp doc is in, and use that as the results. Benefits? I'm not saying this would be practical, but would it be useful? Or, in particular, would it be more useful than the normal Solr/Lucene relevancy? As I recall Carrot^2 had 3 choices for clustering. And let's assume that the searches coming in are more than the 1.4 words average. Maybe a few sentences or something. I'm mot sure a 1 word query would really benefit from this. :-) Some clustering algorithms don't allow you to find a cluster containing a specific document, so those wouldn't work as a "search engine". More Like This as a "cluster" search? A similar scenario could be made for the "more like this" feature. Take a user's search text (presumably lengthy), quickly index it, then use that new temp doc as a MLT seed doc. I haven't looked deep into the code, it might be that it uses essentially the same relevancy as a query. -- Mark Bennett / New Idea Engineering, Inc. / [EMAIL PROTECTED] Direct: 408-733-0387 / Main: 866-IDEA-ENG / Cell: 408-829-6513
-
Re: Solr 1.4 Clustering / mlt AS search?Mark Bennett 2009-08-11, 16:56
With regards my second question, re. More Like this, I do see:
"The MoreLikeThisHandler can also use a ContentStream to find similar documents. It will extract the "interesting terms" from the posted text." at http://wiki.apache.org/solr/MoreLikeThisHandler and that it uses the TF/IDF stuff. Still wondering if anybody's tried MLK or Carrot clustering as a primary search entry point. On Tue, Aug 11, 2009 at 9:44 AM, Mark Bennett <[EMAIL PROTECTED]> wrote: > I'm going somewhere with this... be patient. :-) I had asked about this > briefly at the SF meetup, but there was a lot going on. > > 1: Suppose you had Solr 1.4 and all the Carrot^2 DOCUMENT clustering was > all in, and you had built the cluster index for all your docs. > > 2: Then, if you had a particular cluster, and one of the docs in that > cluster happened to be your search, then the other documents in the cluster > could be considered the results. In effect, the cluster is like the search > results. > > 3: Now imagine you can take an arbitrary doc and find the clusters that > document is in. (some clustering engines let you do this). > > 4: And then imagine that, when somebody submits a search, you quickly turn > it into a document, add it to the index, redo the clusters, find the > clusters this new temp doc is in, and use that as the results. > > Benefits? > > I'm not saying this would be practical, but would it be useful? Or, in > particular, would it be more useful than the normal Solr/Lucene relevancy? > As I recall Carrot^2 had 3 choices for clustering. > > And let's assume that the searches coming in are more than the 1.4 words > average. Maybe a few sentences or something. I'm mot sure a 1 word query > would really benefit from this. :-) > > Some clustering algorithms don't allow you to find a cluster containing a > specific document, so those wouldn't work as a "search engine". > > More Like This as a "cluster" search? > > A similar scenario could be made for the "more like this" feature. Take a > user's search text (presumably lengthy), quickly index it, then use that new > temp doc as a MLT seed doc. I haven't looked deep into the code, it might > be that it uses essentially the same relevancy as a query. > > -- > Mark Bennett / New Idea Engineering, Inc. / [EMAIL PROTECTED] > Direct: 408-733-0387 / Main: 866-IDEA-ENG / Cell: 408-829-6513 >
-
Re: Solr 1.4 Clustering / mlt AS search?Grant Ingersoll 2009-08-11, 19:40
Inline...
On Aug 11, 2009, at 12:44 PM, Mark Bennett wrote: > I'm going somewhere with this... be patient. :-) I had asked about > this > briefly at the SF meetup, but there was a lot going on. > > 1: Suppose you had Solr 1.4 and all the Carrot^2 DOCUMENT clustering > was all > in, and you had built the cluster index for all your docs. > > 2: Then, if you had a particular cluster, and one of the docs in that > cluster happened to be your search, then the other documents in the > cluster > could be considered the results. In effect, the cluster is like the > search > results. > > 3: Now imagine you can take an arbitrary doc and find the clusters > that > document is in. (some clustering engines let you do this). > > 4: And then imagine that, when somebody submits a search, you > quickly turn > it into a document, add it to the index, redo the clusters, find the > clusters this new temp doc is in, and use that as the results. > I guess I'd argue that this is already what Lucene does, except for the part about adding the query into the document set. The Lucene Query is just your arbitrary document. Really, the primary difference as I see it, I think, is that you want a the Carrot2 scoring mechanism instead of the existing Lucene one, no? Otherwise, I don't see much benefit to actually indexing the query, other than it could potentially be used to skew results over time as people ask the same queries over and over again. Under a certain lens, couldn't you just argue that search is finding all the docs that cluster around your query? (I know that isn't the traditional description, but regardless, the math underneath is often very similar) > Benefits? > > I'm not saying this would be practical, but would it be useful? Or, > in > particular, would it be more useful than the normal Solr/Lucene > relevancy? > As I recall Carrot^2 had 3 choices for clustering. > > And let's assume that the searches coming in are more than the 1.4 > words > average. Maybe a few sentences or something. I'm mot sure a 1 word > query > would really benefit from this. :-) > > Some clustering algorithms don't allow you to find a cluster > containing a > specific document, so those wouldn't work as a "search engine". > > More Like This as a "cluster" search? > > A similar scenario could be made for the "more like this" feature. > Take a > user's search text (presumably lengthy), quickly index it, then use > that new > temp doc as a MLT seed doc. I haven't looked deep into the code, it > might > be that it uses essentially the same relevancy as a query. Again, I don't see the benefit of indexing it. You slightly peturb the corpus statistics, but other than that, how is it different from just submitting the query and getting back the results?
-
Re: Solr 1.4 Clustering / mlt AS search?Mark Bennett 2009-08-11, 20:19
Thanks Grant.
*** mlb: comments inline On Tue, Aug 11, 2009 at 12:40 PM, Grant Ingersoll <[EMAIL PROTECTED]>wrote: > Inline... > > On Aug 11, 2009, at 12:44 PM, Mark Bennett wrote: > > I'm going somewhere with this... be patient. :-) I had asked about this >> briefly at the SF meetup, but there was a lot going on. >> >> 1: Suppose you had Solr 1.4 and all the Carrot^2 DOCUMENT clustering was >> all >> in, and you had built the cluster index for all your docs. >> >> 2: Then, if you had a particular cluster, and one of the docs in that >> cluster happened to be your search, then the other documents in the >> cluster >> could be considered the results. In effect, the cluster is like the >> search >> results. >> >> 3: Now imagine you can take an arbitrary doc and find the clusters that >> document is in. (some clustering engines let you do this). >> >> 4: And then imagine that, when somebody submits a search, you quickly turn >> it into a document, add it to the index, redo the clusters, find the >> clusters this new temp doc is in, and use that as the results. >> >> > I guess I'd argue that this is already what Lucene does, except for the > part about adding the query into the document set. The Lucene Query is just > your arbitrary document. Really, the primary difference as I see it, I > think, is that you want a the Carrot2 scoring mechanism instead of the > existing Lucene one, no? Otherwise, I don't see much benefit to actually > indexing the query, other than it could potentially be used to skew results > over time as people ask the same queries over and over again. *** mlb: Yes, this is essentially what I'm suggesting. Carrot2 has several pluggable algorithms to choose from, though I have no evidence that they're "better" than Lucene's. Where TF/IDF is sort of a one step algebraic calculation, some clustering algorithms use iterative approaches, etc. > > > Under a certain lens, couldn't you just argue that search is finding all > the docs that cluster around your query? (I know that isn't the traditional > description, but regardless, the math underneath is often very similar) > *** mlb: Yes, exactly. And so the question is might some of these other methods work better for certain applications, certain vocabularies, etc. So I guess it's about flexibility, etc. Though you can plugin your own similarity class, that's still the one shot algebraic model, regardless of the specific formulas. Some of the newer machine learning algorithms have other tricks up their sleeves that might fit some usage models better. > > > > Benefits? >> >> I'm not saying this would be practical, but would it be useful? Or, in >> particular, would it be more useful than the normal Solr/Lucene relevancy? >> As I recall Carrot^2 had 3 choices for clustering. >> > > >> And let's assume that the searches coming in are more than the 1.4 words >> average. Maybe a few sentences or something. I'm mot sure a 1 word query >> would really benefit from this. :-) >> >> Some clustering algorithms don't allow you to find a cluster containing a >> specific document, so those wouldn't work as a "search engine". >> >> More Like This as a "cluster" search? >> >> A similar scenario could be made for the "more like this" feature. Take a >> user's search text (presumably lengthy), quickly index it, then use that >> new >> temp doc as a MLT seed doc. I haven't looked deep into the code, it might >> be that it uses essentially the same relevancy as a query. >> > > Again, I don't see the benefit of indexing it. You slightly peturb the > corpus statistics, but other than that, how is it different from just > submitting the query and getting back the results? *** Yeah, actually I'm not wild about changing the index for the sake of processing a search. And looking at MLT, they claim you can send in a stream, so no need to update the index.
-
Re: Solr 1.4 Clustering / mlt AS search?Stanislaw Osinski 2009-08-13, 16:39
Hi,
On Tue, Aug 11, 2009 at 22:19, Mark Bennett <[EMAIL PROTECTED]> wrote: Carrot2 has several pluggable algorithms to choose from, though I have no > evidence that they're "better" than Lucene's. Where TF/IDF is sort of a > one > step algebraic calculation, some clustering algorithms use iterative > approaches, etc. I'm not sure if I completely follow the way in which you'd like to use Carrot2 for scoring -- would you cluster the whole index? Carrot2 was designed to be a post-retrieval clustering algorithm and optimized to cluster small sets of documents (up to ~1000) in real time. All processing is performed in-memory, which limits Carrot2's applicability to really large sets of documents. S.
-
Re: Solr 1.4 Clustering / mlt AS search?Mark Bennett 2009-08-13, 17:29
* mlb: comments
On Thu, Aug 13, 2009 at 9:39 AM, Stanislaw Osinski <[EMAIL PROTECTED]>wrote: > Hi, > > On Tue, Aug 11, 2009 at 22:19, Mark Bennett <[EMAIL PROTECTED]> wrote: > > Carrot2 has several pluggable algorithms to choose from, though I have no > > evidence that they're "better" than Lucene's. Where TF/IDF is sort of a > > one > > step algebraic calculation, some clustering algorithms use iterative > > approaches, etc. > > > I'm not sure if I completely follow the way in which you'd like to use > Carrot2 for scoring -- would you cluster the whole index? Carrot2 was > designed to be a post-retrieval clustering algorithm and optimized to > cluster small sets of documents (up to ~1000) in real time. All processing > is performed in-memory, which limits Carrot2's applicability to really > large > sets of documents. > > S. > * mlb: I agree with all of your assertions, but... There are comments in the Solr materials about having an option to cluster based on the entire document set, and some warning about this being atypical and possibly slow. And from what you're saying, for a big enough docset, it might go from "slow" to "impossible", I'm not sure. And so my question was, *if* you were willing to spend that much time and effort to cluster all the text of all the documents (and if it were even possible), would the result perform better than the standard TF/IDF techniques? In the application I'm considering, the queries tend to be longer than average, more like full sentences or more. And they tend to be of a question and answer nature. I've seen references in several search engines that QandA search sometimes benefits from alternative search techniques. And, from a separate email, the IDF part of the standard similarity may be causing a problem, so I'm casting a wide net for other ideas. Just brainstorming here... :-) So, given that, did you have any thoughts on it Stanislaw? Mark
-
Re: Solr 1.4 Clustering / mlt AS search?Grant Ingersoll 2009-08-13, 20:24
On Aug 13, 2009, at 1:29 PM, Mark Bennett wrote: > * mlb: comments > > On Thu, Aug 13, 2009 at 9:39 AM, Stanislaw Osinski > <[EMAIL PROTECTED]>wrote: > >> Hi, >> >> On Tue, Aug 11, 2009 at 22:19, Mark Bennett <[EMAIL PROTECTED]> >> wrote: >> >> Carrot2 has several pluggable algorithms to choose from, though I >> have no >>> evidence that they're "better" than Lucene's. Where TF/IDF is >>> sort of a >>> one >>> step algebraic calculation, some clustering algorithms use iterative >>> approaches, etc. >> >> >> I'm not sure if I completely follow the way in which you'd like to >> use >> Carrot2 for scoring -- would you cluster the whole index? Carrot2 was >> designed to be a post-retrieval clustering algorithm and optimized to >> cluster small sets of documents (up to ~1000) in real time. All >> processing >> is performed in-memory, which limits Carrot2's applicability to >> really >> large >> sets of documents. >> >> S. >> > > * mlb: I agree with all of your assertions, but... > > There are comments in the Solr materials about having an option to > cluster > based on the entire document set, and some warning about this being > atypical > and possibly slow. And from what you're saying, for a big enough > docset, it > might go from "slow" to "impossible", I'm not sure. Those comments are referring to a yet unimplemented feature that will allow for pluggable background clustering using something like Mahout to cluster the whole collection and then return back the results later upon request. > > And so my question was, *if* you were willing to spend that much > time and > effort to cluster all the text of all the documents (and if it were > even > possible), would the result perform better than the standard TF/IDF > techniques? > > In the application I'm considering, the queries tend to be longer than > average, more like full sentences or more. And they tend to be of a > question and answer nature. I've seen references in several search > engines > that QandA search sometimes benefits from alternative search > techniques. > And, from a separate email, the IDF part of the standard similarity > may be > causing a problem, so I'm casting a wide net for other ideas. Just > brainstorming here... :-) QA has a lot of factors at play, but I can't recall anyone using clustering as a way of doing the initial passage retrieval, but it's been a few years since I kept up with that literature. You of course can turn off or downplay IDF if that is an issue. I think payloads can also play a useful hand in QA (or Lucene's new Attribute capabilities, but I won't quite go there yet) because you could store term level information (often POS plays a role in helping QA, as well as parsing information) -------------------------- Grant Ingersoll http://www.lucidimagination.com/ Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using Solr/Lucene: http://www.lucidimagination.com/search
-
Re: Solr 1.4 Clustering / mlt AS search?Stanislaw Osinski 2009-08-15, 10:43
Hi,
On Thu, Aug 13, 2009 at 19:29, Mark Bennett <[EMAIL PROTECTED]> wrote: There are comments in the Solr materials about having an option to cluster > based on the entire document set, and some warning about this being > atypical > and possibly slow. And from what you're saying, for a big enough docset, > it > might go from "slow" to "impossible", I'm not sure. For Carrot2, it would go to "impossible" I'd say. But as Grant mentioned earlier, Mahout is developing clustering algorithms that should be able to handle the whole-index types of docsets. And so my question was, *if* you were willing to spend that much time and > effort to cluster all the text of all the documents (and if it were even > possible), would the result perform better than the standard TF/IDF > techniques? Depends on the algorithm, really. In case of Carrot2, we don't do re-ranking of documents within clusters, we simply use whatever document order we got on input. As far as I'm aware, most clustering algorithms do pretty much the same: they concentrate on finding groups of documents and don't delve much into the issues of ranking documents within clusters. > In the application I'm considering, the queries tend to be longer than > average, more like full sentences or more. And they tend to be of a > question and answer nature. I've seen references in several search engines > that QandA search sometimes benefits from alternative search techniques. > And, from a separate email, the IDF part of the standard similarity may be > causing a problem, so I'm casting a wide net for other ideas. Just > brainstorming here... :-) Because of what I described above, clustering the whole index may not give you the best results. But you can try something different. You could try fetching a bunch (100--500) of more or less relevant documents for the question (MLT should be fine to start with), add your question as an extra document, perform clustering and see where the question-document ends up. If it doesn't end up in the Other Topics cluster, you could examine if the other documents from the cluster give an answer to the question. In this scenario, Carrot2 should be fine, at least performance-wise. I've not followed the QA literature very closely, so it's hard to say what the results would be quality-wise, but it should be very quick to try. Carrot2 Clustering Workbench [1][2] may come in handy for the experiments too. S. [1] http://download.carrot2.org/head/manual/#section.workbench [2] http://download.carrot2.org/head/manual/#section.getting-started.xml-files |