|
Sean Owen
2012-07-06, 09:35
Jens Grivolla
2012-07-06, 10:18
Razon, Oren
2012-07-06, 15:13
sam wu
2012-07-06, 16:01
sam wu
2012-07-06, 16:19
Ted Dunning
2012-07-06, 18:07
Sean Owen
2012-07-06, 18:32
Ted Dunning
2012-07-06, 18:51
|
-
Shortcut to finding the best recs from factored matrices?Sean Owen 2012-07-06, 09:35
Here's one I've been puzzling over for a bit. In a factorization based
on the SVD or what have you, you reconstruct the approximate original matrix (well, one row) by multiplying the matrices back together and looking for the largest elements. This is essentially multiplying a user feature vector by the entire item-feature matrix to reconstruct one approximate row of the input. That's not necessarily so slow, but it's not the fastest thing. I want to speed it up. It seems like there ought to be some shortcut, even if it means a probabilistic approach that could get it slightly wrong at times. I say so because most item feature vectors are nowhere near the user feature vector in feature space. Their dot product is not going to be the largest. In fact, given the user feature vector you can say exactly where in feature space (which direction) you want to look for the top items. For example, if the user feature vector is (2,1) you are looking for item vector (x,y) that maximizes 2x+y and that is largest in the direction of (2,1). When feature space is 50+-dimensional though, I'm having a hard time thinking of an efficient way to index those item feature vectors such that one could quickly find a few buckets of items to check and with high confidence have found the best recommendations. The bases I have are not necessarily orthogonal let alone orthonormal either. I bet, hope, someone will have an insight? You could cluster the items with k-means, quickly, I suppose. I was hoping for something less heavy. Sean
-
Re: Shortcut to finding the best recs from factored matrices?Jens Grivolla 2012-07-06, 10:18
Maybe locality-sensitive hashing can help to get candidates before
calculating the exact distance? Bye, Jens On 07/06/2012 11:35 AM, Sean Owen wrote: > Here's one I've been puzzling over for a bit. In a factorization based > on the SVD or what have you, you reconstruct the approximate original > matrix (well, one row) by multiplying the matrices back together and > looking for the largest elements. This is essentially multiplying a > user feature vector by the entire item-feature matrix to reconstruct > one approximate row of the input. > > That's not necessarily so slow, but it's not the fastest thing. I want > to speed it up. It seems like there ought to be some shortcut, even if > it means a probabilistic approach that could get it slightly wrong at > times. > > I say so because most item feature vectors are nowhere near the user > feature vector in feature space. Their dot product is not going to be > the largest. In fact, given the user feature vector you can say > exactly where in feature space (which direction) you want to look for > the top items. For example, if the user feature vector is (2,1) you > are looking for item vector (x,y) that maximizes 2x+y and that is > largest in the direction of (2,1). > > When feature space is 50+-dimensional though, I'm having a hard time > thinking of an efficient way to index those item feature vectors such > that one could quickly find a few buckets of items to check and with > high confidence have found the best recommendations. The bases I have > are not necessarily orthogonal let alone orthonormal either. I bet, > hope, someone will have an insight? > > You could cluster the items with k-means, quickly, I suppose. I was > hoping for something less heavy. > > Sean >
-
RE: Shortcut to finding the best recs from factored matrices?Razon, Oren 2012-07-06, 15:13
Thanks, it helped!
After having some thoughts about what the outcome prediction, I'm having a question about measuring the quality of my model. If I'm using a technique in which in the end I'm predicting a preference value (implicit \ explicit) I could easily measure my model by applying it on a test dataset and calculating RMSE and etc. But if I'm just estimating the possibility the user will like the item (such with the co-occurrence item based), it give me the ability to rank items, but how could I estimate my success? How could I measure the success of my ranking? -----Original Message----- From: Sean Owen [mailto:[EMAIL PROTECTED]] Sent: Friday, July 06, 2012 12:35 To: Mahout User List Subject: Shortcut to finding the best recs from factored matrices? Here's one I've been puzzling over for a bit. In a factorization based on the SVD or what have you, you reconstruct the approximate original matrix (well, one row) by multiplying the matrices back together and looking for the largest elements. This is essentially multiplying a user feature vector by the entire item-feature matrix to reconstruct one approximate row of the input. That's not necessarily so slow, but it's not the fastest thing. I want to speed it up. It seems like there ought to be some shortcut, even if it means a probabilistic approach that could get it slightly wrong at times. I say so because most item feature vectors are nowhere near the user feature vector in feature space. Their dot product is not going to be the largest. In fact, given the user feature vector you can say exactly where in feature space (which direction) you want to look for the top items. For example, if the user feature vector is (2,1) you are looking for item vector (x,y) that maximizes 2x+y and that is largest in the direction of (2,1). When feature space is 50+-dimensional though, I'm having a hard time thinking of an efficient way to index those item feature vectors such that one could quickly find a few buckets of items to check and with high confidence have found the best recommendations. The bases I have are not necessarily orthogonal let alone orthonormal either. I bet, hope, someone will have an insight? You could cluster the items with k-means, quickly, I suppose. I was hoping for something less heavy. Sean --------------------------------------------------------------------- Intel Electronics Ltd. This e-mail and any attachments may contain confidential material for the sole use of the intended recipient(s). Any review or distribution by others is strictly prohibited. If you are not the intended recipient, please contact the sender and delete all copies.
-
Re: Shortcut to finding the best recs from factored matrices?sam wu 2012-07-06, 16:01
LSH has many different flavors (based on the different similarity metric).
Normally Minhash, which is good for if you have boolean (yes-no, 0-1) features, and in the case of k-shingle, it fits well. In the latent topcis model, like ALS, the feature is no longer 0-1. I think Random Hyperplane (cosine-similarity for LSH) will be better. Another thought for finding NN, is to steal some idea from Ted's previous "K-Means Cluster at Scale", projection search for nearest cluster ( how to efficient to find k-NN centroids for a new vector). One TreeSet per feature with HeadSet & TailSet. Not sure will this scale to hugh data ? BTW, I recalled this streaming K-Means will be rolled into Mahout 0.8, but I didn't find it. is this in the pipeline ? Sam On Fri, Jul 6, 2012 at 3:18 AM, Jens Grivolla <j+[EMAIL PROTECTED]> wrote: > Maybe locality-sensitive hashing can help to get candidates before > calculating the exact distance? > > Bye, > Jens > > > On 07/06/2012 11:35 AM, Sean Owen wrote: > >> Here's one I've been puzzling over for a bit. In a factorization based >> on the SVD or what have you, you reconstruct the approximate original >> matrix (well, one row) by multiplying the matrices back together and >> looking for the largest elements. This is essentially multiplying a >> user feature vector by the entire item-feature matrix to reconstruct >> one approximate row of the input. >> >> That's not necessarily so slow, but it's not the fastest thing. I want >> to speed it up. It seems like there ought to be some shortcut, even if >> it means a probabilistic approach that could get it slightly wrong at >> times. >> >> I say so because most item feature vectors are nowhere near the user >> feature vector in feature space. Their dot product is not going to be >> the largest. In fact, given the user feature vector you can say >> exactly where in feature space (which direction) you want to look for >> the top items. For example, if the user feature vector is (2,1) you >> are looking for item vector (x,y) that maximizes 2x+y and that is >> largest in the direction of (2,1). >> >> When feature space is 50+-dimensional though, I'm having a hard time >> thinking of an efficient way to index those item feature vectors such >> that one could quickly find a few buckets of items to check and with >> high confidence have found the best recommendations. The bases I have >> are not necessarily orthogonal let alone orthonormal either. I bet, >> hope, someone will have an insight? >> >> You could cluster the items with k-means, quickly, I suppose. I was >> hoping for something less heavy. >> >> Sean >> >> > > >
-
Re: Shortcut to finding the best recs from factored matrices?sam wu 2012-07-06, 16:19
One more thought.
Cosine similarity kinds of measure the ratio of different feature preference. In recommendation job, I think ratio of feature preference is more relevant than the score itself ( kind reducing bias impact, some people rank score higher,..) Sam On Fri, Jul 6, 2012 at 9:01 AM, sam wu <[EMAIL PROTECTED]> wrote: > LSH has many different flavors (based on the different similarity metric). > Normally Minhash, which is good for if you have boolean (yes-no, 0-1) > features, and in the case of k-shingle, it fits well. > In the latent topcis model, like ALS, the feature is no longer 0-1. I > think Random Hyperplane (cosine-similarity for LSH) will be better. > > Another thought for finding NN, is to steal some idea from Ted's previous > "K-Means Cluster at Scale", projection search for nearest cluster ( how to > efficient to find k-NN centroids for a new vector). One TreeSet per feature > with HeadSet & TailSet. Not sure will this scale to hugh data ? > BTW, I recalled this streaming K-Means will be rolled into Mahout 0.8, but > I didn't find it. is this in the pipeline ? > > Sam > > > On Fri, Jul 6, 2012 at 3:18 AM, Jens Grivolla <j+[EMAIL PROTECTED]> wrote: > >> Maybe locality-sensitive hashing can help to get candidates before >> calculating the exact distance? >> >> Bye, >> Jens >> >> >> On 07/06/2012 11:35 AM, Sean Owen wrote: >> >>> Here's one I've been puzzling over for a bit. In a factorization based >>> on the SVD or what have you, you reconstruct the approximate original >>> matrix (well, one row) by multiplying the matrices back together and >>> looking for the largest elements. This is essentially multiplying a >>> user feature vector by the entire item-feature matrix to reconstruct >>> one approximate row of the input. >>> >>> That's not necessarily so slow, but it's not the fastest thing. I want >>> to speed it up. It seems like there ought to be some shortcut, even if >>> it means a probabilistic approach that could get it slightly wrong at >>> times. >>> >>> I say so because most item feature vectors are nowhere near the user >>> feature vector in feature space. Their dot product is not going to be >>> the largest. In fact, given the user feature vector you can say >>> exactly where in feature space (which direction) you want to look for >>> the top items. For example, if the user feature vector is (2,1) you >>> are looking for item vector (x,y) that maximizes 2x+y and that is >>> largest in the direction of (2,1). >>> >>> When feature space is 50+-dimensional though, I'm having a hard time >>> thinking of an efficient way to index those item feature vectors such >>> that one could quickly find a few buckets of items to check and with >>> high confidence have found the best recommendations. The bases I have >>> are not necessarily orthogonal let alone orthonormal either. I bet, >>> hope, someone will have an insight? >>> >>> You could cluster the items with k-means, quickly, I suppose. I was >>> hoping for something less heavy. >>> >>> Sean >>> >>> >> >> >> >
-
Re: Shortcut to finding the best recs from factored matrices?Ted Dunning 2012-07-06, 18:07
THere is a very lightweight LSH implementation in
https://github.com/tdunning/knn that will be what I am bringing into Mahout as part of 0.8. It is specifically design to approximate dot products to accelerate search of this sort. You should be able to decrease the number of actual dot products by 40x or so with 64 bit hashes. There is also a projection search implementation that might help. On Fri, Jul 6, 2012 at 9:01 AM, sam wu <[EMAIL PROTECTED]> wrote: > LSH has many different flavors (based on the different similarity metric). > Normally Minhash, which is good for if you have boolean (yes-no, 0-1) > features, and in the case of k-shingle, it fits well. > In the latent topcis model, like ALS, the feature is no longer 0-1. I think > Random Hyperplane (cosine-similarity for LSH) will be better. > > Another thought for finding NN, is to steal some idea from Ted's previous > "K-Means Cluster at Scale", projection search for nearest cluster ( how to > efficient to find k-NN centroids for a new vector). One TreeSet per feature > with HeadSet & TailSet. Not sure will this scale to hugh data ? > BTW, I recalled this streaming K-Means will be rolled into Mahout 0.8, but > I didn't find it. is this in the pipeline ? > > Sam > > > On Fri, Jul 6, 2012 at 3:18 AM, Jens Grivolla <j+[EMAIL PROTECTED]> wrote: > > > Maybe locality-sensitive hashing can help to get candidates before > > calculating the exact distance? > > > > Bye, > > Jens > > > > > > On 07/06/2012 11:35 AM, Sean Owen wrote: > > > >> Here's one I've been puzzling over for a bit. In a factorization based > >> on the SVD or what have you, you reconstruct the approximate original > >> matrix (well, one row) by multiplying the matrices back together and > >> looking for the largest elements. This is essentially multiplying a > >> user feature vector by the entire item-feature matrix to reconstruct > >> one approximate row of the input. > >> > >> That's not necessarily so slow, but it's not the fastest thing. I want > >> to speed it up. It seems like there ought to be some shortcut, even if > >> it means a probabilistic approach that could get it slightly wrong at > >> times. > >> > >> I say so because most item feature vectors are nowhere near the user > >> feature vector in feature space. Their dot product is not going to be > >> the largest. In fact, given the user feature vector you can say > >> exactly where in feature space (which direction) you want to look for > >> the top items. For example, if the user feature vector is (2,1) you > >> are looking for item vector (x,y) that maximizes 2x+y and that is > >> largest in the direction of (2,1). > >> > >> When feature space is 50+-dimensional though, I'm having a hard time > >> thinking of an efficient way to index those item feature vectors such > >> that one could quickly find a few buckets of items to check and with > >> high confidence have found the best recommendations. The bases I have > >> are not necessarily orthogonal let alone orthonormal either. I bet, > >> hope, someone will have an insight? > >> > >> You could cluster the items with k-means, quickly, I suppose. I was > >> hoping for something less heavy. > >> > >> Sean > >> > >> > > > > > > >
-
Re: Shortcut to finding the best recs from factored matrices?Sean Owen 2012-07-06, 18:32
LSH is probably my ticket, thanks all. I tried a form of this, but
just used the basis of the feature space to define the hyperplanes because I was lazy and experimenting. I didn't work well in the sense that the best recommendations were not hashed together unless you had fairly few buckets (i.e., not much speedup). But -- I imagine I did something wrong. On Fri, Jul 6, 2012 at 7:01 PM, sam wu <[EMAIL PROTECTED]> wrote: > LSH has many different flavors (based on the different similarity metric). > Normally Minhash, which is good for if you have boolean (yes-no, 0-1) > features, and in the case of k-shingle, it fits well. > In the latent topcis model, like ALS, the feature is no longer 0-1. I think > Random Hyperplane (cosine-similarity for LSH) will be better. > > Another thought for finding NN, is to steal some idea from Ted's previous > "K-Means Cluster at Scale", projection search for nearest cluster ( how to > efficient to find k-NN centroids for a new vector). One TreeSet per feature > with HeadSet & TailSet. Not sure will this scale to hugh data ? > BTW, I recalled this streaming K-Means will be rolled into Mahout 0.8, but > I didn't find it. is this in the pipeline ? > > Sam > > > On Fri, Jul 6, 2012 at 3:18 AM, Jens Grivolla <j+[EMAIL PROTECTED]> wrote: > >> Maybe locality-sensitive hashing can help to get candidates before >> calculating the exact distance? >> >> Bye, >> Jens >> >> >> On 07/06/2012 11:35 AM, Sean Owen wrote: >> >>> Here's one I've been puzzling over for a bit. In a factorization based >>> on the SVD or what have you, you reconstruct the approximate original >>> matrix (well, one row) by multiplying the matrices back together and >>> looking for the largest elements. This is essentially multiplying a >>> user feature vector by the entire item-feature matrix to reconstruct >>> one approximate row of the input. >>> >>> That's not necessarily so slow, but it's not the fastest thing. I want >>> to speed it up. It seems like there ought to be some shortcut, even if >>> it means a probabilistic approach that could get it slightly wrong at >>> times. >>> >>> I say so because most item feature vectors are nowhere near the user >>> feature vector in feature space. Their dot product is not going to be >>> the largest. In fact, given the user feature vector you can say >>> exactly where in feature space (which direction) you want to look for >>> the top items. For example, if the user feature vector is (2,1) you >>> are looking for item vector (x,y) that maximizes 2x+y and that is >>> largest in the direction of (2,1). >>> >>> When feature space is 50+-dimensional though, I'm having a hard time >>> thinking of an efficient way to index those item feature vectors such >>> that one could quickly find a few buckets of items to check and with >>> high confidence have found the best recommendations. The bases I have >>> are not necessarily orthogonal let alone orthonormal either. I bet, >>> hope, someone will have an insight? >>> >>> You could cluster the items with k-means, quickly, I suppose. I was >>> hoping for something less heavy. >>> >>> Sean >>> >>> >> >> >>
-
Re: Shortcut to finding the best recs from factored matrices?Ted Dunning 2012-07-06, 18:51
It is critical to use randomized projections here in order to get the
dimension independent characteristics. On Fri, Jul 6, 2012 at 11:32 AM, Sean Owen <[EMAIL PROTECTED]> wrote: > LSH is probably my ticket, thanks all. I tried a form of this, but > just used the basis of the feature space to define the hyperplanes > because I was lazy and experimenting. I didn't work well in the sense > that the best recommendations were not hashed together unless you had > fairly few buckets (i.e., not much speedup). But -- I imagine I did > something wrong. > > On Fri, Jul 6, 2012 at 7:01 PM, sam wu <[EMAIL PROTECTED]> wrote: > > LSH has many different flavors (based on the different similarity > metric). > > Normally Minhash, which is good for if you have boolean (yes-no, 0-1) > > features, and in the case of k-shingle, it fits well. > > In the latent topcis model, like ALS, the feature is no longer 0-1. I > think > > Random Hyperplane (cosine-similarity for LSH) will be better. > > > > Another thought for finding NN, is to steal some idea from Ted's previous > > "K-Means Cluster at Scale", projection search for nearest cluster ( how > to > > efficient to find k-NN centroids for a new vector). One TreeSet per > feature > > with HeadSet & TailSet. Not sure will this scale to hugh data ? > > BTW, I recalled this streaming K-Means will be rolled into Mahout 0.8, > but > > I didn't find it. is this in the pipeline ? > > > > Sam > > > > > > On Fri, Jul 6, 2012 at 3:18 AM, Jens Grivolla <j+[EMAIL PROTECTED]> > wrote: > > > >> Maybe locality-sensitive hashing can help to get candidates before > >> calculating the exact distance? > >> > >> Bye, > >> Jens > >> > >> > >> On 07/06/2012 11:35 AM, Sean Owen wrote: > >> > >>> Here's one I've been puzzling over for a bit. In a factorization based > >>> on the SVD or what have you, you reconstruct the approximate original > >>> matrix (well, one row) by multiplying the matrices back together and > >>> looking for the largest elements. This is essentially multiplying a > >>> user feature vector by the entire item-feature matrix to reconstruct > >>> one approximate row of the input. > >>> > >>> That's not necessarily so slow, but it's not the fastest thing. I want > >>> to speed it up. It seems like there ought to be some shortcut, even if > >>> it means a probabilistic approach that could get it slightly wrong at > >>> times. > >>> > >>> I say so because most item feature vectors are nowhere near the user > >>> feature vector in feature space. Their dot product is not going to be > >>> the largest. In fact, given the user feature vector you can say > >>> exactly where in feature space (which direction) you want to look for > >>> the top items. For example, if the user feature vector is (2,1) you > >>> are looking for item vector (x,y) that maximizes 2x+y and that is > >>> largest in the direction of (2,1). > >>> > >>> When feature space is 50+-dimensional though, I'm having a hard time > >>> thinking of an efficient way to index those item feature vectors such > >>> that one could quickly find a few buckets of items to check and with > >>> high confidence have found the best recommendations. The bases I have > >>> are not necessarily orthogonal let alone orthonormal either. I bet, > >>> hope, someone will have an insight? > >>> > >>> You could cluster the items with k-means, quickly, I suppose. I was > >>> hoping for something less heavy. > >>> > >>> Sean > >>> > >>> > >> > >> > >> > |