-Re: Understanding the SVD recommender
Ted Dunning 2010-06-04, 07:53
On Fri, Jun 4, 2010 at 12:38 AM, Sean Owen <[EMAIL PROTECTED]> wrote:
> Yes thanks a lot. Makes sense to me: we're just changing basis and V
> is the change-of-basis transformation. Glad to see that is all there
> is to it; not sure what the rest is about.
> I had thought of U S as "user preferences for features" and V as
> "expression of features in items". The paper breaks S in half by
> taking the square root S = B* B and putting B* with U and B with V.
This is pretty common usage actually. It allows some degree of
normalization of the vectors, but isn't strictly necessary.
Note that since this is an SVD, S is diagonal and all elements are real (and
positive, actually). Thus B* = B.
> But am I right that both are equivalent?
But not customary. The old LSI article makes a better case than I can off
the cuff just before bed.
> Because I'd rather think of
> maintaining and updating U S. Because conceptually S is just full of
> multipliers -- making users 3x more keen on feature 1 is the same as
> reducing the item express 3x less of feature 1. Certainly in the
> recommendation computation they show, which makes sense, it doesn't
> matter since the dot product is the same.
Actually the elements of S aren't item or user specific. Remember there are
only k non-zeros there.
They represent the strength of each of the singular vectors. A very useful
way to look at it is to consider the SVD as a sum of rank-1 outer products
u_i * v'_i. These rank-1 products are summed with weights s_i. This way of
looking at matters makes a number of lemmas about SVD's pretty trivial.
> They also add on the "row" average to make a prediction, which is the
> average rating by the user, I'm guessing -- "row" is a row of A?
I would guess so, but that would only make sense if they subtracted it ahead
of time. In general, I don't see the point for that. I would rather cosine
normalize each user row.
> Just working backwards, I'd assume this is because the generated
> predictions are otherwise "centered" in the sense that 0 will be
> predicted for an item that the user might be neutral on. But I guess I
> hadn't seen the intuitive reason this is the result. Is there any easy
> way to see it?
For a new user with no history, h = 0 so the corresponding k-dimensional
representation of this user will be s^-(1/2) h v = 0. The dot product with
any item vector will be identically 0.
I don't know that it would make any useful difference, but it would make me
happier to reduce the ratings to binary, normalize rows and then decompose.
In general, I think that the Belkor team had a much better approach with
SVD++ and their time dynamics trick. That is much the same as mean removal.
> On Fri, Jun 4, 2010 at 6:48 AM, Ted Dunning <[EMAIL PROTECTED]> wrote:
> > You are correct. The paper has an appalling treatment of the folding in
> > approach.
> > In fact, the procedure is dead simple.
> > The basic idea is to leave the coordinate system derived in the original
> > intact and simply project the new users into that space.
> > The easiest way to see what is happening is to start again with the
> > rating matrix A as decomposed:
> > A = U S V'
> > where A is users x items. If we multiply on the right by V, we get
> > A V = U S V' V = U S
> > (because V' V = I, by definition). This result is (users x items) x
> > x k) = users x k, that is, it gives a k dimensional vector for each user.
> > Similarly, multiplication on the left by U' gives a k x items matrix
> > when transposed gives a k dimensional vector for each item.
> > This implies that if we augment U with new user row vectors U_new, we
> > be able to simply compute new k-dimensional vectors for the new users and
> > adjoin these new vectors to the previous vectors. Concisely put,