-Better Way of calculating Cosine Similarity between documents
Kasun Perera 2012-05-18, 09:19
I’m indexing collection of documents using Lucene specifying TermVerctor at
the indexing time. Then I retrieve terms and their term frequencies by
reading the index and calculate TF-IDF scores vector for each document.
Then using TF-IDF vectors, I calculate pairwise cosine similarity between
documents using the equation here
This is my problem
Say I have two identical documents “A” and “B” in this collection (A and B
have more than 200 sentences).
If I calculate pairwise cosine similarity between A and B it gives me
cosine value=1 which is perfectly OK.
But If I remove a single sentence from Doc “B”, it gives me cosine
similarity value around 0.85 between these two documents. The documents are
almost similar but cosine values are not. I understand the problem is with
the equation that I’m using.
Is there better way/ better equation that I can use for calculating cosine
similarity between documents?