|
|
-
Better Way of calculating Cosine Similarity between documentsKasun Perera 2012-05-18, 09:19
Hi all
I’m indexing collection of documents using Lucene specifying TermVerctor at the indexing time. Then I retrieve terms and their term frequencies by reading the index and calculate TF-IDF scores vector for each document. Then using TF-IDF vectors, I calculate pairwise cosine similarity between documents using the equation here http://en.wikipedia.org/wiki/Cosine_similarity. This is my problem Say I have two identical documents “A” and “B” in this collection (A and B have more than 200 sentences). If I calculate pairwise cosine similarity between A and B it gives me cosine value=1 which is perfectly OK. But If I remove a single sentence from Doc “B”, it gives me cosine similarity value around 0.85 between these two documents. The documents are almost similar but cosine values are not. I understand the problem is with the equation that I’m using. Is there better way/ better equation that I can use for calculating cosine similarity between documents? -- Regards Kasun Perera +
nemeskey.david@... 2012-05-18, 09:52
+
Akos Tajti 2012-05-18, 11:06
|