-Re: Duplicate documents in a corpus
Ken Krugler 2011-07-28, 16:37
On Jul 28, 2011, at 8:49am, Rich Heimann wrote:
> I am curious if Lucene and/or Mahout can identify duplicate documents? I am
> having trouble with many redundant docs in my corpus, which is causing
> inflated values and an expense on users to process and reprocess much of the
> material. Can the redundancy be removed or managed in some sense my either
> Lucene at ingestion or Mahout at post-processing? The Vector Space Model
> seems to be notional similar to PCA or Factor Analysis, which both have
> similar ambitions. Thoughts???
Nutch has a TextProfileSignature class that creates a hash which is somewhat resilient to minor text changes between documents.
Assuming you have such a hash, then it's trivial to use a Hadoop workflow to remove duplicates.
Or Solr supports removing duplicates as well - see http://wiki.apache.org/solr/Deduplication
custom data mining solutions