Home | About | Sematext search-lucene.com search-hadoop.com
 Search Lucene and all its subprojects:

Switch to Threaded View
Nutch, mail # user - Near Duplicate Detection in nutch /Solr


Copy link to this message
-
Re: Near Duplicate Detection in nutch /Solr
remi tassing 2012-06-23, 09:59
I'm very interested in this topic as well. Plz let the community know
if/when you get smth cool implemented =)

On Saturday, June 23, 2012, parnab kumar wrote:

> Hi,
>
> I have crawled and  indexed  around 2.5 million web pages . However ,
> almost 30 % of the pages are near duplicates . Is there any functionality
> in SOLR or nutch to remove those near duplicates from the index. Nutch
> dedup command only handles exact duplicates i guess . Exact duplicates wont
> serve my purpose .
>     Please help / advise me on how to address the problem.
>
> Thanks ,
> Parnab
>