|
parnab kumar
2012-06-23, 08:41
Markus Jelsma
2012-06-23, 08:46
remi tassing
2012-06-23, 09:59
John McCormac
2012-06-23, 11:08
Markus Jelsma
2012-06-23, 11:14
John McCormac
2012-06-23, 11:28
Markus Jelsma
2012-06-23, 12:17
John McCormac
2012-06-23, 13:10
Markus Jelsma
2012-06-23, 13:21
John McCormac
2012-06-23, 14:35
Markus Jelsma
2012-06-23, 15:25
|
-
Near Duplicate Detection in nutch /Solrparnab kumar 2012-06-23, 08:41
Hi,
I have crawled and indexed around 2.5 million web pages . However , almost 30 % of the pages are near duplicates . Is there any functionality in SOLR or nutch to remove those near duplicates from the index. Nutch dedup command only handles exact duplicates i guess . Exact duplicates wont serve my purpose . Please help / advise me on how to address the problem. Thanks , Parnab
-
RE: Near Duplicate Detection in nutch /SolrMarkus Jelsma 2012-06-23, 08:46
You can use Nutch TextProfileSignature to create a less than exact signature for pages. It can delete some near duplicates.
-----Original message----- > From:parnab kumar <[EMAIL PROTECTED]> > Sent: Sat 23-Jun-2012 10:42 > To: [EMAIL PROTECTED] > Subject: Near Duplicate Detection in nutch /Solr > > Hi, > > I have crawled and indexed around 2.5 million web pages . However , > almost 30 % of the pages are near duplicates . Is there any functionality > in SOLR or nutch to remove those near duplicates from the index. Nutch > dedup command only handles exact duplicates i guess . Exact duplicates wont > serve my purpose . > Please help / advise me on how to address the problem. > > Thanks , > Parnab >
-
Re: Near Duplicate Detection in nutch /Solrremi tassing 2012-06-23, 09:59
I'm very interested in this topic as well. Plz let the community know
if/when you get smth cool implemented =) On Saturday, June 23, 2012, parnab kumar wrote: > Hi, > > I have crawled and indexed around 2.5 million web pages . However , > almost 30 % of the pages are near duplicates . Is there any functionality > in SOLR or nutch to remove those near duplicates from the index. Nutch > dedup command only handles exact duplicates i guess . Exact duplicates wont > serve my purpose . > Please help / advise me on how to address the problem. > > Thanks , > Parnab >
-
Re: Near Duplicate Detection in nutch /SolrJohn McCormac 2012-06-23, 11:08
On 23/06/2012 09:41, parnab kumar wrote:
> Hi, > > I have crawled and indexed around 2.5 million web pages . However , > almost 30 % of the pages are near duplicates . Is there any functionality > in SOLR or nutch to remove those near duplicates from the index. Nutch > dedup command only handles exact duplicates i guess . Exact duplicates wont > serve my purpose . > Please help / advise me on how to address the problem. From experience, the problem is that many businesses effectively have multiple copies of their websites on the web because they do not use 301 redirects. This means that example.com, example.net, example.org and example.cctld may all be the same site but only differ in the domain name. The solution involves identifying which of these clone sites is actually the main site and then excluding the clones from the indexing list. Sometimes you can use in-page cues such as URL construction or Base href tags to identify the main site. However the best way to solve the clones problem is outside the main/live index. Regards...jmcc -- ********************************************************** John McCormac * e-mail: [EMAIL PROTECTED] MC2 * web: http://www.hosterstats.com/ 22 Viewmount * Domain Registrations Statistics Waterford * And Historical DNS Database. Ireland * Over 275 Million Domains Tracked. IE * http://www.hosterstats.com/blog **********************************************************
-
RE: Near Duplicate Detection in nutch /SolrMarkus Jelsma 2012-06-23, 11:14
Nutch now has a HostURLNormalizer capable of normalizing source hosts to a target host. This prevents duplication of complete websites and bad hyperlinks.
https://issues.apache.org/jira/browse/NUTCH-1319 -----Original message----- > From:John McCormac <[EMAIL PROTECTED]> > Sent: Sat 23-Jun-2012 13:08 > To: [EMAIL PROTECTED] > Subject: Re: Near Duplicate Detection in nutch /Solr > > On 23/06/2012 09:41, parnab kumar wrote: > > Hi, > > > > I have crawled and indexed around 2.5 million web pages . However , > > almost 30 % of the pages are near duplicates . Is there any functionality > > in SOLR or nutch to remove those near duplicates from the index. Nutch > > dedup command only handles exact duplicates i guess . Exact duplicates wont > > serve my purpose . > > Please help / advise me on how to address the problem. > > From experience, the problem is that many businesses effectively have > multiple copies of their websites on the web because they do not use 301 > redirects. This means that example.com, example.net, example.org and > example.cctld may all be the same site but only differ in the domain > name. The solution involves identifying which of these clone sites is > actually the main site and then excluding the clones from the indexing > list. Sometimes you can use in-page cues such as URL construction or > Base href tags to identify the main site. However the best way to solve > the clones problem is outside the main/live index. > > Regards...jmcc > -- > ********************************************************** > John McCormac * e-mail: [EMAIL PROTECTED] > MC2 * web: http://www.hosterstats.com/ > 22 Viewmount * Domain Registrations Statistics > Waterford * And Historical DNS Database. > Ireland * Over 275 Million Domains Tracked. > IE * http://www.hosterstats.com/blog > ********************************************************** > > >
-
Re: Near Duplicate Detection in nutch /SolrJohn McCormac 2012-06-23, 11:28
On 23/06/2012 12:14, Markus Jelsma wrote:
> Nutch now has a HostURLNormalizer capable of normalizing source hosts to a target host. This prevents duplication of complete websites and bad hyperlinks. > > https://issues.apache.org/jira/browse/NUTCH-1319 But does that normalize subdomains to the main site (same TLD - sub.example.org to example.org etc) rather than clone sites in different TLDs to the main site? Regards...jmcc -- ********************************************************** John McCormac * e-mail: [EMAIL PROTECTED] MC2 * web: http://www.hosterstats.com/ 22 Viewmount * Domain Registrations Statistics Waterford * And Historical DNS Database. Ireland * Over 275 Million Domains Tracked. IE * http://www.hosterstats.com/blog **********************************************************
-
RE: Near Duplicate Detection in nutch /SolrMarkus Jelsma 2012-06-23, 12:17
Hello,
It maps anything to anything and has wildcard support: *.example.com example.org maps all URL's on the example.com domain to example.org. Cheers -----Original message----- > From:John McCormac <[EMAIL PROTECTED]> > Sent: Sat 23-Jun-2012 13:29 > To: [EMAIL PROTECTED] > Subject: Re: Near Duplicate Detection in nutch /Solr > > On 23/06/2012 12:14, Markus Jelsma wrote: > > Nutch now has a HostURLNormalizer capable of normalizing source hosts to a target host. This prevents duplication of complete websites and bad hyperlinks. > > > > https://issues.apache.org/jira/browse/NUTCH-1319 > > But does that normalize subdomains to the main site (same TLD - > sub.example.org to example.org etc) rather than clone sites in different > TLDs to the main site? > > Regards...jmcc > -- > ********************************************************** > John McCormac * e-mail: [EMAIL PROTECTED] > MC2 * web: http://www.hosterstats.com/ > 22 Viewmount * Domain Registrations Statistics > Waterford * And Historical DNS Database. > Ireland * Over 275 Million Domains Tracked. > IE * http://www.hosterstats.com/blog > ********************************************************** > > >
-
Re: Near Duplicate Detection in nutch /SolrJohn McCormac 2012-06-23, 13:10
On 23/06/2012 13:17, Markus Jelsma wrote:
> Hello, > > It maps anything to anything and has wildcard support: > *.example.com example.org > maps all URL's on the example.com domain to example.org. > Thanks. The main problem though is still identifying the clone/original sites so that the mapping can be determined. The process I use has the advantage of having the set of websites to be indexed predetermined and the clone/original problem is dealt with (for the most part) before the main indexing run. It can be a complicated approach depending on the number of TLDs and target countries involved. The logic behind this approach is preventing GIGO as it is easier and more efficent to solve the clone problem before it takes cycles and bandwidth in the main index run. What I have seen is that some businesses will use numbers of keyword type domains pointing (without a 301 redirect) to their main site. However the main clone pair is the ccTLD/.com version of a site (same domain but different TLDs). The .net and .org may also exist for older businesses. The non-core TLDs (biz/info/mobi/eu/asia etc) are often less likely to be properly set up in DNS with a working website as about 85% of a country's domain footprint will be concentrated on the ccTLD/.com axis. Regards...jmcc -- ********************************************************** John McCormac * e-mail: [EMAIL PROTECTED] MC2 * web: http://www.hosterstats.com/ 22 Viewmount * Domain Registrations Statistics Waterford * And Historical DNS Database. Ireland * Over 275 Million Domains Tracked. IE * http://www.hosterstats.com/blog **********************************************************
-
RE: Near Duplicate Detection in nutch /SolrMarkus Jelsma 2012-06-23, 13:21
Keep an eye on these open issues:
https://issues.apache.org/jira/browse/NUTCH-1324 https://issues.apache.org/jira/browse/NUTCH-1325 https://issues.apache.org/jira/browse/NUTCH-1326 They are a set of tools capable of deduplicating the various databases via the HostNormalizer. They collect information on hosts, most importantly the link score. It also collects information on duplicates within a host and then produce deduplication rules for the HostNormalizer based on host and duplicate information. It's limited to domain because that's a larger problem in terms of resources and a bit easier to deal with. -----Original message----- > From:John McCormac <[EMAIL PROTECTED]> > Sent: Sat 23-Jun-2012 15:11 > To: [EMAIL PROTECTED] > Subject: Re: Near Duplicate Detection in nutch /Solr > > On 23/06/2012 13:17, Markus Jelsma wrote: > > Hello, > > > > It maps anything to anything and has wildcard support: > > *.example.com example.org > > maps all URL's on the example.com domain to example.org. > > > > Thanks. > The main problem though is still identifying the clone/original sites so > that the mapping can be determined. > > The process I use has the advantage of having the set of websites to be > indexed predetermined and the clone/original problem is dealt with (for > the most part) before the main indexing run. It can be a complicated > approach depending on the number of TLDs and target countries involved. > > The logic behind this approach is preventing GIGO as it is easier and > more efficent to solve the clone problem before it takes cycles and > bandwidth in the main index run. > > What I have seen is that some businesses will use numbers of keyword > type domains pointing (without a 301 redirect) to their main site. > However the main clone pair is the ccTLD/.com version of a site (same > domain but different TLDs). The .net and .org may also exist for older > businesses. The non-core TLDs (biz/info/mobi/eu/asia etc) are often less > likely to be properly set up in DNS with a working website as about 85% > of a country's domain footprint will be concentrated on the ccTLD/.com axis. > > Regards...jmcc > -- > ********************************************************** > John McCormac * e-mail: [EMAIL PROTECTED] > MC2 * web: http://www.hosterstats.com/ > 22 Viewmount * Domain Registrations Statistics > Waterford * And Historical DNS Database. > Ireland * Over 275 Million Domains Tracked. > IE * http://www.hosterstats.com/blog > ********************************************************** > > >
-
Re: Near Duplicate Detection in nutch /SolrJohn McCormac 2012-06-23, 14:35
On 23/06/2012 14:21, Markus Jelsma wrote:
> Keep an eye on these open issues: > > https://issues.apache.org/jira/browse/NUTCH-1324 > https://issues.apache.org/jira/browse/NUTCH-1325 > https://issues.apache.org/jira/browse/NUTCH-1326 > > They are a set of tools capable of deduplicating the various databases via the HostNormalizer. They collect information on hosts, most importantly the link score. It also collects information on duplicates within a host and then produce deduplication rules for the HostNormalizer based on host and duplicate information. > > It's limited to domain because that's a larger problem in terms of resources and a bit easier to deal with. The HostDB patch looks interesting. (I'm still very much a novice as regards Nutch and Java.) It might be a good thing to add a DNS lookup field and an IP lookup field. Some hosters have domain graveyard IPs (and PPC parking pages) where they point undeveloped or unrenewed domains. This would help with the blacklisting process by allowing unrenewed sites to be identified simply by IP. In DNS terms, if a domain moves to a PPC (sedoparking.com etc) or auction hoster (afternic.com etc) then it is no longer worth including in an active index. Regards...jmcc -- ********************************************************** John McCormac * e-mail: [EMAIL PROTECTED] MC2 * web: http://www.hosterstats.com/ 22 Viewmount * Domain Registrations Statistics Waterford * And Historical DNS Database. Ireland * Over 275 Million Domains Tracked. IE * http://www.hosterstats.com/blog **********************************************************
-
RE: Near Duplicate Detection in nutch /SolrMarkus Jelsma 2012-06-23, 15:25
Thanks for your comments. Please consider adding it to the issue so we can keep track of it.
-----Original message----- > From:John McCormac <[EMAIL PROTECTED]> > Sent: Sat 23-Jun-2012 16:36 > To: [EMAIL PROTECTED] > Subject: Re: Near Duplicate Detection in nutch /Solr > > On 23/06/2012 14:21, Markus Jelsma wrote: > > Keep an eye on these open issues: > > > > https://issues.apache.org/jira/browse/NUTCH-1324 > > https://issues.apache.org/jira/browse/NUTCH-1325 > > https://issues.apache.org/jira/browse/NUTCH-1326 > > > > They are a set of tools capable of deduplicating the various databases via the HostNormalizer. They collect information on hosts, most importantly the link score. It also collects information on duplicates within a host and then produce deduplication rules for the HostNormalizer based on host and duplicate information. > > > > It's limited to domain because that's a larger problem in terms of resources and a bit easier to deal with. > > The HostDB patch looks interesting. (I'm still very much a novice as > regards Nutch and Java.) It might be a good thing to add a DNS lookup > field and an IP lookup field. Some hosters have domain graveyard IPs > (and PPC parking pages) where they point undeveloped or unrenewed > domains. This would help with the blacklisting process by allowing > unrenewed sites to be identified simply by IP. In DNS terms, if a domain > moves to a PPC (sedoparking.com etc) or auction hoster (afternic.com > etc) then it is no longer worth including in an active index. > > Regards...jmcc > -- > ********************************************************** > John McCormac * e-mail: [EMAIL PROTECTED] > MC2 * web: http://www.hosterstats.com/ > 22 Viewmount * Domain Registrations Statistics > Waterford * And Historical DNS Database. > Ireland * Over 275 Million Domains Tracked. > IE * http://www.hosterstats.com/blog > ********************************************************** > > > |