|
Tolga
2012-05-24, 07:17
Piet van Remortel
2012-05-24, 07:25
Tolga
2012-05-24, 07:35
Piet van Remortel
2012-05-24, 08:00
Tolga
2012-05-24, 08:19
Piet van Remortel
2012-05-24, 08:28
Tolga
2012-05-24, 11:52
Piet van Remortel
2012-05-24, 12:10
|
-
Large website not fully crawledTolga 2012-05-24, 07:17
Hi,
I am crawling a large website, which is our university's. From the logs and some grep'ing, I see that some pdf files were not crawled. Why could this happen? I'm crawling with -depth 100 -topN 5. Regards,
-
Re: Large website not fully crawledPiet van Remortel 2012-05-24, 07:25
- your topN parameter limited the crawl : see the info at
http://wiki.apache.org/nutch/NutchTutorial or : - file filters - there is no link to the files (as you suggested yourself already) - did you check the correct/all segments ? - did you check the fully correct filenames ? wildcards don't work on all segmentreader approaches - size limits of the crawler (see previous discussion) - did you check file presence in the segment, or parse result ? i.e. parsing could have failed (cfr the previous discussion of the last few days) - your disk got full and crawling stopped - the webserver(s) kicked you off - your hadoop logs have overrun the local disk on which the crawler was running (i.e. disk full) Piet On Thu, May 24, 2012 at 9:17 AM, Tolga <[EMAIL PROTECTED]> wrote: > Hi, > > I am crawling a large website, which is our university's. From the logs > and some grep'ing, I see that some pdf files were not crawled. Why could > this happen? I'm crawling with -depth 100 -topN 5. > > Regards, >
-
Re: Large website not fully crawledTolga 2012-05-24, 07:35
- I don't fully understand the use of topN parameter. Should I increase it?
- You mean parse-pdf thing? I've got that in my nutch-default.xml. - I looked for the link, it was there. Besides, that was for another website I was experimenting on. - How do I check segments? - I didn't check filenames, but I've tried searching for a word in that PDF file. - I've got more than 50gb free. - I'm not sure about webserver kicking me off, I'll have the check that with the sysadmin. Regards, On 5/24/12 10:25 AM, Piet van Remortel wrote: > - your topN parameter limited the crawl : see the info at > http://wiki.apache.org/nutch/NutchTutorial > > or : > > - file filters > - there is no link to the files (as you suggested yourself already) > - did you check the correct/all segments ? > - did you check the fully correct filenames ? wildcards don't work on all > segmentreader approaches > - size limits of the crawler (see previous discussion) > - did you check file presence in the segment, or parse result ? i.e. > parsing could have failed (cfr the previous discussion of the last few days) > - your disk got full and crawling stopped > - the webserver(s) kicked you off > - your hadoop logs have overrun the local disk on which the crawler was > running (i.e. disk full) > > Piet > > > On Thu, May 24, 2012 at 9:17 AM, Tolga<[EMAIL PROTECTED]> wrote: > >> Hi, >> >> I am crawling a large website, which is our university's. From the logs >> and some grep'ing, I see that some pdf files were not crawled. Why could >> this happen? I'm crawling with -depth 100 -topN 5. >> >> Regards, >>
-
Re: Large website not fully crawledPiet van Remortel 2012-05-24, 08:00
On Thu, May 24, 2012 at 9:35 AM, Tolga <[EMAIL PROTECTED]> wrote:
> - I don't fully understand the use of topN parameter. Should I increase it? > yes > - You mean parse-pdf thing? I've got that in my nutch-default.xml. > good, should work then > - I looked for the link, it was there. Besides, that was for another > website I was experimenting on. > - How do I check segments? > e.g. with segmentreader, a hadoop access command built in nutch > - I didn't check filenames, but I've tried searching for a word in that > PDF file. > then the reason could also be indexing > - I've got more than 50gb free. > - I'm not sure about webserver kicking me off, I'll have the check that > with the sysadmin. > should be visible as something like timeouts or a similar message in the hadoop logs > > Regards, > > > On 5/24/12 10:25 AM, Piet van Remortel wrote: > >> - your topN parameter limited the crawl : see the info at >> http://wiki.apache.org/nutch/**NutchTutorial<http://wiki.apache.org/nutch/NutchTutorial> >> >> or : >> >> - file filters >> - there is no link to the files (as you suggested yourself already) >> - did you check the correct/all segments ? >> - did you check the fully correct filenames ? wildcards don't work on all >> segmentreader approaches >> - size limits of the crawler (see previous discussion) >> - did you check file presence in the segment, or parse result ? i.e. >> parsing could have failed (cfr the previous discussion of the last few >> days) >> - your disk got full and crawling stopped >> - the webserver(s) kicked you off >> - your hadoop logs have overrun the local disk on which the crawler was >> running (i.e. disk full) >> >> Piet >> >> >> On Thu, May 24, 2012 at 9:17 AM, Tolga<[EMAIL PROTECTED]> wrote: >> >> Hi, >>> >>> I am crawling a large website, which is our university's. From the logs >>> and some grep'ing, I see that some pdf files were not crawled. Why could >>> this happen? I'm crawling with -depth 100 -topN 5. >>> >>> Regards, >>> >>>
-
Re: Large website not fully crawledTolga 2012-05-24, 08:19
On 5/24/12 11:00 AM, Piet van Remortel wrote: > On Thu, May 24, 2012 at 9:35 AM, Tolga<[EMAIL PROTECTED]> wrote: > >> - I don't fully understand the use of topN parameter. Should I increase it? >> > yes What would a sensible topN value be a for a large university website? > > >> - You mean parse-pdf thing? I've got that in my nutch-default.xml. >> > good, should work then > > >> - I looked for the link, it was there. Besides, that was for another >> website I was experimenting on. >> - How do I check segments? >> > e.g. with segmentreader, a hadoop access command built in nutch > > >> - I didn't check filenames, but I've tried searching for a word in that >> PDF file. >> > then the reason could also be indexing > > >> - I've got more than 50gb free. >> - I'm not sure about webserver kicking me off, I'll have the check that >> with the sysadmin. >> > should be visible as something like timeouts or a similar message in the > hadoop logs > > >> Regards, >> >> >> On 5/24/12 10:25 AM, Piet van Remortel wrote: >> >>> - your topN parameter limited the crawl : see the info at >>> http://wiki.apache.org/nutch/**NutchTutorial<http://wiki.apache.org/nutch/NutchTutorial> >>> >>> or : >>> >>> - file filters >>> - there is no link to the files (as you suggested yourself already) >>> - did you check the correct/all segments ? >>> - did you check the fully correct filenames ? wildcards don't work on all >>> segmentreader approaches >>> - size limits of the crawler (see previous discussion) >>> - did you check file presence in the segment, or parse result ? i.e. >>> parsing could have failed (cfr the previous discussion of the last few >>> days) >>> - your disk got full and crawling stopped >>> - the webserver(s) kicked you off >>> - your hadoop logs have overrun the local disk on which the crawler was >>> running (i.e. disk full) >>> >>> Piet >>> >>> >>> On Thu, May 24, 2012 at 9:17 AM, Tolga<[EMAIL PROTECTED]> wrote: >>> >>> Hi, >>>> I am crawling a large website, which is our university's. From the logs >>>> and some grep'ing, I see that some pdf files were not crawled. Why could >>>> this happen? I'm crawling with -depth 100 -topN 5. >>>> >>>> Regards, >>>> >>>>
-
Re: Large website not fully crawledPiet van Remortel 2012-05-24, 08:28
I googled for you:
"Typically one starts testing one’s configuration by crawling at shallow depths, sharply limiting the number of pages fetched at each level (-topN), and watching the output to check that desired pages are fetched and undesirable pages are not. Once one is confident of the configuration, then an appropriate depth for a full crawl is around 10. The number of pages per level (-topN) for a full crawl can be from tens of thousands to millions, depending on your resources." Also, as the nutch documentation shows, the topN parameter is optional. Can I respectfully suggest that you go through the basic information that is available online to get familiar with Nutch. Copying the online information into this mailing list is not helping anybody. On Thu, May 24, 2012 at 10:19 AM, Tolga <[EMAIL PROTECTED]> wrote: > > > On 5/24/12 11:00 AM, Piet van Remortel wrote: > >> On Thu, May 24, 2012 at 9:35 AM, Tolga<[EMAIL PROTECTED]> wrote: >> >> - I don't fully understand the use of topN parameter. Should I increase >>> it? >>> >>> yes >> > What would a sensible topN value be a for a large university website? > >> >> >> - You mean parse-pdf thing? I've got that in my nutch-default.xml. >>> >>> good, should work then >> >> >> - I looked for the link, it was there. Besides, that was for another >>> website I was experimenting on. >>> - How do I check segments? >>> >>> e.g. with segmentreader, a hadoop access command built in nutch >> >> >> - I didn't check filenames, but I've tried searching for a word in that >>> PDF file. >>> >>> then the reason could also be indexing >> >> >> - I've got more than 50gb free. >>> - I'm not sure about webserver kicking me off, I'll have the check that >>> with the sysadmin. >>> >>> should be visible as something like timeouts or a similar message in the >> hadoop logs >> >> >> Regards, >>> >>> >>> On 5/24/12 10:25 AM, Piet van Remortel wrote: >>> >>> - your topN parameter limited the crawl : see the info at >>>> http://wiki.apache.org/nutch/****NutchTutorial<http://wiki.apache.org/nutch/**NutchTutorial> >>>> <http://wiki.**apache.org/nutch/NutchTutorial<http://wiki.apache.org/nutch/NutchTutorial> >>>> **> >>>> >>>> >>>> or : >>>> >>>> - file filters >>>> - there is no link to the files (as you suggested yourself already) >>>> - did you check the correct/all segments ? >>>> - did you check the fully correct filenames ? wildcards don't work on >>>> all >>>> segmentreader approaches >>>> - size limits of the crawler (see previous discussion) >>>> - did you check file presence in the segment, or parse result ? i.e. >>>> parsing could have failed (cfr the previous discussion of the last few >>>> days) >>>> - your disk got full and crawling stopped >>>> - the webserver(s) kicked you off >>>> - your hadoop logs have overrun the local disk on which the crawler was >>>> running (i.e. disk full) >>>> >>>> Piet >>>> >>>> >>>> On Thu, May 24, 2012 at 9:17 AM, Tolga<[EMAIL PROTECTED]> wrote: >>>> >>>> Hi, >>>> >>>>> I am crawling a large website, which is our university's. From the logs >>>>> and some grep'ing, I see that some pdf files were not crawled. Why >>>>> could >>>>> this happen? I'm crawling with -depth 100 -topN 5. >>>>> >>>>> Regards, >>>>> >>>>> >>>>>
-
Re: Large website not fully crawledTolga 2012-05-24, 11:52
I might have figured out why. Our website has a lot of query strings in
addresses. One example is http://www.sabanciuniv.edu/eng/?genel_bilgi/yonetim/yonetim_kapak/yonetim_kapak.html. Could this be why? If that's the case, how do I crawl it? Regards, On 5/24/12 11:28 AM, Piet van Remortel wrote: > I googled for you: > > "Typically one starts testing one�s configuration by crawling at shallow > depths, sharply limiting the number of pages fetched at each level (-topN), > and watching the output to check that desired pages are fetched and > undesirable pages are not. Once one is confident of the configuration, then > an appropriate depth for a full crawl is around 10. The number of pages per > level (-topN) for a full crawl can be from tens of thousands to millions, > depending on your resources." > > Also, as the nutch documentation shows, the topN parameter is optional. > > Can I respectfully suggest that you go through the basic information that > is available online to get familiar with Nutch. Copying the online > information into this mailing list is not helping anybody. > > > On Thu, May 24, 2012 at 10:19 AM, Tolga<[EMAIL PROTECTED]> wrote: > >> >> On 5/24/12 11:00 AM, Piet van Remortel wrote: >> >>> On Thu, May 24, 2012 at 9:35 AM, Tolga<[EMAIL PROTECTED]> wrote: >>> >>> - I don't fully understand the use of topN parameter. Should I increase >>>> it? >>>> >>>> yes >> What would a sensible topN value be a for a large university website? >> >>> >>> - You mean parse-pdf thing? I've got that in my nutch-default.xml. >>>> good, should work then >>> >>> - I looked for the link, it was there. Besides, that was for another >>>> website I was experimenting on. >>>> - How do I check segments? >>>> >>>> e.g. with segmentreader, a hadoop access command built in nutch >>> >>> - I didn't check filenames, but I've tried searching for a word in that >>>> PDF file. >>>> >>>> then the reason could also be indexing >>> >>> - I've got more than 50gb free. >>>> - I'm not sure about webserver kicking me off, I'll have the check that >>>> with the sysadmin. >>>> >>>> should be visible as something like timeouts or a similar message in the >>> hadoop logs >>> >>> >>> Regards, >>>> >>>> On 5/24/12 10:25 AM, Piet van Remortel wrote: >>>> >>>> - your topN parameter limited the crawl : see the info at >>>>> http://wiki.apache.org/nutch/****NutchTutorial<http://wiki.apache.org/nutch/**NutchTutorial> >>>>> <http://wiki.**apache.org/nutch/NutchTutorial<http://wiki.apache.org/nutch/NutchTutorial> >>>>> **> >>>>> >>>>> >>>>> or : >>>>> >>>>> - file filters >>>>> - there is no link to the files (as you suggested yourself already) >>>>> - did you check the correct/all segments ? >>>>> - did you check the fully correct filenames ? wildcards don't work on >>>>> all >>>>> segmentreader approaches >>>>> - size limits of the crawler (see previous discussion) >>>>> - did you check file presence in the segment, or parse result ? i.e. >>>>> parsing could have failed (cfr the previous discussion of the last few >>>>> days) >>>>> - your disk got full and crawling stopped >>>>> - the webserver(s) kicked you off >>>>> - your hadoop logs have overrun the local disk on which the crawler was >>>>> running (i.e. disk full) >>>>> >>>>> Piet >>>>> >>>>> >>>>> On Thu, May 24, 2012 at 9:17 AM, Tolga<[EMAIL PROTECTED]> wrote: >>>>> >>>>> Hi, >>>>> >>>>>> I am crawling a large website, which is our university's. From the logs >>>>>> and some grep'ing, I see that some pdf files were not crawled. Why >>>>>> could >>>>>> this happen? I'm crawling with -depth 100 -topN 5. >>>>>> >>>>>> Regards, >>>>>> >>>>>> >>>>>>
-
Re: Large website not fully crawledPiet van Remortel 2012-05-24, 12:10
that could be it indeed
I googled it for you, first hit searching for "nutch crawl query pages" http://stackoverflow.com/questions/7045716/nutch-1-2-why-wont-nutch-crawl-url-with-query-strings On Thu, May 24, 2012 at 1:52 PM, Tolga <[EMAIL PROTECTED]> wrote: > I might have figured out why. Our website has a lot of query strings in > addresses. One example is http://www.sabanciuniv.edu/** > eng/?genel_bilgi/yonetim/**yonetim_kapak/yonetim_kapak.**html<http://www.sabanciuniv.edu/eng/?genel_bilgi/yonetim/yonetim_kapak/yonetim_kapak.html>. > Could this be why? If that's the case, how do I crawl it? > > Regards, > > > On 5/24/12 11:28 AM, Piet van Remortel wrote: > >> I googled for you: >> >> "Typically one starts testing one’s configuration by crawling at shallow >> depths, sharply limiting the number of pages fetched at each level >> (-topN), >> and watching the output to check that desired pages are fetched and >> undesirable pages are not. Once one is confident of the configuration, >> then >> an appropriate depth for a full crawl is around 10. The number of pages >> per >> level (-topN) for a full crawl can be from tens of thousands to millions, >> depending on your resources." >> >> Also, as the nutch documentation shows, the topN parameter is optional. >> >> Can I respectfully suggest that you go through the basic information that >> is available online to get familiar with Nutch. Copying the online >> information into this mailing list is not helping anybody. >> >> >> On Thu, May 24, 2012 at 10:19 AM, Tolga<[EMAIL PROTECTED]> wrote: >> >> >>> On 5/24/12 11:00 AM, Piet van Remortel wrote: >>> >>> On Thu, May 24, 2012 at 9:35 AM, Tolga<[EMAIL PROTECTED]> wrote: >>>> >>>> - I don't fully understand the use of topN parameter. Should I increase >>>> >>>>> it? >>>>> >>>>> yes >>>>> >>>> What would a sensible topN value be a for a large university website? >>> >>> >>>> - You mean parse-pdf thing? I've got that in my nutch-default.xml. >>>> >>>>> good, should work then >>>>> >>>> >>>> - I looked for the link, it was there. Besides, that was for another >>>> >>>>> website I was experimenting on. >>>>> - How do I check segments? >>>>> >>>>> e.g. with segmentreader, a hadoop access command built in nutch >>>>> >>>> >>>> - I didn't check filenames, but I've tried searching for a word in that >>>> >>>>> PDF file. >>>>> >>>>> then the reason could also be indexing >>>>> >>>> >>>> - I've got more than 50gb free. >>>> >>>>> - I'm not sure about webserver kicking me off, I'll have the check that >>>>> with the sysadmin. >>>>> >>>>> should be visible as something like timeouts or a similar message in >>>>> the >>>>> >>>> hadoop logs >>>> >>>> >>>> Regards, >>>> >>>>> >>>>> On 5/24/12 10:25 AM, Piet van Remortel wrote: >>>>> >>>>> - your topN parameter limited the crawl : see the info at >>>>> >>>>>> http://wiki.apache.org/nutch/******NutchTutorial<http://wiki.apache.org/nutch/****NutchTutorial> >>>>>> <http://wiki.**apache.org/nutch/****NutchTutorial<http://wiki.apache.org/nutch/**NutchTutorial> >>>>>> > >>>>>> <http://wiki.**apache.org/**nutch/NutchTutorial<http://apache.org/nutch/NutchTutorial> >>>>>> <http://**wiki.apache.org/nutch/**NutchTutorial<http://wiki.apache.org/nutch/NutchTutorial> >>>>>> > >>>>>> **> >>>>>> >>>>>> >>>>>> >>>>>> or : >>>>>> >>>>>> - file filters >>>>>> - there is no link to the files (as you suggested yourself already) >>>>>> - did you check the correct/all segments ? >>>>>> - did you check the fully correct filenames ? wildcards don't work on >>>>>> all >>>>>> segmentreader approaches >>>>>> - size limits of the crawler (see previous discussion) >>>>>> - did you check file presence in the segment, or parse result ? i.e. >>>>>> parsing could have failed (cfr the previous discussion of the last few >>>>>> days) >>>>>> - your disk got full and crawling stopped >>>>>> - the webserver(s) kicked you off >>>>>> - your hadoop logs have overrun the local disk on which the crawler >>>>>> was >>>>>> running (i.e. disk full) |