|
Tolga
2012-05-22, 07:48
Lewis John Mcgibbney
2012-05-22, 09:13
Lewis John Mcgibbney
2012-05-22, 09:14
Tolga
2012-05-22, 09:19
Lewis John Mcgibbney
2012-05-22, 09:26
Tolga
2012-05-22, 09:27
Tolga
2012-05-22, 09:31
Lewis John Mcgibbney
2012-05-22, 09:34
Tolga
2012-05-22, 09:36
Markus Jelsma
2012-05-22, 09:39
Lewis John Mcgibbney
2012-05-22, 09:44
Piet van Remortel
2012-05-22, 09:47
Lewis John Mcgibbney
2012-05-22, 10:31
Piet van Remortel
2012-05-22, 10:43
Lewis John Mcgibbney
2012-05-22, 11:12
Tolga
2012-05-22, 11:00
Piet van Remortel
2012-05-22, 11:06
Tolga
2012-05-22, 11:37
|
-
PDF not crawled/indexedTolga 2012-05-22, 07:48
Hi,
I am crawling my website with this command: bin/nutch crawl urls -dir crawl-$(date +%FT%H-%M-%S) -solr http://localhost:8983/solr/ -depth 20 -topN 5 Is it a good idea to modify the directory name? Should I always delete indexes prior to crawling and stick to the same directory name? Regards, +
Tolga 2012-05-22, 07:48
-
Re: PDF not crawled/indexedLewis John Mcgibbney 2012-05-22, 09:13
Hi Tolga,
On Tue, May 22, 2012 at 8:48 AM, Tolga <[EMAIL PROTECTED]> wrote: > Is it a good idea to modify the directory name? I suppose this is up to you... do you want to? > Should I always delete > indexes prior to crawling and stick to the same directory name? It depends on what your trying to achieve. If you wish to keep separate crawldb and segments etc, then change the names, however if this is not the case, then either use one or more of the tolls nutch provides more merging or else use one directoy structure in the first place. hth > > Regards, -- Lewis +
Lewis John Mcgibbney 2012-05-22, 09:13
-
Re: PDF not crawled/indexedLewis John Mcgibbney 2012-05-22, 09:14
try your http.content.limit and also make sure that you haven't
changed anything within the tika mimeType mappings. On Tue, May 22, 2012 at 9:06 AM, Tolga <[EMAIL PROTECTED]> wrote: > Sorry, I forgot to also add my original problem. PDF files are not crawled. > I even modified -topN to be 10. > > > -------- Original Message -------- > Subject: PDF not crawled/indexed > Date: Tue, 22 May 2012 10:48:15 +0300 > From: Tolga <[EMAIL PROTECTED]> > To: [EMAIL PROTECTED] > > > > Hi, > > I am crawling my website with this command: > > bin/nutch crawl urls -dir crawl-$(date +%FT%H-%M-%S) -solr > http://localhost:8983/solr/ -depth 20 -topN 5 > > Is it a good idea to modify the directory name? Should I always delete > indexes prior to crawling and stick to the same directory name? > > Regards, > -- Lewis +
Lewis John Mcgibbney 2012-05-22, 09:14
-
Re: PDF not crawled/indexedTolga 2012-05-22, 09:19
By, tika mimeType settings, do you mean protocol-http?
On 5/22/12 12:14 PM, Lewis John Mcgibbney wrote: > try your http.content.limit and also make sure that you haven't > changed anything within the tika mimeType mappings. > > On Tue, May 22, 2012 at 9:06 AM, Tolga<[EMAIL PROTECTED]> wrote: >> Sorry, I forgot to also add my original problem. PDF files are not crawled. >> I even modified -topN to be 10. >> >> >> -------- Original Message -------- >> Subject: PDF not crawled/indexed >> Date: Tue, 22 May 2012 10:48:15 +0300 >> From: Tolga<[EMAIL PROTECTED]> >> To: [EMAIL PROTECTED] >> >> >> >> Hi, >> >> I am crawling my website with this command: >> >> bin/nutch crawl urls -dir crawl-$(date +%FT%H-%M-%S) -solr >> http://localhost:8983/solr/ -depth 20 -topN 5 >> >> Is it a good idea to modify the directory name? Should I always delete >> indexes prior to crawling and stick to the same directory name? >> >> Regards, >> > > +
Tolga 2012-05-22, 09:19
-
Re: PDF not crawled/indexedLewis John Mcgibbney 2012-05-22, 09:26
Sorry I should have been more explicit about the exact file locationb
http://svn.apache.org/repos/asf/nutch/trunk/conf/parse-plugins.xml hth On Tue, May 22, 2012 at 10:19 AM, Tolga <[EMAIL PROTECTED]> wrote: > By, tika mimeType settings, do you mean protocol-http? > > On 5/22/12 12:14 PM, Lewis John Mcgibbney wrote: >> >> try your http.content.limit and also make sure that you haven't >> changed anything within the tika mimeType mappings. >> >> On Tue, May 22, 2012 at 9:06 AM, Tolga<[EMAIL PROTECTED]> wrote: >>> >>> Sorry, I forgot to also add my original problem. PDF files are not >>> crawled. >>> I even modified -topN to be 10. >>> >>> >>> -------- Original Message -------- >>> Subject: PDF not crawled/indexed >>> Date: Tue, 22 May 2012 10:48:15 +0300 >>> From: Tolga<[EMAIL PROTECTED]> >>> To: [EMAIL PROTECTED] >>> >>> >>> >>> Hi, >>> >>> I am crawling my website with this command: >>> >>> bin/nutch crawl urls -dir crawl-$(date +%FT%H-%M-%S) -solr >>> http://localhost:8983/solr/ -depth 20 -topN 5 >>> >>> Is it a good idea to modify the directory name? Should I always delete >>> indexes prior to crawling and stick to the same directory name? >>> >>> Regards, >>> >> >> > -- Lewis +
Lewis John Mcgibbney 2012-05-22, 09:26
-
Re: PDF not crawled/indexedTolga 2012-05-22, 09:27
Hmm, okay. I never touched that file.
On 5/22/12 12:26 PM, Lewis John Mcgibbney wrote: > Sorry I should have been more explicit about the exact file locationb > > http://svn.apache.org/repos/asf/nutch/trunk/conf/parse-plugins.xml > > hth > > On Tue, May 22, 2012 at 10:19 AM, Tolga<[EMAIL PROTECTED]> wrote: >> By, tika mimeType settings, do you mean protocol-http? >> >> On 5/22/12 12:14 PM, Lewis John Mcgibbney wrote: >>> try your http.content.limit and also make sure that you haven't >>> changed anything within the tika mimeType mappings. >>> >>> On Tue, May 22, 2012 at 9:06 AM, Tolga<[EMAIL PROTECTED]> wrote: >>>> Sorry, I forgot to also add my original problem. PDF files are not >>>> crawled. >>>> I even modified -topN to be 10. >>>> >>>> >>>> -------- Original Message -------- >>>> Subject: PDF not crawled/indexed >>>> Date: Tue, 22 May 2012 10:48:15 +0300 >>>> From: Tolga<[EMAIL PROTECTED]> >>>> To: [EMAIL PROTECTED] >>>> >>>> >>>> >>>> Hi, >>>> >>>> I am crawling my website with this command: >>>> >>>> bin/nutch crawl urls -dir crawl-$(date +%FT%H-%M-%S) -solr >>>> http://localhost:8983/solr/ -depth 20 -topN 5 >>>> >>>> Is it a good idea to modify the directory name? Should I always delete >>>> indexes prior to crawling and stick to the same directory name? >>>> >>>> Regards, >>>> >>> > > +
Tolga 2012-05-22, 09:27
-
Re: PDF not crawled/indexedTolga 2012-05-22, 09:31
The value is 65536
On 5/22/12 12:14 PM, Lewis John Mcgibbney wrote: > try your http.content.limit and also make sure that you haven't > changed anything within the tika mimeType mappings. > > On Tue, May 22, 2012 at 9:06 AM, Tolga<[EMAIL PROTECTED]> wrote: >> Sorry, I forgot to also add my original problem. PDF files are not crawled. >> I even modified -topN to be 10. >> >> >> -------- Original Message -------- >> Subject: PDF not crawled/indexed >> Date: Tue, 22 May 2012 10:48:15 +0300 >> From: Tolga<[EMAIL PROTECTED]> >> To: [EMAIL PROTECTED] >> >> >> >> Hi, >> >> I am crawling my website with this command: >> >> bin/nutch crawl urls -dir crawl-$(date +%FT%H-%M-%S) -solr >> http://localhost:8983/solr/ -depth 20 -topN 5 >> >> Is it a good idea to modify the directory name? Should I always delete >> indexes prior to crawling and stick to the same directory name? >> >> Regards, >> > > +
Tolga 2012-05-22, 09:31
-
Re: PDF not crawled/indexedLewis John Mcgibbney 2012-05-22, 09:34
Yes I know.
If your PDF's are larger than this then they will be either truncated or may not be crawled. Please look thoroughly at your log output... you may wish to use the http.verbose and fetcher.verbose properties as well. On Tue, May 22, 2012 at 10:31 AM, Tolga <[EMAIL PROTECTED]> wrote: > The value is 65536 > > On 5/22/12 12:14 PM, Lewis John Mcgibbney wrote: >> >> try your http.content.limit and also make sure that you haven't >> changed anything within the tika mimeType mappings. >> >> On Tue, May 22, 2012 at 9:06 AM, Tolga<[EMAIL PROTECTED]> wrote: >>> >>> Sorry, I forgot to also add my original problem. PDF files are not >>> crawled. >>> I even modified -topN to be 10. >>> >>> >>> -------- Original Message -------- >>> Subject: PDF not crawled/indexed >>> Date: Tue, 22 May 2012 10:48:15 +0300 >>> From: Tolga<[EMAIL PROTECTED]> >>> To: [EMAIL PROTECTED] >>> >>> >>> >>> Hi, >>> >>> I am crawling my website with this command: >>> >>> bin/nutch crawl urls -dir crawl-$(date +%FT%H-%M-%S) -solr >>> http://localhost:8983/solr/ -depth 20 -topN 5 >>> >>> Is it a good idea to modify the directory name? Should I always delete >>> indexes prior to crawling and stick to the same directory name? >>> >>> Regards, >>> >> >> > -- Lewis +
Lewis John Mcgibbney 2012-05-22, 09:34
-
Re: PDF not crawled/indexedTolga 2012-05-22, 09:36
What is that value's unit? kilobytes? My PDF file is 4.7mb.
On 5/22/12 12:34 PM, Lewis John Mcgibbney wrote: > Yes I know. > > If your PDF's are larger than this then they will be either truncated > or may not be crawled. Please look thoroughly at your log output... > you may wish to use the http.verbose and fetcher.verbose properties as > well. > > On Tue, May 22, 2012 at 10:31 AM, Tolga<[EMAIL PROTECTED]> wrote: >> The value is 65536 >> >> On 5/22/12 12:14 PM, Lewis John Mcgibbney wrote: >>> try your http.content.limit and also make sure that you haven't >>> changed anything within the tika mimeType mappings. >>> >>> On Tue, May 22, 2012 at 9:06 AM, Tolga<[EMAIL PROTECTED]> wrote: >>>> Sorry, I forgot to also add my original problem. PDF files are not >>>> crawled. >>>> I even modified -topN to be 10. >>>> >>>> >>>> -------- Original Message -------- >>>> Subject: PDF not crawled/indexed >>>> Date: Tue, 22 May 2012 10:48:15 +0300 >>>> From: Tolga<[EMAIL PROTECTED]> >>>> To: [EMAIL PROTECTED] >>>> >>>> >>>> >>>> Hi, >>>> >>>> I am crawling my website with this command: >>>> >>>> bin/nutch crawl urls -dir crawl-$(date +%FT%H-%M-%S) -solr >>>> http://localhost:8983/solr/ -depth 20 -topN 5 >>>> >>>> Is it a good idea to modify the directory name? Should I always delete >>>> indexes prior to crawling and stick to the same directory name? >>>> >>>> Regards, >>>> >>> > > +
Tolga 2012-05-22, 09:36
-
RE: PDF not crawled/indexedMarkus Jelsma 2012-05-22, 09:39
Please read the description.
-----Original message----- > From:Tolga <[EMAIL PROTECTED]> > Sent: Tue 22-May-2012 11:37 > To: [EMAIL PROTECTED] > Subject: Re: PDF not crawled/indexed > > What is that value's unit? kilobytes? My PDF file is 4.7mb. > > On 5/22/12 12:34 PM, Lewis John Mcgibbney wrote: > > Yes I know. > > > > If your PDF's are larger than this then they will be either truncated > > or may not be crawled. Please look thoroughly at your log output... > > you may wish to use the http.verbose and fetcher.verbose properties as > > well. > > > > On Tue, May 22, 2012 at 10:31 AM, Tolga<[EMAIL PROTECTED]> wrote: > >> The value is 65536 > >> > >> On 5/22/12 12:14 PM, Lewis John Mcgibbney wrote: > >>> try your http.content.limit and also make sure that you haven't > >>> changed anything within the tika mimeType mappings. > >>> > >>> On Tue, May 22, 2012 at 9:06 AM, Tolga<[EMAIL PROTECTED]> wrote: > >>>> Sorry, I forgot to also add my original problem. PDF files are not > >>>> crawled. > >>>> I even modified -topN to be 10. > >>>> > >>>> > >>>> -------- Original Message -------- > >>>> Subject: PDF not crawled/indexed > >>>> Date: Tue, 22 May 2012 10:48:15 +0300 > >>>> From: Tolga<[EMAIL PROTECTED]> > >>>> To: [EMAIL PROTECTED] > >>>> > >>>> > >>>> > >>>> Hi, > >>>> > >>>> I am crawling my website with this command: > >>>> > >>>> bin/nutch crawl urls -dir crawl-$(date +%FT%H-%M-%S) -solr > >>>> http://localhost:8983/solr/ -depth 20 -topN 5 > >>>> > >>>> Is it a good idea to modify the directory name? Should I always delete > >>>> indexes prior to crawling and stick to the same directory name? > >>>> > >>>> Regards, > >>>> > >>> > > > > > +
Markus Jelsma 2012-05-22, 09:39
-
Re: PDF not crawled/indexedLewis John Mcgibbney 2012-05-22, 09:44
yes well then you should either set this property to -1 (which is a
safe guard to ensure that you definitely crawl and parse all of your PDF's) or a a safe guard, responsible value to reflect the size of PDF's or other documents which you envisage to be obtained during your crawl. The first option has the downside that on occasion the parser can choke on rather large files... On Tue, May 22, 2012 at 10:36 AM, Tolga <[EMAIL PROTECTED]> wrote: > What is that value's unit? kilobytes? My PDF file is 4.7mb. > > On 5/22/12 12:34 PM, Lewis John Mcgibbney wrote: >> >> Yes I know. >> >> If your PDF's are larger than this then they will be either truncated >> or may not be crawled. Please look thoroughly at your log output... >> you may wish to use the http.verbose and fetcher.verbose properties as >> well. >> >> On Tue, May 22, 2012 at 10:31 AM, Tolga<[EMAIL PROTECTED]> wrote: >>> >>> The value is 65536 >>> >>> On 5/22/12 12:14 PM, Lewis John Mcgibbney wrote: >>>> >>>> try your http.content.limit and also make sure that you haven't >>>> changed anything within the tika mimeType mappings. >>>> >>>> On Tue, May 22, 2012 at 9:06 AM, Tolga<[EMAIL PROTECTED]> wrote: >>>>> >>>>> Sorry, I forgot to also add my original problem. PDF files are not >>>>> crawled. >>>>> I even modified -topN to be 10. >>>>> >>>>> >>>>> -------- Original Message -------- >>>>> Subject: PDF not crawled/indexed >>>>> Date: Tue, 22 May 2012 10:48:15 +0300 >>>>> From: Tolga<[EMAIL PROTECTED]> >>>>> To: [EMAIL PROTECTED] >>>>> >>>>> >>>>> >>>>> Hi, >>>>> >>>>> I am crawling my website with this command: >>>>> >>>>> bin/nutch crawl urls -dir crawl-$(date +%FT%H-%M-%S) -solr >>>>> http://localhost:8983/solr/ -depth 20 -topN 5 >>>>> >>>>> Is it a good idea to modify the directory name? Should I always delete >>>>> indexes prior to crawling and stick to the same directory name? >>>>> >>>>> Regards, >>>>> >>>> >> >> > -- Lewis +
Lewis John Mcgibbney 2012-05-22, 09:44
-
Re: PDF not crawled/indexedPiet van Remortel 2012-05-22, 09:47
I have been dealing with the exact same issues, and I wonder what happens
to PDF's that exceed the file size limit, are they cropped (and partly parsed?) or fully ignored ? I seem to observe parsing problems in PDFs since using a file size limit. Setting the limit to -1 indeed caused consistent choke errors on large pages/files so setting a hard limit seemed the only option. thanks Piet On Tue, May 22, 2012 at 11:44 AM, Lewis John Mcgibbney < [EMAIL PROTECTED]> wrote: > yes well then you should either set this property to -1 (which is a > safe guard to ensure that you definitely crawl and parse all of your > PDF's) or a a safe guard, responsible value to reflect the size of > PDF's or other documents which you envisage to be obtained during your > crawl. The first option has the downside that on occasion the parser > can choke on rather large files... > > On Tue, May 22, 2012 at 10:36 AM, Tolga <[EMAIL PROTECTED]> wrote: > > What is that value's unit? kilobytes? My PDF file is 4.7mb. > > > > On 5/22/12 12:34 PM, Lewis John Mcgibbney wrote: > >> > >> Yes I know. > >> > >> If your PDF's are larger than this then they will be either truncated > >> or may not be crawled. Please look thoroughly at your log output... > >> you may wish to use the http.verbose and fetcher.verbose properties as > >> well. > >> > >> On Tue, May 22, 2012 at 10:31 AM, Tolga<[EMAIL PROTECTED]> wrote: > >>> > >>> The value is 65536 > >>> > >>> On 5/22/12 12:14 PM, Lewis John Mcgibbney wrote: > >>>> > >>>> try your http.content.limit and also make sure that you haven't > >>>> changed anything within the tika mimeType mappings. > >>>> > >>>> On Tue, May 22, 2012 at 9:06 AM, Tolga<[EMAIL PROTECTED]> wrote: > >>>>> > >>>>> Sorry, I forgot to also add my original problem. PDF files are not > >>>>> crawled. > >>>>> I even modified -topN to be 10. > >>>>> > >>>>> > >>>>> -------- Original Message -------- > >>>>> Subject: PDF not crawled/indexed > >>>>> Date: Tue, 22 May 2012 10:48:15 +0300 > >>>>> From: Tolga<[EMAIL PROTECTED]> > >>>>> To: [EMAIL PROTECTED] > >>>>> > >>>>> > >>>>> > >>>>> Hi, > >>>>> > >>>>> I am crawling my website with this command: > >>>>> > >>>>> bin/nutch crawl urls -dir crawl-$(date +%FT%H-%M-%S) -solr > >>>>> http://localhost:8983/solr/ -depth 20 -topN 5 > >>>>> > >>>>> Is it a good idea to modify the directory name? Should I always > delete > >>>>> indexes prior to crawling and stick to the same directory name? > >>>>> > >>>>> Regards, > >>>>> > >>>> > >> > >> > > > > > > -- > Lewis > +
Piet van Remortel 2012-05-22, 09:47
-
Re: PDF not crawled/indexedLewis John Mcgibbney 2012-05-22, 10:31
Well the value is in bytes. So anything above the default (~65000) is truncated.
Ferdy also introduced a parser.skip.truncated property which is set to true by default. Justification on this is that parsing can sometimes take extremely high levels of CPU which then leads to the parser choking. On Tue, May 22, 2012 at 10:47 AM, Piet van Remortel <[EMAIL PROTECTED]> wrote: > I have been dealing with the exact same issues, and I wonder what happens > to PDF's that exceed the file size limit, are they cropped (and partly > parsed?) or fully ignored ? I seem to observe parsing problems in PDFs > since using a file size limit. Setting the limit to -1 indeed caused > consistent choke errors on large pages/files so setting a hard limit seemed > the only option. > > thanks > > Piet > > > On Tue, May 22, 2012 at 11:44 AM, Lewis John Mcgibbney < > [EMAIL PROTECTED]> wrote: > >> yes well then you should either set this property to -1 (which is a >> safe guard to ensure that you definitely crawl and parse all of your >> PDF's) or a a safe guard, responsible value to reflect the size of >> PDF's or other documents which you envisage to be obtained during your >> crawl. The first option has the downside that on occasion the parser >> can choke on rather large files... >> >> On Tue, May 22, 2012 at 10:36 AM, Tolga <[EMAIL PROTECTED]> wrote: >> > What is that value's unit? kilobytes? My PDF file is 4.7mb. >> > >> > On 5/22/12 12:34 PM, Lewis John Mcgibbney wrote: >> >> >> >> Yes I know. >> >> >> >> If your PDF's are larger than this then they will be either truncated >> >> or may not be crawled. Please look thoroughly at your log output... >> >> you may wish to use the http.verbose and fetcher.verbose properties as >> >> well. >> >> >> >> On Tue, May 22, 2012 at 10:31 AM, Tolga<[EMAIL PROTECTED]> wrote: >> >>> >> >>> The value is 65536 >> >>> >> >>> On 5/22/12 12:14 PM, Lewis John Mcgibbney wrote: >> >>>> >> >>>> try your http.content.limit and also make sure that you haven't >> >>>> changed anything within the tika mimeType mappings. >> >>>> >> >>>> On Tue, May 22, 2012 at 9:06 AM, Tolga<[EMAIL PROTECTED]> wrote: >> >>>>> >> >>>>> Sorry, I forgot to also add my original problem. PDF files are not >> >>>>> crawled. >> >>>>> I even modified -topN to be 10. >> >>>>> >> >>>>> >> >>>>> -------- Original Message -------- >> >>>>> Subject: PDF not crawled/indexed >> >>>>> Date: Tue, 22 May 2012 10:48:15 +0300 >> >>>>> From: Tolga<[EMAIL PROTECTED]> >> >>>>> To: [EMAIL PROTECTED] >> >>>>> >> >>>>> >> >>>>> >> >>>>> Hi, >> >>>>> >> >>>>> I am crawling my website with this command: >> >>>>> >> >>>>> bin/nutch crawl urls -dir crawl-$(date +%FT%H-%M-%S) -solr >> >>>>> http://localhost:8983/solr/ -depth 20 -topN 5 >> >>>>> >> >>>>> Is it a good idea to modify the directory name? Should I always >> delete >> >>>>> indexes prior to crawling and stick to the same directory name? >> >>>>> >> >>>>> Regards, >> >>>>> >> >>>> >> >> >> >> >> > >> >> >> >> -- >> Lewis >> -- Lewis +
Lewis John Mcgibbney 2012-05-22, 10:31
-
Re: PDF not crawled/indexedPiet van Remortel 2012-05-22, 10:43
Ok thanks, that property seems the right solution indeed, but it's not part
of the 1.4 release that I currently use. Current source trunk includes it though. On Tue, May 22, 2012 at 12:31 PM, Lewis John Mcgibbney < [EMAIL PROTECTED]> wrote: > Well the value is in bytes. So anything above the default (~65000) is > truncated. > Ferdy also introduced a parser.skip.truncated property which is set to > true by default. Justification on this is that parsing can sometimes > take extremely high levels of CPU which then leads to the parser > choking. > > On Tue, May 22, 2012 at 10:47 AM, Piet van Remortel > <[EMAIL PROTECTED]> wrote: > > I have been dealing with the exact same issues, and I wonder what happens > > to PDF's that exceed the file size limit, are they cropped (and partly > > parsed?) or fully ignored ? I seem to observe parsing problems in PDFs > > since using a file size limit. Setting the limit to -1 indeed caused > > consistent choke errors on large pages/files so setting a hard limit > seemed > > the only option. > > > > thanks > > > > Piet > > > > > > On Tue, May 22, 2012 at 11:44 AM, Lewis John Mcgibbney < > > [EMAIL PROTECTED]> wrote: > > > >> yes well then you should either set this property to -1 (which is a > >> safe guard to ensure that you definitely crawl and parse all of your > >> PDF's) or a a safe guard, responsible value to reflect the size of > >> PDF's or other documents which you envisage to be obtained during your > >> crawl. The first option has the downside that on occasion the parser > >> can choke on rather large files... > >> > >> On Tue, May 22, 2012 at 10:36 AM, Tolga <[EMAIL PROTECTED]> wrote: > >> > What is that value's unit? kilobytes? My PDF file is 4.7mb. > >> > > >> > On 5/22/12 12:34 PM, Lewis John Mcgibbney wrote: > >> >> > >> >> Yes I know. > >> >> > >> >> If your PDF's are larger than this then they will be either truncated > >> >> or may not be crawled. Please look thoroughly at your log output... > >> >> you may wish to use the http.verbose and fetcher.verbose properties > as > >> >> well. > >> >> > >> >> On Tue, May 22, 2012 at 10:31 AM, Tolga<[EMAIL PROTECTED]> wrote: > >> >>> > >> >>> The value is 65536 > >> >>> > >> >>> On 5/22/12 12:14 PM, Lewis John Mcgibbney wrote: > >> >>>> > >> >>>> try your http.content.limit and also make sure that you haven't > >> >>>> changed anything within the tika mimeType mappings. > >> >>>> > >> >>>> On Tue, May 22, 2012 at 9:06 AM, Tolga<[EMAIL PROTECTED]> wrote: > >> >>>>> > >> >>>>> Sorry, I forgot to also add my original problem. PDF files are not > >> >>>>> crawled. > >> >>>>> I even modified -topN to be 10. > >> >>>>> > >> >>>>> > >> >>>>> -------- Original Message -------- > >> >>>>> Subject: PDF not crawled/indexed > >> >>>>> Date: Tue, 22 May 2012 10:48:15 +0300 > >> >>>>> From: Tolga<[EMAIL PROTECTED]> > >> >>>>> To: [EMAIL PROTECTED] > >> >>>>> > >> >>>>> > >> >>>>> > >> >>>>> Hi, > >> >>>>> > >> >>>>> I am crawling my website with this command: > >> >>>>> > >> >>>>> bin/nutch crawl urls -dir crawl-$(date +%FT%H-%M-%S) -solr > >> >>>>> http://localhost:8983/solr/ -depth 20 -topN 5 > >> >>>>> > >> >>>>> Is it a good idea to modify the directory name? Should I always > >> delete > >> >>>>> indexes prior to crawling and stick to the same directory name? > >> >>>>> > >> >>>>> Regards, > >> >>>>> > >> >>>> > >> >> > >> >> > >> > > >> > >> > >> > >> -- > >> Lewis > >> > > > > -- > Lewis > +
Piet van Remortel 2012-05-22, 10:43
-
Re: PDF not crawled/indexedLewis John Mcgibbney 2012-05-22, 11:12
Hi Piet,
We will hopefully be pushing 1.5 in the next few days so please watch this space. Thanks On Tue, May 22, 2012 at 11:43 AM, Piet van Remortel <[EMAIL PROTECTED]> wrote: > Ok thanks, that property seems the right solution indeed, but it's not part > of the 1.4 release that I currently use. > Current source trunk includes it though. > > On Tue, May 22, 2012 at 12:31 PM, Lewis John Mcgibbney < > [EMAIL PROTECTED]> wrote: > >> Well the value is in bytes. So anything above the default (~65000) is >> truncated. >> Ferdy also introduced a parser.skip.truncated property which is set to >> true by default. Justification on this is that parsing can sometimes >> take extremely high levels of CPU which then leads to the parser >> choking. >> >> On Tue, May 22, 2012 at 10:47 AM, Piet van Remortel >> <[EMAIL PROTECTED]> wrote: >> > I have been dealing with the exact same issues, and I wonder what happens >> > to PDF's that exceed the file size limit, are they cropped (and partly >> > parsed?) or fully ignored ? I seem to observe parsing problems in PDFs >> > since using a file size limit. Setting the limit to -1 indeed caused >> > consistent choke errors on large pages/files so setting a hard limit >> seemed >> > the only option. >> > >> > thanks >> > >> > Piet >> > >> > >> > On Tue, May 22, 2012 at 11:44 AM, Lewis John Mcgibbney < >> > [EMAIL PROTECTED]> wrote: >> > >> >> yes well then you should either set this property to -1 (which is a >> >> safe guard to ensure that you definitely crawl and parse all of your >> >> PDF's) or a a safe guard, responsible value to reflect the size of >> >> PDF's or other documents which you envisage to be obtained during your >> >> crawl. The first option has the downside that on occasion the parser >> >> can choke on rather large files... >> >> >> >> On Tue, May 22, 2012 at 10:36 AM, Tolga <[EMAIL PROTECTED]> wrote: >> >> > What is that value's unit? kilobytes? My PDF file is 4.7mb. >> >> > >> >> > On 5/22/12 12:34 PM, Lewis John Mcgibbney wrote: >> >> >> >> >> >> Yes I know. >> >> >> >> >> >> If your PDF's are larger than this then they will be either truncated >> >> >> or may not be crawled. Please look thoroughly at your log output... >> >> >> you may wish to use the http.verbose and fetcher.verbose properties >> as >> >> >> well. >> >> >> >> >> >> On Tue, May 22, 2012 at 10:31 AM, Tolga<[EMAIL PROTECTED]> wrote: >> >> >>> >> >> >>> The value is 65536 >> >> >>> >> >> >>> On 5/22/12 12:14 PM, Lewis John Mcgibbney wrote: >> >> >>>> >> >> >>>> try your http.content.limit and also make sure that you haven't >> >> >>>> changed anything within the tika mimeType mappings. >> >> >>>> >> >> >>>> On Tue, May 22, 2012 at 9:06 AM, Tolga<[EMAIL PROTECTED]> wrote: >> >> >>>>> >> >> >>>>> Sorry, I forgot to also add my original problem. PDF files are not >> >> >>>>> crawled. >> >> >>>>> I even modified -topN to be 10. >> >> >>>>> >> >> >>>>> >> >> >>>>> -------- Original Message -------- >> >> >>>>> Subject: PDF not crawled/indexed >> >> >>>>> Date: Tue, 22 May 2012 10:48:15 +0300 >> >> >>>>> From: Tolga<[EMAIL PROTECTED]> >> >> >>>>> To: [EMAIL PROTECTED] >> >> >>>>> >> >> >>>>> >> >> >>>>> >> >> >>>>> Hi, >> >> >>>>> >> >> >>>>> I am crawling my website with this command: >> >> >>>>> >> >> >>>>> bin/nutch crawl urls -dir crawl-$(date +%FT%H-%M-%S) -solr >> >> >>>>> http://localhost:8983/solr/ -depth 20 -topN 5 >> >> >>>>> >> >> >>>>> Is it a good idea to modify the directory name? Should I always >> >> delete >> >> >>>>> indexes prior to crawling and stick to the same directory name? >> >> >>>>> >> >> >>>>> Regards, >> >> >>>>> >> >> >>>> >> >> >> >> >> >> >> >> > >> >> >> >> >> >> >> >> -- >> >> Lewis >> >> >> >> >> >> -- >> Lewis >> -- Lewis +
Lewis John Mcgibbney 2012-05-22, 11:12
-
Re: PDF not crawled/indexedTolga 2012-05-22, 11:00
Hi again,
I am getting this error: org.apache.nutch.parse.ParseException: parser not found for contentType=application/pdf. I googled and found out that I have to add a plugin.includes line to include pdf extension. However, I already have that line. Actually, the whole <property> block looks like this: <property> <name>plugin.includes</name> <value>protocol-http|urlfilter-regex|parse-(text|html|js|msexcel|mspowerpoint|msword|oo|pdf|swf|zip)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value> <description>Some long description</description> </property> However, I still get that error. What am I missing? Thanks, On 5/22/12 12:44 PM, Lewis John Mcgibbney wrote: > yes well then you should either set this property to -1 (which is a > safe guard to ensure that you definitely crawl and parse all of your > PDF's) or a a safe guard, responsible value to reflect the size of > PDF's or other documents which you envisage to be obtained during your > crawl. The first option has the downside that on occasion the parser > can choke on rather large files... > > On Tue, May 22, 2012 at 10:36 AM, Tolga<[EMAIL PROTECTED]> wrote: >> What is that value's unit? kilobytes? My PDF file is 4.7mb. >> >> On 5/22/12 12:34 PM, Lewis John Mcgibbney wrote: >>> Yes I know. >>> >>> If your PDF's are larger than this then they will be either truncated >>> or may not be crawled. Please look thoroughly at your log output... >>> you may wish to use the http.verbose and fetcher.verbose properties as >>> well. >>> >>> On Tue, May 22, 2012 at 10:31 AM, Tolga<[EMAIL PROTECTED]> wrote: >>>> The value is 65536 >>>> >>>> On 5/22/12 12:14 PM, Lewis John Mcgibbney wrote: >>>>> try your http.content.limit and also make sure that you haven't >>>>> changed anything within the tika mimeType mappings. >>>>> >>>>> On Tue, May 22, 2012 at 9:06 AM, Tolga<[EMAIL PROTECTED]> wrote: >>>>>> Sorry, I forgot to also add my original problem. PDF files are not >>>>>> crawled. >>>>>> I even modified -topN to be 10. >>>>>> >>>>>> >>>>>> -------- Original Message -------- >>>>>> Subject: PDF not crawled/indexed >>>>>> Date: Tue, 22 May 2012 10:48:15 +0300 >>>>>> From: Tolga<[EMAIL PROTECTED]> >>>>>> To: [EMAIL PROTECTED] >>>>>> >>>>>> >>>>>> >>>>>> Hi, >>>>>> >>>>>> I am crawling my website with this command: >>>>>> >>>>>> bin/nutch crawl urls -dir crawl-$(date +%FT%H-%M-%S) -solr >>>>>> http://localhost:8983/solr/ -depth 20 -topN 5 >>>>>> >>>>>> Is it a good idea to modify the directory name? Should I always delete >>>>>> indexes prior to crawling and stick to the same directory name? >>>>>> >>>>>> Regards, >>>>>> >>> > > +
Tolga 2012-05-22, 11:00
-
Re: PDF not crawled/indexedPiet van Remortel 2012-05-22, 11:06
another option is
<value>protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|scoring-opic|urlnormalizer-(pass|regex|basic)</value> which uses Tika, which parses PDF. On Tue, May 22, 2012 at 1:00 PM, Tolga <[EMAIL PROTECTED]> wrote: > Hi again, > > I am getting this error: org.apache.nutch.parse.**ParseException: parser > not found for contentType=application/pdf. I googled and found out that I > have to add a plugin.includes line to include pdf extension. However, I > already have that line. Actually, the whole <property> block looks like > this: > > <property> > <name>plugin.includes</name> > <value>protocol-http|**urlfilter-regex|parse-(text|** > html|js|msexcel|mspowerpoint|**msword|oo|pdf|swf|zip)|index-** > basic|query-(basic|site|url)|**summary-basic|scoring-opic|** > urlnormalizer-(pass|regex|**basic)</value> > <description>Some long description</description> > </property> > > However, I still get that error. > > What am I missing? > > Thanks, > > > On 5/22/12 12:44 PM, Lewis John Mcgibbney wrote: > >> yes well then you should either set this property to -1 (which is a >> safe guard to ensure that you definitely crawl and parse all of your >> PDF's) or a a safe guard, responsible value to reflect the size of >> PDF's or other documents which you envisage to be obtained during your >> crawl. The first option has the downside that on occasion the parser >> can choke on rather large files... >> >> On Tue, May 22, 2012 at 10:36 AM, Tolga<[EMAIL PROTECTED]> wrote: >> >>> What is that value's unit? kilobytes? My PDF file is 4.7mb. >>> >>> On 5/22/12 12:34 PM, Lewis John Mcgibbney wrote: >>> >>>> Yes I know. >>>> >>>> If your PDF's are larger than this then they will be either truncated >>>> or may not be crawled. Please look thoroughly at your log output... >>>> you may wish to use the http.verbose and fetcher.verbose properties as >>>> well. >>>> >>>> On Tue, May 22, 2012 at 10:31 AM, Tolga<[EMAIL PROTECTED]> wrote: >>>> >>>>> The value is 65536 >>>>> >>>>> On 5/22/12 12:14 PM, Lewis John Mcgibbney wrote: >>>>> >>>>>> try your http.content.limit and also make sure that you haven't >>>>>> changed anything within the tika mimeType mappings. >>>>>> >>>>>> On Tue, May 22, 2012 at 9:06 AM, Tolga<[EMAIL PROTECTED]> wrote: >>>>>> >>>>>>> Sorry, I forgot to also add my original problem. PDF files are not >>>>>>> crawled. >>>>>>> I even modified -topN to be 10. >>>>>>> >>>>>>> >>>>>>> -------- Original Message -------- >>>>>>> Subject: PDF not crawled/indexed >>>>>>> Date: Tue, 22 May 2012 10:48:15 +0300 >>>>>>> From: Tolga<[EMAIL PROTECTED]> >>>>>>> To: [EMAIL PROTECTED] >>>>>>> >>>>>>> >>>>>>> >>>>>>> Hi, >>>>>>> >>>>>>> I am crawling my website with this command: >>>>>>> >>>>>>> bin/nutch crawl urls -dir crawl-$(date +%FT%H-%M-%S) -solr >>>>>>> http://localhost:8983/solr/ -depth 20 -topN 5 >>>>>>> >>>>>>> Is it a good idea to modify the directory name? Should I always >>>>>>> delete >>>>>>> indexes prior to crawling and stick to the same directory name? >>>>>>> >>>>>>> Regards, >>>>>>> >>>>>>> >>>> >> >> +
Piet van Remortel 2012-05-22, 11:06
-
Re: PDF not crawled/indexedTolga 2012-05-22, 11:37
Hi again,
I'm getting this error: The parsing plugins: [org.apache.nutch.parse.tika.TikaParser] are enabled via the plugin.includes system property, and all claim to support the content type application/pdf, but they are not mapped to it in the parse-plugins.xml file. Should I add <mimeType name="application/pdf"> <plugin id="parse-pdf" /> </mimeType> to conf/parse-plugins.xml? Regards, On 5/22/12 2:06 PM, Piet van Remortel wrote: > another option is > > <value>protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|scoring-opic|urlnormalizer-(pass|regex|basic)</value> > > which uses Tika, which parses PDF. > > > On Tue, May 22, 2012 at 1:00 PM, Tolga<[EMAIL PROTECTED]> wrote: > >> Hi again, >> >> I am getting this error: org.apache.nutch.parse.**ParseException: parser >> not found for contentType=application/pdf. I googled and found out that I >> have to add a plugin.includes line to include pdf extension. However, I >> already have that line. Actually, the whole<property> block looks like >> this: >> >> <property> >> <name>plugin.includes</name> >> <value>protocol-http|**urlfilter-regex|parse-(text|** >> html|js|msexcel|mspowerpoint|**msword|oo|pdf|swf|zip)|index-** >> basic|query-(basic|site|url)|**summary-basic|scoring-opic|** >> urlnormalizer-(pass|regex|**basic)</value> >> <description>Some long description</description> >> </property> >> >> However, I still get that error. >> >> What am I missing? >> >> Thanks, >> >> >> On 5/22/12 12:44 PM, Lewis John Mcgibbney wrote: >> >>> yes well then you should either set this property to -1 (which is a >>> safe guard to ensure that you definitely crawl and parse all of your >>> PDF's) or a a safe guard, responsible value to reflect the size of >>> PDF's or other documents which you envisage to be obtained during your >>> crawl. The first option has the downside that on occasion the parser >>> can choke on rather large files... >>> >>> On Tue, May 22, 2012 at 10:36 AM, Tolga<[EMAIL PROTECTED]> wrote: >>> >>>> What is that value's unit? kilobytes? My PDF file is 4.7mb. >>>> >>>> On 5/22/12 12:34 PM, Lewis John Mcgibbney wrote: >>>> >>>>> Yes I know. >>>>> >>>>> If your PDF's are larger than this then they will be either truncated >>>>> or may not be crawled. Please look thoroughly at your log output... >>>>> you may wish to use the http.verbose and fetcher.verbose properties as >>>>> well. >>>>> >>>>> On Tue, May 22, 2012 at 10:31 AM, Tolga<[EMAIL PROTECTED]> wrote: >>>>> >>>>>> The value is 65536 >>>>>> >>>>>> On 5/22/12 12:14 PM, Lewis John Mcgibbney wrote: >>>>>> >>>>>>> try your http.content.limit and also make sure that you haven't >>>>>>> changed anything within the tika mimeType mappings. >>>>>>> >>>>>>> On Tue, May 22, 2012 at 9:06 AM, Tolga<[EMAIL PROTECTED]> wrote: >>>>>>> >>>>>>>> Sorry, I forgot to also add my original problem. PDF files are not >>>>>>>> crawled. >>>>>>>> I even modified -topN to be 10. >>>>>>>> >>>>>>>> >>>>>>>> -------- Original Message -------- >>>>>>>> Subject: PDF not crawled/indexed >>>>>>>> Date: Tue, 22 May 2012 10:48:15 +0300 >>>>>>>> From: Tolga<[EMAIL PROTECTED]> >>>>>>>> To: [EMAIL PROTECTED] >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> Hi, >>>>>>>> >>>>>>>> I am crawling my website with this command: >>>>>>>> >>>>>>>> bin/nutch crawl urls -dir crawl-$(date +%FT%H-%M-%S) -solr >>>>>>>> http://localhost:8983/solr/ -depth 20 -topN 5 >>>>>>>> >>>>>>>> Is it a good idea to modify the directory name? Should I always >>>>>>>> delete >>>>>>>> indexes prior to crawling and stick to the same directory name? >>>>>>>> >>>>>>>> Regards, >>>>>>>> >>>>>>>> >>> +
Tolga 2012-05-22, 11:37
|