Home | About | Sematext search-lucene.com search-hadoop.com
 Search Lucene and all its subprojects:

Switch to Plain View
Nutch, mail # user - PDF not crawled/indexed


+
Tolga 2012-05-22, 07:48
+
Lewis John Mcgibbney 2012-05-22, 09:13
+
Lewis John Mcgibbney 2012-05-22, 09:14
+
Tolga 2012-05-22, 09:19
+
Lewis John Mcgibbney 2012-05-22, 09:26
+
Tolga 2012-05-22, 09:27
+
Tolga 2012-05-22, 09:31
+
Lewis John Mcgibbney 2012-05-22, 09:34
+
Tolga 2012-05-22, 09:36
Copy link to this message
-
RE: PDF not crawled/indexed
Markus Jelsma 2012-05-22, 09:39
Please read the description.
 
 
-----Original message-----
> From:Tolga <[EMAIL PROTECTED]>
> Sent: Tue 22-May-2012 11:37
> To: [EMAIL PROTECTED]
> Subject: Re: PDF not crawled/indexed
>
> What is that value's unit? kilobytes? My PDF file is 4.7mb.
>
> On 5/22/12 12:34 PM, Lewis John Mcgibbney wrote:
> > Yes I know.
> >
> > If your PDF's are larger than this then they will be either truncated
> > or may not be crawled. Please look thoroughly at your log output...
> > you may wish to use the http.verbose and fetcher.verbose properties as
> > well.
> >
> > On Tue, May 22, 2012 at 10:31 AM, Tolga<[EMAIL PROTECTED]>  wrote:
> >> The value is 65536
> >>
> >> On 5/22/12 12:14 PM, Lewis John Mcgibbney wrote:
> >>> try your http.content.limit and also make sure that you haven't
> >>> changed anything within the tika mimeType mappings.
> >>>
> >>> On Tue, May 22, 2012 at 9:06 AM, Tolga<[EMAIL PROTECTED]>    wrote:
> >>>> Sorry, I forgot to also add my original problem. PDF files are not
> >>>> crawled.
> >>>> I even modified -topN to be 10.
> >>>>
> >>>>
> >>>> -------- Original Message --------
> >>>> Subject:        PDF not crawled/indexed
> >>>> Date:   Tue, 22 May 2012 10:48:15 +0300
> >>>> From:   Tolga<[EMAIL PROTECTED]>
> >>>> To:     [EMAIL PROTECTED]
> >>>>
> >>>>
> >>>>
> >>>> Hi,
> >>>>
> >>>> I am crawling my website with this command:
> >>>>
> >>>> bin/nutch crawl urls -dir crawl-$(date +%FT%H-%M-%S) -solr
> >>>> http://localhost:8983/solr/ -depth 20 -topN 5
> >>>>
> >>>> Is it a good idea to modify the directory name? Should I always delete
> >>>> indexes prior to crawling and stick to the same directory name?
> >>>>
> >>>> Regards,
> >>>>
> >>>
> >
> >
>
+
Lewis John Mcgibbney 2012-05-22, 09:44
+
Piet van Remortel 2012-05-22, 09:47
+
Lewis John Mcgibbney 2012-05-22, 10:31
+
Piet van Remortel 2012-05-22, 10:43
+
Lewis John Mcgibbney 2012-05-22, 11:12
+
Tolga 2012-05-22, 11:00
+
Piet van Remortel 2012-05-22, 11:06
+
Tolga 2012-05-22, 11:37