Home | About | Sematext search-lucene.com search-hadoop.com
 Search Lucene and all its subprojects:

Switch to Threaded View
Nutch, mail # user - Re: nutch crawling file system SOLVED


Copy link to this message
-
Re: nutch crawling file system SOLVED
alessio crisantemi 2012-03-11, 22:51
thank you Remi for your preciuos help. I try again and I write you the
results.
But I have another little question: how can I do for limit the crawling
only to my selected root?

Because all time, Nutch crawl also the parent directories. I read that "The
code that is responsable for this is in
org.apache.nutch.protocol.file.FileResponse.getDirAsHttpResponse(java.io.File
f). "

And a guy suggest to change the following line:
this.content = list2html(f.listFiles(), path, "/".equals(path) ? false :
true);

to
this.content = list2html(f.listFiles(), path, false);

and recompiled.

But in my class file, I have just this raw...And that's not a simple mode

There is another method, I suppose?

thank you

alessio

Il giorno 11 marzo 2012 18:32, Lewis John Mcgibbney <
[EMAIL PROTECTED]> ha scritto:

> Please see below
>
> On Sun, Mar 11, 2012 at 5:10 PM, alessio crisantemi <
> [EMAIL PROTECTED]> wrote:
>
> >
> > [1]
> http://wiki.apache.org/nutch/FAQ#How_do_I_index_my_local_file_system.3F
> >
>
> I've now updated this link, thanks for pointing this out.
>
>
> > And Now, I have another problem:
> > I crawled my local file system: a directory with a lot of Pdf files. All
> > works, and nutch index on Solr the results.
> >
>
> OK
>
>
> > But this is the problem: whe I submit a query on solr, I can see only a
> > list of files, and not the pdf contents.
> > why, in your opinion?
> >
>
> Well this might be to with your file.content.limit in nutch-site.xml, maybe
> your documents are being truncated if they are too large.
> Additionally your Solr mapping's and or schema configuration may need to be
> tweaked slightly to permit you to view snippets of the PDF content within
> your Solr search results. In your schema configuration for index-basec, try
> changing
>
> <field name="content" type="text" stored="false" indexed="true"/>
>
> to
>
> <field name="content" type="text" stored="true" indexed="true"/>
>
>
> You will need to reindex your content if you wish to see the results
> through Solr.
>