Home | About | Sematext search-lucene.com search-hadoop.com
 Search Lucene and all its subprojects:

Switch to Plain View
Solr, mail # user - Solr and Nutch/Droids - to use or not to use?


+
MitchK 2010-06-16, 15:27
Copy link to this message
-
Re: Solr and Nutch/Droids - to use or not to use?
Otis Gospodnetic 2010-06-16, 15:37
My quick feedback would be:
Try using Nutch first, because it is a more complete "platform".  From what I know, Droids is just the crawler with an in-memory queue + link extractor.  We did use it for crawling Lucene project sites (for the index on http://search-lucene.com/ ), but that is because the data volume is low, the crawl very narrow, scaling requirements low, etc.

 Otis
----
Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/

----- Original Message ----
> From: MitchK <[EMAIL PROTECTED]>
> To: [EMAIL PROTECTED]
> Sent: Wed, June 16, 2010 11:27:20 AM
> Subject: Solr and Nutch/Droids - to use or not to use?
>
>
Hello community,

from several discussions about Solr and Nutch, I
> got some questions for a
virtual web-search-engine.
I know I've posted
> this message to the mailing list a few days ago, but the
thread got injected
> and at least I did not get any more postings about the
topic and so I try to
> reopen it, hopefully no one gets upset here :-).
Please, bear with me. Thank
> you.

The requirements:
I. I need a scalable solution for a growing
> index that becomes larger than
one machine can handle. If I add more
> hardware, I want to linear improve the
performance.

II. I want to use
> technologies like the OPIC-algorithm (default algorithm in
Nutch) or PageRank
> or... whatever is out there to improve the ranking of the
webpages.
>

III. I want to be able to easily add more fields to my documents.
> Imagine
one retrives information from a webpage's content, than I want to
> make it
searchable.

IV. While fetching my data, I want to make
> special-searches possible. For
example I want to retrive pictures from a
> webpage and want to index
picture-related content into another search-index
> plus I want to save a
small thumbnail of the picture itself. Btw: This is (as
> far as I know) not
possible with solr, because solr was not intended to do
> such special
indexing-logic.

V. I want to use filter queries (i.e.
> main-query "christopher lee" returns
1.5mio results, subquery "action" ->
> the main-query would be a filter-query
and "action" would be the actual
> query. So a search within search-results
would be easily made available).
>

VI. I want to be able to use different logics for different pages. Maybe
> I
got a pool of 100 domains that I know better than others and I got
> special
scripts that retrive more special information from those 100 domains.
> Than I
want to apply my special logic to those 100 domains, but every other
> domain
should use the default logic.

-----------------

The
> project is only virtual. So why I am asking?
I want to learn more about
> websearch and I would like to make some new
experiences.

What do I
> know about Solr + Nutch:
As it is said on lucidimagination.com, Solr + Nutch
> does not scale if the
index is too large.
The article was a little bit
> older and I don't know whether this problem
gets fixed with the new
> distributed abilities of Solr.

Furthermore I don't want to index the
> pages with nutch and reindex them with
solr.
The only exception would be:
> If the content of a webpage get's indexed by
nutch, I want to use the already
> tokenized content of the body with some
Solr copyfield operations to extend
> the search (i.e. making fuzzy search
possible). At the moment: I don't think
> this is possible.

I don't know much about the droids project and how
> well it is documented.
But from what I can read by some posts of Otis, it
> seems to be usable as a
crawler-framework.
Pros for Nutch are: It
> is very scalable! Thanks to hadoop and MapReduce it
is a scaling-monster
> (from what I've read).

Cons: The search is not as rich as it is possible
> with Solr. Extend Nutch's
search-abilities *seems* to be more complicated
> than with Solr. Furthermore,
if I want to use Solr to search nutch's index,
> looking at my requirements I
would need to reindex the whole thing - without

What I don't know at the moment is, how it is
like in II. mentioned with Solr.

I hope
be the
indexing.
Where should I dive deeper?
Solr + Droids?
Solr + Nutch?
Nutch + howToExtendNutchToMakeSearchBetter?
Thanks for the discussion!
- Mitch
View this message
Sent
+
MitchK 2010-06-16, 17:37
+
Otis Gospodnetic 2010-06-16, 17:50
+
MitchK 2010-06-16, 18:27
+
Otis Gospodnetic 2010-06-16, 19:02
+
Markus Jelsma 2010-06-16, 19:31
+
Otis Gospodnetic 2010-06-16, 19:40
+
Markus Jelsma 2010-06-16, 19:53
+
MitchK 2010-06-17, 05:52
+
Otis Gospodnetic 2010-06-17, 06:45
+
MitchK 2010-06-17, 08:15
+
Otis Gospodnetic 2010-06-17, 12:34
+
MitchK 2010-06-17, 16:07
+
Otis Gospodnetic 2010-06-17, 21:17
+
MitchK 2010-06-17, 22:03
+
MitchK 2010-06-12, 11:41
+
MitchK 2010-06-14, 13:30