-Re: anyone use hadoop+solr?
Andrzej Bialecki 2010-09-06, 10:11
On 2010-09-04 19:53, MitchK wrote:
> this topic started a few months ago, however there are some questions from
> my side, that I couldn't answer by looking at the SOLR-1301-issue nor the
> Let me try to explain my thoughts:
> Given: a Hadoop-cluster, a solr-search-cluster and nutch as a
> crawling-engine which also performs LinkRank and webgraph-related tasks.
> Once a list of documents is created by nutch, you put the list + the
> LinkRank-values etc. into a Solr+Hadoop-job like it is described in
> Solr-1301 to index or reindex the given documents.
There is no out of the box integration between Nutch and SOLR-1301, so
there is some step that you omitted from this chain... e.g. "export from
Nutch segments to CSV".
> When the shards are built, they will be sent over the network to the
> Is this description correct?
Not really. SOLR-1301 doesn't deal with how you deploy the results of
indexing. It simply creates the shards on HDFS. SOLR-1301 just creates
the index data - it doesn't deal with serving the data...
> What makes me thinking is:
> Assumed I got a Document X on machine Y in shard Y...
> When I reindex that document X together with lots of other documents that
> are present or not present in Shard Y... and I put the resulting shard on a
> machine Z, how does machine Y notice that it has got an older version of
> document X than machine Z?
> Furthermore: Go on and assume that the shard Y was replicated to three other
> machines, how do they all notice, that their version of document X is not
> the newest available one?
> In such an environment, we do not have a master (right?), so far: How to
> keep the index as consistent as possible?
It's not possible to do it like this, at least for now...
Looking into the future: eventually, when SolrCloud arrives we will be
able to index straight to a SolrCloud cluster, assigning documents to
shards through a hashing schema (e.g. 'md5(docId) % numShards'). Since
shards would be created in a consistent way, then newer versions of
documents would end up in the same shards and they would replace the
older versions of the same documents - thus the problem would be solved.
Additional benefit from this model is that it's not a disruptive and
copy-intensive operation like SOLR-1301 (where you have to do "create
new indexes, deploy them and switch") but rather a regular online update
that is already supported in Solr.
Once this is in place, we can modify Nutch to send documents directly to
a SolrCloud cluster. Until then, you need to build and deploy indexes
more or less manually (or using Katta, but again Katta is not integrated
SolrCloud is not far away from hitting the trunk (right, Mark? ;) ), so
medium-term I think this is your best bet.
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com