|
|
-
Re: anyone use hadoop+solr?Andrzej Bialecki 2010-09-06, 10:11
On 2010-09-04 19:53, MitchK wrote:
> > Hi, > > this topic started a few months ago, however there are some questions from > my side, that I couldn't answer by looking at the SOLR-1301-issue nor the > wiki-pages. > > Let me try to explain my thoughts: > Given: a Hadoop-cluster, a solr-search-cluster and nutch as a > crawling-engine which also performs LinkRank and webgraph-related tasks. > > Once a list of documents is created by nutch, you put the list + the > LinkRank-values etc. into a Solr+Hadoop-job like it is described in > Solr-1301 to index or reindex the given documents. There is no out of the box integration between Nutch and SOLR-1301, so there is some step that you omitted from this chain... e.g. "export from Nutch segments to CSV". > When the shards are built, they will be sent over the network to the > solr-search-cluster. > Is this description correct? Not really. SOLR-1301 doesn't deal with how you deploy the results of indexing. It simply creates the shards on HDFS. SOLR-1301 just creates the index data - it doesn't deal with serving the data... > > What makes me thinking is: > Assumed I got a Document X on machine Y in shard Y... > When I reindex that document X together with lots of other documents that > are present or not present in Shard Y... and I put the resulting shard on a > machine Z, how does machine Y notice that it has got an older version of > document X than machine Z? > > Furthermore: Go on and assume that the shard Y was replicated to three other > machines, how do they all notice, that their version of document X is not > the newest available one? > In such an environment, we do not have a master (right?), so far: How to > keep the index as consistent as possible? It's not possible to do it like this, at least for now... Looking into the future: eventually, when SolrCloud arrives we will be able to index straight to a SolrCloud cluster, assigning documents to shards through a hashing schema (e.g. 'md5(docId) % numShards'). Since shards would be created in a consistent way, then newer versions of documents would end up in the same shards and they would replace the older versions of the same documents - thus the problem would be solved. Additional benefit from this model is that it's not a disruptive and copy-intensive operation like SOLR-1301 (where you have to do "create new indexes, deploy them and switch") but rather a regular online update that is already supported in Solr. Once this is in place, we can modify Nutch to send documents directly to a SolrCloud cluster. Until then, you need to build and deploy indexes more or less manually (or using Katta, but again Katta is not integrated with Nutch). SolrCloud is not far away from hitting the trunk (right, Mark? ;) ), so medium-term I think this is your best bet. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __________________________________ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com |