-SolrCloud distributed indexing (Re: anyone use hadoop+solr?)
Andrzej Bialecki 2010-09-06, 13:44
(I adjusted the subject to better reflect the content of this discussion).
On 2010-09-06 14:37, MitchK wrote:
> Thanks for your detailed feedback Andzej!
>> From what I understood, SOLR-1301 becomes obsolete ones Solr becomes
> cloud-ready, right?
Who knows... I certainly didn't expect this code to become so popular ;)
so even after SolrCloud becomes available it's likely that some people
will continue to use it. But SolrCloud should solve the original problem
that I tried to solve with this patch.
>> Looking into the future: eventually, when SolrCloud arrives we will be
>> able to index straight to a SolrCloud cluster, assigning documents to
>> shards through a hashing schema (e.g. 'md5(docId) % numShards')
> Hm, let's say the md5(docId) would produce a value of 10 (it won't, but
> let's assume it).
> If I got a constant number of shards, the doc will be published to the same
> shard again and again.
> i.e.: 10 % numShards(5) = 2 -> doc 10 will be indexed at shard 2.
> A few days later the rest of the cluster is available, now it looks like
> 10 % numShards(10) -> 1 -> doc 10 will be indexed at shard 1... and what
> about the older version at shard 2? I am no expert when it comes to
> cloudComputing and the other stuff.
There are several possible solutions to this, and they all boil down to
the way how you assign documents to shards... Keep in mind that nodes
(physical machines) can manage several shards, and the aggregate
collection of all unique shards across all nodes forms your whole index
- so there's also a related, but different issue, of how to assign
shards to nodes.
Here are some scenarios how you can solve the doc-to-shard mapping
problem (note: I removed the issue of replication from the picture to
make this clearer):
a) keep the number of shards constant no matter how large is the
cluster. The mapping schema is then as simple as the one above. In this
scenario you create relatively small shards, so that a single physical
node can manage dozens of shards (each shard using one core, or perhaps
a more lightweight structure like MultiReader). This is also known as
micro-sharding. As the number of documents grows the size of each shard
will grow until you have to reduce the number of shards per node,
ultimately ending up with a single shard per node. After that, if your
collection continues to grow, you have to modify your hashing schema to
split some shards (and reindex some shards, or use an index splitter tool).
b) use consistent hashing as the mapping schema to assign documents to a
changing number of shards. There are many explanations of this schema on
the net, here's one that is very simple:
In this case, you can grow/shrink the number of shards (and their size)
as you see fit, incurring only a small reindexing cost.
> If you can point me to one or another reference where I can read about it,
> it would help me a lot, since I only want to understand how it works at the
> The problem with Solr is its lack of documentation in some classes and the
> lack of capsulating some very complex things into different methods or
> extra-classes. Of course, this is because it costs some extra time to do so,
> but it makes understanding and modifying things very complicated if you do
> not understand whats going on from a theoretical point of view.
In this case the lack of good docs and user-level API can be blamed on
the fact that this functionality is still under heavy development.
> Since the cloud-feature will be complex, a lack of documentation and no
> understanding of the theory behind the code will make contributing back
> very, very complicated.
For now, yes, it's an issue - though as soon as SolrCloud gets committed
I'm sure people will follow up with user-level convenience components
that will make it easier.
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com