|
Per Steffensen
2012-03-12, 09:26
Mark Miller
2012-03-12, 12:15
Per Steffensen
2012-03-12, 13:39
Mark Miller
2012-03-12, 14:24
Per Steffensen
2012-03-12, 20:54
Yonik Seeley
2012-03-12, 21:32
Mark Miller
2012-03-12, 21:39
Yonik Seeley
2012-03-12, 21:46
Per Steffensen
2012-03-14, 12:43
|
-
Exposing Solr routing to SolrJ clientPer Steffensen 2012-03-12, 09:26
Hi
I believe Solr(Cloud) is doing some internal routing of update-requests to make sure documents are stored in the correct core/shard decided by Solrs internal routing algoritm (I believe it basically finds out who is the leader-shard for a given document, using shared information in ZK, info about the collection and hash(document.id)). All nice and cool. I also believe realtime-gets are not forwarded internally in Solr through this routing algorithm, and that it therefore is "impossible" to do realtime-gets from a client, because you dont know which core/shard to contact directly, again because you dont know the routing alogrithm. If Im wrong, it would be very helpfull with a few directions on how to do realtime-gets from a client to a Solr servers system containing many shards and collection. If Im right, I think it would be very nice if the the routing algorithm was somehow exposed to the client (in code reachable from SolrJ) so that you can get to do realtime-gets from a SolrJ-based client - if it should be done automatically for you of if the client using SolrJ explicitly needs to call some code to get info about the core to contact, is not so important for now. Such a solution would also make it possible to get rid of another performance related "problem", that most update-requests has to be transported among JVMs twice to reach their destination. First from client to some "random" Solr server, and then from this Solr server to the Solr server holding the core involved in the update. If routing information was available for the client it could make sure to route its updates directly to the core (the one currently playing the role as leader-shard for the shard to which the routing algorithm maps the document) involved in the update. ElasticSearch has a solution to this problem by the usage of "Node Client" (instead of just "Transport Client"), where a node client is basically a real node in the system that just doesnt store document, but which have all the logic and shared information like e.g. routing algorithm available - http://www.elasticsearch.org/guide/reference/java-api/client.html. It certainly doesnt have to be like that with Solr clients, but it would be nice if somehow routing logic where available to the SolrJ so that it can send its updates (and realtime-gets) directly to the correct destination. Hope to get some comments on this issue. Regards, Per Steffensen ---------------------------------------------------------------------
-
Re: Exposing Solr routing to SolrJ clientMark Miller 2012-03-12, 12:15
Hey Per,
A couple things: 1. Distributed realtime get is coming - I know Yonik was looking at this recently but got caught up in some other things. 2. There is a Solrj client that is aware of the cluster state - its called CloudSolrServer. You give it the zookeeper address rather than a node's address. Currently it doesn't send directly to the leader, but this is planned - it's a little tricky due to lack of access to the Schema for hashing, but likely coming soon - there is a JIRA issue for it. Clients in other languages should be able to do the same thing. - Mark On Mar 12, 2012, at 5:26 AM, Per Steffensen wrote: > Hi > > I believe Solr(Cloud) is doing some internal routing of update-requests to make sure documents are stored in the correct core/shard decided by Solrs internal routing algoritm (I believe it basically finds out who is the leader-shard for a given document, using shared information in ZK, info about the collection and hash(document.id)). All nice and cool. > > I also believe realtime-gets are not forwarded internally in Solr through this routing algorithm, and that it therefore is "impossible" to do realtime-gets from a client, because you dont know which core/shard to contact directly, again because you dont know the routing alogrithm. If Im wrong, it would be very helpfull with a few directions on how to do realtime-gets from a client to a Solr servers system containing many shards and collection. If Im right, I think it would be very nice if the the routing algorithm was somehow exposed to the client (in code reachable from SolrJ) so that you can get to do realtime-gets from a SolrJ-based client - if it should be done automatically for you of if the client using SolrJ explicitly needs to call some code to get info about the core to contact, is not so important for now. > > Such a solution would also make it possible to get rid of another performance related "problem", that most update-requests has to be transported among JVMs twice to reach their destination. First from client to some "random" Solr server, and then from this Solr server to the Solr server holding the core involved in the update. If routing information was available for the client it could make sure to route its updates directly to the core (the one currently playing the role as leader-shard for the shard to which the routing algorithm maps the document) involved in the update. > > ElasticSearch has a solution to this problem by the usage of "Node Client" (instead of just "Transport Client"), where a node client is basically a real node in the system that just doesnt store document, but which have all the logic and shared information like e.g. routing algorithm available - http://www.elasticsearch.org/guide/reference/java-api/client.html. It certainly doesnt have to be like that with Solr clients, but it would be nice if somehow routing logic where available to the SolrJ so that it can send its updates (and realtime-gets) directly to the correct destination. > > Hope to get some comments on this issue. > > Regards, Per Steffensen > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > - Mark Miller lucidimagination.com ---------------------------------------------------------------------
-
Re: Exposing Solr routing to SolrJ clientPer Steffensen 2012-03-12, 13:39
Mark Miller skrev:
> Hey Per, > > A couple things: > > 1. Distributed realtime get is coming - I know Yonik was looking at this recently but got caught up in some other things. > Fantistic! I believe, if the client becomes "routing aware", it is only necessary when you are sending more than one id (using "ids") in your realtime-get request, and even then the distribution (to several Solr servers and merging of results from those) could happen in the client (or not, if you dont think that is appropriate). > 2. There is a Solrj client that is aware of the cluster state - its called CloudSolrServer. You give it the zookeeper address rather than a node's address. Currently it doesn't send directly to the leader, but this is planned Nice! So you plan to solve the "two hop" problem (as ElasticSearch calls it) that I was mentioning! http://www.elasticsearch.org/guide/reference/java-api/client.html > - it's a little tricky due to lack of access to the Schema for hashing, but likely coming soon - there is a JIRA issue for it. Clients in other languages should be able to do the same thing. > But can I do realtime-get from a SolrJ client already, then? You say that CloudSolrServer does not go directly to leader yet, and if I am correct when I claim that realtime-get (/get) requests are not routed on serverside to leader, then I will still not be able to do realtime-get using CloudSolrServer. Am I correct that I cant do it yet, even using CloudSolrServer? BTW, congratulations and thanks, for the terrific work you guys are doing on Solr(Cloud)! Hope to get to contribute "versioning" (for optimistic locking) and a "unique key" feature that allows the operation to fail if the document already exists (instead of just automatically deleting what is already there). > - Mark > > On Mar 12, 2012, at 5:26 AM, Per Steffensen wrote: > > >> Hi >> >> I believe Solr(Cloud) is doing some internal routing of update-requests to make sure documents are stored in the correct core/shard decided by Solrs internal routing algoritm (I believe it basically finds out who is the leader-shard for a given document, using shared information in ZK, info about the collection and hash(document.id)). All nice and cool. >> >> I also believe realtime-gets are not forwarded internally in Solr through this routing algorithm, and that it therefore is "impossible" to do realtime-gets from a client, because you dont know which core/shard to contact directly, again because you dont know the routing alogrithm. If Im wrong, it would be very helpfull with a few directions on how to do realtime-gets from a client to a Solr servers system containing many shards and collection. If Im right, I think it would be very nice if the the routing algorithm was somehow exposed to the client (in code reachable from SolrJ) so that you can get to do realtime-gets from a SolrJ-based client - if it should be done automatically for you of if the client using SolrJ explicitly needs to call some code to get info about the core to contact, is not so important for now. >> >> Such a solution would also make it possible to get rid of another performance related "problem", that most update-requests has to be transported among JVMs twice to reach their destination. First from client to some "random" Solr server, and then from this Solr server to the Solr server holding the core involved in the update. If routing information was available for the client it could make sure to route its updates directly to the core (the one currently playing the role as leader-shard for the shard to which the routing algorithm maps the document) involved in the update. >> >> ElasticSearch has a solution to this problem by the usage of "Node Client" (instead of just "Transport Client"), where a node client is basically a real node in the system that just doesnt store document, but which have all the logic and shared information like e.g. routing algorithm available - http://www.elasticsearch.org/guide/reference/java-api/client.html. It certainly doesnt have to be like that with Solr clients, but it would be nice if somehow routing logic where available to the SolrJ so that it can send its updates (and realtime-gets) directly to the correct destination.
-
Re: Exposing Solr routing to SolrJ clientMark Miller 2012-03-12, 14:24
On Mar 12, 2012, at 9:39 AM, Per Steffensen wrote: > Mark Miller skrev: >> Hey Per, >> >> A couple things: >> >> 1. Distributed realtime get is coming - I know Yonik was looking at this recently but got caught up in some other things. >> >> > Fantistic! I believe, if the client becomes "routing aware", it is only necessary when you are sending more than one id (using "ids") in your realtime-get request, and even then the distribution (to several Solr servers and merging of results from those) could happen in the client (or not, if you dont think that is appropriate). >> 2. There is a Solrj client that is aware of the cluster state - its called CloudSolrServer. You give it the zookeeper address rather than a node's address. Currently it doesn't send directly to the leader, but this is planned > Nice! So you plan to solve the "two hop" problem (as ElasticSearch calls it) that I was mentioning! http://www.elasticsearch.org/guide/reference/java-api/client.html >> - it's a little tricky due to lack of access to the Schema for hashing, but likely coming soon - there is a JIRA issue for it. Clients in other languages should be able to do the same thing. >> >> > But can I do realtime-get from a SolrJ client already, then? You say that CloudSolrServer does not go directly to leader yet, and if I am correct when I claim that realtime-get (/get) requests are not routed on serverside to leader, then I will still not be able to do realtime-get using CloudSolrServer. Am I correct that I cant do it yet, even using CloudSolrServer? Right, you can't yet even with CloudSolrServer - but I think it will be done soon - certainly before the 4 release anyway. > > BTW, congratulations and thanks, for the terrific work you guys are doing on Solr(Cloud)! Hope to get to contribute "versioning" (for optimistic locking) and a "unique key" feature that allows the operation to fail if the document already exists (instead of just automatically deleting what is already there). >> - Mark >> >> On Mar 12, 2012, at 5:26 AM, Per Steffensen wrote: >> >> >> >>> Hi >>> >>> I believe Solr(Cloud) is doing some internal routing of update-requests to make sure documents are stored in the correct core/shard decided by Solrs internal routing algoritm (I believe it basically finds out who is the leader-shard for a given document, using shared information in ZK, info about the collection and hash(document.id)). All nice and cool. >>> >>> I also believe realtime-gets are not forwarded internally in Solr through this routing algorithm, and that it therefore is "impossible" to do realtime-gets from a client, because you dont know which core/shard to contact directly, again because you dont know the routing alogrithm. If Im wrong, it would be very helpfull with a few directions on how to do realtime-gets from a client to a Solr servers system containing many shards and collection. If Im right, I think it would be very nice if the the routing algorithm was somehow exposed to the client (in code reachable from SolrJ) so that you can get to do realtime-gets from a SolrJ-based client - if it should be done automatically for you of if the client using SolrJ explicitly needs to call some code to get info about the core to contact, is not so important for now. >>> >>> Such a solution would also make it possible to get rid of another performance related "problem", that most update-requests has to be transported among JVMs twice to reach their destination. First from client to some "random" Solr server, and then from this Solr server to the Solr server holding the core involved in the update. If routing information was available for the client it could make sure to route its updates directly to the core (the one currently playing the role as leader-shard for the shard to which the routing algorithm maps the document) involved in the update. >>> >>> ElasticSearch has a solution to this problem by the usage of "Node Client" (instead of just "Transport Client"), where a node client is basically a real node in the system that just doesnt store document, but which have all the logic and shared information like e.g. routing algorithm available - - Mark Miller lucidimagination.com
-
Re: Exposing Solr routing to SolrJ clientPer Steffensen 2012-03-12, 20:54
> Right, you can't yet even with CloudSolrServer - but I think it will > be done soon - certainly before the 4 release anyway. Ok, I will cross my fingers for it to be done soon. Thanks for your kind help. Regards, Steff ---------------------------------------------------------------------
-
Re: Exposing Solr routing to SolrJ clientYonik Seeley 2012-03-12, 21:32
On Mon, Mar 12, 2012 at 8:15 AM, Mark Miller <[EMAIL PROTECTED]> wrote:
> Currently it doesn't send directly to the leader, but this is planned - it's a little tricky due to lack of access to the Schema for hashing Hmmm, why is this? Identification of the "uniqueKey" field? Maybe we just assume "id", or let the user configure it if it's something different. That should really be "best practice" along with sticking to normal java identifiers for field names. -Yonik lucenerevolution.com - Lucene/Solr Open Source Search Conference. Boston May 7-10 ---------------------------------------------------------------------
-
Re: Exposing Solr routing to SolrJ clientMark Miller 2012-03-12, 21:39
On Mar 12, 2012, at 5:32 PM, Yonik Seeley wrote: > On Mon, Mar 12, 2012 at 8:15 AM, Mark Miller <[EMAIL PROTECTED]> wrote: >> Currently it doesn't send directly to the leader, but this is planned - it's a little tricky due to lack of access to the Schema for hashing > > Hmmm, why is this? Identification of the "uniqueKey" field? Maybe we > just assume "id", or let the user configure it if it's something > different. That should really be "best practice" along with sticking > to normal java identifiers for field names. Yeah, for id my plan was just let the user supply the field, perhaps default to id. The other issue is that we hash on the indexed value though - which we get though a customizable field type method impl last I recall. I think this tends to be the same as the raw text for the types we care about. But we have to make some assumptions - it's not really arbitrary support - though it should easily cover the current common types of numeric or string. I think most impls end up using UnicodeUtil.UTF16toUTF8 and hopefully most toInternal methods simply return what is passed in (ie use the base class impl)... /** Given an indexed term, append the human readable representation*/ public CharsRef indexedToReadable(BytesRef input, CharsRef output) { UnicodeUtil.UTF8toUTF16(input, output); return output; } public String toInternal(String val) { // - used in delete when a Term needs to be created. // - used by the default getTokenizer() and createField() return val; } > > -Yonik > lucenerevolution.com - Lucene/Solr Open Source Search Conference. > Boston May 7-10 > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > - Mark Miller lucidimagination.com UnicodeUtil.UTF16toUTF8( ---------------------------------------------------------------------
-
Re: Exposing Solr routing to SolrJ clientYonik Seeley 2012-03-12, 21:46
On Mon, Mar 12, 2012 at 5:39 PM, Mark Miller <[EMAIL PROTECTED]> wrote:
> > On Mar 12, 2012, at 5:32 PM, Yonik Seeley wrote: > >> On Mon, Mar 12, 2012 at 8:15 AM, Mark Miller <[EMAIL PROTECTED]> wrote: >>> Currently it doesn't send directly to the leader, but this is planned - it's a little tricky due to lack of access to the Schema for hashing >> >> Hmmm, why is this? Identification of the "uniqueKey" field? Maybe we >> just assume "id", or let the user configure it if it's something >> different. That should really be "best practice" along with sticking >> to normal java identifiers for field names. > > Yeah, for id my plan was just let the user supply the field, perhaps default to id. The other issue is that we hash on the indexed value though - which we get though a customizable field type method impl last I recall. I think this tends to be the same as the raw text for the types we care about. But we have to make some assumptions - it's not really arbitrary support - though it should easily cover the current common types of numeric or string. I think most impls end up using UnicodeUtil.UTF16toUTF8 and hopefully most toInternal methods simply return what is passed in (ie use the base class impl)... Non "string" (or compatible) ids are more trouble than they are worth... and since cloud is new, I think it would be fine to say "use a string id". -Yonik lucenerevolution.com - Lucene/Solr Open Source Search Conference. Boston May 7-10 ---------------------------------------------------------------------
-
Re: Exposing Solr routing to SolrJ clientPer Steffensen 2012-03-14, 12:43
FYI (if it is of any interest), we just hacked CloudSolrServer locally
to support routing of realtime-get requests. Limitations are: - Only "id"-parameter and not "ids"-parameter supported in realtime-get requests. - Only schemas with uniqueKey on field named "id" and only "id"-field of type string supported. We did this to be able to start performance tests on our own system building on SolrCloud. The performance of our own system is dependent on being able to do realtime-gets from the client (our system), because we often do updates of documents very quickly after they have been indexed for the first time (and we run with soft-commit = 1 sec - we cant wait for that). We use the "version control" (for optimistic locking) and "unique key constraint where you fail instead of overwrite if document already exists" (http://wiki.apache.org/solr/Per%20Steffensen/Update%20semantics) in our highly concurrent performance test, so that will also be tested wrt performance. What we did in CloudSolrServer was: * Added the following to the requst method between the "if (collection == null)" statement and the "LBHttpSolrServer.Req req = new LBHttpSolrServer.Req(request, urlList);" statement: List<String> urlList = new ArrayList<String>(); if (reqParams.get(CommonParams.QT) != null && reqParams.get(CommonParams.QT).equals("/get")) { String id = reqParams.get("id"); int hash = hash(id); String shardId = getShard(hash, collection, cloudState); ZkCoreNodeProps leaderProps = null; try { leaderProps = new ZkCoreNodeProps(zkStateReader.getLeaderProps( collection, shardId)); } catch (InterruptedException ie) { throw new SolrServerException(ie); } String fullUrl = ensureUrlHasProtocolIdentifier(leaderProps.getCoreUrl()); urlList.add(fullUrl); } else { <stuff that was already in request between the "if (collection == null)" statement and the "LBHttpSolrServer.Req req = new LBHttpSolrServer.Req(request, urlList);" statement } * Added the follow helper-methods (stolen from DistributedUpdateProcessor etc.) private String ensureUrlHasProtocolIdentifier(String url) { if (!url.startsWith("http://") && !url.startsWith("https://")) { url = "http://" + url; } return url; } private String getShard(int hash, String collection, CloudState cloudState) { return cloudState.getShard(hash, collection); } private int hash(String id) { BytesRef indexedId = new BytesRef(); UnicodeUtil.UTF16toUTF8(id, 0, id.length(), indexedId); return Hash.murmurhash3_x86_32(indexedId.bytes, indexedId.offset, indexedId.length, 0); } It seems to work for us, but we look very much forward to the "real" solution. Regards, Per Steffensen Per Steffensen skrev: > >> Right, you can't yet even with CloudSolrServer - but I think it will >> be done soon - certainly before the 4 release anyway. > Ok, I will cross my fingers for it to be done soon. Thanks for your > kind help. > > Regards, Steff > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > --------------------------------------------------------------------- |