|
|
-
Replication issues after machine failure
Jamie Johnson 2012-05-11, 20:55
I've had a few instances where a machine has needed to be restored from a prior state. After doing so and firing up solr again I've had instances where replication doesn't seem to be working properly. I have not seen any failures in logs (will have to keep a closer eye on this) but when this happens and I execute a query against each with distrib=false I am seeing the following counts
Shard @ host1(shard1) returned 95150 Shard @ host2(shard1) returned 95150 Shard @ host2(shard4) returned 94311 Shard @ host3(shard4) returned 8468 Shard @ host3(shard5) returned 8303 Shard @ host1(shard5) returned 96054 Shard @ host1(shard2) returned 95620 Shard @ host2(shard2) returned 95620 Shard @ host2(shard3) returned 93195 Shard @ host3(shard3) returned 8336 Shard @ host3(shard6) returned 8309 Shard @ host1(shard6) returned 96036 in this case host3 is what failed and as you can see everything on host3 is significantly less than what the leader has. Has anyone else experienced this?
+
Jamie Johnson 2012-05-11, 20:55
-
Re: Replication issues after machine failure
Mark Miller 2012-05-12, 03:08
So it's easy to reproduce? What do you mean restored from a prior state?
What snapshot are you on these days for future ref?
You have double checked to make sure that shard is listed as ACTIVE right?
On May 11, 2012, at 4:55 PM, Jamie Johnson wrote:
> I've had a few instances where a machine has needed to be restored > from a prior state. After doing so and firing up solr again I've had > instances where replication doesn't seem to be working properly. I > have not seen any failures in logs (will have to keep a closer eye on > this) but when this happens and I execute a query against each with > distrib=false I am seeing the following counts > > Shard @ host1(shard1) returned 95150 > Shard @ host2(shard1) returned 95150 > Shard @ host2(shard4) returned 94311 > Shard @ host3(shard4) returned 8468 > Shard @ host3(shard5) returned 8303 > Shard @ host1(shard5) returned 96054 > Shard @ host1(shard2) returned 95620 > Shard @ host2(shard2) returned 95620 > Shard @ host2(shard3) returned 93195 > Shard @ host3(shard3) returned 8336 > Shard @ host3(shard6) returned 8309 > Shard @ host1(shard6) returned 96036 > > > in this case host3 is what failed and as you can see everything on > host3 is significantly less than what the leader has. Has anyone else > experienced this?
- Mark Miller lucidimagination.com
+
Mark Miller 2012-05-12, 03:08
-
Re: Replication issues after machine failure
Jamie Johnson 2012-05-13, 02:35
I have not tried to reproduce as of yet but hope to do so Monday. The machine that had the issue was a vm out of my control so I'm not certain how it was restored. I am using a fairly recent nightly build within the last few weeks
On Friday, May 11, 2012, Mark Miller <[EMAIL PROTECTED]> wrote: > So it's easy to reproduce? What do you mean restored from a prior state? > > What snapshot are you on these days for future ref? > > You have double checked to make sure that shard is listed as ACTIVE right? > > On May 11, 2012, at 4:55 PM, Jamie Johnson wrote: > >> I've had a few instances where a machine has needed to be restored >> from a prior state. After doing so and firing up solr again I've had >> instances where replication doesn't seem to be working properly. I >> have not seen any failures in logs (will have to keep a closer eye on >> this) but when this happens and I execute a query against each with >> distrib=false I am seeing the following counts >> >> Shard @ host1(shard1) returned 95150 >> Shard @ host2(shard1) returned 95150 >> Shard @ host2(shard4) returned 94311 >> Shard @ host3(shard4) returned 8468 >> Shard @ host3(shard5) returned 8303 >> Shard @ host1(shard5) returned 96054 >> Shard @ host1(shard2) returned 95620 >> Shard @ host2(shard2) returned 95620 >> Shard @ host2(shard3) returned 93195 >> Shard @ host3(shard3) returned 8336 >> Shard @ host3(shard6) returned 8309 >> Shard @ host1(shard6) returned 96036 >> >> >> in this case host3 is what failed and as you can see everything on >> host3 is significantly less than what the leader has. Has anyone else >> experienced this? > > - Mark Miller > lucidimagination.com > > > > > > > > > > > >
+
Jamie Johnson 2012-05-13, 02:35
-
Re: Replication issues after machine failure
Jamie Johnson 2012-05-13, 02:45
Sorry hit send too fast. The shards were listed as active. Also the solr instances were still running but the file system they wrote to had become read only. I thought that would make replication fail and when the issue was fixed and solr restarted replication would then succeed. Am I hitting some fringe case?
On Saturday, May 12, 2012, Jamie Johnson <[EMAIL PROTECTED]> wrote: > I have not tried to reproduce as of yet but hope to do so Monday. The machine that had the issue was a vm out of my control so I'm not certain how it was restored. I am using a fairly recent nightly build within the last few weeks > > On Friday, May 11, 2012, Mark Miller <[EMAIL PROTECTED]> wrote: >> So it's easy to reproduce? What do you mean restored from a prior state? >> >> What snapshot are you on these days for future ref? >> >> You have double checked to make sure that shard is listed as ACTIVE right? >> >> On May 11, 2012, at 4:55 PM, Jamie Johnson wrote: >> >>> I've had a few instances where a machine has needed to be restored >>> from a prior state. After doing so and firing up solr again I've had >>> instances where replication doesn't seem to be working properly. I >>> have not seen any failures in logs (will have to keep a closer eye on >>> this) but when this happens and I execute a query against each with >>> distrib=false I am seeing the following counts >>> >>> Shard @ host1(shard1) returned 95150 >>> Shard @ host2(shard1) returned 95150 >>> Shard @ host2(shard4) returned 94311 >>> Shard @ host3(shard4) returned 8468 >>> Shard @ host3(shard5) returned 8303 >>> Shard @ host1(shard5) returned 96054 >>> Shard @ host1(shard2) returned 95620 >>> Shard @ host2(shard2) returned 95620 >>> Shard @ host2(shard3) returned 93195 >>> Shard @ host3(shard3) returned 8336 >>> Shard @ host3(shard6) returned 8309 >>> Shard @ host1(shard6) returned 96036 >>> >>> >>> in this case host3 is what failed and as you can see everything on >>> host3 is significantly less than what the leader has. Has anyone else >>> experienced this? >> >> - Mark Miller >> lucidimagination.com >> >> >> >> >> >> >> >> >> >> >> >>
+
Jamie Johnson 2012-05-13, 02:45
|
|