Details
-
Bug
-
Status: Closed
-
Major
-
Resolution: Not a Bug
-
Cheshire-Cat
-
CB EE 7.0.0-5050
-
Untriaged
-
Centos 64-bit
-
-
1
-
No
Description
Filing this after talking to Meni Hillel and Steve Watanabe this morning based on the comment here at:
https://issues.couchbase.com/browse/MB-37842?focusedCommentId=494107&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-494107
Steps to reproduce
1. 3 node cluster .215 .217 .219
2. Split brain .217 by blocking traffic from .219 on .215 and vice-versa
So now we have two cluster halves:
First half - .215 and .217 with .219 unresponsive
Second half - .217 and .219 with .215 unresponsive
Now it appears that we can't (unsafe) failover .215 (orchestrator) node out of the cluster as these options will fail
Option 1: Unsafe failover request to .215/.217
curl -v -X POST -u Administrator:password http://172.23.105.217:8091/controller/failOver -d 'otpNode=ns_1@172.23.105.215' -d 'allowUnsafe=true'
|
will fail as:
Unexpected server error: {error,
|
{not_in_peers,'ns_1@172.23.105.215',
|
['ns_1@172.23.105.217',
|
as .217 and .215 think that orchestrator (.215) is healthy
Option 2: Unsafe failover request to .219
curl -v -X POST -u Administrator:password http://172.23.105.219:8091/controller/failOver -d 'otpNode=ns_1@172.23.105.215' -d 'allowUnsafe=true'
|
won't return a response as ultimately the request tries to get routed to .215 and it cannot communicate with it.
Option 3: Do a regular hard-failover of .215 by making a request to .217 node
curl -v -X POST -u Administrator:password http://172.23.105.217:8091/controller/failOver -d 'otpNode=ns_1@172.23.105.215'
|
will fail as:
Failover exited with reason {{badmatch,
|
{error,
|
{no_quorum,
|
[{required_quorum,
|
[majority,
|
{majority,
|
{set,2,16,16,8,80,48,
|
{[],[],[],[],[],[],[],[],[],[],[],[],[],
|
[],[],[]},
|
{{[],[],[],[],[],[],[],
|
['ns_1@172.23.105.217'],
|
[],
|
['ns_1@172.23.105.219'],
|
[],[],[],[],[],[]}}}}]},
|
{leases,
|
['ns_1@172.23.105.215',
|
'ns_1@172.23.105.217']}]}}},
|
[{failover,deactivate_nodes,2,
|
(Note that we can unsafe failover .219; that works fine. The issue here is that we can't get .215 (orch) out of the cluster)
Logs attached
Thanks for this Sumedh Basarkod. These are an artifact of the situation that we still have an orchestrator combined with the asymmetric network partition and the fact that we require nodes that survive a quorum failover to be up.
The following remedies should work in this situation, I would think.
Did you get a chance to test these?