Details
-
Bug
-
Resolution: Not a Bug
-
Major
-
Cheshire-Cat
-
CB EE 7.0.0-5050
-
Untriaged
-
Centos 64-bit
-
-
1
-
No
Description
Filing this after talking to Meni Hillel and Steve Watanabe this morning based on the comment here at:
https://issues.couchbase.com/browse/MB-37842?focusedCommentId=494107&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-494107
Steps to reproduce
1. 3 node cluster .215 .217 .219
2. Split brain .217 by blocking traffic from .219 on .215 and vice-versa
So now we have two cluster halves:
First half - .215 and .217 with .219 unresponsive
Second half - .217 and .219 with .215 unresponsive
Now it appears that we can't (unsafe) failover .215 (orchestrator) node out of the cluster as these options will fail
Option 1: Unsafe failover request to .215/.217
curl -v -X POST -u Administrator:password http://172.23.105.217:8091/controller/failOver -d 'otpNode=ns_1@172.23.105.215' -d 'allowUnsafe=true'
|
will fail as:
Unexpected server error: {error,
|
{not_in_peers,'ns_1@172.23.105.215',
|
['ns_1@172.23.105.217',
|
as .217 and .215 think that orchestrator (.215) is healthy
Option 2: Unsafe failover request to .219
curl -v -X POST -u Administrator:password http://172.23.105.219:8091/controller/failOver -d 'otpNode=ns_1@172.23.105.215' -d 'allowUnsafe=true'
|
won't return a response as ultimately the request tries to get routed to .215 and it cannot communicate with it.
Option 3: Do a regular hard-failover of .215 by making a request to .217 node
curl -v -X POST -u Administrator:password http://172.23.105.217:8091/controller/failOver -d 'otpNode=ns_1@172.23.105.215'
|
will fail as:
Failover exited with reason {{badmatch,
|
{error,
|
{no_quorum,
|
[{required_quorum,
|
[majority,
|
{majority,
|
{set,2,16,16,8,80,48,
|
{[],[],[],[],[],[],[],[],[],[],[],[],[],
|
[],[],[]},
|
{{[],[],[],[],[],[],[],
|
['ns_1@172.23.105.217'],
|
[],
|
['ns_1@172.23.105.219'],
|
[],[],[],[],[],[]}}}}]},
|
{leases,
|
['ns_1@172.23.105.215',
|
'ns_1@172.23.105.217']}]}}},
|
[{failover,deactivate_nodes,2,
|
(Note that we can unsafe failover .219; that works fine. The issue here is that we can't get .215 (orch) out of the cluster)
Logs attached