Loading...

XML

Word

Printable

Details

Type: Bug
Resolution: Not a Bug
Priority: Major
Fix Version/s: 7.0.0
Affects Version/s: Cheshire-Cat
Component/s: ns_server
Labels:
- manual_testing
- ns_server
Environment:
CB EE 7.0.0-5050

Triage:
Untriaged
Operating System:
Centos 64-bit
Link to Log File, atop/blg, CBCollectInfo, Core dump:

Hide
s3://cb-customers-secure/split_brain_unsafe_failover/2021-04-29/collectinfo-2021-04-29t181551-ns_1@172.23.105.215.zip
s3://cb-customers-secure/split_brain_unsafe_failover/2021-04-29/collectinfo-2021-04-29t181551-ns_1@172.23.105.217.zip
s3://cb-customers-secure/split_brain_unsafe_failover/2021-04-29/172.23.105.219.zip

Show
s3://cb-customers-secure/split_brain_unsafe_failover/2021-04-29/collectinfo-2021-04-29t181551-ns_1@172.23.105.215.zip s3://cb-customers-secure/split_brain_unsafe_failover/2021-04-29/collectinfo-2021-04-29t181551-ns_1@172.23.105.217.zip s3://cb-customers-secure/split_brain_unsafe_failover/2021-04-29/172.23.105.219.zip
Story Points:
1
Is this a Regression?:
No

Description

Filing this after talking to Meni Hillel and Steve Watanabe this morning based on the comment here at:
https://issues.couchbase.com/browse/MB-37842?focusedCommentId=494107&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-494107

Steps to reproduce
1. 3 node cluster .215 .217 .219
2. Split brain .217 by blocking traffic from .219 on .215 and vice-versa
So now we have two cluster halves:
First half - .215 and .217 with .219 unresponsive
Second half - .217 and .219 with .215 unresponsive
Now it appears that we can't (unsafe) failover .215 (orchestrator) node out of the cluster as these options will fail

Option 1: Unsafe failover request to .215/.217

curl -v -X POST -u Administrator:password http://172.23.105.217:8091/controller/failOver -d 'otpNode=ns_1@172.23.105.215' -d 'allowUnsafe=true'

will fail as:

Unexpected server error: {error,

                             {not_in_peers,'ns_1@172.23.105.215',

                                 ['ns_1@172.23.105.217',

as .217 and .215 think that orchestrator (.215) is healthy

Option 2: Unsafe failover request to .219

curl -v -X POST -u Administrator:password http://172.23.105.219:8091/controller/failOver -d 'otpNode=ns_1@172.23.105.215' -d 'allowUnsafe=true'

won't return a response as ultimately the request tries to get routed to .215 and it cannot communicate with it.

Option 3: Do a regular hard-failover of .215 by making a request to .217 node

curl -v -X POST -u Administrator:password http://172.23.105.217:8091/controller/failOver -d 'otpNode=ns_1@172.23.105.215'

will fail as:

Failover exited with reason {{badmatch,

{error,

{no_quorum,

[{required_quorum,

[majority,

{majority,

{set,2,16,16,8,80,48,

{[],[],[],[],[],[],[],[],[],[],[],[],[],

[],[],[]},

{{[],[],[],[],[],[],[],

['ns_1@172.23.105.217'],

[],

['ns_1@172.23.105.219'],

[],[],[],[],[],[]}}}}]},

{leases,

['ns_1@172.23.105.215',

'ns_1@172.23.105.217']}]}}},

[{failover,deactivate_nodes,2,

(Note that we can unsafe failover .219; that works fine. The issue here is that we can't get .215 (orch) out of the cluster)

Logs attached

Attachments

Gerrit Reviews

- Issue Only
- Show All Reviews
- Show Open Reviews
- Show All Issues
- Show Open Issues

No reviews matched the request. Check your Options in the drop-down menu of this sections header.

Activity

People

Assignee:: Dave Finlay

Reporter:: Sumedh Basarkod (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 29/Apr/21 11:41 AM

Updated:: 17/Jun/21 2:49 PM

Resolved:: 29/Apr/21 8:14 PM

Gerrit Reviews

There are no open Gerrit changes

Unsafe failover orchestrator after split brain

Details

Description

Attachments

Gerrit Reviews

Activity

People

Dates

Gerrit Reviews

PagerDuty