Loading...

XML

Word

Printable

Details

Type: Improvement
Resolution: Not a Bug
Priority: Major
Fix Version/s: 7.1.0
Affects Version/s: Cheshire-Cat
Component/s: ns_server
Labels:
- jepsen
- ns_server

Description

Steps to reproduce this bug is as follows:

Create a 4 node cluster (10.112.194.101, 10.112.194.102, 10.112.194.103, 10.112.194.104, with 10.112.194.101 being the node that initiates the cluster creation)
Isolate two nodes 10.112.194.101 and 10.112.194.102 from each other. So this introduces a network partition such that these two nodes cannot communicate with
each other, but are able to communicate with all other nodes.
This can be done by executing the following commands inside each of the two nodes above
Execute in node 1 and node 2 respectively,

iptables -A INPUT -s 10.112.194.102 -j DROP

iptables -A INPUT -s 10.112.194.101 -j DROP

3. Hard failover the first node with a rest call to third node. So in node 1 execute:

curl -v -X POST -u Administrator:password http://10.112.194.103:8091/controller/failOver -d 'otpNode=ns_1@10.112.194.101'

Failover fails with the above mentioned error. Screenshots are attached.

I found this intermittent bug originally when we run jepsen-durability-misc-daily-new tests (http://qa.sc.couchbase.com/job/jepsen-durability-misc-daily-new/) for the partition-failover workload(and when the failed over node happens to be the first node in the cluster).
Nemesis crashes because failover fails with the above mentioned error and results in "unknown" error. The config for that is as follows:

workload=partition-failover,node-count=6,replicas=2,no-autofailover,kv-timeout=30,durability=0:100:0:0

Node that this may fail or succeed depending upon whether the failed over node is first node of the cluster or not.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

Screen Shot 2020-02-10 at 1.49.14 PM.png
230 kB
10/Feb/20 12:42 AM
Screen Shot 2020-02-10 at 1.49.52 PM.png
622 kB
10/Feb/20 12:42 AM
Screen Shot 2020-02-10 at 2.47.34 PM.png
159 kB
10/Feb/20 1:18 AM

Gerrit Reviews

- Issue Only
- Show All Reviews
- Show Open Reviews
- Show All Issues
- Show Open Issues

No reviews matched the request. Check your Options in the drop-down menu of this sections header.

Activity

People

Assignee:: Sumedh Basarkod (Inactive)

Reporter:: Sumedh Basarkod (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 9 Start watching this issue

Dates

Created:: 10/Feb/20 1:43 AM

Updated:: 17/Jun/21 5:41 PM

Resolved:: 17/Jun/21 5:41 PM

Gerrit Reviews

There are no open Gerrit changes

Show There is 1 closed Gerrit change

Hide There is 1 closed Gerrit change

CBQE-6041: Add partition failover of orchestrator: Gerrit Review:

Partition-failover of the first node in the cluster fails with "500 Internal server error; config sync failed"

Details

Description

Attachments

Attachments

Gerrit Reviews

Activity

People

Dates

Gerrit Reviews

PagerDuty