Details
-
Bug
-
Status: Closed
-
Critical
-
Resolution: Fixed
-
6.5.0
-
ubuntu1604
-
Untriaged
-
-
Yes
-
KV Sprint 2019-12
Description
While running the following Jepsen test that performs a graceful failover of a node and then re-adds it into the cluster using delta node recovery. We observed a rebalance hang during the failover stage of the test.
lein trampoline run test --nodes-file ./nodes --username vagrant --ssh-private-key ./resources/vagrantkey --package /home/couchbase/jenkins/workspace/kv-engine-jepsen-post-commit/install --workload=failover --failover-type=graceful --recovery-type=delta --replicas=2 --no-autofailover --disrupt-count=1 --rate=0 --durability=0:100:0:0 --eviction-policy=value --cas --use-json-docs --doc-padding-size=3072 --hashdump --enable-memcached-debug-log-level --enable-tcp-capture
Points to note about the test:
- We in DGM less than 50% resident
- We have two replicas
- Each document is about 3MB
- We're performing Duriabilty Majority writes
I've also managed to collect core dumps of memcached on each node:
172.28.128.125=node1
172.28.128.126=node2
172.28.128.127=node3
172.28.128.128=node4
Build: couchbase-server-enterprise_6.5.1-6007-ubuntu16.04
Attachments
Issue Links
Activity
Field | Original Value | New Value |
---|---|---|
Attachment | collectinfo-2019-12-17T132606-ns_1@172.28.128.125.zip [ 79462 ] | |
Attachment | collectinfo-2019-12-17T132606-ns_1@172.28.128.126.zip [ 79463 ] | |
Attachment | collectinfo-2019-12-17T132606-ns_1@172.28.128.127.zip [ 79464 ] | |
Attachment | collectinfo-2019-12-17T132606-ns_1@172.28.128.128.zip [ 79465 ] |
Assignee | Daniel Owen [ owend ] | Dave Rigby [ drigby ] |
Priority | Major [ 3 ] | Critical [ 2 ] |
Attachment | hang-screen-shot.png [ 79466 ] |
Description |
While running the following Jepsen test that performs a graceful failover of a node and then re-adds it into the cluster using delta node recovery. We observed a rebalance hang during the failover stage of the test.
{{lein trampoline run test --nodes-file ./nodes --username vagrant --ssh-private-key ./resources/vagrantkey --package /home/couchbase/jenkins/workspace/kv-engine-jepsen-post-commit/install --workload=failover --failover-type=graceful --recovery-type=delta --replicas=2 --no-autofailover --disrupt-count=1 --rate=0 --durability=0:100:0:0 --eviction-policy=value --cas --use-json-docs --doc-padding-size=3072 --hashdump --enable-memcached-debug-log-level --enable-tcp-capture}} Points to note about the test: * We in DGM less than 50% resident * We have two replicas * Each document is about 3MB * We're performing Duriabilty Majority writes |
While running the following Jepsen test that performs a graceful failover of a node and then re-adds it into the cluster using delta node recovery. We observed a rebalance hang during the failover stage of the test.
{{lein trampoline run test --nodes-file ./nodes --username vagrant --ssh-private-key ./resources/vagrantkey --package /home/couchbase/jenkins/workspace/kv-engine-jepsen-post-commit/install --workload=failover --failover-type=graceful --recovery-type=delta --replicas=2 --no-autofailover --disrupt-count=1 --rate=0 --durability=0:100:0:0 --eviction-policy=value --cas --use-json-docs --doc-padding-size=3072 --hashdump --enable-memcached-debug-log-level --enable-tcp-capture}} Points to note about the test: * We in DGM less than 50% resident * We have two replicas * Each document is about 3MB * We're performing Duriabilty Majority writes !hang-screen-shot.png|thumbnail! |
Description |
While running the following Jepsen test that performs a graceful failover of a node and then re-adds it into the cluster using delta node recovery. We observed a rebalance hang during the failover stage of the test.
{{lein trampoline run test --nodes-file ./nodes --username vagrant --ssh-private-key ./resources/vagrantkey --package /home/couchbase/jenkins/workspace/kv-engine-jepsen-post-commit/install --workload=failover --failover-type=graceful --recovery-type=delta --replicas=2 --no-autofailover --disrupt-count=1 --rate=0 --durability=0:100:0:0 --eviction-policy=value --cas --use-json-docs --doc-padding-size=3072 --hashdump --enable-memcached-debug-log-level --enable-tcp-capture}} Points to note about the test: * We in DGM less than 50% resident * We have two replicas * Each document is about 3MB * We're performing Duriabilty Majority writes !hang-screen-shot.png|thumbnail! |
While running the following Jepsen test that performs a graceful failover of a node and then re-adds it into the cluster using delta node recovery. We observed a rebalance hang during the failover stage of the test.
{{lein trampoline run test --nodes-file ./nodes --username vagrant --ssh-private-key ./resources/vagrantkey --package /home/couchbase/jenkins/workspace/kv-engine-jepsen-post-commit/install --workload=failover --failover-type=graceful --recovery-type=delta --replicas=2 --no-autofailover --disrupt-count=1 --rate=0 --durability=0:100:0:0 --eviction-policy=value --cas --use-json-docs --doc-padding-size=3072 --hashdump --enable-memcached-debug-log-level --enable-tcp-capture}} Points to note about the test: * We in DGM less than 50% resident * We have two replicas * Each document is about 3MB * We're performing Duriabilty Majority writes !hang-screen-shot.png|thumbnail! I've also managed to collect core dumps of memcached on each node: 172.28.128.125=node1 172.28.128.126=node2 172.28.128.127=node3 172.28.128.128=node4 Build: couchbase-server-enterprise_6.5.1-6007-ubuntu16.04 |
Fix Version/s | 6.5.1 [ 16622 ] |
Description |
While running the following Jepsen test that performs a graceful failover of a node and then re-adds it into the cluster using delta node recovery. We observed a rebalance hang during the failover stage of the test.
{{lein trampoline run test --nodes-file ./nodes --username vagrant --ssh-private-key ./resources/vagrantkey --package /home/couchbase/jenkins/workspace/kv-engine-jepsen-post-commit/install --workload=failover --failover-type=graceful --recovery-type=delta --replicas=2 --no-autofailover --disrupt-count=1 --rate=0 --durability=0:100:0:0 --eviction-policy=value --cas --use-json-docs --doc-padding-size=3072 --hashdump --enable-memcached-debug-log-level --enable-tcp-capture}} Points to note about the test: * We in DGM less than 50% resident * We have two replicas * Each document is about 3MB * We're performing Duriabilty Majority writes !hang-screen-shot.png|thumbnail! I've also managed to collect core dumps of memcached on each node: 172.28.128.125=node1 172.28.128.126=node2 172.28.128.127=node3 172.28.128.128=node4 Build: couchbase-server-enterprise_6.5.1-6007-ubuntu16.04 |
While running the following Jepsen test that performs a graceful failover of a node and then re-adds it into the cluster using delta node recovery. We observed a rebalance hang during the failover stage of the test.
{{lein trampoline run test --nodes-file ./nodes --username vagrant --ssh-private-key ./resources/vagrantkey --package /home/couchbase/jenkins/workspace/kv-engine-jepsen-post-commit/install --workload=failover --failover-type=graceful --recovery-type=delta --replicas=2 --no-autofailover --disrupt-count=1 --rate=0 --durability=0:100:0:0 --eviction-policy=value --cas --use-json-docs --doc-padding-size=3072 --hashdump --enable-memcached-debug-log-level --enable-tcp-capture}} Points to note about the test: * We in DGM less than 50% resident * We have two replicas * Each document is about 3MB * We're performing Duriabilty Majority writes !hang-screen-shot.png|thumbnail! I've also managed to collect core dumps of memcached on each node: 172.28.128.125=[node1|https://cb-jira.s3.us-east-2.amazonaws.com/logs/MB-37294/5ff93127/core.10508.node1] 172.28.128.126=[node2|https://cb-jira.s3.us-east-2.amazonaws.com/logs/MB-37294/d9706f00/core.11168.node2] 172.28.128.127=[node3|https://cb-jira.s3.us-east-2.amazonaws.com/logs/MB-37294/7b76fe90/core.9916.node3] 172.28.128.128=[node4|https://cb-jira.s3.us-east-2.amazonaws.com/logs/MB-37294/d2cd55a2/core.6059.node4] Build: couchbase-server-enterprise_6.5.1-6007-ubuntu16.04 |
Attachment | jepsen-output-1.log [ 79472 ] |
Assignee | Dave Rigby [ drigby ] | Dave Finlay [ dfinlay ] |
Assignee | Dave Finlay [ dfinlay ] | Aliaksey Artamonau [ aliaksey artamonau ] |
Fix Version/s | 6.5.0 [ 16624 ] | |
Fix Version/s | 6.5.1 [ 16622 ] |
Component/s | ns_server [ 10019 ] |
Attachment | mem-usage-172.28.128.125.png [ 79480 ] |
Attachment | mem-usage-172.28.128.125.png [ 79480 ] |
Attachment | mem-usage-172.28.128.128.png [ 79481 ] |
Attachment | mem-usage-172.28.128.128.png [ 79481 ] |
Attachment | mem-usage-172.28.128.128.png [ 79482 ] |
Fix Version/s | 6.5.1 [ 16622 ] | |
Fix Version/s | 6.5.0 [ 16624 ] |
Component/s | ns_server [ 10019 ] |
Is this a Regression? | Unknown [ 10452 ] | No [ 10451 ] |
Assignee | Aliaksey Artamonau [ aliaksey artamonau ] | Daniel Owen [ owend ] |
Assignee | Daniel Owen [ owend ] | Richard deMellow [ richard.demellow ] |
Attachment | screenshot-1.png [ 79521 ] |
Attachment | image-2019-12-18-11-30-45-806.png [ 79522 ] |
Attachment | image-2019-12-18-11-31-01-008.png [ 79523 ] |
Attachment | test-with-512KB-docs-80%-resident.zip [ 79525 ] |
Attachment | 172.28.128.132-mem-used-512-experiment.png [ 79528 ] |
Is this a Regression? | No [ 10451 ] | Yes [ 10450 ] |
Fix Version/s | Mad-Hatter [ 15037 ] | |
Fix Version/s | 6.5.1 [ 16622 ] |
Sprint | KV Sprint 2019-12 [ 939 ] |
Rank | Ranked higher |
Attachment | image-2019-12-20-14-26-48-491.png [ 79698 ] |
Attachment | image-2019-12-20-14-26-48-491.png [ 79698 ] |
Attachment | image-2019-12-20-14-29-26-998.png [ 79699 ] |
Attachment | image-2019-12-20-14-31-07-630.png [ 79700 ] |
Attachment | image-2019-12-20-14-29-26-998.png [ 79699 ] |
Attachment | image-2019-12-20-14-31-07-630.png [ 79700 ] |
Attachment | image-2019-12-20-14-52-07-154.png [ 79701 ] | |
Attachment | image-2019-12-20-14-51-50-994.png [ 79702 ] |
Labels | jepsen | approved-for-mad-hatter jepsen |
Due Date | 20/Dec/19 |
Link | This issue blocks MB-36676 [ MB-36676 ] |
Resolution | Fixed [ 1 ] | |
Status | Open [ 1 ] | Resolved [ 5 ] |
Assignee | Richard deMellow [ richard.demellow ] | Ashwin Govindarajulu [ ashwin.govindarajulu ] |
Status | Resolved [ 5 ] | Closed [ 6 ] |