Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-37294

[Jepsen] Hang during rebalance in DGM scenario while perform graceful failover

    XMLWordPrintable

Details

    Description

      While running the following Jepsen test that performs a graceful failover of a node and then re-adds it into the cluster using delta node recovery. We observed a rebalance hang during the failover stage of the test.
      lein trampoline run test --nodes-file ./nodes --username vagrant --ssh-private-key ./resources/vagrantkey --package /home/couchbase/jenkins/workspace/kv-engine-jepsen-post-commit/install --workload=failover --failover-type=graceful --recovery-type=delta --replicas=2 --no-autofailover --disrupt-count=1 --rate=0 --durability=0:100:0:0 --eviction-policy=value --cas --use-json-docs --doc-padding-size=3072 --hashdump --enable-memcached-debug-log-level --enable-tcp-capture
      Points to note about the test:

      • We in DGM less than 50% resident
      • We have two replicas
      • Each document is about 3MB
      • We're performing Duriabilty Majority writes

      I've also managed to collect core dumps of memcached on each node:
      172.28.128.125=node1
      172.28.128.126=node2
      172.28.128.127=node3
      172.28.128.128=node4

      Build: couchbase-server-enterprise_6.5.1-6007-ubuntu16.04

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            richard.demellow Richard deMellow created issue -
            richard.demellow Richard deMellow made changes -
            owend Daniel Owen made changes -
            Assignee Daniel Owen [ owend ] Dave Rigby [ drigby ]
            owend Daniel Owen made changes -
            Priority Major [ 3 ] Critical [ 2 ]
            richard.demellow Richard deMellow made changes -
            Attachment hang-screen-shot.png [ 79466 ]
            richard.demellow Richard deMellow made changes -
            Description While running the following Jepsen test that performs a graceful failover of a node and then re-adds it into the cluster using delta node recovery. We observed a rebalance hang during the failover stage of the test.
            {{lein trampoline run test --nodes-file ./nodes --username vagrant --ssh-private-key ./resources/vagrantkey --package /home/couchbase/jenkins/workspace/kv-engine-jepsen-post-commit/install --workload=failover --failover-type=graceful --recovery-type=delta --replicas=2 --no-autofailover --disrupt-count=1 --rate=0 --durability=0:100:0:0 --eviction-policy=value --cas --use-json-docs --doc-padding-size=3072 --hashdump --enable-memcached-debug-log-level --enable-tcp-capture}}
            Points to note about the test:
            * We in DGM less than 50% resident
            * We have two replicas
            * Each document is about 3MB
            * We're performing Duriabilty Majority writes
            While running the following Jepsen test that performs a graceful failover of a node and then re-adds it into the cluster using delta node recovery. We observed a rebalance hang during the failover stage of the test.
            {{lein trampoline run test --nodes-file ./nodes --username vagrant --ssh-private-key ./resources/vagrantkey --package /home/couchbase/jenkins/workspace/kv-engine-jepsen-post-commit/install --workload=failover --failover-type=graceful --recovery-type=delta --replicas=2 --no-autofailover --disrupt-count=1 --rate=0 --durability=0:100:0:0 --eviction-policy=value --cas --use-json-docs --doc-padding-size=3072 --hashdump --enable-memcached-debug-log-level --enable-tcp-capture}}
            Points to note about the test:
            * We in DGM less than 50% resident
            * We have two replicas
            * Each document is about 3MB
            * We're performing Duriabilty Majority writes

             !hang-screen-shot.png|thumbnail!
            richard.demellow Richard deMellow made changes -
            Description While running the following Jepsen test that performs a graceful failover of a node and then re-adds it into the cluster using delta node recovery. We observed a rebalance hang during the failover stage of the test.
            {{lein trampoline run test --nodes-file ./nodes --username vagrant --ssh-private-key ./resources/vagrantkey --package /home/couchbase/jenkins/workspace/kv-engine-jepsen-post-commit/install --workload=failover --failover-type=graceful --recovery-type=delta --replicas=2 --no-autofailover --disrupt-count=1 --rate=0 --durability=0:100:0:0 --eviction-policy=value --cas --use-json-docs --doc-padding-size=3072 --hashdump --enable-memcached-debug-log-level --enable-tcp-capture}}
            Points to note about the test:
            * We in DGM less than 50% resident
            * We have two replicas
            * Each document is about 3MB
            * We're performing Duriabilty Majority writes

             !hang-screen-shot.png|thumbnail!
            While running the following Jepsen test that performs a graceful failover of a node and then re-adds it into the cluster using delta node recovery. We observed a rebalance hang during the failover stage of the test.
            {{lein trampoline run test --nodes-file ./nodes --username vagrant --ssh-private-key ./resources/vagrantkey --package /home/couchbase/jenkins/workspace/kv-engine-jepsen-post-commit/install --workload=failover --failover-type=graceful --recovery-type=delta --replicas=2 --no-autofailover --disrupt-count=1 --rate=0 --durability=0:100:0:0 --eviction-policy=value --cas --use-json-docs --doc-padding-size=3072 --hashdump --enable-memcached-debug-log-level --enable-tcp-capture}}
            Points to note about the test:
            * We in DGM less than 50% resident
            * We have two replicas
            * Each document is about 3MB
            * We're performing Duriabilty Majority writes

             !hang-screen-shot.png|thumbnail!

            I've also managed to collect core dumps of memcached on each node:
            172.28.128.125=node1
            172.28.128.126=node2
            172.28.128.127=node3
            172.28.128.128=node4

            Build: couchbase-server-enterprise_6.5.1-6007-ubuntu16.04
            owend Daniel Owen made changes -
            Fix Version/s 6.5.1 [ 16622 ]
            richard.demellow Richard deMellow made changes -
            Description While running the following Jepsen test that performs a graceful failover of a node and then re-adds it into the cluster using delta node recovery. We observed a rebalance hang during the failover stage of the test.
            {{lein trampoline run test --nodes-file ./nodes --username vagrant --ssh-private-key ./resources/vagrantkey --package /home/couchbase/jenkins/workspace/kv-engine-jepsen-post-commit/install --workload=failover --failover-type=graceful --recovery-type=delta --replicas=2 --no-autofailover --disrupt-count=1 --rate=0 --durability=0:100:0:0 --eviction-policy=value --cas --use-json-docs --doc-padding-size=3072 --hashdump --enable-memcached-debug-log-level --enable-tcp-capture}}
            Points to note about the test:
            * We in DGM less than 50% resident
            * We have two replicas
            * Each document is about 3MB
            * We're performing Duriabilty Majority writes

             !hang-screen-shot.png|thumbnail!

            I've also managed to collect core dumps of memcached on each node:
            172.28.128.125=node1
            172.28.128.126=node2
            172.28.128.127=node3
            172.28.128.128=node4

            Build: couchbase-server-enterprise_6.5.1-6007-ubuntu16.04
            While running the following Jepsen test that performs a graceful failover of a node and then re-adds it into the cluster using delta node recovery. We observed a rebalance hang during the failover stage of the test.
            {{lein trampoline run test --nodes-file ./nodes --username vagrant --ssh-private-key ./resources/vagrantkey --package /home/couchbase/jenkins/workspace/kv-engine-jepsen-post-commit/install --workload=failover --failover-type=graceful --recovery-type=delta --replicas=2 --no-autofailover --disrupt-count=1 --rate=0 --durability=0:100:0:0 --eviction-policy=value --cas --use-json-docs --doc-padding-size=3072 --hashdump --enable-memcached-debug-log-level --enable-tcp-capture}}
            Points to note about the test:
            * We in DGM less than 50% resident
            * We have two replicas
            * Each document is about 3MB
            * We're performing Duriabilty Majority writes

             !hang-screen-shot.png|thumbnail!

            I've also managed to collect core dumps of memcached on each node:
            172.28.128.125=[node1|https://cb-jira.s3.us-east-2.amazonaws.com/logs/MB-37294/5ff93127/core.10508.node1]
            172.28.128.126=[node2|https://cb-jira.s3.us-east-2.amazonaws.com/logs/MB-37294/d9706f00/core.11168.node2]
            172.28.128.127=[node3|https://cb-jira.s3.us-east-2.amazonaws.com/logs/MB-37294/7b76fe90/core.9916.node3]
            172.28.128.128=[node4|https://cb-jira.s3.us-east-2.amazonaws.com/logs/MB-37294/d2cd55a2/core.6059.node4]

            Build: couchbase-server-enterprise_6.5.1-6007-ubuntu16.04
            richard.demellow Richard deMellow made changes -
            Attachment jepsen-output-1.log [ 79472 ]
            drigby Dave Rigby made changes -
            Assignee Dave Rigby [ drigby ] Dave Finlay [ dfinlay ]
            dfinlay Dave Finlay made changes -
            Assignee Dave Finlay [ dfinlay ] Aliaksey Artamonau [ aliaksey artamonau ]
            owend Daniel Owen made changes -
            Fix Version/s 6.5.0 [ 16624 ]
            Fix Version/s 6.5.1 [ 16622 ]
            owend Daniel Owen made changes -
            Component/s ns_server [ 10019 ]
            owend Daniel Owen made changes -
            Attachment mem-usage-172.28.128.125.png [ 79480 ]
            owend Daniel Owen made changes -
            Attachment mem-usage-172.28.128.125.png [ 79480 ]
            owend Daniel Owen made changes -
            Attachment mem-usage-172.28.128.128.png [ 79481 ]
            owend Daniel Owen made changes -
            Attachment mem-usage-172.28.128.128.png [ 79481 ]
            owend Daniel Owen made changes -
            Attachment mem-usage-172.28.128.128.png [ 79482 ]
            owend Daniel Owen made changes -
            Fix Version/s 6.5.1 [ 16622 ]
            Fix Version/s 6.5.0 [ 16624 ]
            owend Daniel Owen made changes -
            Component/s ns_server [ 10019 ]
            owend Daniel Owen made changes -
            Is this a Regression? Unknown [ 10452 ] No [ 10451 ]
            Aliaksey Artamonau Aliaksey Artamonau (Inactive) made changes -
            Assignee Aliaksey Artamonau [ aliaksey artamonau ] Daniel Owen [ owend ]
            owend Daniel Owen made changes -
            Assignee Daniel Owen [ owend ] Richard deMellow [ richard.demellow ]
            richard.demellow Richard deMellow made changes -
            Attachment screenshot-1.png [ 79521 ]
            richard.demellow Richard deMellow made changes -
            Attachment image-2019-12-18-11-30-45-806.png [ 79522 ]
            richard.demellow Richard deMellow made changes -
            Attachment image-2019-12-18-11-31-01-008.png [ 79523 ]
            richard.demellow Richard deMellow made changes -
            owend Daniel Owen made changes -
            owend Daniel Owen made changes -
            Is this a Regression? No [ 10451 ] Yes [ 10450 ]
            raju Raju Suravarjjala made changes -
            Fix Version/s Mad-Hatter [ 15037 ]
            Fix Version/s 6.5.1 [ 16622 ]
            owend Daniel Owen made changes -
            Sprint KV Sprint 2019-12 [ 939 ]
            owend Daniel Owen made changes -
            Rank Ranked higher
            ashwin.govindarajulu Ashwin Govindarajulu made changes -
            Attachment image-2019-12-20-14-26-48-491.png [ 79698 ]
            ashwin.govindarajulu Ashwin Govindarajulu made changes -
            Attachment image-2019-12-20-14-26-48-491.png [ 79698 ]
            ashwin.govindarajulu Ashwin Govindarajulu made changes -
            Attachment image-2019-12-20-14-29-26-998.png [ 79699 ]
            ashwin.govindarajulu Ashwin Govindarajulu made changes -
            Attachment image-2019-12-20-14-31-07-630.png [ 79700 ]
            ashwin.govindarajulu Ashwin Govindarajulu made changes -
            Attachment image-2019-12-20-14-29-26-998.png [ 79699 ]
            ashwin.govindarajulu Ashwin Govindarajulu made changes -
            Attachment image-2019-12-20-14-31-07-630.png [ 79700 ]
            ashwin.govindarajulu Ashwin Govindarajulu made changes -
            Attachment image-2019-12-20-14-52-07-154.png [ 79701 ]
            Attachment image-2019-12-20-14-51-50-994.png [ 79702 ]
            owend Daniel Owen made changes -
            Link This issue relates to MB-37329 [ MB-37329 ]
            owend Daniel Owen made changes -
            Link This issue relates to MB-37330 [ MB-37330 ]
            drigby Dave Rigby made changes -
            Link This issue relates to MB-37330 [ MB-37330 ]
            drigby Dave Rigby made changes -
            Link This issue relates to MB-37330 [ MB-37330 ]
            lynn.straus Lynn Straus made changes -
            Labels jepsen approved-for-mad-hatter jepsen
            lynn.straus Lynn Straus made changes -
            Due Date 20/Dec/19
            drigby Dave Rigby made changes -
            Link This issue blocks MB-36676 [ MB-36676 ]
            drigby Dave Rigby made changes -
            Resolution Fixed [ 1 ]
            Status Open [ 1 ] Resolved [ 5 ]
            ashwin.govindarajulu Ashwin Govindarajulu made changes -
            Assignee Richard deMellow [ richard.demellow ] Ashwin Govindarajulu [ ashwin.govindarajulu ]
            Status Resolved [ 5 ] Closed [ 6 ]

            People

              ashwin.govindarajulu Ashwin Govindarajulu
              richard.demellow Richard deMellow
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                PagerDuty