Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-34344

Graceful failover is allowed even if there is no enough replica nodes in the cluster

    XMLWordPrintable

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: User Error
    • 5.5.4
    • 5.5.5
    • ns_server
    • Enterprise Edition 5.5.4 build 4340
    • Untriaged
    • Centos 64-bit
    • No

    Description

      Build: 5.5.4 build 4340

      Scenario:

      1. Create 3 node cluster
      2. Create "default" couchbase bucket with replica=0
      3. Do graceful failover of 1 node from the cluster
      4. Before rebalancing the cluster, perform doc_ops, such that it affects the vbucket of the "gracefully failover" node
      5. All operation will  failed with the memcached error #7: Not my vbucket

      Same can happen when 'graceful failover' of,

      1. 2 nodes with replica=1
      2. 3 nodes with replica=2
      3. 4 nodes with replica=3

      Expected behavior:

      User should not be allowed to perform graceful failover when there is no active node left behind in the server and failover operation must fail.

      Because in graceful failover, user is not expected to get any failures related to doc_operations. And in these cases, users should only be allowed to perform 'hard failover'

      Attachments

        For Gerrit Dashboard: MB-34344
        # Subject Branch Project Status CR V

        Activity

           

          Similar to original scenario in the QA test, created a 3 node cluster, 1 bucket with 0 replica.

          As expected, graceful failover fails on MH and 5.5.2 with the error below.

          I do not think anything has changed in 5.5.4 & graceful failover should fail there as well but will create a Vulcan repo and verify.

           
           

          curl -X POST -u Administrator:asdasd http://127.0.0.1:9000/controller/startGracefulFailover -d 'otpNode=n_2@127.0.0.1' Failover cannot be done gracefully (would lose vbuckets).

           
           
           

          poonam Poonam Dhavale added a comment -   Similar to original scenario in the QA test, created a 3 node cluster, 1 bucket with 0 replica. As expected, graceful failover fails on MH and 5.5.2 with the error below. I do not think anything has changed in 5.5.4 & graceful failover should fail there as well but will create a Vulcan repo and verify.     curl -X POST -u Administrator:asdasd http://127.0.0.1:9000/controller/startGracefulFailover -d 'otpNode=n_2@127.0.0.1' Failover cannot be done gracefully (would lose vbuckets).      

           

          Repeated above experiment with 5.5.4 and graceful failover fails as expected when there is a bucket with 0 replicas.

          Ashwin Govindarajulu, how is graceful failover initiated by the test?  Can you please provide cbcollect as well as QA test logs?

           

          /Users/poonam$ curl GET -u Administrator:asdasd http://127.0.0.1:9000/pools/default | jq '[.nodes]' | grep version
          ...
          "version": "5.5.4-0000-enterprise",
          "version": "5.5.4-0000-enterprise",
          "version": "5.5.4-0000-enterprise",
          "version": "5.5.4-0000-enterprise",

          /Users/poonam$ curl GET -u Administrator:asdasd http://127.0.0.1:9000/pools/default/buckets?basic_stats=true&skipMap=true 
          [{"name":"0replica","bucketType":"membase", ... "replicaNumber":0,...

          /Users/poonam$ curl -X POST -u Administrator:asdasd http://127.0.0.1:9000/controller/startGracefulFailover -d 'otpNode=n_2@127.0.0.1'
          Failover cannot be done gracefully (would lose vbuckets).

          poonam Poonam Dhavale added a comment -   Repeated above experiment with 5.5.4 and graceful failover fails as expected when there is a bucket with 0 replicas. Ashwin Govindarajulu , how is graceful failover initiated by the test?  Can you please provide cbcollect as well as QA test logs?   /Users/poonam$ curl GET -u Administrator:asdasd http://127.0.0.1:9000/pools/default | jq ' [.nodes] ' | grep version ... "version": "5.5.4-0000-enterprise", "version": "5.5.4-0000-enterprise", "version": "5.5.4-0000-enterprise", "version": "5.5.4-0000-enterprise", /Users/poonam$ curl GET -u Administrator:asdasd http://127.0.0.1:9000/pools/default/buckets?basic_stats=true&skipMap=true   [{"name":"0replica","bucketType":"membase", ... "replicaNumber":0,... /Users/poonam$ curl -X POST -u Administrator:asdasd http://127.0.0.1:9000/controller/startGracefulFailover -d 'otpNode=n_2@127.0.0.1' Failover cannot be done gracefully (would lose vbuckets).

          Poonam Dhavale Failover is initiated from the UI manually.

          To reproduce using UI,

          1. From "Servers" tab, click on the node and select "Failover" button.
          2. Pop up appears with "Graceful Failover (default)" as default option. In that click on "Failedover Node" button.
          3. Failover will the triggered successfully.

          UI logs:

          cbcollect logs:
          https://s3.amazonaws.com/bugdb/jira/MB_34344_on_4513/collectinfo-2019-06-18T051758-ns_1%4010.112.191.101.zip
          https://s3.amazonaws.com/bugdb/jira/MB_34344_on_4513/collectinfo-2019-06-18T051758-ns_1%4010.112.191.102.zip

          ashwin.govindarajulu Ashwin Govindarajulu added a comment - Poonam Dhavale Failover is initiated from the UI manually. To reproduce using UI, From "Servers" tab, click on the node and select "Failover" button. Pop up appears with "Graceful Failover (default)" as default option. In that click on "Failedover Node" button. Failover will the triggered successfully. UI logs: cbcollect logs: https://s3.amazonaws.com/bugdb/jira/MB_34344_on_4513/collectinfo-2019-06-18T051758-ns_1%4010.112.191.101.zip https://s3.amazonaws.com/bugdb/jira/MB_34344_on_4513/collectinfo-2019-06-18T051758-ns_1%4010.112.191.102.zip

          Dave Finlay Ajit Yagaty [X]

          Debugged this further, looks to be a test ware issue. Will close this bug now and will submit the fix for test case as well.

          ashwin.govindarajulu Ashwin Govindarajulu added a comment - Dave Finlay Ajit Yagaty [X] Debugged this further, looks to be a test ware issue. Will close this bug now and will submit the fix for test case as well.

          Merged the test case fix. So closing the ticket.

          ashwin.govindarajulu Ashwin Govindarajulu added a comment - Merged the test case fix. So closing the ticket.

          People

            ashwin.govindarajulu Ashwin Govindarajulu
            ashwin.govindarajulu Ashwin Govindarajulu
            Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty