Uploaded image for project: 'Couchbase C client library libcouchbase'
  1. Couchbase C client library libcouchbase
  2. CCBC-766

cbc-pillowfight does not recover after node failover triggered

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Critical
    • Resolution: Fixed
    • Affects Version/s: 2.7.3
    • Fix Version/s: 2.7.4
    • Component/s: None
    • Labels:
      None
    • Environment:
      Linux ip-10-0-129-240 4.4.11-23.53.amzn1.x86_64 #1 SMP Wed Jun 1 22:22:50 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

      Description

      When running cbc-pillowfight, if a node is stopped, and a failover triggered, cbc-pillowfight does not recover.

       

      Version tested as failing: 2.7.3

      Version tested as working: 2.6.4

       

      Steps to reproduce:

      Note also, brief video showing reproduction steps here: 

      https://drive.google.com/open?id=0Bz1warxYkSR1REpGSEhWell6eVk

       

      • Create 3 node cluster, 1 bucket
      • From a separate application server, run cbc-pillowfight
      • One one node in the cluster run:
        • service couchbase-server stop
      • The node will show up as 'down' in the UI
        • There are less than 1024 active vbuckets.
      • Pillowfight's traffic load will become sporadic due to timeouts (this is expected, I think).
      • Trigger a failover from the UI of the down node
      • There are now 1024 active vbuckets
      • Pillowfight does not recover, traffic remains sporadic.
      • ... ...
      • Start the couchbase node again
        • service couchbase-server start
      • Add the node back in with 'delta recovery' option.
      • Issue rebalance
      • As soon as the rebalance is started (and before it is complete!) pillowfight appears to start running full traffic loads again successfully. 

       

      Further notes:

      • I found same behaviour regardless of whether couchbase was shutdown cleanly or had a 'hard stop'.
      • My tests showed that this worked successfully on 2.6.4, I will collect additional logs for this version and attach to ticket.

       

      Attachments:

      • pillowfight.log   a -vvv log from pillowfight client run during the full process outlined above, with version 2.7.3.
      • Screenshots showing the traffic load at different stages (pre node down, during node down/failover, after rebalance started)

       

      Server side logs are available:
      https://s3.amazonaws.com/cb-customers/tom/collectinfo-2017-04-01T102252-ns_1%40ec2-34-248-41-12.eu-west-1.compute.amazonaws.com.zip
      https://s3.amazonaws.com/cb-customers/tom/collectinfo-2017-04-01T102252-ns_1%40ec2-52-30-209-45.eu-west-1.compute.amazonaws.com.zip
      https://s3.amazonaws.com/cb-customers/tom/collectinfo-2017-04-01T102252-ns_1%40ec2-52-31-18-152.eu-west-1.compute.amazonaws.com.zip

        Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

          Hide
          mnunberg Mark Nunberg (Inactive) added a comment -

          Compare:

           

          2.7.0

          43500ms [I0] {24403} [TRACE] (confmon - L:292) Start refresh requested
          43501ms [I0] {24403} [INFO] (confmon - L:166) Not applying configuration received via CCCP. No changes detected. A.rev=1114, B.rev=1114
          43501ms [I0] {24403} [TRACE] (confmon - L:271) Current provider is CCCP
          43501ms [I0] {24403} [INFO] (cccp - L:119) Re-Issuing CCCP Command on server struct 0xa06920 (ec2-34-248-41-12.eu-west-1.compute.amazonaws.com:11210)
          43502ms [I0] {24403} [INFO] (confmon - L:166) Not applying configuration received via CCCP. No changes detected. A.rev=1114, B.rev=1114
          53501ms [I0] {24403} [TRACE] (confmon - L:292) Start refresh requested
          53501ms [I0] {24403} [INFO] (confmon - L:166) Not applying configuration received via CCCP. No changes detected. A.rev=1114, B.rev=1114
          53501ms [I0] {24403} [TRACE] (confmon - L:271) Current provider is CCCP
          53501ms [I0] {24403} [INFO] (cccp - L:119) Re-Issuing CCCP Command on server struct 0xa092b0 (ec2-52-30-209-45.eu-west-1.compute.amazonaws.com:11210)
          53502ms [I0] {24403} [INFO] (confmon - L:166) Not applying configuration received via CCCP. No changes detected. A.rev=1114, B.rev=1114
          63503ms [I0] {24403} [TRACE] (confmon - L:292) Start refresh requested
          63503ms [I0] {24403} [INFO] (confmon - L:166) Not applying configuration received via CCCP. No changes detected. A.rev=1114, B.rev=1114
          63503ms [I0] {24403} [TRACE] (confmon - L:271) Current provider is CCCP
          63503ms [I0] {24403} [INFO] (cccp - L:119) Re-Issuing CCCP Command on server struct 0xa0a0f0 (ec2-52-31-18-152.eu-west-1.compute.amazonaws.com:11210)
          63519ms [I0] {24403} [INFO] (confmon - L:166) Not applying configuration received via CCCP. No changes detected. A.rev=1114, B.rev=1114
          73504ms [I0] {24403} [TRACE] (confmon - L:292) Start refresh requested
          73504ms [I0] {24403} [INFO] (confmon - L:166) Not applying configuration received via CCCP. No changes detected. A.rev=1114, B.rev=1114
          73504ms [I0] {24403} [TRACE] (confmon - L:271) Current provider is CCCP
          73504ms [I0] {24403} [INFO] (cccp - L:119) Re-Issuing CCCP Command on server struct 0xa06920 (ec2-34-248-41-12.eu-west-1.compute.amazonaws.com:11210)

          To (2.7.3):

          92693ms [I0] {3345} [TRACE] (confmon - L:252) Refreshing current cluster map
          92695ms [I0] {3345} [INFO] (confmon - L:145) Not applying configuration received via CCCP. No changes detected. A.rev=396, B.rev=396
          92695ms [I0] {3345} [TRACE] (confmon - L:239) Attempting to retrieve cluster map via CCCP
          92695ms [I0] {3345} [INFO] (cccp - L:137) Re-Issuing CCCP Command on server struct 0x156a4a0 (ec2-34-248-41-12.eu-west-1.compute.amazonaws.com:11210)
          92697ms [I0] {3345} [INFO] (confmon - L:145) Not applying configuration received via CCCP. No changes detected. A.rev=396, B.rev=396
          102708ms [I0] {3345} [TRACE] (confmon - L:252) Refreshing current cluster map
          102708ms [I0] {3345} [INFO] (confmon - L:145) Not applying configuration received via CCCP. No changes detected. A.rev=396, B.rev=396
          102708ms [I0] {3345} [TRACE] (confmon - L:239) Attempting to retrieve cluster map via CCCP
          198857ms [I0] {3345} [INFO] (confmon - L:153) Setting new configuration. Received via CCCP
          

          ...

          The first snippet spans 30 seconds and shows the config actually being fetched, while the second snippet spans 100 seconds and shows only one fetch, at 92695ms. Presumably when the server responds with the config, the cccp subsystem doesn't unmark itself as being in progress...

           

          I am unsure why our testing does not cover this; maybe we should add a specific test for this?

           

          Akshata Trivedi [X], Jae Park [X]: Do we have a test where we just failover without rebalance?

          It seems that the rebalance accidentally makes it unstuck because in this instance, the config is received via the payload of a not-my-vbucket response.

          Show
          mnunberg Mark Nunberg (Inactive) added a comment - Compare:   2.7.0 43500ms [I0] {24403} [TRACE] (confmon - L:292) Start refresh requested 43501ms [I0] {24403} [INFO] (confmon - L:166) Not applying configuration received via CCCP. No changes detected. A.rev=1114, B.rev=1114 43501ms [I0] {24403} [TRACE] (confmon - L:271) Current provider is CCCP 43501ms [I0] {24403} [INFO] (cccp - L:119) Re-Issuing CCCP Command on server struct 0xa06920 (ec2-34-248-41-12.eu-west-1.compute.amazonaws.com:11210) 43502ms [I0] {24403} [INFO] (confmon - L:166) Not applying configuration received via CCCP. No changes detected. A.rev=1114, B.rev=1114 53501ms [I0] {24403} [TRACE] (confmon - L:292) Start refresh requested 53501ms [I0] {24403} [INFO] (confmon - L:166) Not applying configuration received via CCCP. No changes detected. A.rev=1114, B.rev=1114 53501ms [I0] {24403} [TRACE] (confmon - L:271) Current provider is CCCP 53501ms [I0] {24403} [INFO] (cccp - L:119) Re-Issuing CCCP Command on server struct 0xa092b0 (ec2-52-30-209-45.eu-west-1.compute.amazonaws.com:11210) 53502ms [I0] {24403} [INFO] (confmon - L:166) Not applying configuration received via CCCP. No changes detected. A.rev=1114, B.rev=1114 63503ms [I0] {24403} [TRACE] (confmon - L:292) Start refresh requested 63503ms [I0] {24403} [INFO] (confmon - L:166) Not applying configuration received via CCCP. No changes detected. A.rev=1114, B.rev=1114 63503ms [I0] {24403} [TRACE] (confmon - L:271) Current provider is CCCP 63503ms [I0] {24403} [INFO] (cccp - L:119) Re-Issuing CCCP Command on server struct 0xa0a0f0 (ec2-52-31-18-152.eu-west-1.compute.amazonaws.com:11210) 63519ms [I0] {24403} [INFO] (confmon - L:166) Not applying configuration received via CCCP. No changes detected. A.rev=1114, B.rev=1114 73504ms [I0] {24403} [TRACE] (confmon - L:292) Start refresh requested 73504ms [I0] {24403} [INFO] (confmon - L:166) Not applying configuration received via CCCP. No changes detected. A.rev=1114, B.rev=1114 73504ms [I0] {24403} [TRACE] (confmon - L:271) Current provider is CCCP 73504ms [I0] {24403} [INFO] (cccp - L:119) Re-Issuing CCCP Command on server struct 0xa06920 (ec2-34-248-41-12.eu-west-1.compute.amazonaws.com:11210) To (2.7.3): 92693ms [I0] {3345} [TRACE] (confmon - L:252) Refreshing current cluster map 92695ms [I0] {3345} [INFO] (confmon - L:145) Not applying configuration received via CCCP. No changes detected. A.rev=396, B.rev=396 92695ms [I0] {3345} [TRACE] (confmon - L:239) Attempting to retrieve cluster map via CCCP 92695ms [I0] {3345} [INFO] (cccp - L:137) Re-Issuing CCCP Command on server struct 0x156a4a0 (ec2-34-248-41-12.eu-west-1.compute.amazonaws.com:11210) 92697ms [I0] {3345} [INFO] (confmon - L:145) Not applying configuration received via CCCP. No changes detected. A.rev=396, B.rev=396 102708ms [I0] {3345} [TRACE] (confmon - L:252) Refreshing current cluster map 102708ms [I0] {3345} [INFO] (confmon - L:145) Not applying configuration received via CCCP. No changes detected. A.rev=396, B.rev=396 102708ms [I0] {3345} [TRACE] (confmon - L:239) Attempting to retrieve cluster map via CCCP 198857ms [I0] {3345} [INFO] (confmon - L:153) Setting new configuration. Received via CCCP ... The first snippet spans 30 seconds and shows the config actually being fetched, while the second snippet spans 100 seconds and shows only one fetch, at 92695ms. Presumably when the server responds with the config, the cccp subsystem doesn't unmark itself as being in progress...   I am unsure why our testing does not cover this; maybe we should add a specific test for this?   Akshata Trivedi [X] , Jae Park [X] : Do we have a test where we  just failover  without rebalance? It seems that the rebalance accidentally makes it unstuck because in this instance, the config is received via the payload of a not-my-vbucket response.
          Hide
          jaekwon.park Jae Park [X] (Inactive) added a comment - - edited

          We have a test case that failover(by calling /controller/failOver REST API) one node (entry point node) and then eject (by calling /controller/ejectNode REST API) after 30 seconds then rebound to check the performance changes but didn't see significant changes. Here is one example http://sdk-testresults.couchbase.com.s3.amazonaws.com/SDK-SDK/CB-4.5.1-xxxx/FoEptEject-HYBRID/03-21-17/015669/38fc246faa422dd7187e07a18bc7fdd3-CB.html

          Mark Nunberg can you please let me know what might need to be changed on above test ? then we will work on adding proper test case.

          Show
          jaekwon.park Jae Park [X] (Inactive) added a comment - - edited We have a test case that failover(by calling /controller/failOver REST API) one node (entry point node) and then eject (by calling /controller/ejectNode REST API) after 30 seconds then rebound to check the performance changes but didn't see significant changes. Here is one example http://sdk-testresults.couchbase.com.s3.amazonaws.com/SDK-SDK/CB-4.5.1-xxxx/FoEptEject-HYBRID/03-21-17/015669/38fc246faa422dd7187e07a18bc7fdd3-CB.html Mark Nunberg  can you please let me know what might need to be changed on above test ? then we will work on adding proper test case.
          Hide
          mnunberg Mark Nunberg (Inactive) added a comment - - edited

          Is the cluster configured with replicas? If there are no replicas, a failover would cause the existing vbucket maps to change, triggering an NMVB payload. I have a suspicion that the FoEptEject test does not use replicas.

           

          I realize that the 'show default values' checkbox doesn't work - it used to display the actual default values (replica being one of them). They are still visible when clicking on the page source, and it does suggest the test is run with a replica node.

           

          Tom Green - is your test run with a replica node?

           

          Jae Park [X] - can you check if we run this test with or without a replica node?

           

          Furthermore, I'd like to be able to get a negative test replicating Tom's experience. It seems like a valid bug and there's a code change here.. but in order for the code change to be sufficient, we need to know it fixes the issue

          Show
          mnunberg Mark Nunberg (Inactive) added a comment - - edited Is the cluster configured with replicas? If there are no replicas, a failover would cause the existing vbucket maps to change, triggering an NMVB payload. I have a suspicion that the FoEptEject test does not use replicas.   I realize that the 'show default values' checkbox doesn't work - it used to display the actual default values (replica being one of them). They are still visible when clicking on the page source, and it does suggest the test is run with a replica node.   Tom Green - is your test run with a replica node?   Jae Park [X] - can you check if we run this test with or without a replica node?   Furthermore, I'd like to be able to get a negative test replicating Tom's experience. It seems like a valid bug and there's a code change here.. but in order for the code change to be sufficient, we need to know it fixes the issue
          Hide
          tom.green Tom Green (Inactive) added a comment -

          The issue is 100% reproducible every time across multiple operating systems for me.

          Just fire up 3 couchbase nodes (or even 2 would probably do) in virtualbox on laptop, create a cluster using all default settings, and default bucket, add the extra nodes in, rebalance to get yourself a stable 2 or 3 node cluster.

          Hit the bucket with pillowfight, take one of the nodes offline then when it shows as 'down' in UI click failover.

          Alternatively you could enable autofailover and allow the failover to be triggered by the system instead of manually, if it's a 3 node cluster.

          I can confirm a replica was used, yes. If you look in the video at about 1min25 onwards you can see i wave my mouse over the active vbucket count after the failover to show that we've taken the replica vbuckets and converted them to active. In the graphs on that section you can see the replica count has dropped and active gone up to the full 1024 as expected when failover is triggered.

          How to formally integrate that scenario into the proper test case environments is something i'm not going to be much help with though as i just don't know enough about our test systems  

           

          Cheers,

          Tom

          Show
          tom.green Tom Green (Inactive) added a comment - The issue is 100% reproducible every time across multiple operating systems for me. Just fire up 3 couchbase nodes (or even 2 would probably do) in virtualbox on laptop, create a cluster using all default settings, and default bucket, add the extra nodes in, rebalance to get yourself a stable 2 or 3 node cluster. Hit the bucket with pillowfight, take one of the nodes offline then when it shows as 'down' in UI click failover. Alternatively you could enable autofailover and allow the failover to be triggered by the system instead of manually, if it's a 3 node cluster. I can confirm a replica was used, yes. If you look in the video at about 1min25 onwards you can see i wave my mouse over the active vbucket count after the failover to show that we've taken the replica vbuckets and converted them to active. In the graphs on that section you can see the replica count has dropped and active gone up to the full 1024 as expected when failover is triggered. How to formally integrate that scenario into the proper test case environments is something i'm not going to be much help with though as i just don't know enough about our test systems     Cheers, Tom
          Hide
          mnunberg Mark Nunberg (Inactive) added a comment -

          I've managed to reproduce it, and the patch for fixing it works (seems to fix it quite nicely).

          Show
          mnunberg Mark Nunberg (Inactive) added a comment - I've managed to reproduce it, and the patch for fixing it works (seems to fix it quite nicely).

            People

            • Assignee:
              mnunberg Mark Nunberg (Inactive)
              Reporter:
              tom.green Tom Green (Inactive)
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Gerrit Reviews

                There are no open Gerrit changes

                  PagerDuty

                  Error rendering 'com.pagerduty.jira-server-plugin:PagerDuty'. Please contact your Jira administrators.