Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-44452

[couchstore]:Graceful Failover -> Full Recovery -> Rebalance failed due to buckets_shutdown_wait_failed

    XMLWordPrintable

Details

    Description

      1. Create a 17 node cluster
      2. Create num_buckets=1,num_scopes=1,num_collections=500.
      3. Create 100000000 items sequentially
      4. Rebalance in with Loading of docs. Rebalance completed with progress: 100% in 11.5199999809 sec
      5. Sleep 61 seconds. Reason: Iteration:0 waiting to kill memc on all nodes
      6. Sleep 72 seconds. Reason: Iteration:1 waiting to kill memc on all nodes
      7. Rebalance Out with Loading of docs. Rebalance completed with progress: 100% in 97.2890000343 sec
      8. Sleep 116 seconds. Reason: Iteration:0 waiting to kill memc on all nodes
      9. Sleep 93 seconds. Reason: Iteration:1 waiting to kill memc on all nodes
      10. Rebalance In_Out with Loading of docs. Rebalance completed with progress: 100% in 96.7730000019 sec
      11. Sleep 69 seconds. Reason: Iteration:0 waiting to kill memc on all nodes
      12. Sleep 99 seconds. Reason: Iteration:1 waiting to kill memc on all nodes
      13. Swap with Loading of docs. Rebalance completed with progress: 100% in 12.1180000305 sec
      14. Sleep 81 seconds. Reason: Iteration:0 waiting to kill memc on all nodes
      15. Sleep 63 seconds. Reason: Iteration:1 waiting to kill memc on all nodes
      16. Failover a node and RebalanceOut that node with loading in parallel
      17. Sleep 109 seconds. Reason: Iteration:0 waiting to kill memc on all nodes
      18. Sleep 87 seconds. Reason: Iteration:1 waiting to kill memc on all nodes
      19. Failover a node and FullRecovery that node. Rebalance.

      Rebalance Failure

      Rebalance exited with reason {buckets_shutdown_wait_failed,
      [{'ns_1@172.23.121.115',
      {'EXIT',
      {old_buckets_shutdown_wait_failed,
      ["GleamBookUsers0"]}}}]}.
      Rebalance Operation Id = 27c0ef1569581833a25f2de5d9300d91
      

      During the test: mem_used is within limits of HWM-LWM.

      QE Test

      guides/gradlew --refresh-dependencies testrunner -P jython=/opt/jython/bin/jython -P 'args=-i /tmp/test_job_magma.ini -p bucket_storage=couchstore,bucket_eviction_policy=fullEviction,rerun=False -t volumetests.Magma.volume.SystemTestMagma,nodes_init=17,replicas=1,skip_cleanup=True,num_items=100000000,num_buckets=1,bucket_names=GleamBook,doc_size=64,bucket_type=membase,compression_mode=off,iterations=20,batch_size=1000,sdk_timeout=60,log_level=debug,infra_log_level=debug,rerun=False,skip_cleanup=True,key_size=18,randomize_doc_size=False,randomize_value=True,assert_crashes_on_load=True,maxttl=60,num_buckets=1,num_scopes=1,num_collections=500,doc_ops=create:update,durability=None,crashes=1,sdk_client_pool=True -m rest'
      

      CPU: Seems to be in limits at around 70% across all nodes.

      Attachments

        Issue Links

          Activity

            People

              steve.watanabe Steve Watanabe
              ritesh.agarwal Ritesh Agarwal
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                PagerDuty