Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-26037

vbucket mover crashed if pending vBucket requires rollback

    XMLWordPrintable

Details

    Description

      Issue occurred 5 days into longevity test with ephemeral buckets having no eviction policy.

      Logs show rebalance started, then we got some metadata overhead warnings followed by ns_server backtrace

      2017-09-13T07:34:51.604-07:00, ns_orchestrator:4:info:message(ns_1@172.23.106.14) - Starting rebalance, KeepNodes = ['ns_1@172.23.105.60','ns_1@172.23.105.61',
                                       'ns_1@172.23.105.62','ns_1@172.23.105.63',
                                       'ns_1@172.23.106.14','ns_1@172.23.106.213',
                                       'ns_1@172.23.106.96','ns_1@172.23.99.168',
                                       'ns_1@172.23.99.253'], EjectNodes = ['ns_1@172.23.105.83'], Failed over and being ejected nodes = []; no delta recovery nodes

      2017-09-13T07:40:32.197-07:00, ns_vbucket_mover:0:info:message(ns_1@172.23.106.14) - Bucket "default" rebalance appears to be swap rebalance
      2017-09-13T08:02:01.695-07:00, menelaus_web_alerts_srv:0:info:message(ns_1@172.23.99.253) - Metadata overhead warning. Over  50% of RAM allocated to bucket  "default" on node "172.23.99.253" is taken up by keys and metadata.
      2017-09-13T08:02:22.551-07:00, menelaus_web_alerts_srv:0:info:message(ns_1@172.23.99.253) - Metadata overhead warning. Over  50% of RAM allocated to bucket  "default" on node "172.23.99.253" is taken up by keys and metadata. (repeated 6 times)
      
      

      per_node_processes('ns_1@172.23.106.14') =
           {<0.32569.4081>,
            [{registered_name,[]},
             {status,waiting},
             {initial_call,{proc_lib,init_p,5}},
             {backtrace,
                 [<<"Program counter: 0x00007f460af7b288 (ns_single_vbucket_mover:spawn_and_wait/1 + 72)">>,
                  <<"CP: 0x0000000000000000 (invalid)">>,<<"arity = 0">>,<<>>,
                  <<"0x00007f4609bdd678 Return addr 0x00007f46533eee90 (misc:try_with_maybe_ignorant_after/2 + 80)">>,
                  <<"y(0)     []">>,<<"y(1)     []">>,<<"y(2)     <0.20357.4080>">>,
                  <<>>,
                  <<"0x00007f4609bdd698 Return addr 0x00007f460af7b0d8 (ns_single_vbucket_mover:mover/5 + 896)">>,
                  <<"y(0)     []">>,<<"y(1)     []">>,<<"y(2)     []">>,
                  <<"y(3)     []">>,
                  <<"y(4)     #Fun<ns_single_vbucket_mover.3.48828051>">>,
                  <<"y(5)     Catch 0x00007f46533eeeb0 (misc:try_with_maybe_ignorant_after/2 + 112)">>,
                  <<>>,
                  <<"0x00007f4609bdd6d0 Return addr 0x00007f465befc198 (proc_lib:init_p_do_apply/3 + 56)">>,
                  <<"y(0)     []">>,<<"y(1)     true">>,
                  <<"y(2)     ['ns_1@172.23.105.62','ns_1@172.23.106.213']">>,
                  <<"y(3)     ['ns_1@172.23.105.62','ns_1@172.23.105.83']">>,
                  <<"y(4)     27">>,<<"y(5)     <0.25037.4080>">>,<<>>,
                  <<"0x00007f4609bdd708 Return addr 0x0000000000893588 (<terminate process normally>)">>,
                  <<"y(0)     Catch 0x00007f465befc1b8 (proc_lib:init_p_do_apply/3 + 88)">>,
                  <<>>]},
      
      

      Result is that rebalance is hanging in the cluster.

      Attachments

        Issue Links

          Activity

            People

              drigby Dave Rigby (Inactive)
              tommie Tommie McAfee (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              15 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                PagerDuty