Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-35990

Rebalance In 2 nodes failed for ephemeral bucket

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Critical
    • Resolution: Not a Bug
    • Affects Version/s: 6.5.0
    • Fix Version/s: 6.5.0
    • Component/s: test-execution
    • Labels:
    • Environment:
      6.5.0-4282

      Description

      Steps to reproduce:
      1. Initialize a 3 node cluster

      +----------------+----------+--------------+
      | Nodes          | Services | Status       |
      +----------------+----------+--------------+
      | 172.23.105.220 | [u'kv']  | Cluster node |
      | 172.23.105.221 | None     | <--- IN ---  |
      | 172.23.105.223 | None     | <--- IN ---  |
      +----------------+----------+--------------+
      

      2. Create 1 bucket with bucket type ephemeral and eviction policy as noEviction and replicas = 2.

      http://172.23.105.220:8091/pools/default/buckets with param: maxTTL=0&flushEnabled=1&compressionMode=off&bucketType=ephemeral&name=default&replicaNumber=2&ramQuotaMB=1424&threadsNumber=3&evictionPolicy=noEviction
      

      Load 100k docs with durability=MAJORITY

      +---------+-----------+----------+-----+--------+------------+-----------+-----------+
      | Bucket  | Type      | Replicas | TTL | Items  | RAM Quota  | RAM Used  | Disk Used |
      +---------+-----------+----------+-----+--------+------------+-----------+-----------+
      | default | ephemeral | 1        | 0   | 250000 | 5997854720 | 195164064 | 136       |
      +---------+-----------+----------+-----+--------+------------+-----------+-----------+
      

      3. Load another 100k docs with durability=MAJORITY and do a rebalance IN of 2 nodes

      +----------------+----------+--------------+
      | Nodes          | Services | Status       |
      +----------------+----------+--------------+
      | 172.23.105.220 | [u'kv']  | Cluster node |
      | 172.23.105.223 | [u'kv']  | Cluster node |
      | 172.23.105.221 | [u'kv']  | Cluster node |
      | 172.23.105.225 | None     | <--- IN ---  |
      | 172.23.105.226 | None     | <--- IN ---  |
      +----------------+----------+--------------+
      

      4. Rebalance Failed:

      {pre_rebalance_janitor_run_failed,"default",
      {error,unsafe_nodes,['ns_1@172.23.105.223']}}.
      Rebalance Operation Id = c69663e53c2dfb3d6429ba5a9113da77
      

      Test to validate:

      gradle-site-plugin/gradlew --refresh-dependencies testrunner -P jython=/opt/jython/bin/jython -P args="-i /tmp/ritesh.ini -t rebalance_new.rebalance_in.RebalanceInTests.rebalance_in_with_warming_up,nodes_init=3,nodes_in=2,num_items=100000,replicas=2,GROUP=P1;durability -p num_items=100000,durability=MAJORITY,bucket_type=ephemeral,get-cbcollect-info=True,skip_cleanup=True -p infra_log_level=debug,log_level=debug -m rest"
      

        Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

          Hide
          owend Daniel Owen added a comment -

          Can't see any issues in the memcached logs, that would cause the rebalance failure.
          Therefore assigning to ns_server to identify cause.

          Show
          owend Daniel Owen added a comment - Can't see any issues in the memcached logs, that would cause the rebalance failure. Therefore assigning to ns_server to identify cause.
          Hide
          ajit.yagaty Ajit Yagaty [X] (Inactive) added a comment -

          Steve Watanabe - Can you please take a look at this?

           

          Show
          ajit.yagaty Ajit Yagaty [X] (Inactive) added a comment - Steve Watanabe - Can you please take a look at this?  
          Hide
          steve.watanabe Steve Watanabe added a comment -

          ns_server.debug.log on .220 starts rebalance

          [user:info,2019-09-15T22:05:35.908-07:00,ns_1@172.23.105.220:<0.1762.0>:ns_orchestrator:idle:834]Starting rebalance, KeepNodes = ['ns_1@172.23.105.225','ns_1@172.23.105.226',
                                           'ns_1@172.23.105.220','ns_1@172.23.105.223',
                                           'ns_1@172.23.105.221'], EjectNodes = [], Failed over and being ejected nodes = []; no delta recovery nodes; Operation Id = c69663e53c2dfb3d6429ba5a9113da77
          

          which fails due to "unsafe_nodes"

          [user:error,2019-09-15T22:05:37.278-07:00,ns_1@172.23.105.220:<0.1762.0>:ns_orchestrator:log_rebalance_completion:1472]Rebalance exited with reason {pre_rebalance_janitor_run_failed,"default",
                                           {error,unsafe_nodes,['ns_1@172.23.105.223']}}.
          

          The unsafe nodes check done by the janitor was via this change

          Id837767d627b5610739cd5244d37b6c007961ae7
          =========================================
           
          [3] The janitor cleanup now looks for unsafe nodes. These
          are the nodes on which the memcached process has restarted
          within auto-failover timeout and by virtue of which the
          ACTIVEs are found to be in missing state for certain vbuckets.
          If unsafe nodes are found then the janitor doesn't bring
          the bucket online at all but simply return the list of
          unsafe nodes back to the orchestrator.
          

          Looking at memcached.log on .223 we see that it did stop/restart

          019-09-15T22:05:09.700297-07:00 INFO Received shutdown request
          2019-09-15T22:05:09.700359-07:00 INFO Initiating graceful shutdown.
          2019-09-15T22:05:09.700369-07:00 INFO Stop all buckets
          <snip>
          2019-09-15T22:05:10.132778-07:00 INFO Shutting down logger extension
          2019-09-15T22:05:10.133260-07:00 INFO ---------- Closing logfile
          2019-09-15T22:05:36.200212-07:00 INFO ---------- Opening logfile: /opt/couchbase/var/lib/couchbase/logs/memcached.log.000001.txt
          2019-09-15T22:05:36.202104-07:00 INFO Couchbase version 6.5.0-4282 starting.
          

          Ok, so why did memcached restart? Looking at ns_server_babysitter.log on .223

          [ns_server:info,2019-09-15T22:05:09.693-07:00,babysitter_of_ns_1@cb.local:<0.327.0>:ns_babysitter_bootstrap:stop:34]94901: got shutdown request. Terminating.
          

          and this occurs when the couchbase server is stopped. The .ini file (couchbase-server.sh.in) has this code

           
          _stop() {
              [ -f "$COOKIEFILE" -a -f "$NODEFILE" ] || return 1
           
              cookie=`cat "$COOKIEFILE"`
              node=`cat "$NODEFILE"`
           
              erl \
                  -name 'executioner@cb.local' \
                  -kernel dist_config_file "\"$CB_DIST_CFG_FILE\"" \
                          inetrc "\"$HOSTS_CFG_FILE\"" \
                  -proto_dist cb \
                  -epmd_module cb_epmd \
                  -ssl_dist_optfile $SSL_DIST_OPTFILE \
                  -noshell \
                  -hidden \
                  -setcookie "$cookie" \
                  -eval "ns_babysitter_bootstrap:remote_stop('$node')"
          

          Which runs this code from ns_babysitter_bootstrap.erl

          stop() ->
              (catch ?log_info("~s: got shutdown request. Terminating.", [os:getpid()])),
              application:stop(ns_babysitter),
              ale:sync_all_sinks(),
              ns_babysitter:delete_pidfile(),
              init:stop().
           
          remote_stop(Node) ->
              RV = rpc:call(Node, ns_babysitter_bootstrap, stop, []),
          

          This is the expected behavior when stopping couchbase during a rebalance.

          Show
          steve.watanabe Steve Watanabe added a comment - ns_server.debug.log on .220 starts rebalance [user:info,2019-09-15T22:05:35.908-07:00,ns_1@172.23.105.220:<0.1762.0>:ns_orchestrator:idle:834]Starting rebalance, KeepNodes = ['ns_1@172.23.105.225','ns_1@172.23.105.226', 'ns_1@172.23.105.220','ns_1@172.23.105.223', 'ns_1@172.23.105.221'], EjectNodes = [], Failed over and being ejected nodes = []; no delta recovery nodes; Operation Id = c69663e53c2dfb3d6429ba5a9113da77 which fails due to "unsafe_nodes" [user:error,2019-09-15T22:05:37.278-07:00,ns_1@172.23.105.220:<0.1762.0>:ns_orchestrator:log_rebalance_completion:1472]Rebalance exited with reason {pre_rebalance_janitor_run_failed,"default", {error,unsafe_nodes,['ns_1@172.23.105.223']}}. The unsafe nodes check done by the janitor was via this change Id837767d627b5610739cd5244d37b6c007961ae7 =========================================   [3] The janitor cleanup now looks for unsafe nodes. These are the nodes on which the memcached process has restarted within auto-failover timeout and by virtue of which the ACTIVEs are found to be in missing state for certain vbuckets. If unsafe nodes are found then the janitor doesn't bring the bucket online at all but simply return the list of unsafe nodes back to the orchestrator. Looking at memcached.log on .223 we see that it did stop/restart 019-09-15T22:05:09.700297-07:00 INFO Received shutdown request 2019-09-15T22:05:09.700359-07:00 INFO Initiating graceful shutdown. 2019-09-15T22:05:09.700369-07:00 INFO Stop all buckets <snip> 2019-09-15T22:05:10.132778-07:00 INFO Shutting down logger extension 2019-09-15T22:05:10.133260-07:00 INFO ---------- Closing logfile 2019-09-15T22:05:36.200212-07:00 INFO ---------- Opening logfile: /opt/couchbase/var/lib/couchbase/logs/memcached.log.000001.txt 2019-09-15T22:05:36.202104-07:00 INFO Couchbase version 6.5.0-4282 starting. Ok, so why did memcached restart? Looking at ns_server_babysitter.log on .223 [ns_server:info,2019-09-15T22:05:09.693-07:00,babysitter_of_ns_1@cb.local:<0.327.0>:ns_babysitter_bootstrap:stop:34]94901: got shutdown request. Terminating. and this occurs when the couchbase server is stopped. The .ini file (couchbase-server.sh.in) has this code _stop() { [ -f "$COOKIEFILE" -a -f "$NODEFILE" ] || return 1   cookie=`cat "$COOKIEFILE"` node=`cat "$NODEFILE"`   erl \ -name 'executioner@cb.local' \ -kernel dist_config_file "\"$CB_DIST_CFG_FILE\"" \ inetrc "\"$HOSTS_CFG_FILE\"" \ -proto_dist cb \ -epmd_module cb_epmd \ -ssl_dist_optfile $SSL_DIST_OPTFILE \ -noshell \ -hidden \ -setcookie "$cookie" \ -eval "ns_babysitter_bootstrap:remote_stop('$node')" Which runs this code from ns_babysitter_bootstrap.erl stop() -> (catch ?log_info("~s: got shutdown request. Terminating.", [os:getpid()])), application:stop(ns_babysitter), ale:sync_all_sinks(), ns_babysitter:delete_pidfile(), init:stop().   remote_stop(Node) -> RV = rpc:call(Node, ns_babysitter_bootstrap, stop, []), This is the expected behavior when stopping couchbase during a rebalance.

            People

            Assignee:
            ritesh.agarwal Ritesh Agarwal
            Reporter:
            ritesh.agarwal Ritesh Agarwal
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

              Dates

              Created:
              Updated:
              Resolved:

                Gerrit Reviews

                There are no open Gerrit changes

                  PagerDuty