Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-40536

[Ephemeral] Rebalance out failed with reaon "unnsafe nodes"

    XMLWordPrintable

    Details

      Description

      Build: 6.6.0-7883

      Scenario:

      1. 7 node cluster, Ephemral bucket(replica=1)
      2. Load 250000 docs into the bucket (SyncWrites)
      3. Rebalance out 3 nodes from the cluster

        +----------------+----------+--------------+
        | Nodes          | Services | Status       |
        +----------------+----------+--------------+
        | 172.23.121.215 | kv       | Cluster node |
        | 172.23.105.5   | [u'kv']  | --- OUT ---> |
        | 172.23.105.173 | [u'kv']  | --- OUT ---> |
        | 172.23.105.200 | [u'kv']  | --- OUT ---> |
        | 172.23.105.163 | kv       | Cluster node |
        | 172.23.123.143 | kv       | Cluster node |
        | 172.23.123.161 | kv       | Cluster node |
        +----------------+----------+--------------+

      Observation:

      Rebalance failed with reason with reason,

      Rebalance exited with reason {pre_rebalance_janitor_run_failed,"default",
      {error,unsafe_nodes,['ns_1@172.23.105.163']}}.
      Rebalance Operation Id = bcddb51f49060183600aa7f4eaba6286

      Test log: http://qa.sc.couchbase.com/job/test_suite_executor-TAF/44667/consoleText

      Test case:

      ./testrunner -i /tmp/testexec.18646.ini sdk_retries=10,num_items=250000,GROUP=P0;durability,EXCLUDE_GROUP=not_for_ephemeral,durability=MAJORITY,bucket_type=ephemeral,rerun=False,get-cbcollect-info=True,collect_pcaps=True,log_level=info,upgrade_version=6.6.0-7883 -t rebalance_new.rebalance_out.RebalanceOutTests.rebalance_out_with_warming_up,value_size=1024,bucket_type=ephemeral,upgrade_version=6.6.0-7883,rerun=False,sdk_retries=10,GROUP=P0;durability,nodes_out=3,EXCLUDE_GROUP=not_for_ephemeral,max_verify=100000,get-cbcollect-info=False,replicas=1,durability=MAJORITY,log_level=debug,nodes_init=7,num_items=250000,infra_log_level=critical
      

        Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

          ashwin.govindarajulu Ashwin Govindarajulu created issue -
          Hide
          owend Daniel Owen added a comment -

          On .143 see:

          [user:error,2020-07-18T12:36:38.340-07:00,ns_1@172.23.123.143:<0.1613.0>:ns_orchestrator:log_rebalance_completion:1445]Rebalance exited with reason {pre_rebalance_janitor_run_failed,"default",
                                           {error,unsafe_nodes,['ns_1@172.23.105.163']}}.
          

          On .163 see:

          2020-07-18T12:36:10.608957-07:00 INFO Received shutdown request
          2020-07-18T12:36:10.609032-07:00 INFO Initiating graceful shutdown.
          

          We then restart

          2020-07-18T12:36:37.280405-07:00 INFO Couchbase version 6.6.0-7883 starting.
          ..
          2020-07-18T12:36:37.290110-07:00 INFO Initialization complete. Accepting clients.
          ..
          2020-07-18T12:36:38.266766-07:00 INFO 46 Create bucket [default]
          ..
          2020-07-18T12:36:38.270303-07:00 INFO (default) EP Engine: Initialization of ephemeral bucket complete
          2020-07-18T12:36:38.270315-07:00 INFO 46 - Bucket [default] created successfully
          ....
          2020-07-18T12:36:43.364349-07:00 INFO (default) VBucket: created vb:902 with state:replica initialState:dead lastSeqno:0 persistedRange:{0,0} max_cas:0 uuid:247488021379162 topology:null
          ...
          2020-07-18T12:36:43.402127-07:00 INFO (default) VBucket: created vb:147 with state:replica initialState:dead lastSeqno:0 persistedRange:{0,0} max_cas:0 uuid:189357973142445 topology:null
          

          Don't see anything wrong in the memcached.log
          Ashwin Govindarajulu. In the test description it does not mention that .163 should get a (graceful) shutdown request. Is this expected?

          Show
          owend Daniel Owen added a comment - On .143 see: [user:error,2020-07-18T12:36:38.340-07:00,ns_1@172.23.123.143:<0.1613.0>:ns_orchestrator:log_rebalance_completion:1445]Rebalance exited with reason {pre_rebalance_janitor_run_failed,"default", {error,unsafe_nodes,['ns_1@172.23.105.163']}}. On .163 see: 2020-07-18T12:36:10.608957-07:00 INFO Received shutdown request 2020-07-18T12:36:10.609032-07:00 INFO Initiating graceful shutdown. We then restart 2020-07-18T12:36:37.280405-07:00 INFO Couchbase version 6.6.0-7883 starting. .. 2020-07-18T12:36:37.290110-07:00 INFO Initialization complete. Accepting clients. .. 2020-07-18T12:36:38.266766-07:00 INFO 46 Create bucket [default] .. 2020-07-18T12:36:38.270303-07:00 INFO (default) EP Engine: Initialization of ephemeral bucket complete 2020-07-18T12:36:38.270315-07:00 INFO 46 - Bucket [default] created successfully .... 2020-07-18T12:36:43.364349-07:00 INFO (default) VBucket: created vb:902 with state:replica initialState:dead lastSeqno:0 persistedRange:{0,0} max_cas:0 uuid:247488021379162 topology:null ... 2020-07-18T12:36:43.402127-07:00 INFO (default) VBucket: created vb:147 with state:replica initialState:dead lastSeqno:0 persistedRange:{0,0} max_cas:0 uuid:189357973142445 topology:null Don't see anything wrong in the memcached.log Ashwin Govindarajulu . In the test description it does not mention that .163 should get a (graceful) shutdown request. Is this expected?
          owend Daniel Owen made changes -
          Field Original Value New Value
          Assignee Daniel Owen [ owend ] Ashwin Govindarajulu [ ashwin.govindarajulu ]
          Hide
          owend Daniel Owen added a comment -

          Hi Ashwin Govindarajulu,
          Also forgot to ask is this a regression? And if so do you have the last working build. thanks

          Show
          owend Daniel Owen added a comment - Hi Ashwin Govindarajulu , Also forgot to ask is this a regression? And if so do you have the last working build. thanks
          Hide
          dfinlay Dave Finlay added a comment -

          Note that "unsafe nodes" error means that rebalance was attempted after a node with an ephemeral bucket was restarted. The vbuckets on this node are essentially empty and if the rebalance goes ahead a lot of data will be lost. Auto-reprovisioning is the feature that kicks in here - the ephemeral vbuckets on the node that was restarted will be provisioned elsewhere in the cluster allowing the customer to then rebalance and not lose any data. It's described briefly here on this docs page:

          The Node Availability panel also contains a For Ephemeral Buckets option. When opened, this provides an Enable auto-reprovisioning checkbox, with a configurable number of nodes. Checking this ensures that if a node containing active Ephemeral buckets becomes unavailable, its replicas on the specified number of other nodes are promoted to active status as appropriate, to avoid data-loss. Note, however, that this may leave the cluster in an unbalanced state, requiring a rebalance.

          At any rate, the reason rebalance is failing is that it's detected that memcached has restarted with an ephemeral bucket.

          Show
          dfinlay Dave Finlay added a comment - Note that "unsafe nodes" error means that rebalance was attempted after a node with an ephemeral bucket was restarted. The vbuckets on this node are essentially empty and if the rebalance goes ahead a lot of data will be lost. Auto-reprovisioning is the feature that kicks in here - the ephemeral vbuckets on the node that was restarted will be provisioned elsewhere in the cluster allowing the customer to then rebalance and not lose any data. It's described briefly here on this docs page : The Node Availability panel also contains a For Ephemeral Buckets option. When opened, this provides an Enable auto-reprovisioning checkbox, with a configurable number of nodes. Checking this ensures that if a node containing active Ephemeral buckets becomes unavailable, its replicas on the specified number of other nodes are promoted to active status as appropriate, to avoid data-loss. Note, however, that this may leave the cluster in an unbalanced state, requiring a rebalance. At any rate, the reason rebalance is failing is that it's detected that memcached has restarted with an ephemeral bucket.
          Hide
          jwalker Jim Walker added a comment -

          Dave Finlay I've gone over the logs here for all nodes and can't see any case of memcached restarting and no CRITICAL log messages?

          Can ns_server team take a look, is there any more info from ns_server logs about the type of issue detected?

          Show
          jwalker Jim Walker added a comment - Dave Finlay I've gone over the logs here for all nodes and can't see any case of memcached restarting and no CRITICAL log messages? Can ns_server team take a look, is there any more info from ns_server logs about the type of issue detected?
          jwalker Jim Walker made changes -
          Assignee Ashwin Govindarajulu [ ashwin.govindarajulu ] Dave Finlay [ dfinlay ]
          ashwin.govindarajulu Ashwin Govindarajulu made changes -
          Assignee Dave Finlay [ dfinlay ] Ashwin Govindarajulu [ ashwin.govindarajulu ]
          Hide
          dfinlay Dave Finlay added a comment -

          Jim Walker: Dan refers to "graceful shutdown" and restarts in his comments.

          Show
          dfinlay Dave Finlay added a comment - Jim Walker : Dan refers to "graceful shutdown" and restarts in his comments.
          ritam.sharma Ritam Sharma made changes -
          Priority Major [ 3 ] Critical [ 2 ]
          Hide
          ashwin.govindarajulu Ashwin Govindarajulu added a comment -

          This seems to be a test ware issue.

          http://qa.sc.couchbase.com/job/test_suite_executor-TAF/45318/console - this is the successful run for the same case with the patch http://review.couchbase.org/c/TAF/+/132831

          Root cause for this issue was the base logic of rebalance failure was changed in the code.

          Show
          ashwin.govindarajulu Ashwin Govindarajulu added a comment - This seems to be a test ware issue. http://qa.sc.couchbase.com/job/test_suite_executor-TAF/45318/console - this is the successful run for the same case with the patch http://review.couchbase.org/c/TAF/+/132831 Root cause for this issue was the base logic of rebalance failure was changed in the code.
          ashwin.govindarajulu Ashwin Govindarajulu made changes -
          Labels 6.6.0 durability functional-test 6.6.0 Testing durability functional-test
          Hide
          ashwin.govindarajulu Ashwin Govindarajulu added a comment -

          Closing this issue since the test is fixed using the patch http://review.couchbase.org/c/TAF/+/132831

          Show
          ashwin.govindarajulu Ashwin Govindarajulu added a comment - Closing this issue since the test is fixed using the patch http://review.couchbase.org/c/TAF/+/132831
          ashwin.govindarajulu Ashwin Govindarajulu made changes -
          Resolution User Error [ 10100 ]
          Status Open [ 1 ] Closed [ 6 ]

            People

            Assignee:
            ashwin.govindarajulu Ashwin Govindarajulu
            Reporter:
            ashwin.govindarajulu Ashwin Govindarajulu
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

              Dates

              Created:
              Updated:
              Resolved:

                Gerrit Reviews

                There are no open Gerrit changes

                  PagerDuty