Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-37104

[System test]: Autofailover failed with prepare_rebalance_failed

    XMLWordPrintable

Details

    • Bug
    • Status: Closed
    • Critical
    • Resolution: Not a Bug
    • 6.5.0
    • 6.5.0
    • ns_server
    • None

    Description

      Build: 6.5.0-4908 not seen on 4890

      Test: MH longevity with durability

      Cycle: 2nd

      Day: 1st

      Test Step:

      Autofailover 1 kv node

      [2019-11-29T22:02:12-08:00, sequoiatools/couchbase-cli:6.5:74e46b] setting-autofailover -c 172.23.108.103:8091 -u Administrator -p password --enable-auto-failover=1 --auto-failover-timeout=5 --max-failovers=1
      [2019-11-29T22:02:38-08:00, sequoiatools/cmd:681b7c] 10
      [2019-11-29T22:03:15-08:00, sequoiatools/cbutil:b46f53] /cbinit.py 172.23.106.100 root couchbase stop
      [2019-11-29T22:03:56-08:00, sequoiatools/cmd:641481] 10
      [2019-11-29T22:04:27-08:00, sequoiatools/couchbase-cli:6.5:adf868] rebalance -c 172.23.108.103:8091 -u Administrator -p password
      → 
       
       
      Error occurred on container - sequoiatools/couchbase-cli:6.5:[rebalance -c 172.23.108.103:8091 -u Administrator -p password]
       
       
      docker logs adf868
      docker start adf868
       
       
      *Unable to display progress bar on this os
      JERROR: Rebalance failed. See logs for detailed reason. You can try again.
      [2019-11-29T22:05:26-08:00, sequoiatools/cmd:07a801] 60
      

      Rebalance failed

      [user:error,2019-11-29T22:04:58.567-08:00,ns_1@172.23.108.103:<0.12064.0>:ns_orchestrator:log_rebalance_completion:1445]Rebalance exited with reason {prepare_rebalance_failed,
                                    {error,
                                     {failed_nodes,
                                      [{'ns_1@172.23.106.100',{error,timeout}}]}}}.
      Rebalance Operation Id = b8a928a76ed5b8a39656c137ca54a1b9 

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          Auto-failover failed because some of the vbuckets didn't have replicas:

          2019-11-29T22:03:25.623-08:00, auto_failover:0:info:message(ns_1@172.23.108.103) - Could not automatically fail over nodes (['ns_1@172.23.106.100']). Would lose vbuckets in the following buckets: ["ORDER_LINE","ORDERS",
                                                         "NEW_ORDER","ITEM","HISTORY",
                                                         "DISTRICT","CUSTOMER",
                                                         "default"]
          

          Then, even though the node 106.100 was down, a rebalance was initiated:

          2019-11-29T22:04:28.563-08:00, ns_orchestrator:0:info:message(ns_1@172.23.108.103) - Starting rebalance, KeepNodes = ['ns_1@172.23.104.155','ns_1@172.23.104.156',
                                           'ns_1@172.23.104.157','ns_1@172.23.104.164',
                                           'ns_1@172.23.104.61','ns_1@172.23.104.69',
                                           'ns_1@172.23.104.87','ns_1@172.23.104.88',
                                           'ns_1@172.23.106.100','ns_1@172.23.106.188',
                                           'ns_1@172.23.108.103','ns_1@172.23.96.148',
                                           'ns_1@172.23.96.251','ns_1@172.23.96.252',
                                           'ns_1@172.23.96.253','ns_1@172.23.96.95',
                                           'ns_1@172.23.97.119','ns_1@172.23.97.121',
                                           'ns_1@172.23.97.122','ns_1@172.23.97.239',
                                           'ns_1@172.23.97.242','ns_1@172.23.98.135',
                                           'ns_1@172.23.99.11','ns_1@172.23.99.21',
                                           'ns_1@172.23.99.25'], EjectNodes = [], Failed over and being ejected nodes = []; no delta recovery nodes; Operation Id = b8a928a76ed5b8a39656c137ca54a1b9
          

          But since node 106.100 was down, it got stuck in preparation stages waiting for the node to respond. Eventually it timed out:

          2019-11-29T22:04:58.567-08:00, ns_orchestrator:0:critical:message(ns_1@172.23.108.103) - Rebalance exited with reason {prepare_rebalance_failed,
                                        {error,
                                         {failed_nodes,
                                          [{'ns_1@172.23.106.100',{error,timeout}}]}}}.
          

          In between these two events, we attempted to auto-failover again. Which failed with a different reason because of the running rebalance:

          2019-11-29T22:04:28.718-08:00, auto_failover:0:info:message(ns_1@172.23.108.103) - Could not automatically fail over nodes (['ns_1@172.23.106.100']). Rebalance is running.
          

          None of this constitutes a bug.

          Aliaksey Artamonau Aliaksey Artamonau added a comment - Auto-failover failed because some of the vbuckets didn't have replicas: 2019-11-29T22:03:25.623-08:00, auto_failover:0:info:message(ns_1@172.23.108.103) - Could not automatically fail over nodes (['ns_1@172.23.106.100']). Would lose vbuckets in the following buckets: ["ORDER_LINE","ORDERS", "NEW_ORDER","ITEM","HISTORY", "DISTRICT","CUSTOMER", "default"] Then, even though the node 106.100 was down, a rebalance was initiated: 2019-11-29T22:04:28.563-08:00, ns_orchestrator:0:info:message(ns_1@172.23.108.103) - Starting rebalance, KeepNodes = ['ns_1@172.23.104.155','ns_1@172.23.104.156', 'ns_1@172.23.104.157','ns_1@172.23.104.164', 'ns_1@172.23.104.61','ns_1@172.23.104.69', 'ns_1@172.23.104.87','ns_1@172.23.104.88', 'ns_1@172.23.106.100','ns_1@172.23.106.188', 'ns_1@172.23.108.103','ns_1@172.23.96.148', 'ns_1@172.23.96.251','ns_1@172.23.96.252', 'ns_1@172.23.96.253','ns_1@172.23.96.95', 'ns_1@172.23.97.119','ns_1@172.23.97.121', 'ns_1@172.23.97.122','ns_1@172.23.97.239', 'ns_1@172.23.97.242','ns_1@172.23.98.135', 'ns_1@172.23.99.11','ns_1@172.23.99.21', 'ns_1@172.23.99.25'], EjectNodes = [], Failed over and being ejected nodes = []; no delta recovery nodes; Operation Id = b8a928a76ed5b8a39656c137ca54a1b9 But since node 106.100 was down, it got stuck in preparation stages waiting for the node to respond. Eventually it timed out: 2019-11-29T22:04:58.567-08:00, ns_orchestrator:0:critical:message(ns_1@172.23.108.103) - Rebalance exited with reason {prepare_rebalance_failed, {error, {failed_nodes, [{'ns_1@172.23.106.100',{error,timeout}}]}}}. In between these two events, we attempted to auto-failover again. Which failed with a different reason because of the running rebalance: 2019-11-29T22:04:28.718-08:00, auto_failover:0:info:message(ns_1@172.23.108.103) - Could not automatically fail over nodes (['ns_1@172.23.106.100']). Rebalance is running. None of this constitutes a bug.

          People

            vikas.chaudhary Vikas Chaudhary
            vikas.chaudhary Vikas Chaudhary
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty