Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-22767

Autofailover of node fails after network is restarted on a different node even after cluster is rebuilt.

    XMLWordPrintable

Details

    • Bug
    • Resolution: Cannot Reproduce
    • Major
    • 5.0.0
    • 5.0.0
    • ns_server
    • Untriaged
    • No

    Description

      1. Create a 3 node cluster with 1 bucket
      2. Enable Autofailover with 5 second as timeout
      3. ssh into any of the node and restart the network service network stop && sleep 5 && service network start
      4. Wait for the autofailover to kick in and failover the node.
      5. Recreate the cluster again with same autofailover timeout enabled and restart the network of another node
      6. We expect the server to be failed over but we see the following in the UI logs

      {u'node': u'ns_1@172.23.98.79', u'code': 3, u'text': u"Could not auto-failover node ('ns_1@172.23.98.79'). There was at least another node down.", u'shortText': u'message', u'serverTime': u'2017-01-21T01:29:30.141Z', u'module': u'auto_failover', u'tstamp': 1484990970141, u'type': u'info'}

      [2017-01-21 01:30:40,139] - [rest_client:2700] ERROR -

      {u'node': u'ns_1@172.23.98.79', u'code': 0, u'text': u"IP address seems to have changed. Unable to listen on 'ns_1@172.23.98.79'. (Underlaying POSIX error code: 'eaddrnotavail')", u'shortText': u'message', u'serverTime': u'2017-01-21T01:29:26.276Z', u'module': u'menelaus_web_alerts_srv', u'tstamp': 1484990966276, u'type': u'info'}

      [2017-01-21 01:30:40,139] - [rest_client:2700] ERROR -

      {u'node': u'ns_1@172.23.98.79', u'code': 0, u'text': u"IP address seems to have changed. Unable to listen on 'ns_1@172.23.98.79'. (Underlaying POSIX error code: 'eaddrnotavail') (repeated 8 times)", u'shortText': u'message', u'serverTime': u'2017-01-21T01:29:24.134Z', u'module': u'menelaus_web_alerts_srv', u'tstamp': 1484990964134, u'type': u'info'}

      You can reproduce the same using automated tests too:
      clone testrunner from this repo : https://github.com/bharath-gp/testrunner.git and checkout autofailovertests branch, create an ini file with atleast 4 servers in it (examples in b/resources folder of testrunner)
      Run the following tests one after other

      ./testrunner -i <ini file here> -t failover.AutoFailoverTests.AutoFailoverTests.test_autofailover,timeout=5,num_node_failures=2,pause_between_failover_action=35,failover_action=restart_network,failover_action=restart_network,nodes_init=3

      ./testrunner -i <ini file here> -t failover.AutoFailoverTests.AutoFailoverTests.test_autofailover,timeout=5,num_node_failures=1,failover_orchestrator=True,failover_action=restart_network,nodes_init=3

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          People

            bharath.gp Bharath G P
            bharath.gp Bharath G P
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty