Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-27245

[FTS] Service rebalance fails during hard-failover, delta-recovery and add back to cluster

    XMLWordPrintable

Details

    • Bug
    • Resolution: Cannot Reproduce
    • Major
    • 5.5.0
    • 5.5.0
    • fts
    • Untriaged
    • Centos 64-bit
    • Yes

    Description

      Build
      5.1.0-1511

      Testcase
      ./testrunner -i /tmp/testexec.1880.ini -t fts.moving_topology_fts.MovingTopFTS.hard_failover_and_delta_recovery_during_index_building,items=30000,cluster=D,D+F,GROUP=P0

      bookkeeping - test17 in centos-p0-fts-vset00-00-moving-topology-P0

      Steps
      1. D,D+F cluster
      2. Load 30k docs in a bucket, create index
      3. While index is building, hard-failover the D+F node, delta recover the same node and rebalance - rebalance fails here as shown.

      [2017-12-14 12:34:53,360] - [fts_base:2636] INFO - Starting failover for nodes:[ip:172.23.105.201 port:8091 ssh_username:root] at C1 cluster 172.23.105.200
      [2017-12-14 12:34:54,121] - [task:3364] INFO - Failing over 172.23.105.201:8091 with graceful=False
      [2017-12-14 12:34:54,637] - [rest_client:1363] INFO - fail_over node ns_1@172.23.105.201 successful
      [2017-12-14 12:34:54,638] - [task:3345] INFO - 0 seconds sleep after failover, for nodes to go pending....
      [2017-12-14 12:34:54,682] - [rest_client:1396] INFO - add_back_node ns_1@172.23.105.201 successful
      [2017-12-14 12:34:54,683] - [rest_client:1370] INFO - Going to set recoveryType=delta for node :: ns_1@172.23.105.201
      [2017-12-14 12:34:54,696] - [rest_client:1382] INFO - recoveryType for node ns_1@172.23.105.201 set successful
      [2017-12-14 12:34:55,680] - [rest_client:1418] INFO - rebalance params : {'password': 'password', 'ejectedNodes': '', 'user': 'Administrator', 'knownNodes': u'ns_1@172.23.105.201,ns_1@172.23.105.200'}
      [2017-12-14 12:34:55,723] - [rest_client:1423] INFO - rebalance operation started
      [2017-12-14 12:34:55,738] - [rest_client:1571] INFO - rebalance percentage : 0.00 %
      [2017-12-14 12:34:55,739] - [task:491] INFO - Rebalance - status: running, progress: 0
      [2017-12-14 12:35:05,772] - [rest_client:1554] ERROR - {u'status': u'none', u'errorMessage': u'Rebalance failed. See logs for detailed reason. You can try again.'} - rebalance failed
      [2017-12-14 12:35:05,808] - [rest_client:3029] INFO - Latest logs from UI on 172.23.105.200:
      [2017-12-14 12:35:05,808] - [rest_client:3030] ERROR - {u'node': u'ns_1@172.23.105.201', u'code': 0, u'text': u'Bucket "default" loaded on node \'ns_1@172.23.105.201\' in 0 seconds.', u'shortText': u'message', u'serverTime': u'2017-12-14T14:34:56.207Z', u'module': u'ns_memcached', u'tstamp': 1513283696207, u'type': u'info'}
      [2017-12-14 12:35:05,808] - [rest_client:3030] ERROR - {u'node': u'ns_1@172.23.105.201', u'code': 0, u'text': u'Shutting down bucket "default" on \'ns_1@172.23.105.201\' for server shutdown', u'shortText': u'message', u'serverTime': u'2017-12-14T14:34:56.184Z', u'module': u'ns_memcached', u'tstamp': 1513283696184, u'type': u'info'}
      [2017-12-14 12:35:05,809] - [rest_client:3030] ERROR - {u'node': u'ns_1@172.23.105.200', u'code': 0, u'text': u'Rebalance exited with reason {{badmatch,\n                                  {error,\n                                      {failed_nodes,[\'ns_1@172.23.105.201\']}}},\n                              [{ns_janitor,cleanup_with_states,6,\n                                   [{file,"src/ns_janitor.erl"},{line,136}]},\n                               {ns_rebalancer,do_run_janitor_pre_rebalance,1,\n                                   [{file,"src/ns_rebalancer.erl"},\n                                    {line,766}]}]}', u'shortText': u'message', u'serverTime': u'2017-12-14T14:34:56.032Z', u'module': u'ns_orchestrator', u'tstamp': 1513283696032, u'type': u'critical'}
      [2017-12-14 12:35:05,809] - [rest_client:3030] ERROR - {u'node': u'ns_1@172.23.105.200', u'code': 0, u'text': u'Started rebalancing bucket default', u'shortText': u'message', u'serverTime': u'2017-12-14T14:34:55.794Z', u'module': u'ns_rebalancer', u'tstamp': 1513283695794, u'type': u'info'}
      [2017-12-14 12:35:05,809] - [rest_client:3030] ERROR - {u'node': u'ns_1@172.23.105.200', u'code': 4, u'text': u"Starting rebalance, KeepNodes = ['ns_1@172.23.105.201','ns_1@172.23.105.200'], EjectNodes = [], Failed over and being ejected nodes = [], Delta recovery nodes = ['ns_1@172.23.105.201'],  Delta recovery buckets = all", u'shortText': u'message', u'serverTime': u'2017-12-14T14:34:55.719Z', u'module': u'ns_orchestrator', u'tstamp': 1513283695719, u'type': u'info'}
      [2017-12-14 12:35:05,809] - [rest_client:3030] ERROR - {u'node': u'ns_1@172.23.105.200', u'code': 0, u'text': u"Failed over 'ns_1@172.23.105.201': ok", u'shortText': u'message', u'serverTime': u'2017-12-14T14:34:54.630Z', u'module': u'ns_rebalancer', u'tstamp': 1513283694630, u'type': u'info'}
      

      I don't have the logs from previous successful build but I do not remember seeing this test fail earlier. Attaching logs.

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          People

            Sreekanth Sivasankaran Sreekanth Sivasankaran (Inactive)
            apiravi Aruna Piravi (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty