Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-8148

Rebalance exited with reason bulk_set_vbucket_state_failed {unexpected_reason,killed} when rebalance out 2 failovered nodes

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Critical
    • Resolution: Won't Fix
    • Affects Version/s: 2.1.0
    • Fix Version/s: 2.1.0
    • Component/s: ns_server
    • Security Level: Public
    • Labels:
      None
    • Environment:
      centos64

      Description

      2.0.2-769-rel
      http://qa.hq.northscale.net/job/centos-64-2.0-rebalance-regressions/200/consoleFull

      ./testrunner -i /tmp/rebalance_regression.ini wait_timeout=100,get-cbcollect-info=True -t swaprebalance.SwapRebalanceFailedTests.test_add_back_failed_node,replica=2,num-buckets=3,num-swap=2,keys-count=1000000

      steps:
      nothing special, 7 servers, failover 2 and rebalance. 3 buckets*1M items
      [2013-04-22 02:34:45,289] - INFO - current nodes : [u'ns_1@10.3.121.94', u'ns_1@10.3.121.92', u'ns_1@10.3.121.98', u'ns_1@10.3.121.96', u'ns_1@10.3.121.93', u'ns_1@10.3.121.97', u'ns_1@10.3.121.95']
      [2013-04-22 02:34:45,860] - INFO - failover node ns_1@10.3.121.94
      [2013-04-22 02:34:51,468] - INFO - failover node ns_1@10.3.121.92
      [2013-04-22 02:34:56,963] - INFO - rebalance params : password=password&ejectedNodes=ns_1%4010.3.121.94%2Cns_1%4010.3.121.92&user=Administrator&knownNodes=ns_1%4010.3.121.94%2Cns_1%4010.3.121.92%2Cns_1%4010.3.121.98%2Cns_1%4010.3.121.96%2Cns_1%4010.3.121.93%2Cns_1%4010.3.121.97%2Cns_1%4010.3.121.95

      [2013-04-22 02:45:51,264] - [rest_client:1000] INFO - rebalance percentage : 20.1342244449 %
      [2013-04-22 02:45:53,280] - [rest_client:1000] INFO - rebalance percentage : 20.1342244449 %
      [2013-04-22 02:45:55,289] - [rest_client:1000] INFO - rebalance percentage : 20.1342244449 %
      [2013-04-22 02:45:57,296] - [rest_client:1000] INFO - rebalance percentage : 20.1342244449 %
      [2013-04-22 02:45:59,305] - [rest_client:1000] INFO - rebalance percentage : 20.1342244449 %
      [2013-04-22 02:45:59,834] - [rest_client:665] ERROR - http://10.3.121.98:8091/nodes/self error 500 reason: unknown ["Unexpected server error, request logged."]
      [2013-04-22 02:46:01,313] - [rest_client:983] ERROR -

      {u'status': u'none', u'errorMessage': u'Rebalance failed. See logs for detailed reason. You can try rebalance again.'}

      - rebalance failed
      [2013-04-22 02:46:01,313] - [rest_client:984] INFO - Latest logs from UI:
      [2013-04-22 02:46:01,454] - [rest_client:985] ERROR - {u'node': u'ns_1@10.3.121.93', u'code': 2, u'text': u"Rebalance exited with reason {{bulk_set_vbucket_state_failed,\n [{'ns_1@10.3.121.98',\n {'EXIT',\n {{{{unexpected_reason,killed},\n [

      {misc,executing_on_new_process,1}

      ,\n

      {tap_replication_manager,\n change_vbucket_filter,4}

      ,\n

      {tap_replication_manager,\n '-do_set_incoming_replication_map/3-lc$^2/1-2-',\n 2}

      ,\n

      {tap_replication_manager,\n do_set_incoming_replication_map,3}

      ,\n

      {tap_replication_manager,handle_call,3}

      ,\n

      {gen_server,handle_msg,5},\n {proc_lib,init_p_do_apply,3}]},\n {gen_server,call,\n ['tap_replication_manager-bucket-2',\n {change_vbucket_replication,412,\n 'ns_1@10.3.121.96'},\n infinity]}},\n {gen_server,call,\n [{'janitor_agent-bucket-2',\n 'ns_1@10.3.121.98'},\n {if_rebalance,<0.11754.18>,\n {update_vbucket_state,412,replica,\n undefined,'ns_1@10.3.121.96'}},\n infinity]}}}}]},\n [{janitor_agent,bulk_set_vbucket_state,4},\n {ns_vbucket_mover,\n update_replication_post_move,3},\n {ns_vbucket_mover,on_move_done,2},\n {gen_server,handle_msg,5}

      ,\n

      {proc_lib,init_p_do_apply,3}]}\n", u'shortText': u'message', u'module': u'ns_orchestrator', u'tstamp': 1366624376395, u'type': u'info'}
      [2013-04-22 02:46:01,455] - [rest_client:985] ERROR - {u'node': u'ns_1@10.3.121.98', u'code': 0, u'text': u"Haven't heard from a higher priority node or a master, so I'm taking over. (repeated 1 times)", u'shortText': u'message', u'module': u'mb_master', u'tstamp': 1366624376173, u'type': u'info'}
      [2013-04-22 02:46:01,455] - [rest_client:985] ERROR - {u'node': u'ns_1@10.3.121.98', u'code': 19, u'text': u'Server error during processing: ["web request failed",\n {path,"/nodes/self"},\n {type,exit},\n {what,\n {timeout,\n {gen_server,call,\n [ns_cookie_manager,cookie_get]}}},\n {trace,\n [{gen_server,call,2},\n {menelaus_web,build_nodes_info_fun,3},\n {menelaus_web,build_full_node_info,2},\n {menelaus_web,handle_node,3},\n {menelaus_web,loop,3},\n {mochiweb_http,headers,5},\n {proc_lib,init_p_do_apply,3}

      ]}]', u'shortText': u'server error during request processing', u'module': u'menelaus_web', u'tstamp': 1366624345004, u'type': u'warning'}
      [2013-04-22 02:46:01,455] - [rest_client:985] ERROR -

      {u'node': u'ns_1@10.3.121.98', u'code': 0, u'text': u"Haven't heard from a higher priority node or a master, so I'm taking over.", u'shortText': u'message', u'module': u'mb_master', u'tstamp': 1366624327663, u'type': u'info'}

      [2013-04-22 02:46:01,455] - [rest_client:985] ERROR -

      {u'node': u'ns_1@10.3.121.93', u'code': 0, u'text': u'Bucket "bucket-2" rebalance does not seem to be swap rebalance', u'shortText': u'message', u'module': u'ns_vbucket_mover', u'tstamp': 1366623715033, u'type': u'info'}

      [2013-04-22 02:46:01,456] - [rest_client:985] ERROR - {u'node': u'ns_1@10.3.121.97', u'code': 5, u'text': u"Node 'ns_1@10.3.121.97' saw that node 'ns_1@10.3.121.92' went down. Details: [

      {nodedown_reason,\n connection_closed}]", u'shortText': u'node down', u'module': u'ns_node_disco', u'tstamp': 1366623714329, u'type': u'warning'}
      [2013-04-22 02:46:01,456] - [rest_client:985] ERROR - {u'node': u'ns_1@10.3.121.93', u'code': 5, u'text': u"Node 'ns_1@10.3.121.93' saw that node 'ns_1@10.3.121.92' went down. Details: [{nodedown_reason,n connection_closed}

      ]", u'shortText': u'node down', u'module': u'ns_node_disco', u'tstamp': 1366623714313, u'type': u'warning'}
      [2013-04-22 02:46:01,456] - [rest_client:985] ERROR - {u'node': u'ns_1@10.3.121.97', u'code': 5, u'text': u"Node 'ns_1@10.3.121.97' saw that node 'ns_1@10.3.121.94' went down. Details: [

      {nodedown_reason,\n connection_closed}]", u'shortText': u'node down', u'module': u'ns_node_disco', u'tstamp': 1366623714309, u'type': u'warning'}
      [2013-04-22 02:46:01,457] - [rest_client:985] ERROR - {u'node': u'ns_1@10.3.121.95', u'code': 5, u'text': u"Node 'ns_1@10.3.121.95' saw that node 'ns_1@10.3.121.92' went down. Details: [{nodedown_reason,n connection_closed}

      ]", u'shortText': u'node down', u'module': u'ns_node_disco', u'tstamp': 1366623714299, u'type': u'warning'}
      [2013-04-22 02:46:01,457] - [rest_client:985] ERROR - {u'node': u'ns_1@10.3.121.98', u'code': 5, u'text': u"Node 'ns_1@10.3.121.98' saw that node 'ns_1@10.3.121.92' went down. Details: [

      {nodedown_reason,\n connection_closed}

      ]", u'shortText': u'node down', u'module': u'ns_node_disco', u'tstamp': 1366623714296, u'type': u'warning'}

      No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

        Hide
        andreibaranouski Andrei Baranouski added a comment -

        oops, can't get logs, slv-0703 server is lost in jenkins

        Show
        andreibaranouski Andrei Baranouski added a comment - oops, can't get logs, slv-0703 server is lost in jenkins
        Show
        andreibaranouski Andrei Baranouski added a comment - https://s3.amazonaws.com/bugdb/jira/MB-8148/e837dda5/10.3.121.93-4222013-247-diag.zip https://s3.amazonaws.com/bugdb/jira/MB-8148/e837dda5/10.3.121.98-4222013-248-diag.zip https://s3.amazonaws.com/bugdb/jira/MB-8148/e837dda5/10.3.121.94-4222013-249-diag.zip https://s3.amazonaws.com/bugdb/jira/MB-8148/e837dda5/10.3.121.95-4222013-250-diag.zip https://s3.amazonaws.com/bugdb/jira/MB-8148/e837dda5/10.3.121.96-4222013-251-diag.zip https://s3.amazonaws.com/bugdb/jira/MB-8148/e837dda5/10.3.121.97-4222013-251-diag.zip https://s3.amazonaws.com/bugdb/jira/MB-8148/e837dda5/10.3.121.92-4222013-252-diag.zip
        Hide
        Aliaksey Artamonau Aliaksey Artamonau added a comment -

        memcached on the master node got killed by someone:

        2013-04-22 01:34:58.191 ns_log:0:info:message(ns_1@10.3.121.93) - Port server memcached on node 'babysitter_of_ns_1@127.0.0.1' exited with status 137. Restarting.

        Show
        Aliaksey Artamonau Aliaksey Artamonau added a comment - memcached on the master node got killed by someone: 2013-04-22 01:34:58.191 ns_log:0:info:message(ns_1@10.3.121.93) - Port server memcached on node 'babysitter_of_ns_1@127.0.0.1' exited with status 137. Restarting.
        Hide
        maria Maria McDuff (Inactive) added a comment -

        andrei, we can close this.

        Show
        maria Maria McDuff (Inactive) added a comment - andrei, we can close this.

          People

          • Assignee:
            andreibaranouski Andrei Baranouski
            Reporter:
            andreibaranouski Andrei Baranouski
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Gerrit Reviews

              There are no open Gerrit changes