Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-7508

Rebalance exited with reason {{bulk_set_vbucket_state_failed after crasher in new_ns_replicas_builder(rebalance continued after killing memcached)

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 2.0.1
    • Fix Version/s: 2.0.1
    • Component/s: couchbase-bucket, ns_server
    • Security Level: Public
    • Labels:
      None

      Description

      build 2.0.1-120-rel
      http://qa.hq.northscale.net/job/centos-32-2.0-swaprebalance-tests/250/consoleFull
      /testrunner -i /tmp/swaprebalance-cent-32.ini get-logs=True,GROUP=P0 -t swaprebalance.SwapRebalanceFailedTests.test_failed_swap_rebalance,replica=1,num-buckets=4,num-swap=3,swap-orchestrator=True,GROUP=P0

      steps:
      1.rebalance params : password=password&ejectedNodes=ns_1%4010.3.2.146%2Cns_1%4010.3.2.152%2Cns_1%4010.3.2.145&user=Administrator&knownNodes=ns_1%4010.3.2.146%2Cns_1%4010.3.2.147%2Cns_1%4010.3.2.149%2Cns_1%4010.3.2.148%2Cns_1%4010.3.2.145%2Cns_1%4010.3.2.152
      2. kill memcached when rebalance reached >40%
      [2013-01-08 13:47:49,441] - [rest_client:881] INFO - rebalance percentage : 40.0627668879 %
      [2013-01-08 13:47:51,444] - [rest_client:73] INFO - rebalance reached >40.0627668879% in 663.750247002 seconds
      [2013-01-08 13:47:51,971] - [remote_util:116] INFO - connecting to 10.3.2.149 with username : root password : couchbase ssh_key:
      [2013-01-08 13:47:52,196] - [remote_util:148] INFO - Connected
      [2013-01-08 13:47:52,557] - [remote_util:1128] INFO - running command.raw sudo cat /proc/cpuinfo
      [2013-01-08 13:47:52,657] - [remote_util:1157] INFO - command executed successfully
      [2013-01-08 13:47:52,658] - [remote_util:1128] INFO - running command.raw sudo df -Th
      [2013-01-08 13:47:52,774] - [remote_util:1157] INFO - command executed successfully
      [2013-01-08 13:47:52,775] - [remote_util:1128] INFO - running command.raw sudo cat /proc/meminfo
      [2013-01-08 13:47:52,882] - [remote_util:1157] INFO - command executed successfully
      [2013-01-08 13:47:52,883] - [remote_util:1128] INFO - running command.raw hostname
      [2013-01-08 13:47:52,991] - [remote_util:1157] INFO - command executed successfully
      [2013-01-08 13:47:52,993] - [remote_util:1128] INFO - running command.raw ps -eo comm,pid | awk '$1 == "memcached"

      { print $2 }

      '
      [2013-01-08 13:47:53,107] - [remote_util:1157] INFO - command executed successfully
      [2013-01-08 13:47:53,108] - [swaprebalance:401] INFO - os:cmd("kill -9 3663 ")
      [2013-01-08 13:47:53,130] - [rest_client:803] INFO - /diag/eval status: True content: [] command: os:cmd("kill -9 3663 ")
      [2013-01-08 13:47:53,130] - [swaprebalance:403] INFO - killed 10.3.2.149:8091?? (True, '[]')
      [2013-01-08 13:47:53,131] - [swaprebalance:404] INFO - sleep for 10 sec after kill memcached
      ...
      2013-01-08 13:49:04,037] - [swaprebalance:412] INFO - rebalance progress: 42.5532203758
      [2013-01-08 13:49:05,058] - [swaprebalance:415] INFO - rebalance progress status:running
      [2013-01-08 13:49:05,068] - [swaprebalance:417] INFO - rebalance is still running even after restarting memcached
      [2013-01-08 13:49:05,069] - [swaprebalance:379] INFO - FAIL SWAP REBALANCE PHASE @ 60
      [2013-01-08 13:49:05,079] - [rest_client:881] INFO - rebalance percentage : 42.5775868475 %
      ...
      [2013-01-08 13:52:34,570] - [rest_client:881] INFO - rebalance percentage : 50.0 %
      [2013-01-08 13:52:36,597] - [rest_client:866] ERROR -

      {u'status': u'none', u'errorMessage': u'Rebalance failed. See logs for detailed reason. You can try rebalance again.'}

      - rebalance failed

      [ns_server:info,2013-01-08T12:32:23.377,ns_1@10.3.2.145:<0.32321.1>:ns_replicas_builder_utils:kill_a_bunch_of_tap_names:59]Killed the following tap names on 'ns_1@10.3.2.145': [<<"replication_building_138_'ns_1@10.3.2.147'">>,
      <<"replication_building_138_'ns_1@10.3.2.149'">>]
      [error_logger:error,2013-01-08T12:32:23.377,ns_1@10.3.2.145:error_logger<0.6.0>:ale_error_logger_handler:log_msg:76]** Generic server <0.32321.1> terminating

        • Last message in was {'EXIT',<20175.7759.0>,{badmatch, {error,closed}}}
          ** When Server state == {state,"default",138,'ns_1@10.3.2.145',
          [{'ns_1@10.3.2.147',<20177.4979.0>},
          {'ns_1@10.3.2.149',<20175.7759.0>}]}
          ** Reason for termination ==
          ** {{badmatch,[{<20177.4979.0>,{badmatch,{error,closed}

          }},

          {<20175.7759.0>,noproc}]},
          [{misc,sync_shutdown_many_i_am_trapping_exits,1},
          {misc,try_with_maybe_ignorant_after,2},
          {gen_server,terminate,6},
          {proc_lib,init_p_do_apply,3}]}

          [ns_server:error,2013-01-08T12:32:23.378,ns_1@10.3.2.145:<0.32285.1>:misc:sync_shutdown_many_i_am_trapping_exits:1408]Shutdown of the following failed: [{<0.32321.1>,
          {{badmatch,
          [{<20177.4979.0>,
          {badmatch,{error,closed}}},
          {<20175.7759.0>,noproc}

          ]},
          [

          {misc, sync_shutdown_many_i_am_trapping_exits, 1}

          ,

          {misc,try_with_maybe_ignorant_after,2},
          {gen_server,terminate,6},
          {proc_lib,init_p_do_apply,3}]}}]
          [ns_server:error,2013-01-08T12:32:23.379,ns_1@10.3.2.145:<0.32285.1>:misc:try_with_maybe_ignorant_after:1444]Eating exception from ignorant after-block:
          {error,{badmatch,[{<0.32321.1>,
          badmatch,[{<20177.4979.0>,{badmatch,{error,closed},
          {<20175.7759.0>,noproc}]},
          [{misc,sync_shutdown_many_i_am_trapping_exits,1},
          {misc,try_with_maybe_ignorant_after,2}

          ,

          {gen_server,terminate,6},
          {proc_lib,init_p_do_apply,3}]}}]},
          [{misc,sync_shutdown_many_i_am_trapping_exits,1},
          {misc,try_with_maybe_ignorant_after,2},
          {ns_single_vbucket_mover,mover,6},
          {proc_lib,init_p_do_apply,3}]}
          [error_logger:error,2013-01-08T12:32:23.379,ns_1@10.3.2.145:error_logger<0.6.0>:ale_error_logger_handler:log_report:72]
          =========================CRASH REPORT=========================
          crasher:
          initial call: new_ns_replicas_builder:init/1
          pid: <0.32321.1>
          registered_name: []
          exception exit: badmatch,[{<20177.4979.0>,{badmatch,{error,closed},
          {<20175.7759.0>,noproc}]},
          [{misc,sync_shutdown_many_i_am_trapping_exits,1},
          {misc,try_with_maybe_ignorant_after,2},
          {gen_server,terminate,6}

          ,

          {proc_lib,init_p_do_apply,3}]}
          in function gen_server:terminate/6
          ancestors: [<0.32285.1>,<0.13837.1>,<0.13752.1>]
          messages: [{'EXIT',<0.32285.1>,shutdown}]
          links: [<0.32285.1>]
          dictionary: []
          trap_exit: true
          status: running
          heap_size: 196418
          stack_size: 24
          reductions: 24429
          neighbours:

          [views:debug,2013-01-08T12:32:23.379,ns_1@10.3.2.145:mc_couch_events<0.425.0>:capi_set_view_manager:handle_mc_couch_event:529]Got set_vbucket event for default/438. Updated state: dead (1)
          [ns_server:info,2013-01-08T12:32:23.383,ns_1@10.3.2.145:janitor_agent-default<0.643.0>:janitor_agent:handle_info:676]Undoing temporary vbucket states caused by rebalance
          [user:info,2013-01-08T12:32:23.383,ns_1@10.3.2.145:<0.2801.0>:ns_orchestrator:handle_info:319]Rebalance exited with reason {{bulk_set_vbucket_state_failed,
          [{'ns_1@10.3.2.146',
          {'EXIT',
          {{{{unexpected_reason,
          badmatch,{error,closed,
          [{mc_binary,quick_stats_recv,3},
          {mc_binary,quick_stats_loop,5},
          {mc_binary,quick_stats,5},
          {mc_client_binary, get_zero_open_checkpoint_vbuckets,3},
          {ebucketmigrator_srv,handle_call,3},
          {gen_server,handle_msg,5},
          {proc_lib,init_p_do_apply,3}

          ]}},
          [

          {misc,executing_on_new_process,1},
          {tap_replication_manager, change_vbucket_filter,4},
          {tap_replication_manager, '-do_set_incoming_replication_map/3-lc$^5/1-5-', 2},
          {tap_replication_manager, do_set_incoming_replication_map,3},
          {tap_replication_manager,handle_call,3},
          {gen_server,handle_msg,5},
          {proc_lib,init_p_do_apply,3}]},
          {gen_server,call,
          ['tap_replication_manager-default',
          {change_vbucket_replication,139, undefined},
          infinity]}},
          {gen_server,call,
          [{'janitor_agent-default', 'ns_1@10.3.2.146'},
          {if_rebalance,<0.13837.1>,
          {update_vbucket_state,139,replica,
          undefined,undefined}},
          infinity]}}}}]},
          [{janitor_agent,bulk_set_vbucket_state,4},
          {ns_vbucket_mover, update_replication_post_move,3},
          {ns_vbucket_mover,on_move_done,2},
          {gen_server,handle_msg,5},
          {proc_lib,init_p_do_apply,3}]}

          [ns_server:debug,2013-01-08T12:32:23.384,ns_1@10.3.2.145:compaction_daemon<0.467.0>:compaction_daemon:handle_info:367]Looks like vbucket mover inhibiting view compaction for for bucket "default" is dead. Canceling inhibition
          [ns_server:debug,2013-01-08T12:32:23.383,ns_1@10.3.2.145:<0.13845.1>:ns_pubsub:do_subscribe_link:132]Parent process of subscription {ns_node_disco_events,<0.13837.1>} exited with reason {{bulk_set_vbucket_state_failed,
          [{'ns_1@10.3.2.146',
          {'EXIT',
          {{{{unexpected_reason,
          {{badmatch,
          {error,
          closed}},
          [{mc_binary, quick_stats_recv, 3},
          {mc_binary, quick_stats_loop, 5},
          {mc_binary, quick_stats, 5},
          {mc_client_binary, get_zero_open_checkpoint_vbuckets, 3},
          {ebucketmigrator_srv, handle_call, 3},
          {gen_server, handle_msg, 5},
          {proc_lib, init_p_do_apply, 3}]}},
          [{misc, executing_on_new_process, 1},
          {tap_replication_manager, change_vbucket_filter, 4},
          {tap_replication_manager, '-do_set_incoming_replication_map/3-lc$^5/1-5-', 2},
          {tap_replication_manager, do_set_incoming_replication_map, 3},
          {tap_replication_manager, handle_call, 3},
          {gen_server, handle_msg, 5},
          {proc_lib, init_p_do_apply, 3}]},
          {gen_server,
          call,
          ['tap_replication_manager-default',
          {change_vbucket_replication, 139, undefined},
          infinity]}},
          {gen_server,
          call,
          [{'janitor_agent-default', 'ns_1@10.3.2.146'},
          {if_rebalance,
          <0.13837.1>,
          {update_vbucket_state,
          139,
          replica,
          undefined,
          undefined}},
          infinity]}}}}]},
          [{janitor_agent, bulk_set_vbucket_state, 4},
          {ns_vbucket_mover, update_replication_post_move, 3},
          {ns_vbucket_mover, on_move_done, 2},
          {gen_server, handle_msg, 5},
          {proc_lib, init_p_do_apply, 3}]}
          [ns_server:debug,2013-01-08T12:32:23.384,ns_1@10.3.2.145:<0.32356.1>:mc_connection:do_notify_vbucket_update:112]Signaled mc_couch_event: {set_vbucket,"default",438,dead,1}
          [ns_server:debug,2013-01-08T12:32:23.387,ns_1@10.3.2.145:capi_set_view_manager-default<0.617.0>:capi_set_view_manager:handle_info:349]doing replicate_newnodes_docs
          [error_logger:error,2013-01-08T12:32:23.387,ns_1@10.3.2.145:error_logger<0.6.0>:ale_error_logger_handler:log_report:72]
          =========================CRASH REPORT=========================
          crasher:
          initial call: ns_single_vbucket_mover:mover/6
          pid: <0.32285.1>
          registered_name: []
          exception exit: {unexpected_exit,
          {'EXIT',<0.13837.1>,
          {{bulk_set_vbucket_state_failed,
          [{'ns_1@10.3.2.146',
          {'EXIT',
          {{{{unexpected_reason,
          badmatch,{error,closed,
          [{mc_binary,quick_stats_recv,3},
          {mc_binary,quick_stats_loop,5},
          {mc_binary,quick_stats,5},
          {mc_client_binary, get_zero_open_checkpoint_vbuckets,3},
          {ebucketmigrator_srv,handle_call,3},
          {gen_server,handle_msg,5},
          {proc_lib,init_p_do_apply,3}]}},
          [{misc,executing_on_new_process,1}

          ,

          {tap_replication_manager, change_vbucket_filter,4}

          ,

          {tap_replication_manager, '-do_set_incoming_replication_map/3-lc$^5/1-5-', 2}

          ,

          {tap_replication_manager, do_set_incoming_replication_map,3}

          ,

          {tap_replication_manager,handle_call,3},
          {gen_server,handle_msg,5},
          {proc_lib,init_p_do_apply,3}]},
          {gen_server,call,
          ['tap_replication_manager-default',
          {change_vbucket_replication,139,undefined},
          infinity]}},
          {gen_server,call,
          [{'janitor_agent-default','ns_1@10.3.2.146'},
          {if_rebalance,<0.13837.1>,
          {update_vbucket_state,139,replica,undefined,
          undefined}},
          infinity]}}}}]},
          [{janitor_agent,bulk_set_vbucket_state,4},
          {ns_vbucket_mover,update_replication_post_move,3},
          {ns_vbucket_mover,on_move_done,2},
          {gen_server,handle_msg,5},
          {proc_lib,init_p_do_apply,3}]}}}
          in function ns_single_vbucket_mover:spawn_and_wait/1
          in call from ns_single_vbucket_mover:mover_inner/6
          in call from misc:try_with_maybe_ignorant_after/2
          in call from ns_single_vbucket_mover:mover/6
          ancestors: [<0.13837.1>,<0.13752.1>]
          messages: [{'EXIT',<0.13837.1>,
          {{bulk_set_vbucket_state_failed,
          [{'ns_1@10.3.2.146',
          {'EXIT',
          {{{{unexpected_reason,
          badmatch,{error,closed,
          [{mc_binary,quick_stats_recv,3},
          {mc_binary,quick_stats_loop,5},
          {mc_binary,quick_stats,5},
          {mc_client_binary, get_zero_open_checkpoint_vbuckets,3},
          {ebucketmigrator_srv,handle_call,3},
          {gen_server,handle_msg,5},
          {proc_lib,init_p_do_apply,3}]}},
          [{misc,executing_on_new_process,1},
          {tap_replication_manager,change_vbucket_filter,4},
          {tap_replication_manager, '-do_set_incoming_replication_map/3-lc$^5/1-5-', 2},
          {tap_replication_manager, do_set_incoming_replication_map,3},
          {tap_replication_manager,handle_call,3}

          ,

          {gen_server,handle_msg,5},
          {proc_lib,init_p_do_apply,3}]},
          {gen_server,call,
          ['tap_replication_manager-default',
          {change_vbucket_replication,139,undefined},
          infinity]}},
          {gen_server,call,
          [{'janitor_agent-default','ns_1@10.3.2.146'},
          {if_rebalance,<0.13837.1>,
          {update_vbucket_state,139,replica,undefined,
          undefined}},
          infinity]}}}}]},
          [{janitor_agent,bulk_set_vbucket_state,4},
          {ns_vbucket_mover,update_replication_post_move,3},
          {ns_vbucket_mover,on_move_done,2},
          {gen_server,handle_msg,5}

          ,

          {proc_lib,init_p_do_apply,3}

          ]}}]
          links: [<0.13837.1>,<0.32352.1>]
          dictionary: [

          {cleanup_list,[<0.32321.1>,<0.32322.1>]}

          ]
          trap_exit: true
          status: running
          heap_size: 6765
          stack_size: 24
          reductions: 9680
          neighbours:

      No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

        Hide
        farshid Farshid Ghods (Inactive) added a comment -

        >>>3)'If the rebalance status is not changed even after an hour' - do you mean it stuck?

        yes. I think we already have a code in monitorRebalance that detects if rebalanceProgress is not changing and declares that it is stuck and fails the rebalance.

        Show
        farshid Farshid Ghods (Inactive) added a comment - >>>3)'If the rebalance status is not changed even after an hour' - do you mean it stuck? yes. I think we already have a code in monitorRebalance that detects if rebalanceProgress is not changing and declares that it is stuck and fails the rebalance.
        Hide
        farshid Farshid Ghods (Inactive) added a comment -

        >>>3)'If the rebalance status is not changed even after an hour' - do you mean it stuck?

        yes. I think we already have a code in monitorRebalance that detects if rebalanceProgress is not changing and declares that it is stuck and fails the rebalance.

        Show
        farshid Farshid Ghods (Inactive) added a comment - >>>3)'If the rebalance status is not changed even after an hour' - do you mean it stuck? yes. I think we already have a code in monitorRebalance that detects if rebalanceProgress is not changing and declares that it is stuck and fails the rebalance.
        Hide
        andreibaranouski Andrei Baranouski added a comment -
        Show
        andreibaranouski Andrei Baranouski added a comment - test was changed http://review.couchbase.org/#/c/23987
        Hide
        farshid Farshid Ghods (Inactive) added a comment -

        please close the ticket if test passes after changes are merged

        Show
        farshid Farshid Ghods (Inactive) added a comment - please close the ticket if test passes after changes are merged
        Show
        andreibaranouski Andrei Baranouski added a comment - all tests passed http://qa.hq.northscale.net/view/2.0.1/job/centos-32-2.0-swaprebalance-tests/263/

          People

          • Assignee:
            andreibaranouski Andrei Baranouski
            Reporter:
            andreibaranouski Andrei Baranouski
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Gerrit Reviews

              There are no open Gerrit changes