Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-7508

Rebalance exited with reason {{bulk_set_vbucket_state_failed after crasher in new_ns_replicas_builder(rebalance continued after killing memcached)

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 2.0.1
    • Fix Version/s: 2.0.1
    • Component/s: couchbase-bucket, ns_server
    • Security Level: Public
    • Labels:
      None

      Description

      build 2.0.1-120-rel
      http://qa.hq.northscale.net/job/centos-32-2.0-swaprebalance-tests/250/consoleFull
      /testrunner -i /tmp/swaprebalance-cent-32.ini get-logs=True,GROUP=P0 -t swaprebalance.SwapRebalanceFailedTests.test_failed_swap_rebalance,replica=1,num-buckets=4,num-swap=3,swap-orchestrator=True,GROUP=P0

      steps:
      1.rebalance params : password=password&ejectedNodes=ns_1%4010.3.2.146%2Cns_1%4010.3.2.152%2Cns_1%4010.3.2.145&user=Administrator&knownNodes=ns_1%4010.3.2.146%2Cns_1%4010.3.2.147%2Cns_1%4010.3.2.149%2Cns_1%4010.3.2.148%2Cns_1%4010.3.2.145%2Cns_1%4010.3.2.152
      2. kill memcached when rebalance reached >40%
      [2013-01-08 13:47:49,441] - [rest_client:881] INFO - rebalance percentage : 40.0627668879 %
      [2013-01-08 13:47:51,444] - [rest_client:73] INFO - rebalance reached >40.0627668879% in 663.750247002 seconds
      [2013-01-08 13:47:51,971] - [remote_util:116] INFO - connecting to 10.3.2.149 with username : root password : couchbase ssh_key:
      [2013-01-08 13:47:52,196] - [remote_util:148] INFO - Connected
      [2013-01-08 13:47:52,557] - [remote_util:1128] INFO - running command.raw sudo cat /proc/cpuinfo
      [2013-01-08 13:47:52,657] - [remote_util:1157] INFO - command executed successfully
      [2013-01-08 13:47:52,658] - [remote_util:1128] INFO - running command.raw sudo df -Th
      [2013-01-08 13:47:52,774] - [remote_util:1157] INFO - command executed successfully
      [2013-01-08 13:47:52,775] - [remote_util:1128] INFO - running command.raw sudo cat /proc/meminfo
      [2013-01-08 13:47:52,882] - [remote_util:1157] INFO - command executed successfully
      [2013-01-08 13:47:52,883] - [remote_util:1128] INFO - running command.raw hostname
      [2013-01-08 13:47:52,991] - [remote_util:1157] INFO - command executed successfully
      [2013-01-08 13:47:52,993] - [remote_util:1128] INFO - running command.raw ps -eo comm,pid | awk '$1 == "memcached"

      { print $2 }

      '
      [2013-01-08 13:47:53,107] - [remote_util:1157] INFO - command executed successfully
      [2013-01-08 13:47:53,108] - [swaprebalance:401] INFO - os:cmd("kill -9 3663 ")
      [2013-01-08 13:47:53,130] - [rest_client:803] INFO - /diag/eval status: True content: [] command: os:cmd("kill -9 3663 ")
      [2013-01-08 13:47:53,130] - [swaprebalance:403] INFO - killed 10.3.2.149:8091?? (True, '[]')
      [2013-01-08 13:47:53,131] - [swaprebalance:404] INFO - sleep for 10 sec after kill memcached
      ...
      2013-01-08 13:49:04,037] - [swaprebalance:412] INFO - rebalance progress: 42.5532203758
      [2013-01-08 13:49:05,058] - [swaprebalance:415] INFO - rebalance progress status:running
      [2013-01-08 13:49:05,068] - [swaprebalance:417] INFO - rebalance is still running even after restarting memcached
      [2013-01-08 13:49:05,069] - [swaprebalance:379] INFO - FAIL SWAP REBALANCE PHASE @ 60
      [2013-01-08 13:49:05,079] - [rest_client:881] INFO - rebalance percentage : 42.5775868475 %
      ...
      [2013-01-08 13:52:34,570] - [rest_client:881] INFO - rebalance percentage : 50.0 %
      [2013-01-08 13:52:36,597] - [rest_client:866] ERROR -

      {u'status': u'none', u'errorMessage': u'Rebalance failed. See logs for detailed reason. You can try rebalance again.'}

      - rebalance failed

      [ns_server:info,2013-01-08T12:32:23.377,ns_1@10.3.2.145:<0.32321.1>:ns_replicas_builder_utils:kill_a_bunch_of_tap_names:59]Killed the following tap names on 'ns_1@10.3.2.145': [<<"replication_building_138_'ns_1@10.3.2.147'">>,
      <<"replication_building_138_'ns_1@10.3.2.149'">>]
      [error_logger:error,2013-01-08T12:32:23.377,ns_1@10.3.2.145:error_logger<0.6.0>:ale_error_logger_handler:log_msg:76]** Generic server <0.32321.1> terminating

        • Last message in was {'EXIT',<20175.7759.0>,{badmatch, {error,closed}}}
          ** When Server state == {state,"default",138,'ns_1@10.3.2.145',
          [{'ns_1@10.3.2.147',<20177.4979.0>},
          {'ns_1@10.3.2.149',<20175.7759.0>}]}
          ** Reason for termination ==
          ** {{badmatch,[{<20177.4979.0>,{badmatch,{error,closed}

          }},

          {<20175.7759.0>,noproc}]},
          [{misc,sync_shutdown_many_i_am_trapping_exits,1},
          {misc,try_with_maybe_ignorant_after,2},
          {gen_server,terminate,6},
          {proc_lib,init_p_do_apply,3}]}

          [ns_server:error,2013-01-08T12:32:23.378,ns_1@10.3.2.145:<0.32285.1>:misc:sync_shutdown_many_i_am_trapping_exits:1408]Shutdown of the following failed: [{<0.32321.1>,
          {{badmatch,
          [{<20177.4979.0>,
          {badmatch,{error,closed}}},
          {<20175.7759.0>,noproc}

          ]},
          [

          {misc, sync_shutdown_many_i_am_trapping_exits, 1}

          ,

          {misc,try_with_maybe_ignorant_after,2},
          {gen_server,terminate,6},
          {proc_lib,init_p_do_apply,3}]}}]
          [ns_server:error,2013-01-08T12:32:23.379,ns_1@10.3.2.145:<0.32285.1>:misc:try_with_maybe_ignorant_after:1444]Eating exception from ignorant after-block:
          {error,{badmatch,[{<0.32321.1>,
          badmatch,[{<20177.4979.0>,{badmatch,{error,closed},
          {<20175.7759.0>,noproc}]},
          [{misc,sync_shutdown_many_i_am_trapping_exits,1},
          {misc,try_with_maybe_ignorant_after,2}

          ,

          {gen_server,terminate,6},
          {proc_lib,init_p_do_apply,3}]}}]},
          [{misc,sync_shutdown_many_i_am_trapping_exits,1},
          {misc,try_with_maybe_ignorant_after,2},
          {ns_single_vbucket_mover,mover,6},
          {proc_lib,init_p_do_apply,3}]}
          [error_logger:error,2013-01-08T12:32:23.379,ns_1@10.3.2.145:error_logger<0.6.0>:ale_error_logger_handler:log_report:72]
          =========================CRASH REPORT=========================
          crasher:
          initial call: new_ns_replicas_builder:init/1
          pid: <0.32321.1>
          registered_name: []
          exception exit: badmatch,[{<20177.4979.0>,{badmatch,{error,closed},
          {<20175.7759.0>,noproc}]},
          [{misc,sync_shutdown_many_i_am_trapping_exits,1},
          {misc,try_with_maybe_ignorant_after,2},
          {gen_server,terminate,6}

          ,

          {proc_lib,init_p_do_apply,3}]}
          in function gen_server:terminate/6
          ancestors: [<0.32285.1>,<0.13837.1>,<0.13752.1>]
          messages: [{'EXIT',<0.32285.1>,shutdown}]
          links: [<0.32285.1>]
          dictionary: []
          trap_exit: true
          status: running
          heap_size: 196418
          stack_size: 24
          reductions: 24429
          neighbours:

          [views:debug,2013-01-08T12:32:23.379,ns_1@10.3.2.145:mc_couch_events<0.425.0>:capi_set_view_manager:handle_mc_couch_event:529]Got set_vbucket event for default/438. Updated state: dead (1)
          [ns_server:info,2013-01-08T12:32:23.383,ns_1@10.3.2.145:janitor_agent-default<0.643.0>:janitor_agent:handle_info:676]Undoing temporary vbucket states caused by rebalance
          [user:info,2013-01-08T12:32:23.383,ns_1@10.3.2.145:<0.2801.0>:ns_orchestrator:handle_info:319]Rebalance exited with reason {{bulk_set_vbucket_state_failed,
          [{'ns_1@10.3.2.146',
          {'EXIT',
          {{{{unexpected_reason,
          badmatch,{error,closed,
          [{mc_binary,quick_stats_recv,3},
          {mc_binary,quick_stats_loop,5},
          {mc_binary,quick_stats,5},
          {mc_client_binary, get_zero_open_checkpoint_vbuckets,3},
          {ebucketmigrator_srv,handle_call,3},
          {gen_server,handle_msg,5},
          {proc_lib,init_p_do_apply,3}

          ]}},
          [

          {misc,executing_on_new_process,1},
          {tap_replication_manager, change_vbucket_filter,4},
          {tap_replication_manager, '-do_set_incoming_replication_map/3-lc$^5/1-5-', 2},
          {tap_replication_manager, do_set_incoming_replication_map,3},
          {tap_replication_manager,handle_call,3},
          {gen_server,handle_msg,5},
          {proc_lib,init_p_do_apply,3}]},
          {gen_server,call,
          ['tap_replication_manager-default',
          {change_vbucket_replication,139, undefined},
          infinity]}},
          {gen_server,call,
          [{'janitor_agent-default', 'ns_1@10.3.2.146'},
          {if_rebalance,<0.13837.1>,
          {update_vbucket_state,139,replica,
          undefined,undefined}},
          infinity]}}}}]},
          [{janitor_agent,bulk_set_vbucket_state,4},
          {ns_vbucket_mover, update_replication_post_move,3},
          {ns_vbucket_mover,on_move_done,2},
          {gen_server,handle_msg,5},
          {proc_lib,init_p_do_apply,3}]}

          [ns_server:debug,2013-01-08T12:32:23.384,ns_1@10.3.2.145:compaction_daemon<0.467.0>:compaction_daemon:handle_info:367]Looks like vbucket mover inhibiting view compaction for for bucket "default" is dead. Canceling inhibition
          [ns_server:debug,2013-01-08T12:32:23.383,ns_1@10.3.2.145:<0.13845.1>:ns_pubsub:do_subscribe_link:132]Parent process of subscription {ns_node_disco_events,<0.13837.1>} exited with reason {{bulk_set_vbucket_state_failed,
          [{'ns_1@10.3.2.146',
          {'EXIT',
          {{{{unexpected_reason,
          {{badmatch,
          {error,
          closed}},
          [{mc_binary, quick_stats_recv, 3},
          {mc_binary, quick_stats_loop, 5},
          {mc_binary, quick_stats, 5},
          {mc_client_binary, get_zero_open_checkpoint_vbuckets, 3},
          {ebucketmigrator_srv, handle_call, 3},
          {gen_server, handle_msg, 5},
          {proc_lib, init_p_do_apply, 3}]}},
          [{misc, executing_on_new_process, 1},
          {tap_replication_manager, change_vbucket_filter, 4},
          {tap_replication_manager, '-do_set_incoming_replication_map/3-lc$^5/1-5-', 2},
          {tap_replication_manager, do_set_incoming_replication_map, 3},
          {tap_replication_manager, handle_call, 3},
          {gen_server, handle_msg, 5},
          {proc_lib, init_p_do_apply, 3}]},
          {gen_server,
          call,
          ['tap_replication_manager-default',
          {change_vbucket_replication, 139, undefined},
          infinity]}},
          {gen_server,
          call,
          [{'janitor_agent-default', 'ns_1@10.3.2.146'},
          {if_rebalance,
          <0.13837.1>,
          {update_vbucket_state,
          139,
          replica,
          undefined,
          undefined}},
          infinity]}}}}]},
          [{janitor_agent, bulk_set_vbucket_state, 4},
          {ns_vbucket_mover, update_replication_post_move, 3},
          {ns_vbucket_mover, on_move_done, 2},
          {gen_server, handle_msg, 5},
          {proc_lib, init_p_do_apply, 3}]}
          [ns_server:debug,2013-01-08T12:32:23.384,ns_1@10.3.2.145:<0.32356.1>:mc_connection:do_notify_vbucket_update:112]Signaled mc_couch_event: {set_vbucket,"default",438,dead,1}
          [ns_server:debug,2013-01-08T12:32:23.387,ns_1@10.3.2.145:capi_set_view_manager-default<0.617.0>:capi_set_view_manager:handle_info:349]doing replicate_newnodes_docs
          [error_logger:error,2013-01-08T12:32:23.387,ns_1@10.3.2.145:error_logger<0.6.0>:ale_error_logger_handler:log_report:72]
          =========================CRASH REPORT=========================
          crasher:
          initial call: ns_single_vbucket_mover:mover/6
          pid: <0.32285.1>
          registered_name: []
          exception exit: {unexpected_exit,
          {'EXIT',<0.13837.1>,
          {{bulk_set_vbucket_state_failed,
          [{'ns_1@10.3.2.146',
          {'EXIT',
          {{{{unexpected_reason,
          badmatch,{error,closed,
          [{mc_binary,quick_stats_recv,3},
          {mc_binary,quick_stats_loop,5},
          {mc_binary,quick_stats,5},
          {mc_client_binary, get_zero_open_checkpoint_vbuckets,3},
          {ebucketmigrator_srv,handle_call,3},
          {gen_server,handle_msg,5},
          {proc_lib,init_p_do_apply,3}]}},
          [{misc,executing_on_new_process,1}

          ,

          {tap_replication_manager, change_vbucket_filter,4}

          ,

          {tap_replication_manager, '-do_set_incoming_replication_map/3-lc$^5/1-5-', 2}

          ,

          {tap_replication_manager, do_set_incoming_replication_map,3}

          ,

          {tap_replication_manager,handle_call,3},
          {gen_server,handle_msg,5},
          {proc_lib,init_p_do_apply,3}]},
          {gen_server,call,
          ['tap_replication_manager-default',
          {change_vbucket_replication,139,undefined},
          infinity]}},
          {gen_server,call,
          [{'janitor_agent-default','ns_1@10.3.2.146'},
          {if_rebalance,<0.13837.1>,
          {update_vbucket_state,139,replica,undefined,
          undefined}},
          infinity]}}}}]},
          [{janitor_agent,bulk_set_vbucket_state,4},
          {ns_vbucket_mover,update_replication_post_move,3},
          {ns_vbucket_mover,on_move_done,2},
          {gen_server,handle_msg,5},
          {proc_lib,init_p_do_apply,3}]}}}
          in function ns_single_vbucket_mover:spawn_and_wait/1
          in call from ns_single_vbucket_mover:mover_inner/6
          in call from misc:try_with_maybe_ignorant_after/2
          in call from ns_single_vbucket_mover:mover/6
          ancestors: [<0.13837.1>,<0.13752.1>]
          messages: [{'EXIT',<0.13837.1>,
          {{bulk_set_vbucket_state_failed,
          [{'ns_1@10.3.2.146',
          {'EXIT',
          {{{{unexpected_reason,
          badmatch,{error,closed,
          [{mc_binary,quick_stats_recv,3},
          {mc_binary,quick_stats_loop,5},
          {mc_binary,quick_stats,5},
          {mc_client_binary, get_zero_open_checkpoint_vbuckets,3},
          {ebucketmigrator_srv,handle_call,3},
          {gen_server,handle_msg,5},
          {proc_lib,init_p_do_apply,3}]}},
          [{misc,executing_on_new_process,1},
          {tap_replication_manager,change_vbucket_filter,4},
          {tap_replication_manager, '-do_set_incoming_replication_map/3-lc$^5/1-5-', 2},
          {tap_replication_manager, do_set_incoming_replication_map,3},
          {tap_replication_manager,handle_call,3}

          ,

          {gen_server,handle_msg,5},
          {proc_lib,init_p_do_apply,3}]},
          {gen_server,call,
          ['tap_replication_manager-default',
          {change_vbucket_replication,139,undefined},
          infinity]}},
          {gen_server,call,
          [{'janitor_agent-default','ns_1@10.3.2.146'},
          {if_rebalance,<0.13837.1>,
          {update_vbucket_state,139,replica,undefined,
          undefined}},
          infinity]}}}}]},
          [{janitor_agent,bulk_set_vbucket_state,4},
          {ns_vbucket_mover,update_replication_post_move,3},
          {ns_vbucket_mover,on_move_done,2},
          {gen_server,handle_msg,5}

          ,

          {proc_lib,init_p_do_apply,3}

          ]}}]
          links: [<0.13837.1>,<0.32352.1>]
          dictionary: [

          {cleanup_list,[<0.32321.1>,<0.32322.1>]}

          ]
          trap_exit: true
          status: running
          heap_size: 6765
          stack_size: 24
          reductions: 9680
          neighbours:

      No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

        andreibaranouski Andrei Baranouski created issue -
        Hide
        andreibaranouski Andrei Baranouski added a comment -

        no memcached crash

        Show
        andreibaranouski Andrei Baranouski added a comment - no memcached crash
        Show
        andreibaranouski Andrei Baranouski added a comment - - edited https://s3.amazonaws.com/bugdb/jira/MB-7508/d662d812-0e7b-4a5d-bda0-204208588e75-10.3.2.145-diag.txt.gz https://s3.amazonaws.com/bugdb/jira/MB-7508/d662d812-0e7b-4a5d-bda0-204208588e75-10.3.2.146-diag.txt.gz https://s3.amazonaws.com/bugdb/jira/MB-7508/d662d812-0e7b-4a5d-bda0-204208588e75-10.3.2.147-diag.txt.gz https://s3.amazonaws.com/bugdb/jira/MB-7508/d662d812-0e7b-4a5d-bda0-204208588e75-10.3.2.148-diag.txt.gz https://s3.amazonaws.com/bugdb/jira/MB-7508/d662d812-0e7b-4a5d-bda0-204208588e75-10.3.2.149-diag.txt.gz https://s3.amazonaws.com/bugdb/jira/MB-7508/d662d812-0e7b-4a5d-bda0-204208588e75-10.3.2.152-diag.txt.gz
        Hide
        alkondratenko Aleksey Kondratenko (Inactive) added a comment -

        Indeed no memcached crash, but errors indicate that memcached dropped connection.

        Show
        alkondratenko Aleksey Kondratenko (Inactive) added a comment - Indeed no memcached crash, but errors indicate that memcached dropped connection.
        alkondratenko Aleksey Kondratenko (Inactive) made changes -
        Field Original Value New Value
        Assignee Aleksey Kondratenko [ alkondratenko ] Chiyoung Seo [ chiyoung ]
        Component/s couchbase-bucket [ 10173 ]
        farshid Farshid Ghods (Inactive) made changes -
        Fix Version/s 2.0.1 [ 10399 ]
        Hide
        farshid Farshid Ghods (Inactive) added a comment -

        test code :

        for i in [1, 2, 3]:
        expected_progress = 20 * i
        self.log.info("FAIL SWAP REBALANCE PHASE @

        {0}".format(expected_progress))
        RestHelper(rest).rebalance_reached(expected_progress)
        bucket = rest.get_buckets()[0].name
        pid = None
        if self.swap_orchestrator:
        # get PID via remote connection if master is a new node
        shell = RemoteMachineShellConnection(master)
        o, _ = shell.execute_command("ps -eo comm,pid | awk '$1 == \"memcached\" { print $2 }'")
        pid = o[0]
        shell.disconnect()
        else:
        for i in xrange(2):
        try:
        _mc = MemcachedClientHelper.direct_client(master, bucket)
        pid = _mc.stats()["pid"]
        break
        except EOFError as e:
        self.log.error("{0}

        .Retry in 2 sec".format(e))
        time.sleep(1)
        if pid is None:
        self.fail("impossible to get a PID")
        command = "os:cmd(\"kill -9

        {0} \")".format(pid)
        self.log.info(command)
        killed = rest.diag_eval(command)
        self.log.info("killed {0}

        :

        {1}

        ??

        {2}

        ".format(master.ip, master.port, killed))
        self.log.info("sleep for 10 sec after kill memcached")
        time.sleep(10)

        1. we can't get stats for new node when rebalance falls
          if not self.swap_orchestrator:
          ClusterOperationHelper._wait_warmup_completed(self, [master], bucket, wait_time=600)
          i = 0
          #we expect that rebalance will be failed
          while rest._rebalance_progress_status() == "running" and i < 60:
          self.log.info("rebalance progress: {0}".format(rest._rebalance_progress()))
          time.sleep(1)
          i += 1
          self.log.info("rebalance progress status:{0}

          ".format(rest._rebalance_progress_status()))
          if rest._rebalance_progress_status() == "running":
          self.log.info("rebalance is still running even after restarting memcached")
          continue
          knownNodes = rest.node_statuses();
          self.log.info("nodes are still in cluster:

          {0}".format([(node.ip, node.port) for node in knownNodes]))
          ejectedNodes = list(set(optNodesIds) & set([node.id for node in knownNodes]))
          rest.rebalance(otpNodes=[node.id for node in knownNodes],
          ejectedNodes=ejectedNodes)

          self.assertTrue(rest.monitorRebalance(),
          msg="rebalance operation failed after adding node {0}

          ".format(toBeEjectedNodes))

        rebalance ,
        then kill a node
        wait until node is warmed up
        wait until rebalance fails because node is down

        then restart the rebalance.

        Show
        farshid Farshid Ghods (Inactive) added a comment - test code : for i in [1, 2, 3] : expected_progress = 20 * i self.log.info("FAIL SWAP REBALANCE PHASE @ {0}".format(expected_progress)) RestHelper(rest).rebalance_reached(expected_progress) bucket = rest.get_buckets() [0] .name pid = None if self.swap_orchestrator: # get PID via remote connection if master is a new node shell = RemoteMachineShellConnection(master) o, _ = shell.execute_command("ps -eo comm,pid | awk '$1 == \"memcached\" { print $2 }'") pid = o [0] shell.disconnect() else: for i in xrange(2): try: _mc = MemcachedClientHelper.direct_client(master, bucket) pid = _mc.stats() ["pid"] break except EOFError as e: self.log.error("{0} .Retry in 2 sec".format(e)) time.sleep(1) if pid is None: self.fail("impossible to get a PID") command = "os:cmd(\"kill -9 {0} \")".format(pid) self.log.info(command) killed = rest.diag_eval(command) self.log.info("killed {0} : {1} ?? {2} ".format(master.ip, master.port, killed)) self.log.info("sleep for 10 sec after kill memcached") time.sleep(10) we can't get stats for new node when rebalance falls if not self.swap_orchestrator: ClusterOperationHelper._wait_warmup_completed(self, [master] , bucket, wait_time=600) i = 0 #we expect that rebalance will be failed while rest._rebalance_progress_status() == "running" and i < 60: self.log.info("rebalance progress: {0}".format(rest._rebalance_progress())) time.sleep(1) i += 1 self.log.info("rebalance progress status:{0} ".format(rest._rebalance_progress_status())) if rest._rebalance_progress_status() == "running": self.log.info("rebalance is still running even after restarting memcached") continue knownNodes = rest.node_statuses(); self.log.info("nodes are still in cluster: {0}".format( [(node.ip, node.port) for node in knownNodes] )) ejectedNodes = list(set(optNodesIds) & set( [node.id for node in knownNodes] )) rest.rebalance(otpNodes= [node.id for node in knownNodes] , ejectedNodes=ejectedNodes) self.assertTrue(rest.monitorRebalance(), msg="rebalance operation failed after adding node {0} ".format(toBeEjectedNodes)) rebalance , then kill a node wait until node is warmed up wait until rebalance fails because node is down then restart the rebalance.
        Hide
        chiyoung Chiyoung Seo added a comment - - edited

        Andrei,

        Aliaskey A, Farshid, and I discussed the test case that you ran, and identified some stuffs that we can improve in the test case.

        Basically, we should wait enough until the first rebalance fails instead of waiting for 60 sec in three iterations.

        Therefore, the suggestion is

        1) Remove the for loop in the code

        2) Wait for one hour as the timeout for the first rebalance failure

        3) If the rebalance status is not changed even after an hour, print out the error message and exit the test case.

        Can you rerun the test case after adapting the test code?

        Thanks,

        Show
        chiyoung Chiyoung Seo added a comment - - edited Andrei, Aliaskey A, Farshid, and I discussed the test case that you ran, and identified some stuffs that we can improve in the test case. Basically, we should wait enough until the first rebalance fails instead of waiting for 60 sec in three iterations. Therefore, the suggestion is 1) Remove the for loop in the code 2) Wait for one hour as the timeout for the first rebalance failure 3) If the rebalance status is not changed even after an hour, print out the error message and exit the test case. Can you rerun the test case after adapting the test code? Thanks,
        Hide
        chiyoung Chiyoung Seo added a comment -

        See my comments above.

        Show
        chiyoung Chiyoung Seo added a comment - See my comments above.
        chiyoung Chiyoung Seo made changes -
        Assignee Chiyoung Seo [ chiyoung ] Andrei Baranouski [ andreibaranouski ]
        Hide
        andreibaranouski Andrei Baranouski added a comment -

        Hi,
        Let me first clarify what happened in this case

        1. started rebalance and when it reach > 40% we killed memcached on one of the nodes
        here we handle 2 situation: reblance failed and it continued without failure

        in our case, the second option worked:
        [2013-01-08 13:47:49,441] - [rest_client:881] INFO - rebalance percentage : 40.0627668879 %
        ...
        [2013-01-08 13:47:53,130] - [swaprebalance:403] INFO - killed 10.3.2.149:8091?? (True, '[]')
        [2013-01-08 13:47:53,131] - [swaprebalance:404] INFO - sleep for 10 sec after kill memcached
        ...
        2013-01-08 13:49:04,037] - [swaprebalance:412] INFO - rebalance progress: 42.5532203758
        [2013-01-08 13:49:05,058] - [swaprebalance:415] INFO - rebalance progress status:running
        [2013-01-08 13:49:05,068] - [swaprebalance:417] INFO - rebalance is still running even after restarting memcached

        2. when rebalance reached > 60% we killed memcashed in second time, but this did not happen, and it fell before:

        [2013-01-08 13:49:05,079] - [rest_client:881] INFO - rebalance percentage : 42.5775868475 %
        ...
        [2013-01-08 13:52:34,570] - [rest_client:881] INFO - rebalance percentage : 50.0 %
        [2013-01-08 13:52:36,597] - [rest_client:866] ERROR -

        {u'status': u'none', u'errorMessage': u'Rebalance failed. See logs for detailed reason. You can try rebalance again.'}

        - rebalance failed

        So we got to rebalance fell in ~5 minutes after we killed memcached(13:47:53-13:52:36), and all this time the progress rebalance changed(~40->50%)

        I would not want to change the logic of this test, so it works for a long time. In addition, I have a question about your recommendations:
        1) why I need to remove loop?
        2) why do we need to wait an hour for the first rebalance failure? In our case it turns out that we were waiting for the fall of about 5 minutes after the killing memcached
        3)'If the rebalance status is not changed even after an hour' - do you mean it stuck?

        all the logs for clarity

        [2013-01-08 13:47:53,130] - [swaprebalance:403] INFO - killed 10.3.2.149:8091?? (True, '[]')
        [2013-01-08 13:47:53,131] - [swaprebalance:404] INFO - sleep for 10 sec after kill memcached
        [2013-01-08 13:48:03,182] - [rest_client:881] INFO - rebalance percentage : 40.4774970989 %
        [2013-01-08 13:48:03,183] - [swaprebalance:412] INFO - rebalance progress: 40.4774970989
        [2013-01-08 13:48:04,223] - [rest_client:881] INFO - rebalance percentage : 40.5263729542 %
        [2013-01-08 13:48:04,224] - [swaprebalance:412] INFO - rebalance progress: 40.5263729542
        [2013-01-08 13:48:05,254] - [rest_client:881] INFO - rebalance percentage : 40.550739426 %
        [2013-01-08 13:48:05,254] - [swaprebalance:412] INFO - rebalance progress: 40.550739426
        [2013-01-08 13:48:06,289] - [rest_client:881] INFO - rebalance percentage : 40.5995438254 %
        [2013-01-08 13:48:06,289] - [swaprebalance:412] INFO - rebalance progress: 40.5995438254
        [2013-01-08 13:48:07,311] - [rest_client:881] INFO - rebalance percentage : 40.623981753 %
        [2013-01-08 13:48:07,311] - [swaprebalance:412] INFO - rebalance progress: 40.623981753
        [2013-01-08 13:48:08,331] - [rest_client:881] INFO - rebalance percentage : 40.6727861524 %
        [2013-01-08 13:48:08,331] - [swaprebalance:412] INFO - rebalance progress: 40.6727861524
        [2013-01-08 13:48:09,366] - [rest_client:881] INFO - rebalance percentage : 40.7216620077 %
        [2013-01-08 13:48:09,367] - [swaprebalance:412] INFO - rebalance progress: 40.7216620077
        [2013-01-08 13:48:10,389] - [rest_client:881] INFO - rebalance percentage : 40.7704664071 %
        [2013-01-08 13:48:10,390] - [swaprebalance:412] INFO - rebalance progress: 40.7704664071
        [2013-01-08 13:48:11,404] - [rest_client:881] INFO - rebalance percentage : 40.7949043348 %
        [2013-01-08 13:48:11,405] - [swaprebalance:412] INFO - rebalance progress: 40.7949043348
        [2013-01-08 13:48:12,446] - [rest_client:881] INFO - rebalance percentage : 40.8437087342 %
        [2013-01-08 13:48:12,447] - [swaprebalance:412] INFO - rebalance progress: 40.8437087342
        [2013-01-08 13:48:13,468] - [rest_client:881] INFO - rebalance percentage : 40.8681466619 %
        [2013-01-08 13:48:13,468] - [swaprebalance:412] INFO - rebalance progress: 40.8681466619
        [2013-01-08 13:48:14,560] - [rest_client:881] INFO - rebalance percentage : 40.9169510613 %
        [2013-01-08 13:48:14,560] - [swaprebalance:412] INFO - rebalance progress: 40.9169510613
        [2013-01-08 13:48:15,708] - [rest_client:881] INFO - rebalance percentage : 40.9413889889 %
        [2013-01-08 13:48:15,708] - [swaprebalance:412] INFO - rebalance progress: 40.9413889889
        [2013-01-08 13:48:16,722] - [rest_client:881] INFO - rebalance percentage : 40.9657554607 %
        [2013-01-08 13:48:16,723] - [swaprebalance:412] INFO - rebalance progress: 40.9657554607
        [2013-01-08 13:48:17,737] - [rest_client:881] INFO - rebalance percentage : 40.9901933883 %
        [2013-01-08 13:48:17,737] - [swaprebalance:412] INFO - rebalance progress: 40.9901933883
        [2013-01-08 13:48:18,762] - [rest_client:881] INFO - rebalance percentage : 41.0389977877 %
        [2013-01-08 13:48:18,763] - [swaprebalance:412] INFO - rebalance progress: 41.0389977877
        [2013-01-08 13:48:19,810] - [rest_client:881] INFO - rebalance percentage : 41.0634357154 %
        [2013-01-08 13:48:19,811] - [swaprebalance:412] INFO - rebalance progress: 41.0634357154
        [2013-01-08 13:48:20,833] - [rest_client:881] INFO - rebalance percentage : 41.1122401148 %
        [2013-01-08 13:48:20,833] - [swaprebalance:412] INFO - rebalance progress: 41.1122401148
        [2013-01-08 13:48:21,848] - [rest_client:881] INFO - rebalance percentage : 41.1366780425 %
        [2013-01-08 13:48:21,848] - [swaprebalance:412] INFO - rebalance progress: 41.1366780425
        [2013-01-08 13:48:22,864] - [rest_client:881] INFO - rebalance percentage : 41.1611159701 %
        [2013-01-08 13:48:22,865] - [swaprebalance:412] INFO - rebalance progress: 41.1611159701
        [2013-01-08 13:48:23,881] - [rest_client:881] INFO - rebalance percentage : 41.1854824418 %
        [2013-01-08 13:48:23,881] - [swaprebalance:412] INFO - rebalance progress: 41.1854824418
        [2013-01-08 13:48:24,897] - [rest_client:881] INFO - rebalance percentage : 41.2099203695 %
        [2013-01-08 13:48:24,897] - [swaprebalance:412] INFO - rebalance progress: 41.2099203695
        [2013-01-08 13:48:25,928] - [rest_client:881] INFO - rebalance percentage : 41.2343582972 %
        [2013-01-08 13:48:25,929] - [swaprebalance:412] INFO - rebalance progress: 41.2343582972
        [2013-01-08 13:48:26,959] - [rest_client:881] INFO - rebalance percentage : 41.2587247689 %
        [2013-01-08 13:48:26,959] - [swaprebalance:412] INFO - rebalance progress: 41.2587247689
        [2013-01-08 13:48:28,008] - [rest_client:881] INFO - rebalance percentage : 41.2831626966 %
        [2013-01-08 13:48:28,008] - [swaprebalance:412] INFO - rebalance progress: 41.2831626966
        [2013-01-08 13:48:29,054] - [rest_client:881] INFO - rebalance percentage : 41.331967096 %
        [2013-01-08 13:48:29,054] - [swaprebalance:412] INFO - rebalance progress: 41.331967096
        [2013-01-08 13:48:30,070] - [rest_client:881] INFO - rebalance percentage : 41.3564050236 %
        [2013-01-08 13:48:30,071] - [swaprebalance:412] INFO - rebalance progress: 41.3564050236
        [2013-01-08 13:48:31,094] - [rest_client:881] INFO - rebalance percentage : 41.3808429513 %
        [2013-01-08 13:48:31,094] - [swaprebalance:412] INFO - rebalance progress: 41.3808429513
        [2013-01-08 13:48:32,111] - [rest_client:881] INFO - rebalance percentage : 41.405280879 %
        [2013-01-08 13:48:32,112] - [swaprebalance:412] INFO - rebalance progress: 41.405280879
        [2013-01-08 13:48:33,133] - [rest_client:881] INFO - rebalance percentage : 41.4540852784 %
        [2013-01-08 13:48:33,133] - [swaprebalance:412] INFO - rebalance progress: 41.4540852784
        [2013-01-08 13:48:34,152] - [rest_client:881] INFO - rebalance percentage : 41.5028896778 %
        [2013-01-08 13:48:34,153] - [swaprebalance:412] INFO - rebalance progress: 41.5028896778
        [2013-01-08 13:48:35,197] - [rest_client:881] INFO - rebalance percentage : 41.5273276054 %
        [2013-01-08 13:48:35,197] - [swaprebalance:412] INFO - rebalance progress: 41.5273276054
        [2013-01-08 13:48:36,216] - [rest_client:881] INFO - rebalance percentage : 41.5517655331 %
        [2013-01-08 13:48:36,217] - [swaprebalance:412] INFO - rebalance progress: 41.5517655331
        [2013-01-08 13:48:37,231] - [rest_client:881] INFO - rebalance percentage : 41.6006413884 %
        [2013-01-08 13:48:37,232] - [swaprebalance:412] INFO - rebalance progress: 41.6006413884
        [2013-01-08 13:48:38,244] - [rest_client:881] INFO - rebalance percentage : 41.6250793161 %
        [2013-01-08 13:48:38,244] - [swaprebalance:412] INFO - rebalance progress: 41.6250793161
        [2013-01-08 13:48:39,262] - [rest_client:881] INFO - rebalance percentage : 41.6495172437 %
        [2013-01-08 13:48:39,263] - [swaprebalance:412] INFO - rebalance progress: 41.6495172437
        [2013-01-08 13:48:40,285] - [rest_client:881] INFO - rebalance percentage : 41.6983930991 %
        [2013-01-08 13:48:40,285] - [swaprebalance:412] INFO - rebalance progress: 41.6983930991
        [2013-01-08 13:48:41,302] - [rest_client:881] INFO - rebalance percentage : 41.7228310267 %
        [2013-01-08 13:48:41,302] - [swaprebalance:412] INFO - rebalance progress: 41.7228310267
        [2013-01-08 13:48:42,323] - [rest_client:881] INFO - rebalance percentage : 41.7472689544 %
        [2013-01-08 13:48:42,323] - [swaprebalance:412] INFO - rebalance progress: 41.7472689544
        [2013-01-08 13:48:43,341] - [rest_client:881] INFO - rebalance percentage : 41.7717068821 %
        [2013-01-08 13:48:43,341] - [swaprebalance:412] INFO - rebalance progress: 41.7717068821
        [2013-01-08 13:48:44,356] - [rest_client:881] INFO - rebalance percentage : 41.7961448097 %
        [2013-01-08 13:48:44,357] - [swaprebalance:412] INFO - rebalance progress: 41.7961448097
        [2013-01-08 13:48:45,372] - [rest_client:881] INFO - rebalance percentage : 41.8205827374 %
        [2013-01-08 13:48:45,373] - [swaprebalance:412] INFO - rebalance progress: 41.8205827374
        [2013-01-08 13:48:46,391] - [rest_client:881] INFO - rebalance percentage : 41.8450206651 %
        [2013-01-08 13:48:46,391] - [swaprebalance:412] INFO - rebalance progress: 41.8450206651
        [2013-01-08 13:48:47,445] - [rest_client:881] INFO - rebalance percentage : 41.8694585927 %
        [2013-01-08 13:48:47,446] - [swaprebalance:412] INFO - rebalance progress: 41.8694585927
        [2013-01-08 13:48:48,460] - [rest_client:881] INFO - rebalance percentage : 41.918334448 %
        [2013-01-08 13:48:48,460] - [swaprebalance:412] INFO - rebalance progress: 41.918334448
        [2013-01-08 13:48:49,485] - [rest_client:881] INFO - rebalance percentage : 41.9672103034 %
        [2013-01-08 13:48:49,485] - [swaprebalance:412] INFO - rebalance progress: 41.9672103034
        [2013-01-08 13:48:50,506] - [rest_client:881] INFO - rebalance percentage : 42.0160861587 %
        [2013-01-08 13:48:50,507] - [swaprebalance:412] INFO - rebalance progress: 42.0160861587
        [2013-01-08 13:48:51,522] - [rest_client:881] INFO - rebalance percentage : 42.0405240864 %
        [2013-01-08 13:48:51,523] - [swaprebalance:412] INFO - rebalance progress: 42.0405240864
        [2013-01-08 13:48:52,537] - [rest_client:881] INFO - rebalance percentage : 42.0893284858 %
        [2013-01-08 13:48:52,538] - [swaprebalance:412] INFO - rebalance progress: 42.0893284858
        [2013-01-08 13:48:53,552] - [rest_client:881] INFO - rebalance percentage : 42.1381328852 %
        [2013-01-08 13:48:53,553] - [swaprebalance:412] INFO - rebalance progress: 42.1381328852
        [2013-01-08 13:48:54,591] - [rest_client:881] INFO - rebalance percentage : 42.1870087405 %
        [2013-01-08 13:48:54,592] - [swaprebalance:412] INFO - rebalance progress: 42.1870087405
        [2013-01-08 13:48:55,709] - [rest_client:881] INFO - rebalance percentage : 42.2358131399 %
        [2013-01-08 13:48:55,710] - [swaprebalance:412] INFO - rebalance progress: 42.2358131399
        [2013-01-08 13:48:56,723] - [rest_client:881] INFO - rebalance percentage : 42.2602510676 %
        [2013-01-08 13:48:56,724] - [swaprebalance:412] INFO - rebalance progress: 42.2602510676
        [2013-01-08 13:48:57,746] - [rest_client:881] INFO - rebalance percentage : 42.309055467 %
        [2013-01-08 13:48:57,747] - [swaprebalance:412] INFO - rebalance progress: 42.309055467
        [2013-01-08 13:48:58,768] - [rest_client:881] INFO - rebalance percentage : 42.3579313223 %
        [2013-01-08 13:48:58,769] - [swaprebalance:412] INFO - rebalance progress: 42.3579313223
        [2013-01-08 13:48:59,821] - [rest_client:881] INFO - rebalance percentage : 42.382297794 %
        [2013-01-08 13:48:59,821] - [swaprebalance:412] INFO - rebalance progress: 42.382297794
        [2013-01-08 13:49:00,878] - [rest_client:881] INFO - rebalance percentage : 42.4311021934 %
        [2013-01-08 13:49:00,879] - [swaprebalance:412] INFO - rebalance progress: 42.4311021934
        [2013-01-08 13:49:01,925] - [rest_client:881] INFO - rebalance percentage : 42.4799780487 %
        [2013-01-08 13:49:01,926] - [swaprebalance:412] INFO - rebalance progress: 42.4799780487
        [2013-01-08 13:49:02,950] - [rest_client:881] INFO - rebalance percentage : 42.5043445205 %
        [2013-01-08 13:49:02,951] - [swaprebalance:412] INFO - rebalance progress: 42.5043445205
        [2013-01-08 13:49:04,036] - [rest_client:881] INFO - rebalance percentage : 42.5532203758 %
        [2013-01-08 13:49:04,037] - [swaprebalance:412] INFO - rebalance progress: 42.5532203758
        [2013-01-08 13:49:05,058] - [swaprebalance:415] INFO - rebalance progress status:running
        [2013-01-08 13:49:05,068] - [swaprebalance:417] INFO - rebalance is still running even after restarting memcached
        [2013-01-08 13:49:05,069] - [swaprebalance:379] INFO - FAIL SWAP REBALANCE PHASE @ 60
        [2013-01-08 13:49:05,079] - [rest_client:881] INFO - rebalance percentage : 42.5775868475 %
        [2013-01-08 13:49:07,088] - [rest_client:881] INFO - rebalance percentage : 42.6752671023 %
        [2013-01-08 13:49:09,134] - [rest_client:881] INFO - rebalance percentage : 42.772947357 %
        [2013-01-08 13:49:11,151] - [rest_client:881] INFO - rebalance percentage : 42.8218232123 %
        [2013-01-08 13:49:13,164] - [rest_client:881] INFO - rebalance percentage : 42.8949940834 %
        [2013-01-08 13:49:15,179] - [rest_client:881] INFO - rebalance percentage : 42.9682364105 %
        [2013-01-08 13:49:17,189] - [rest_client:881] INFO - rebalance percentage : 43.0414787376 %
        [2013-01-08 13:49:19,208] - [rest_client:881] INFO - rebalance percentage : 43.1390875364 %
        [2013-01-08 13:49:21,219] - [rest_client:881] INFO - rebalance percentage : 43.2124013194 %
        [2013-01-08 13:49:23,229] - [rest_client:881] INFO - rebalance percentage : 43.3100815741 %
        [2013-01-08 13:49:25,239] - [rest_client:881] INFO - rebalance percentage : 43.3588859735 %
        [2013-01-08 13:49:27,252] - [rest_client:881] INFO - rebalance percentage : 43.4321283005 %
        [2013-01-08 13:49:29,261] - [rest_client:881] INFO - rebalance percentage : 43.5054420835 %
        [2013-01-08 13:49:31,272] - [rest_client:881] INFO - rebalance percentage : 43.5542464829 %
        [2013-01-08 13:49:33,292] - [rest_client:881] INFO - rebalance percentage : 43.6519267377 %
        [2013-01-08 13:49:35,303] - [rest_client:881] INFO - rebalance percentage : 43.7251690647 %
        [2013-01-08 13:49:37,317] - [rest_client:881] INFO - rebalance percentage : 43.7740449201 %
        [2013-01-08 13:49:39,326] - [rest_client:881] INFO - rebalance percentage : 43.8472872471 %
        [2013-01-08 13:49:41,337] - [rest_client:881] INFO - rebalance percentage : 43.9205295742 %
        [2013-01-08 13:49:43,346] - [rest_client:881] INFO - rebalance percentage : 43.9937719012 %
        [2013-01-08 13:49:45,354] - [rest_client:881] INFO - rebalance percentage : 44.0425763006 %
        [2013-01-08 13:49:47,368] - [rest_client:881] INFO - rebalance percentage : 44.1158186277 %
        [2013-01-08 13:49:49,375] - [rest_client:881] INFO - rebalance percentage : 44.1890609548 %
        [2013-01-08 13:49:51,387] - [rest_client:881] INFO - rebalance percentage : 44.2623032818 %
        [2013-01-08 13:49:53,442] - [rest_client:881] INFO - rebalance percentage : 44.3355456089 %
        [2013-01-08 13:49:55,451] - [rest_client:881] INFO - rebalance percentage : 44.3843500083 %
        [2013-01-08 13:49:57,465] - [rest_client:881] INFO - rebalance percentage : 44.4576637913 %
        [2013-01-08 13:49:59,481] - [rest_client:881] INFO - rebalance percentage : 44.5309061183 %
        [2013-01-08 13:50:01,491] - [rest_client:881] INFO - rebalance percentage : 44.5797819737 %
        [2013-01-08 13:50:03,508] - [rest_client:881] INFO - rebalance percentage : 44.6773907725 %
        [2013-01-08 13:50:05,525] - [rest_client:881] INFO - rebalance percentage : 44.7750710272 %
        [2013-01-08 13:50:07,544] - [rest_client:881] INFO - rebalance percentage : 44.8239468825 %
        [2013-01-08 13:50:09,562] - [rest_client:881] INFO - rebalance percentage : 44.8971892096 %
        [2013-01-08 13:50:11,570] - [rest_client:881] INFO - rebalance percentage : 44.9947980084 %
        [2013-01-08 13:50:13,592] - [rest_client:881] INFO - rebalance percentage : 45.0680403354 %
        [2013-01-08 13:50:15,606] - [rest_client:881] INFO - rebalance percentage : 45.1412826625 %
        [2013-01-08 13:50:17,625] - [rest_client:881] INFO - rebalance percentage : 45.2145964455 %
        [2013-01-08 13:50:19,653] - [rest_client:881] INFO - rebalance percentage : 45.336571716 %
        [2013-01-08 13:50:21,665] - [rest_client:881] INFO - rebalance percentage : 45.3853761154 %
        [2013-01-08 13:50:23,674] - [rest_client:881] INFO - rebalance percentage : 45.4586184425 %
        [2013-01-08 13:50:25,699] - [rest_client:881] INFO - rebalance percentage : 45.5807366249 %
        [2013-01-08 13:50:27,710] - [rest_client:881] INFO - rebalance percentage : 45.6539789519 %
        [2013-01-08 13:50:29,729] - [rest_client:881] INFO - rebalance percentage : 45.727221279 %
        [2013-01-08 13:50:31,739] - [rest_client:881] INFO - rebalance percentage : 45.7760971343 %
        [2013-01-08 13:50:33,769] - [rest_client:881] INFO - rebalance percentage : 45.8737773891 %
        [2013-01-08 13:50:35,779] - [rest_client:881] INFO - rebalance percentage : 45.9469482602 %
        [2013-01-08 13:50:37,799] - [rest_client:881] INFO - rebalance percentage : 46.0202620432 %
        [2013-01-08 13:50:39,813] - [rest_client:881] INFO - rebalance percentage : 46.0935043702 %
        [2013-01-08 13:50:41,831] - [rest_client:881] INFO - rebalance percentage : 46.1666752414 %
        [2013-01-08 13:50:43,841] - [rest_client:881] INFO - rebalance percentage : 46.2155510967 %
        [2013-01-08 13:50:45,851] - [rest_client:881] INFO - rebalance percentage : 46.2887934238 %
        [2013-01-08 13:50:47,860] - [rest_client:881] INFO - rebalance percentage : 46.3620357508 %
        [2013-01-08 13:50:49,874] - [rest_client:881] INFO - rebalance percentage : 46.4352780779 %
        [2013-01-08 13:50:51,883] - [rest_client:881] INFO - rebalance percentage : 46.4841539332 %
        [2013-01-08 13:50:53,893] - [rest_client:881] INFO - rebalance percentage : 46.6062006597 %
        [2013-01-08 13:50:55,901] - [rest_client:881] INFO - rebalance percentage : 46.6794429867 %
        [2013-01-08 13:50:57,911] - [rest_client:881] INFO - rebalance percentage : 46.7526853138 %
        [2013-01-08 13:50:59,921] - [rest_client:881] INFO - rebalance percentage : 46.8503655685 %
        [2013-01-08 13:51:01,936] - [rest_client:881] INFO - rebalance percentage : 46.9236793515 %
        [2013-01-08 13:51:03,947] - [rest_client:881] INFO - rebalance percentage : 46.9968502227 %
        [2013-01-08 13:51:05,974] - [rest_client:881] INFO - rebalance percentage : 47.0701640056 %
        [2013-01-08 13:51:07,984] - [rest_client:881] INFO - rebalance percentage : 47.118968405 %
        [2013-01-08 13:51:10,010] - [rest_client:881] INFO - rebalance percentage : 47.2166486598 %
        [2013-01-08 13:51:12,017] - [rest_client:881] INFO - rebalance percentage : 47.2898909868 %
        [2013-01-08 13:51:14,032] - [rest_client:881] INFO - rebalance percentage : 47.3631333139 %
        [2013-01-08 13:51:16,055] - [rest_client:881] INFO - rebalance percentage : 47.4364470969 %
        [2013-01-08 13:51:18,069] - [rest_client:881] INFO - rebalance percentage : 47.4852514963 %
        [2013-01-08 13:51:20,089] - [rest_client:881] INFO - rebalance percentage : 47.5340558957 %
        [2013-01-08 13:51:22,103] - [rest_client:881] INFO - rebalance percentage : 47.6317361504 %
        [2013-01-08 13:51:24,113] - [rest_client:881] INFO - rebalance percentage : 47.7293449492 %
        [2013-01-08 13:51:26,122] - [rest_client:881] INFO - rebalance percentage : 47.8026587322 %
        [2013-01-08 13:51:28,133] - [rest_client:881] INFO - rebalance percentage : 47.8759010593 %
        [2013-01-08 13:51:30,146] - [rest_client:881] INFO - rebalance percentage : 47.9247054587 %
        [2013-01-08 13:51:32,157] - [rest_client:881] INFO - rebalance percentage : 47.9735098581 %
        [2013-01-08 13:51:34,168] - [rest_client:881] INFO - rebalance percentage : 48.0467521851 %
        [2013-01-08 13:51:36,178] - [rest_client:881] INFO - rebalance percentage : 48.1445038958 %
        [2013-01-08 13:51:38,205] - [rest_client:881] INFO - rebalance percentage : 48.2176747669 %
        [2013-01-08 13:51:40,220] - [rest_client:881] INFO - rebalance percentage : 48.2909885499 %
        [2013-01-08 13:51:42,231] - [rest_client:881] INFO - rebalance percentage : 48.3397929493 %
        [2013-01-08 13:51:44,241] - [rest_client:881] INFO - rebalance percentage : 48.4131067323 %
        [2013-01-08 13:51:46,268] - [rest_client:881] INFO - rebalance percentage : 48.4863490594 %
        [2013-01-08 13:51:48,288] - [rest_client:881] INFO - rebalance percentage : 48.5595199305 %
        [2013-01-08 13:51:50,299] - [rest_client:881] INFO - rebalance percentage : 48.6328337135 %
        [2013-01-08 13:51:52,320] - [rest_client:881] INFO - rebalance percentage : 48.7060045846 %
        [2013-01-08 13:51:54,330] - [rest_client:881] INFO - rebalance percentage : 48.7793183676 %
        [2013-01-08 13:51:56,341] - [rest_client:881] INFO - rebalance percentage : 48.8769271664 %
        [2013-01-08 13:51:58,357] - [rest_client:881] INFO - rebalance percentage : 48.9989738929 %
        [2013-01-08 13:52:00,371] - [rest_client:881] INFO - rebalance percentage : 49.0722876759 %
        [2013-01-08 13:52:02,386] - [rest_client:881] INFO - rebalance percentage : 49.1210920753 %
        [2013-01-08 13:52:04,396] - [rest_client:881] INFO - rebalance percentage : 49.1943344023 %
        [2013-01-08 13:52:06,404] - [rest_client:881] INFO - rebalance percentage : 49.2431388017 %
        [2013-01-08 13:52:08,416] - [rest_client:881] INFO - rebalance percentage : 49.3163811288 %
        [2013-01-08 13:52:10,429] - [rest_client:881] INFO - rebalance percentage : 49.4140613835 %
        [2013-01-08 13:52:12,440] - [rest_client:881] INFO - rebalance percentage : 49.4873037106 %
        [2013-01-08 13:52:14,450] - [rest_client:881] INFO - rebalance percentage : 49.5605460376 %
        [2013-01-08 13:52:16,460] - [rest_client:881] INFO - rebalance percentage : 49.6338598206 %
        [2013-01-08 13:52:18,469] - [rest_client:881] INFO - rebalance percentage : 49.7070306918 %
        [2013-01-08 13:52:20,491] - [rest_client:881] INFO - rebalance percentage : 49.7802015629 %
        [2013-01-08 13:52:22,501] - [rest_client:881] INFO - rebalance percentage : 49.8534438899 %
        [2013-01-08 13:52:24,512] - [rest_client:881] INFO - rebalance percentage : 49.926686217 %
        [2013-01-08 13:52:26,520] - [rest_client:881] INFO - rebalance percentage : 50.0 %
        [2013-01-08 13:52:28,532] - [rest_client:881] INFO - rebalance percentage : 50.0 %
        [2013-01-08 13:52:30,544] - [rest_client:881] INFO - rebalance percentage : 50.0 %
        [2013-01-08 13:52:32,554] - [rest_client:881] INFO - rebalance percentage : 50.0 %
        [2013-01-08 13:52:34,570] - [rest_client:881] INFO - rebalance percentage : 50.0 %
        [2013-01-08 13:52:36,597] - [rest_client:866] ERROR -

        {u'status': u'none', u'errorMessage': u'Rebalance failed. See logs for detailed reason. You can try rebalance again.'}

        - rebalance failed

        Show
        andreibaranouski Andrei Baranouski added a comment - Hi, Let me first clarify what happened in this case 1. started rebalance and when it reach > 40% we killed memcached on one of the nodes here we handle 2 situation: reblance failed and it continued without failure in our case, the second option worked: [2013-01-08 13:47:49,441] - [rest_client:881] INFO - rebalance percentage : 40.0627668879 % ... [2013-01-08 13:47:53,130] - [swaprebalance:403] INFO - killed 10.3.2.149:8091?? (True, '[]') [2013-01-08 13:47:53,131] - [swaprebalance:404] INFO - sleep for 10 sec after kill memcached ... 2013-01-08 13:49:04,037] - [swaprebalance:412] INFO - rebalance progress: 42.5532203758 [2013-01-08 13:49:05,058] - [swaprebalance:415] INFO - rebalance progress status:running [2013-01-08 13:49:05,068] - [swaprebalance:417] INFO - rebalance is still running even after restarting memcached 2. when rebalance reached > 60% we killed memcashed in second time, but this did not happen, and it fell before: [2013-01-08 13:49:05,079] - [rest_client:881] INFO - rebalance percentage : 42.5775868475 % ... [2013-01-08 13:52:34,570] - [rest_client:881] INFO - rebalance percentage : 50.0 % [2013-01-08 13:52:36,597] - [rest_client:866] ERROR - {u'status': u'none', u'errorMessage': u'Rebalance failed. See logs for detailed reason. You can try rebalance again.'} - rebalance failed So we got to rebalance fell in ~5 minutes after we killed memcached(13:47:53-13:52:36), and all this time the progress rebalance changed(~40->50%) I would not want to change the logic of this test, so it works for a long time. In addition, I have a question about your recommendations: 1) why I need to remove loop? 2) why do we need to wait an hour for the first rebalance failure? In our case it turns out that we were waiting for the fall of about 5 minutes after the killing memcached 3)'If the rebalance status is not changed even after an hour' - do you mean it stuck? all the logs for clarity [2013-01-08 13:47:53,130] - [swaprebalance:403] INFO - killed 10.3.2.149:8091?? (True, '[]') [2013-01-08 13:47:53,131] - [swaprebalance:404] INFO - sleep for 10 sec after kill memcached [2013-01-08 13:48:03,182] - [rest_client:881] INFO - rebalance percentage : 40.4774970989 % [2013-01-08 13:48:03,183] - [swaprebalance:412] INFO - rebalance progress: 40.4774970989 [2013-01-08 13:48:04,223] - [rest_client:881] INFO - rebalance percentage : 40.5263729542 % [2013-01-08 13:48:04,224] - [swaprebalance:412] INFO - rebalance progress: 40.5263729542 [2013-01-08 13:48:05,254] - [rest_client:881] INFO - rebalance percentage : 40.550739426 % [2013-01-08 13:48:05,254] - [swaprebalance:412] INFO - rebalance progress: 40.550739426 [2013-01-08 13:48:06,289] - [rest_client:881] INFO - rebalance percentage : 40.5995438254 % [2013-01-08 13:48:06,289] - [swaprebalance:412] INFO - rebalance progress: 40.5995438254 [2013-01-08 13:48:07,311] - [rest_client:881] INFO - rebalance percentage : 40.623981753 % [2013-01-08 13:48:07,311] - [swaprebalance:412] INFO - rebalance progress: 40.623981753 [2013-01-08 13:48:08,331] - [rest_client:881] INFO - rebalance percentage : 40.6727861524 % [2013-01-08 13:48:08,331] - [swaprebalance:412] INFO - rebalance progress: 40.6727861524 [2013-01-08 13:48:09,366] - [rest_client:881] INFO - rebalance percentage : 40.7216620077 % [2013-01-08 13:48:09,367] - [swaprebalance:412] INFO - rebalance progress: 40.7216620077 [2013-01-08 13:48:10,389] - [rest_client:881] INFO - rebalance percentage : 40.7704664071 % [2013-01-08 13:48:10,390] - [swaprebalance:412] INFO - rebalance progress: 40.7704664071 [2013-01-08 13:48:11,404] - [rest_client:881] INFO - rebalance percentage : 40.7949043348 % [2013-01-08 13:48:11,405] - [swaprebalance:412] INFO - rebalance progress: 40.7949043348 [2013-01-08 13:48:12,446] - [rest_client:881] INFO - rebalance percentage : 40.8437087342 % [2013-01-08 13:48:12,447] - [swaprebalance:412] INFO - rebalance progress: 40.8437087342 [2013-01-08 13:48:13,468] - [rest_client:881] INFO - rebalance percentage : 40.8681466619 % [2013-01-08 13:48:13,468] - [swaprebalance:412] INFO - rebalance progress: 40.8681466619 [2013-01-08 13:48:14,560] - [rest_client:881] INFO - rebalance percentage : 40.9169510613 % [2013-01-08 13:48:14,560] - [swaprebalance:412] INFO - rebalance progress: 40.9169510613 [2013-01-08 13:48:15,708] - [rest_client:881] INFO - rebalance percentage : 40.9413889889 % [2013-01-08 13:48:15,708] - [swaprebalance:412] INFO - rebalance progress: 40.9413889889 [2013-01-08 13:48:16,722] - [rest_client:881] INFO - rebalance percentage : 40.9657554607 % [2013-01-08 13:48:16,723] - [swaprebalance:412] INFO - rebalance progress: 40.9657554607 [2013-01-08 13:48:17,737] - [rest_client:881] INFO - rebalance percentage : 40.9901933883 % [2013-01-08 13:48:17,737] - [swaprebalance:412] INFO - rebalance progress: 40.9901933883 [2013-01-08 13:48:18,762] - [rest_client:881] INFO - rebalance percentage : 41.0389977877 % [2013-01-08 13:48:18,763] - [swaprebalance:412] INFO - rebalance progress: 41.0389977877 [2013-01-08 13:48:19,810] - [rest_client:881] INFO - rebalance percentage : 41.0634357154 % [2013-01-08 13:48:19,811] - [swaprebalance:412] INFO - rebalance progress: 41.0634357154 [2013-01-08 13:48:20,833] - [rest_client:881] INFO - rebalance percentage : 41.1122401148 % [2013-01-08 13:48:20,833] - [swaprebalance:412] INFO - rebalance progress: 41.1122401148 [2013-01-08 13:48:21,848] - [rest_client:881] INFO - rebalance percentage : 41.1366780425 % [2013-01-08 13:48:21,848] - [swaprebalance:412] INFO - rebalance progress: 41.1366780425 [2013-01-08 13:48:22,864] - [rest_client:881] INFO - rebalance percentage : 41.1611159701 % [2013-01-08 13:48:22,865] - [swaprebalance:412] INFO - rebalance progress: 41.1611159701 [2013-01-08 13:48:23,881] - [rest_client:881] INFO - rebalance percentage : 41.1854824418 % [2013-01-08 13:48:23,881] - [swaprebalance:412] INFO - rebalance progress: 41.1854824418 [2013-01-08 13:48:24,897] - [rest_client:881] INFO - rebalance percentage : 41.2099203695 % [2013-01-08 13:48:24,897] - [swaprebalance:412] INFO - rebalance progress: 41.2099203695 [2013-01-08 13:48:25,928] - [rest_client:881] INFO - rebalance percentage : 41.2343582972 % [2013-01-08 13:48:25,929] - [swaprebalance:412] INFO - rebalance progress: 41.2343582972 [2013-01-08 13:48:26,959] - [rest_client:881] INFO - rebalance percentage : 41.2587247689 % [2013-01-08 13:48:26,959] - [swaprebalance:412] INFO - rebalance progress: 41.2587247689 [2013-01-08 13:48:28,008] - [rest_client:881] INFO - rebalance percentage : 41.2831626966 % [2013-01-08 13:48:28,008] - [swaprebalance:412] INFO - rebalance progress: 41.2831626966 [2013-01-08 13:48:29,054] - [rest_client:881] INFO - rebalance percentage : 41.331967096 % [2013-01-08 13:48:29,054] - [swaprebalance:412] INFO - rebalance progress: 41.331967096 [2013-01-08 13:48:30,070] - [rest_client:881] INFO - rebalance percentage : 41.3564050236 % [2013-01-08 13:48:30,071] - [swaprebalance:412] INFO - rebalance progress: 41.3564050236 [2013-01-08 13:48:31,094] - [rest_client:881] INFO - rebalance percentage : 41.3808429513 % [2013-01-08 13:48:31,094] - [swaprebalance:412] INFO - rebalance progress: 41.3808429513 [2013-01-08 13:48:32,111] - [rest_client:881] INFO - rebalance percentage : 41.405280879 % [2013-01-08 13:48:32,112] - [swaprebalance:412] INFO - rebalance progress: 41.405280879 [2013-01-08 13:48:33,133] - [rest_client:881] INFO - rebalance percentage : 41.4540852784 % [2013-01-08 13:48:33,133] - [swaprebalance:412] INFO - rebalance progress: 41.4540852784 [2013-01-08 13:48:34,152] - [rest_client:881] INFO - rebalance percentage : 41.5028896778 % [2013-01-08 13:48:34,153] - [swaprebalance:412] INFO - rebalance progress: 41.5028896778 [2013-01-08 13:48:35,197] - [rest_client:881] INFO - rebalance percentage : 41.5273276054 % [2013-01-08 13:48:35,197] - [swaprebalance:412] INFO - rebalance progress: 41.5273276054 [2013-01-08 13:48:36,216] - [rest_client:881] INFO - rebalance percentage : 41.5517655331 % [2013-01-08 13:48:36,217] - [swaprebalance:412] INFO - rebalance progress: 41.5517655331 [2013-01-08 13:48:37,231] - [rest_client:881] INFO - rebalance percentage : 41.6006413884 % [2013-01-08 13:48:37,232] - [swaprebalance:412] INFO - rebalance progress: 41.6006413884 [2013-01-08 13:48:38,244] - [rest_client:881] INFO - rebalance percentage : 41.6250793161 % [2013-01-08 13:48:38,244] - [swaprebalance:412] INFO - rebalance progress: 41.6250793161 [2013-01-08 13:48:39,262] - [rest_client:881] INFO - rebalance percentage : 41.6495172437 % [2013-01-08 13:48:39,263] - [swaprebalance:412] INFO - rebalance progress: 41.6495172437 [2013-01-08 13:48:40,285] - [rest_client:881] INFO - rebalance percentage : 41.6983930991 % [2013-01-08 13:48:40,285] - [swaprebalance:412] INFO - rebalance progress: 41.6983930991 [2013-01-08 13:48:41,302] - [rest_client:881] INFO - rebalance percentage : 41.7228310267 % [2013-01-08 13:48:41,302] - [swaprebalance:412] INFO - rebalance progress: 41.7228310267 [2013-01-08 13:48:42,323] - [rest_client:881] INFO - rebalance percentage : 41.7472689544 % [2013-01-08 13:48:42,323] - [swaprebalance:412] INFO - rebalance progress: 41.7472689544 [2013-01-08 13:48:43,341] - [rest_client:881] INFO - rebalance percentage : 41.7717068821 % [2013-01-08 13:48:43,341] - [swaprebalance:412] INFO - rebalance progress: 41.7717068821 [2013-01-08 13:48:44,356] - [rest_client:881] INFO - rebalance percentage : 41.7961448097 % [2013-01-08 13:48:44,357] - [swaprebalance:412] INFO - rebalance progress: 41.7961448097 [2013-01-08 13:48:45,372] - [rest_client:881] INFO - rebalance percentage : 41.8205827374 % [2013-01-08 13:48:45,373] - [swaprebalance:412] INFO - rebalance progress: 41.8205827374 [2013-01-08 13:48:46,391] - [rest_client:881] INFO - rebalance percentage : 41.8450206651 % [2013-01-08 13:48:46,391] - [swaprebalance:412] INFO - rebalance progress: 41.8450206651 [2013-01-08 13:48:47,445] - [rest_client:881] INFO - rebalance percentage : 41.8694585927 % [2013-01-08 13:48:47,446] - [swaprebalance:412] INFO - rebalance progress: 41.8694585927 [2013-01-08 13:48:48,460] - [rest_client:881] INFO - rebalance percentage : 41.918334448 % [2013-01-08 13:48:48,460] - [swaprebalance:412] INFO - rebalance progress: 41.918334448 [2013-01-08 13:48:49,485] - [rest_client:881] INFO - rebalance percentage : 41.9672103034 % [2013-01-08 13:48:49,485] - [swaprebalance:412] INFO - rebalance progress: 41.9672103034 [2013-01-08 13:48:50,506] - [rest_client:881] INFO - rebalance percentage : 42.0160861587 % [2013-01-08 13:48:50,507] - [swaprebalance:412] INFO - rebalance progress: 42.0160861587 [2013-01-08 13:48:51,522] - [rest_client:881] INFO - rebalance percentage : 42.0405240864 % [2013-01-08 13:48:51,523] - [swaprebalance:412] INFO - rebalance progress: 42.0405240864 [2013-01-08 13:48:52,537] - [rest_client:881] INFO - rebalance percentage : 42.0893284858 % [2013-01-08 13:48:52,538] - [swaprebalance:412] INFO - rebalance progress: 42.0893284858 [2013-01-08 13:48:53,552] - [rest_client:881] INFO - rebalance percentage : 42.1381328852 % [2013-01-08 13:48:53,553] - [swaprebalance:412] INFO - rebalance progress: 42.1381328852 [2013-01-08 13:48:54,591] - [rest_client:881] INFO - rebalance percentage : 42.1870087405 % [2013-01-08 13:48:54,592] - [swaprebalance:412] INFO - rebalance progress: 42.1870087405 [2013-01-08 13:48:55,709] - [rest_client:881] INFO - rebalance percentage : 42.2358131399 % [2013-01-08 13:48:55,710] - [swaprebalance:412] INFO - rebalance progress: 42.2358131399 [2013-01-08 13:48:56,723] - [rest_client:881] INFO - rebalance percentage : 42.2602510676 % [2013-01-08 13:48:56,724] - [swaprebalance:412] INFO - rebalance progress: 42.2602510676 [2013-01-08 13:48:57,746] - [rest_client:881] INFO - rebalance percentage : 42.309055467 % [2013-01-08 13:48:57,747] - [swaprebalance:412] INFO - rebalance progress: 42.309055467 [2013-01-08 13:48:58,768] - [rest_client:881] INFO - rebalance percentage : 42.3579313223 % [2013-01-08 13:48:58,769] - [swaprebalance:412] INFO - rebalance progress: 42.3579313223 [2013-01-08 13:48:59,821] - [rest_client:881] INFO - rebalance percentage : 42.382297794 % [2013-01-08 13:48:59,821] - [swaprebalance:412] INFO - rebalance progress: 42.382297794 [2013-01-08 13:49:00,878] - [rest_client:881] INFO - rebalance percentage : 42.4311021934 % [2013-01-08 13:49:00,879] - [swaprebalance:412] INFO - rebalance progress: 42.4311021934 [2013-01-08 13:49:01,925] - [rest_client:881] INFO - rebalance percentage : 42.4799780487 % [2013-01-08 13:49:01,926] - [swaprebalance:412] INFO - rebalance progress: 42.4799780487 [2013-01-08 13:49:02,950] - [rest_client:881] INFO - rebalance percentage : 42.5043445205 % [2013-01-08 13:49:02,951] - [swaprebalance:412] INFO - rebalance progress: 42.5043445205 [2013-01-08 13:49:04,036] - [rest_client:881] INFO - rebalance percentage : 42.5532203758 % [2013-01-08 13:49:04,037] - [swaprebalance:412] INFO - rebalance progress: 42.5532203758 [2013-01-08 13:49:05,058] - [swaprebalance:415] INFO - rebalance progress status:running [2013-01-08 13:49:05,068] - [swaprebalance:417] INFO - rebalance is still running even after restarting memcached [2013-01-08 13:49:05,069] - [swaprebalance:379] INFO - FAIL SWAP REBALANCE PHASE @ 60 [2013-01-08 13:49:05,079] - [rest_client:881] INFO - rebalance percentage : 42.5775868475 % [2013-01-08 13:49:07,088] - [rest_client:881] INFO - rebalance percentage : 42.6752671023 % [2013-01-08 13:49:09,134] - [rest_client:881] INFO - rebalance percentage : 42.772947357 % [2013-01-08 13:49:11,151] - [rest_client:881] INFO - rebalance percentage : 42.8218232123 % [2013-01-08 13:49:13,164] - [rest_client:881] INFO - rebalance percentage : 42.8949940834 % [2013-01-08 13:49:15,179] - [rest_client:881] INFO - rebalance percentage : 42.9682364105 % [2013-01-08 13:49:17,189] - [rest_client:881] INFO - rebalance percentage : 43.0414787376 % [2013-01-08 13:49:19,208] - [rest_client:881] INFO - rebalance percentage : 43.1390875364 % [2013-01-08 13:49:21,219] - [rest_client:881] INFO - rebalance percentage : 43.2124013194 % [2013-01-08 13:49:23,229] - [rest_client:881] INFO - rebalance percentage : 43.3100815741 % [2013-01-08 13:49:25,239] - [rest_client:881] INFO - rebalance percentage : 43.3588859735 % [2013-01-08 13:49:27,252] - [rest_client:881] INFO - rebalance percentage : 43.4321283005 % [2013-01-08 13:49:29,261] - [rest_client:881] INFO - rebalance percentage : 43.5054420835 % [2013-01-08 13:49:31,272] - [rest_client:881] INFO - rebalance percentage : 43.5542464829 % [2013-01-08 13:49:33,292] - [rest_client:881] INFO - rebalance percentage : 43.6519267377 % [2013-01-08 13:49:35,303] - [rest_client:881] INFO - rebalance percentage : 43.7251690647 % [2013-01-08 13:49:37,317] - [rest_client:881] INFO - rebalance percentage : 43.7740449201 % [2013-01-08 13:49:39,326] - [rest_client:881] INFO - rebalance percentage : 43.8472872471 % [2013-01-08 13:49:41,337] - [rest_client:881] INFO - rebalance percentage : 43.9205295742 % [2013-01-08 13:49:43,346] - [rest_client:881] INFO - rebalance percentage : 43.9937719012 % [2013-01-08 13:49:45,354] - [rest_client:881] INFO - rebalance percentage : 44.0425763006 % [2013-01-08 13:49:47,368] - [rest_client:881] INFO - rebalance percentage : 44.1158186277 % [2013-01-08 13:49:49,375] - [rest_client:881] INFO - rebalance percentage : 44.1890609548 % [2013-01-08 13:49:51,387] - [rest_client:881] INFO - rebalance percentage : 44.2623032818 % [2013-01-08 13:49:53,442] - [rest_client:881] INFO - rebalance percentage : 44.3355456089 % [2013-01-08 13:49:55,451] - [rest_client:881] INFO - rebalance percentage : 44.3843500083 % [2013-01-08 13:49:57,465] - [rest_client:881] INFO - rebalance percentage : 44.4576637913 % [2013-01-08 13:49:59,481] - [rest_client:881] INFO - rebalance percentage : 44.5309061183 % [2013-01-08 13:50:01,491] - [rest_client:881] INFO - rebalance percentage : 44.5797819737 % [2013-01-08 13:50:03,508] - [rest_client:881] INFO - rebalance percentage : 44.6773907725 % [2013-01-08 13:50:05,525] - [rest_client:881] INFO - rebalance percentage : 44.7750710272 % [2013-01-08 13:50:07,544] - [rest_client:881] INFO - rebalance percentage : 44.8239468825 % [2013-01-08 13:50:09,562] - [rest_client:881] INFO - rebalance percentage : 44.8971892096 % [2013-01-08 13:50:11,570] - [rest_client:881] INFO - rebalance percentage : 44.9947980084 % [2013-01-08 13:50:13,592] - [rest_client:881] INFO - rebalance percentage : 45.0680403354 % [2013-01-08 13:50:15,606] - [rest_client:881] INFO - rebalance percentage : 45.1412826625 % [2013-01-08 13:50:17,625] - [rest_client:881] INFO - rebalance percentage : 45.2145964455 % [2013-01-08 13:50:19,653] - [rest_client:881] INFO - rebalance percentage : 45.336571716 % [2013-01-08 13:50:21,665] - [rest_client:881] INFO - rebalance percentage : 45.3853761154 % [2013-01-08 13:50:23,674] - [rest_client:881] INFO - rebalance percentage : 45.4586184425 % [2013-01-08 13:50:25,699] - [rest_client:881] INFO - rebalance percentage : 45.5807366249 % [2013-01-08 13:50:27,710] - [rest_client:881] INFO - rebalance percentage : 45.6539789519 % [2013-01-08 13:50:29,729] - [rest_client:881] INFO - rebalance percentage : 45.727221279 % [2013-01-08 13:50:31,739] - [rest_client:881] INFO - rebalance percentage : 45.7760971343 % [2013-01-08 13:50:33,769] - [rest_client:881] INFO - rebalance percentage : 45.8737773891 % [2013-01-08 13:50:35,779] - [rest_client:881] INFO - rebalance percentage : 45.9469482602 % [2013-01-08 13:50:37,799] - [rest_client:881] INFO - rebalance percentage : 46.0202620432 % [2013-01-08 13:50:39,813] - [rest_client:881] INFO - rebalance percentage : 46.0935043702 % [2013-01-08 13:50:41,831] - [rest_client:881] INFO - rebalance percentage : 46.1666752414 % [2013-01-08 13:50:43,841] - [rest_client:881] INFO - rebalance percentage : 46.2155510967 % [2013-01-08 13:50:45,851] - [rest_client:881] INFO - rebalance percentage : 46.2887934238 % [2013-01-08 13:50:47,860] - [rest_client:881] INFO - rebalance percentage : 46.3620357508 % [2013-01-08 13:50:49,874] - [rest_client:881] INFO - rebalance percentage : 46.4352780779 % [2013-01-08 13:50:51,883] - [rest_client:881] INFO - rebalance percentage : 46.4841539332 % [2013-01-08 13:50:53,893] - [rest_client:881] INFO - rebalance percentage : 46.6062006597 % [2013-01-08 13:50:55,901] - [rest_client:881] INFO - rebalance percentage : 46.6794429867 % [2013-01-08 13:50:57,911] - [rest_client:881] INFO - rebalance percentage : 46.7526853138 % [2013-01-08 13:50:59,921] - [rest_client:881] INFO - rebalance percentage : 46.8503655685 % [2013-01-08 13:51:01,936] - [rest_client:881] INFO - rebalance percentage : 46.9236793515 % [2013-01-08 13:51:03,947] - [rest_client:881] INFO - rebalance percentage : 46.9968502227 % [2013-01-08 13:51:05,974] - [rest_client:881] INFO - rebalance percentage : 47.0701640056 % [2013-01-08 13:51:07,984] - [rest_client:881] INFO - rebalance percentage : 47.118968405 % [2013-01-08 13:51:10,010] - [rest_client:881] INFO - rebalance percentage : 47.2166486598 % [2013-01-08 13:51:12,017] - [rest_client:881] INFO - rebalance percentage : 47.2898909868 % [2013-01-08 13:51:14,032] - [rest_client:881] INFO - rebalance percentage : 47.3631333139 % [2013-01-08 13:51:16,055] - [rest_client:881] INFO - rebalance percentage : 47.4364470969 % [2013-01-08 13:51:18,069] - [rest_client:881] INFO - rebalance percentage : 47.4852514963 % [2013-01-08 13:51:20,089] - [rest_client:881] INFO - rebalance percentage : 47.5340558957 % [2013-01-08 13:51:22,103] - [rest_client:881] INFO - rebalance percentage : 47.6317361504 % [2013-01-08 13:51:24,113] - [rest_client:881] INFO - rebalance percentage : 47.7293449492 % [2013-01-08 13:51:26,122] - [rest_client:881] INFO - rebalance percentage : 47.8026587322 % [2013-01-08 13:51:28,133] - [rest_client:881] INFO - rebalance percentage : 47.8759010593 % [2013-01-08 13:51:30,146] - [rest_client:881] INFO - rebalance percentage : 47.9247054587 % [2013-01-08 13:51:32,157] - [rest_client:881] INFO - rebalance percentage : 47.9735098581 % [2013-01-08 13:51:34,168] - [rest_client:881] INFO - rebalance percentage : 48.0467521851 % [2013-01-08 13:51:36,178] - [rest_client:881] INFO - rebalance percentage : 48.1445038958 % [2013-01-08 13:51:38,205] - [rest_client:881] INFO - rebalance percentage : 48.2176747669 % [2013-01-08 13:51:40,220] - [rest_client:881] INFO - rebalance percentage : 48.2909885499 % [2013-01-08 13:51:42,231] - [rest_client:881] INFO - rebalance percentage : 48.3397929493 % [2013-01-08 13:51:44,241] - [rest_client:881] INFO - rebalance percentage : 48.4131067323 % [2013-01-08 13:51:46,268] - [rest_client:881] INFO - rebalance percentage : 48.4863490594 % [2013-01-08 13:51:48,288] - [rest_client:881] INFO - rebalance percentage : 48.5595199305 % [2013-01-08 13:51:50,299] - [rest_client:881] INFO - rebalance percentage : 48.6328337135 % [2013-01-08 13:51:52,320] - [rest_client:881] INFO - rebalance percentage : 48.7060045846 % [2013-01-08 13:51:54,330] - [rest_client:881] INFO - rebalance percentage : 48.7793183676 % [2013-01-08 13:51:56,341] - [rest_client:881] INFO - rebalance percentage : 48.8769271664 % [2013-01-08 13:51:58,357] - [rest_client:881] INFO - rebalance percentage : 48.9989738929 % [2013-01-08 13:52:00,371] - [rest_client:881] INFO - rebalance percentage : 49.0722876759 % [2013-01-08 13:52:02,386] - [rest_client:881] INFO - rebalance percentage : 49.1210920753 % [2013-01-08 13:52:04,396] - [rest_client:881] INFO - rebalance percentage : 49.1943344023 % [2013-01-08 13:52:06,404] - [rest_client:881] INFO - rebalance percentage : 49.2431388017 % [2013-01-08 13:52:08,416] - [rest_client:881] INFO - rebalance percentage : 49.3163811288 % [2013-01-08 13:52:10,429] - [rest_client:881] INFO - rebalance percentage : 49.4140613835 % [2013-01-08 13:52:12,440] - [rest_client:881] INFO - rebalance percentage : 49.4873037106 % [2013-01-08 13:52:14,450] - [rest_client:881] INFO - rebalance percentage : 49.5605460376 % [2013-01-08 13:52:16,460] - [rest_client:881] INFO - rebalance percentage : 49.6338598206 % [2013-01-08 13:52:18,469] - [rest_client:881] INFO - rebalance percentage : 49.7070306918 % [2013-01-08 13:52:20,491] - [rest_client:881] INFO - rebalance percentage : 49.7802015629 % [2013-01-08 13:52:22,501] - [rest_client:881] INFO - rebalance percentage : 49.8534438899 % [2013-01-08 13:52:24,512] - [rest_client:881] INFO - rebalance percentage : 49.926686217 % [2013-01-08 13:52:26,520] - [rest_client:881] INFO - rebalance percentage : 50.0 % [2013-01-08 13:52:28,532] - [rest_client:881] INFO - rebalance percentage : 50.0 % [2013-01-08 13:52:30,544] - [rest_client:881] INFO - rebalance percentage : 50.0 % [2013-01-08 13:52:32,554] - [rest_client:881] INFO - rebalance percentage : 50.0 % [2013-01-08 13:52:34,570] - [rest_client:881] INFO - rebalance percentage : 50.0 % [2013-01-08 13:52:36,597] - [rest_client:866] ERROR - {u'status': u'none', u'errorMessage': u'Rebalance failed. See logs for detailed reason. You can try rebalance again.'} - rebalance failed
        andreibaranouski Andrei Baranouski made changes -
        Assignee Andrei Baranouski [ andreibaranouski ] Chiyoung Seo [ chiyoung ]
        farshid Farshid Ghods (Inactive) made changes -
        Assignee Chiyoung Seo [ chiyoung ] Farshid Ghods [ farshid ]
        Hide
        farshid Farshid Ghods (Inactive) added a comment -

        Andrei,

        lets discuss this test case on monday.

        Show
        farshid Farshid Ghods (Inactive) added a comment - Andrei, lets discuss this test case on monday.
        Hide
        farshid Farshid Ghods (Inactive) added a comment -

        >>I would not want to change the logic of this test, so it works for a long time. In addition, I have a question about your recommendations:
        if the test logic in incorrect and have been passing we still need to fix the test logic.

        >>2) why do we need to wait an hour for the first rebalance failure? In our case it turns out that we were waiting for the fall of about 5 minutes after the killing memcached

        so currently we only wait up to 60 seconds for rebalance to fail or complete. that is not enough. rebalance could continue even though memcached which was shutdown on one of the nodes. consider a use case where rebalance has already rebalanced out node X from 6 node cluster and now removing node Y. if test shuts down memcached X it would not impact rebalance operation and it will succeed.

        so the test should wait for rebalance to finish and ignore whether rebalance after memcached shutdown has succeeded or failed.

        >>1) Remove the for loop in the code
        I actually dont know know why there is a for loop there. can you take a look and see why we needed the for looop there ?

        Show
        farshid Farshid Ghods (Inactive) added a comment - >>I would not want to change the logic of this test, so it works for a long time. In addition, I have a question about your recommendations: if the test logic in incorrect and have been passing we still need to fix the test logic. >>2) why do we need to wait an hour for the first rebalance failure? In our case it turns out that we were waiting for the fall of about 5 minutes after the killing memcached so currently we only wait up to 60 seconds for rebalance to fail or complete. that is not enough. rebalance could continue even though memcached which was shutdown on one of the nodes. consider a use case where rebalance has already rebalanced out node X from 6 node cluster and now removing node Y. if test shuts down memcached X it would not impact rebalance operation and it will succeed. so the test should wait for rebalance to finish and ignore whether rebalance after memcached shutdown has succeeded or failed. >>1) Remove the for loop in the code I actually dont know know why there is a for loop there. can you take a look and see why we needed the for looop there ?
        Hide
        farshid Farshid Ghods (Inactive) added a comment -

        >>>3)'If the rebalance status is not changed even after an hour' - do you mean it stuck?

        yes. I think we already have a code in monitorRebalance that detects if rebalanceProgress is not changing and declares that it is stuck and fails the rebalance.

        Show
        farshid Farshid Ghods (Inactive) added a comment - >>>3)'If the rebalance status is not changed even after an hour' - do you mean it stuck? yes. I think we already have a code in monitorRebalance that detects if rebalanceProgress is not changing and declares that it is stuck and fails the rebalance.
        Hide
        farshid Farshid Ghods (Inactive) added a comment -

        >>>3)'If the rebalance status is not changed even after an hour' - do you mean it stuck?

        yes. I think we already have a code in monitorRebalance that detects if rebalanceProgress is not changing and declares that it is stuck and fails the rebalance.

        Show
        farshid Farshid Ghods (Inactive) added a comment - >>>3)'If the rebalance status is not changed even after an hour' - do you mean it stuck? yes. I think we already have a code in monitorRebalance that detects if rebalanceProgress is not changing and declares that it is stuck and fails the rebalance.
        Hide
        andreibaranouski Andrei Baranouski added a comment -
        Show
        andreibaranouski Andrei Baranouski added a comment - test was changed http://review.couchbase.org/#/c/23987
        Hide
        farshid Farshid Ghods (Inactive) added a comment -

        please close the ticket if test passes after changes are merged

        Show
        farshid Farshid Ghods (Inactive) added a comment - please close the ticket if test passes after changes are merged
        farshid Farshid Ghods (Inactive) made changes -
        Assignee Farshid Ghods [ farshid ] Andrei Baranouski [ andreibaranouski ]
        Show
        andreibaranouski Andrei Baranouski added a comment - all tests passed http://qa.hq.northscale.net/view/2.0.1/job/centos-32-2.0-swaprebalance-tests/263/
        andreibaranouski Andrei Baranouski made changes -
        Status Open [ 1 ] Resolved [ 5 ]
        Resolution Fixed [ 1 ]
        mikew Mike Wiederhold made changes -
        Sprint Status Current Sprint
        andreibaranouski Andrei Baranouski made changes -
        Status Resolved [ 5 ] Closed [ 6 ]

          People

          • Assignee:
            andreibaranouski Andrei Baranouski
            Reporter:
            andreibaranouski Andrei Baranouski
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Gerrit Reviews

              There are no open Gerrit changes