Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-3255

Rebalance fails to Restart, after stopping a rebalance operation near completion

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 1.6.5
    • Fix Version/s: 1.7 beta
    • Component/s: ns_server
    • Security Level: Public
    • Labels:
      None
    • Environment:
      Win 64

      Description

      Rebalancing from a 2 to a 3 node cluster. Started rebalance and allowed it to run for quite a while. Stopped the rebalance process. Restarted the process and allowed it to run until it was very, very close to completion. Stopped it a second time. Now when attempting to restart to finish up, getting multiple errors in logs. Diagnostics attached.

      1. ns-diag-20110103134811.txt
        5.60 MB
        James Phillips
      1. Capture.GIF
        353 kB

        Activity

        james James Phillips (Inactive) created issue -
        Hide
        sharon Sharon Barr (Inactive) added a comment -

        Chiyoung, please have an initial analysis.

        Show
        sharon Sharon Barr (Inactive) added a comment - Chiyoung, please have an initial analysis.
        sharon Sharon Barr (Inactive) made changes -
        Field Original Value New Value
        Assignee Sharon Barr [ sharon ] Chiyoung Seo [ chiyoung ]
        Hide
        james James Phillips (Inactive) added a comment -

        In addition to the failure to allow the rebalance to start up again, the cluster is now behaving very badly when trying to use it. See attached screen capture of monitoring screen when turning on a client to load data. This client was getting 4kops/sec before this rebalance failure. I've turned off the client, but the servers are still up 10.2.1.100/101/102. Administrator/j4958ph is the UI login. Administrator/northscale!23 is the windows login.

        Show
        james James Phillips (Inactive) added a comment - In addition to the failure to allow the rebalance to start up again, the cluster is now behaving very badly when trying to use it. See attached screen capture of monitoring screen when turning on a client to load data. This client was getting 4kops/sec before this rebalance failure. I've turned off the client, but the servers are still up 10.2.1.100/101/102. Administrator/j4958ph is the UI login. Administrator/northscale!23 is the windows login.
        james James Phillips (Inactive) made changes -
        Attachment Capture.GIF [ 10927 ]
        Hide
        chiyoung Chiyoung Seo added a comment -

        Sean,

        I saw the bunch of the following erlang messages in the log file. Can you tell me why they happened? If you think they are not directly related to the ns server, please reassign the bug to me.

        CRASH REPORT <0.11911.0> 2011-01-03 13:47:59
        ===============================================================================
        Crashing process
        initial_call

        {ns_janitor,cleanup,['Argument__1']}

        pid <0.11911.0>
        registered_name []
        error_info
        {exit,{badmatch,{error,timeout,
        [

        {mc_client_binary,cmd_binary_vocal_recv,5}

        ,

        {mc_client_binary,delete_vbucket,2}

        ,

        {ns_memcached,handle_call,3}

        ,

        {gen_server,handle_msg,5}

        ,

        {proc_lib,init_p_do_apply,3}]},
        {gen_server,call,
        [{'ns_memcached-default','ns_1@10.2.1.101'},
        {delete_vbucket,43},
        30000]}},
        [{gen_server,call,3},
        {lists,foreach,2},
        {ns_vbm_sup,start_replicas,4},
        {lists,foreach,2},
        {ns_vbm_sup,'-set_replicas/2-fun-1-',3},
        {lists,foreach,2},
        {proc_lib,init_p_do_apply,3}

        ]}
        ancestors [<0.110.0>,ns_server_sup,ns_server_cluster_sup,<0.60.0>]
        messages []
        links [<0.110.0>]
        dictionary []
        trap_exit false
        status running
        heap_size 17711
        stack_size 24
        reductions 2384777

        INFO REPORT <0.110.0> 2011-01-03 13:47:59
        ===============================================================================

        ns_1@10.2.1.100:ns_orchestrator:174: Janitor run completed for bucket "default" with reason {{{badmatch,
        {error,
        timeout}},
        [

        {mc_client_binary, cmd_binary_vocal_recv, 5}

        ,

        {mc_client_binary, delete_vbucket, 2}

        ,

        {ns_memcached, handle_call, 3}

        ,

        {gen_server, handle_msg, 5}

        ,

        {proc_lib, init_p_do_apply, 3}

        ]},
        {gen_server,
        call,
        [

        {'ns_memcached-default', 'ns_1@10.2.1.101'}

        ,

        {delete_vbucket, 43}

        ,
        30000]}}

        INFO REPORT <0.11992.0> 2011-01-03 13:48:00
        ===============================================================================

        ns_1@10.2.1.100:ns_janitor:193: Killing replicators for vbucket 0 on master 'ns_1@10.2.1.102' because of {{'ns_1@10.2.1.102',
        active},
        {'ns_1@10.2.1.101',
        dead}}

        INFO REPORT <0.11992.0> 2011-01-03 13:48:00
        ===============================================================================

        ns_1@10.2.1.100:ns_janitor:193: Killing replicators for vbucket 1 on master 'ns_1@10.2.1.102' because of {{'ns_1@10.2.1.102',
        active},
        {'ns_1@10.2.1.101',
        dead}}

        INFO REPORT <0.11992.0> 2011-01-03 13:48:00
        ===============================================================================

        ns_1@10.2.1.100:ns_janitor:193: Killing replicators for vbucket 2 on master 'ns_1@10.2.1.102' because of {{'ns_1@10.2.1.102',
        active},
        {'ns_1@10.2.1.101',
        dead}}
        ...

        Show
        chiyoung Chiyoung Seo added a comment - Sean, I saw the bunch of the following erlang messages in the log file. Can you tell me why they happened? If you think they are not directly related to the ns server, please reassign the bug to me. CRASH REPORT <0.11911.0> 2011-01-03 13:47:59 =============================================================================== Crashing process initial_call {ns_janitor,cleanup,['Argument__1']} pid <0.11911.0> registered_name [] error_info {exit,{ badmatch,{error,timeout , [ {mc_client_binary,cmd_binary_vocal_recv,5} , {mc_client_binary,delete_vbucket,2} , {ns_memcached,handle_call,3} , {gen_server,handle_msg,5} , {proc_lib,init_p_do_apply,3}]}, {gen_server,call, [{'ns_memcached-default','ns_1@10.2.1.101'}, {delete_vbucket,43}, 30000]}}, [{gen_server,call,3}, {lists,foreach,2}, {ns_vbm_sup,start_replicas,4}, {lists,foreach,2}, {ns_vbm_sup,'-set_replicas/2-fun-1-',3}, {lists,foreach,2}, {proc_lib,init_p_do_apply,3} ]} ancestors [<0.110.0>,ns_server_sup,ns_server_cluster_sup,<0.60.0>] messages [] links [<0.110.0>] dictionary [] trap_exit false status running heap_size 17711 stack_size 24 reductions 2384777 INFO REPORT <0.110.0> 2011-01-03 13:47:59 =============================================================================== ns_1@10.2.1.100:ns_orchestrator:174: Janitor run completed for bucket "default" with reason {{{badmatch, {error, timeout}}, [ {mc_client_binary, cmd_binary_vocal_recv, 5} , {mc_client_binary, delete_vbucket, 2} , {ns_memcached, handle_call, 3} , {gen_server, handle_msg, 5} , {proc_lib, init_p_do_apply, 3} ]}, {gen_server, call, [ {'ns_memcached-default', 'ns_1@10.2.1.101'} , {delete_vbucket, 43} , 30000]}} INFO REPORT <0.11992.0> 2011-01-03 13:48:00 =============================================================================== ns_1@10.2.1.100:ns_janitor:193: Killing replicators for vbucket 0 on master 'ns_1@10.2.1.102' because of {{'ns_1@10.2.1.102', active}, {'ns_1@10.2.1.101', dead}} INFO REPORT <0.11992.0> 2011-01-03 13:48:00 =============================================================================== ns_1@10.2.1.100:ns_janitor:193: Killing replicators for vbucket 1 on master 'ns_1@10.2.1.102' because of {{'ns_1@10.2.1.102', active}, {'ns_1@10.2.1.101', dead}} INFO REPORT <0.11992.0> 2011-01-03 13:48:00 =============================================================================== ns_1@10.2.1.100:ns_janitor:193: Killing replicators for vbucket 2 on master 'ns_1@10.2.1.102' because of {{'ns_1@10.2.1.102', active}, {'ns_1@10.2.1.101', dead}} ...
        chiyoung Chiyoung Seo made changes -
        Assignee Chiyoung Seo [ chiyoung ] Sean Lynch [ sean ]
        Hide
        sean Sean Lynch (Inactive) added a comment -

        The delete command timed out talking to memcached, after 5 seconds. We can increase this timeout, but I don't think memcached is supposed to have any delays in responding to commands?

        Show
        sean Sean Lynch (Inactive) added a comment - The delete command timed out talking to memcached, after 5 seconds. We can increase this timeout, but I don't think memcached is supposed to have any delays in responding to commands?
        Hide
        james James Phillips (Inactive) added a comment -

        more information...

        As noted above, I left the cluster alone for the last couple days in case someone wanted to take a look. Today I came in and looked at it and there were a bunch of vbucket migrator errors being continuously reported in the log. I started a rebalance again. This time it seemed to start, but then there were more errors. See attached diag.

        Show
        james James Phillips (Inactive) added a comment - more information... As noted above, I left the cluster alone for the last couple days in case someone wanted to take a look. Today I came in and looked at it and there were a bunch of vbucket migrator errors being continuously reported in the log. I started a rebalance again. This time it seemed to start, but then there were more errors. See attached diag.
        james James Phillips (Inactive) made changes -
        Attachment ns-diag-20110105101559.zip [ 10928 ]
        chiyoung Chiyoung Seo made changes -
        Assignee Sean Lynch [ sean ] Chiyoung Seo [ chiyoung ]
        chiyoung Chiyoung Seo made changes -
        Original Estimate 8h [ 28800 ]
        Remaining Estimate 8h [ 28800 ]
        Fix Version/s 2011-01-20 [ 10142 ]
        Hide
        chiyoung Chiyoung Seo added a comment -

        It seems to me that this rebalance failure is caused by 1) quite slow vbucket deletion in memory hashtable and 2) frequent TAP ack timeout.

        All of vbucket migrator errors were caused by TAP ack timeout events that consequently killed vbucket migrator processes. I observed that this frequent ack timeout issue also caused the slow replication problem in 1.6.5. Trond is current working on resolving this timeout issue.

        Regarding the slow vbucket deletion in memory hashtable, I couldn't reproduce it on my Linux testing machines, but the timing stat from one of James's Windows machines shows that deleting a single vbucket from memory hashtable sometimes takes more than 2 sec:

        del_vb_cmd (1415 total)
        0 - 1us : ( 46.08%) 652 #########################################
        8ms - 16ms : ( 50.32%) 60 ###
        16ms - 32ms : ( 51.59%) 18 #
        32ms - 65ms : ( 53.99%) 34 ##
        65ms - 131ms : ( 56.68%) 38 ##
        131ms - 262ms : ( 62.26%) 79 #####
        262ms - 524ms : ( 70.60%) 118 #######
        524ms - 1s : ( 78.37%) 110 ######
        1s - 2s : ( 91.87%) 191 ############
        2s - 4s : ( 98.09%) 88 #####
        4s - 8s : ( 99.93%) 26 #
        8s - 16s : (100.00%) 1

        As Sean mentioned, this is the reason why the ns server got many vbucket deletion timeout events from the membase server. Note that vbucket deletion in disk is scheduled and executed asynchronously. I'm currently investigating this issue...

        Show
        chiyoung Chiyoung Seo added a comment - It seems to me that this rebalance failure is caused by 1) quite slow vbucket deletion in memory hashtable and 2) frequent TAP ack timeout. All of vbucket migrator errors were caused by TAP ack timeout events that consequently killed vbucket migrator processes. I observed that this frequent ack timeout issue also caused the slow replication problem in 1.6.5. Trond is current working on resolving this timeout issue. Regarding the slow vbucket deletion in memory hashtable, I couldn't reproduce it on my Linux testing machines, but the timing stat from one of James's Windows machines shows that deleting a single vbucket from memory hashtable sometimes takes more than 2 sec: del_vb_cmd (1415 total) 0 - 1us : ( 46.08%) 652 ######################################### 8ms - 16ms : ( 50.32%) 60 ### 16ms - 32ms : ( 51.59%) 18 # 32ms - 65ms : ( 53.99%) 34 ## 65ms - 131ms : ( 56.68%) 38 ## 131ms - 262ms : ( 62.26%) 79 ##### 262ms - 524ms : ( 70.60%) 118 ####### 524ms - 1s : ( 78.37%) 110 ###### 1s - 2s : ( 91.87%) 191 ############ 2s - 4s : ( 98.09%) 88 ##### 4s - 8s : ( 99.93%) 26 # 8s - 16s : (100.00%) 1 As Sean mentioned, this is the reason why the ns server got many vbucket deletion timeout events from the membase server. Note that vbucket deletion in disk is scheduled and executed asynchronously. I'm currently investigating this issue...
        chiyoung Chiyoung Seo made changes -
        Fix Version/s 1.6.5 [ 10140 ]
        Priority Major [ 3 ] Blocker [ 1 ]
        perry Perry Krug made changes -
        Summary Rebalance fails to REstart, after stopping a rebalance operation near completion Rebalance fails to Restart, after stopping a rebalance operation near completion
        perry Perry Krug made changes -
        Flagged [Release Note]
        Hide
        chiyoung Chiyoung Seo added a comment -

        We reduced the time spent by vbucket deletion from memory hashtable by avoiding the hashtable visit during vbucket deletion (http://review.membase.org/4251).

        We observed that the aggregated execution time of all steps performed in a single vbucket deletion on Windows is always less than 2 sec, but the wall clock execution time of a single vbucket deletion sometimes takes more than 5 sec under heavy CPU and memory usage, but much less frequently with the above fix . It seems to me that this execution time delay is related to the Windows scheduler.

        If a rebalance fails in this scenario, simply restarting it resolves the failure.

        We will investigate the above scheduling-related issue again soon.

        Show
        chiyoung Chiyoung Seo added a comment - We reduced the time spent by vbucket deletion from memory hashtable by avoiding the hashtable visit during vbucket deletion ( http://review.membase.org/4251 ). We observed that the aggregated execution time of all steps performed in a single vbucket deletion on Windows is always less than 2 sec, but the wall clock execution time of a single vbucket deletion sometimes takes more than 5 sec under heavy CPU and memory usage, but much less frequently with the above fix . It seems to me that this execution time delay is related to the Windows scheduler. If a rebalance fails in this scenario, simply restarting it resolves the failure. We will investigate the above scheduling-related issue again soon.
        chiyoung Chiyoung Seo made changes -
        Fix Version/s 1.7 beta [ 10110 ]
        Fix Version/s 1.6.5 [ 10140 ]
        Fix Version/s 2011-01-20 [ 10142 ]
        Priority Blocker [ 1 ] Major [ 3 ]
        Hide
        sharon Sharon Barr (Inactive) added a comment -

        Tony,
        Please try and reproduce this on 1.7 and close this bug if it is not reproducible.

        Show
        sharon Sharon Barr (Inactive) added a comment - Tony, Please try and reproduce this on 1.7 and close this bug if it is not reproducible.
        sharon Sharon Barr (Inactive) made changes -
        Assignee Chiyoung Seo [ chiyoung ] Thuan Nguyen [ thuan ]
        perry Perry Krug made changes -
        Component/s ns_server [ 10019 ]
        Hide
        thuan Thuan Nguyen added a comment -

        Tested on build basestar-191-g79a0550 on ubuntu 10.04 64bit and repeat step by step as in bug report. I can not reproduce this bug. So I think this issue is fixed. I will close this bug.

        Show
        thuan Thuan Nguyen added a comment - Tested on build basestar-191-g79a0550 on ubuntu 10.04 64bit and repeat step by step as in bug report. I can not reproduce this bug. So I think this issue is fixed. I will close this bug.
        thuan Thuan Nguyen made changes -
        Status Open [ 1 ] Resolved [ 5 ]
        Resolution Fixed [ 1 ]
        farshid Farshid Ghods (Inactive) made changes -
        Flagged [Release Note]

          People

          • Assignee:
            thuan Thuan Nguyen
            Reporter:
            james James Phillips (Inactive)
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Time Tracking

              Estimated:
              Original Estimate - 8h
              8h
              Remaining:
              Remaining Estimate - 8h
              8h
              Logged:
              Time Spent - Not Specified
              Not Specified

                Gerrit Reviews

                There are no open Gerrit changes