Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-4375

rebalance failing with retry_not_ready_vbuckets error if ns_server janitors sets the vbucket state from pending to dead when rebalance fails or stops

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Incomplete
    • Affects Version/s: 1.7.0, 1.7.1.1, 1.8.0
    • Fix Version/s: 1.7.2
    • Component/s: ns_server
    • Security Level: Public
    • Labels:
      None

      Description

      when rebalance fails or is stopped by the user the vbucket state for those rebalance operations which were making progress are still pending.
      ns_server janitor runs every few seconds which will change the vbucket state from pending->dead

      now when the user restarts the rebalance sooner than 5 minutes ep-engine will try to reuse that tap stream and will not send TAP_VBUCKET_SET when restarting the takeover and since the vbucket state is dead now ep-engine will not start the vbucket transfer and this will result in rebalance getting stuck.

      No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

        Show
        farshid Farshid Ghods (Inactive) added a comment - https://github.com/membase/ep-engine/commit/4af628082a4407c37b33dd16196c5a47a51332aa
        Hide
        ingenthr Matt Ingenthron added a comment -

        A user of 1.8.0 has run into this issue it seems. The specific message in the log is:

        CRASH REPORT <0.2530.13> 2012-03-01 09:40:29
        ===============================================================================
        Crashing process
        initial_call

        {ebucketmigrator_srv,init,['Argument__1']}

        pid <0.2530.13>
        registered_name []
        error_info
        {exit,retry_not_ready_vbuckets,
        [

        {ebucketmigrator_srv,init,1}

        ,

        {proc_lib,init_p_do_apply,3}

        ]}
        ancestors ['ns_vbm_sup-default','single_bucket_sup-default',<0.1025.0>]
        messages []
        links [<0.1063.0>]
        dictionary []
        trap_exit false
        status running
        heap_size 4181
        stack_size 24
        reductions 218220

        Show
        ingenthr Matt Ingenthron added a comment - A user of 1.8.0 has run into this issue it seems. The specific message in the log is: CRASH REPORT <0.2530.13> 2012-03-01 09:40:29 =============================================================================== Crashing process initial_call {ebucketmigrator_srv,init,['Argument__1']} pid <0.2530.13> registered_name [] error_info {exit,retry_not_ready_vbuckets, [ {ebucketmigrator_srv,init,1} , {proc_lib,init_p_do_apply,3} ]} ancestors ['ns_vbm_sup-default','single_bucket_sup-default',<0.1025.0>] messages [] links [<0.1063.0>] dictionary [] trap_exit false status running heap_size 4181 stack_size 24 reductions 218220
        Hide
        farshid Farshid Ghods (Inactive) added a comment -

        Hi Matt,

        do you have access to the user to grab diags from their cluster ?

        Show
        farshid Farshid Ghods (Inactive) added a comment - Hi Matt, do you have access to the user to grab diags from their cluster ?
        Hide
        alkondratenko Aleksey Kondratenko (Inactive) added a comment -

        Folks, retry_not_ready_vbucket is actually a "voluntarily crash". We do that in order to restart replication later. I.e. when replicating from some node if some of vbuckets we need to replicate from are not ready yet (i.e. we're second replica and 1st is not yet ready to be replicated from) we just don't replicate those vbuckets, but after 30 seconds we perform harakiri so that supervisor restarts us and we check again. This was quick fix in few days before 1.7.0 and I'm really sorry for not making log messages clearer that it's not a problem at all. 1.8.1 will fix that.

        So this message has nothing at all to do with rebalance failing. May we ask logs from master node ? Master node can be identified by looking at user visible logs. Server that logs "rebalance failed" message is the master node for that failed rebalance.

        Show
        alkondratenko Aleksey Kondratenko (Inactive) added a comment - Folks, retry_not_ready_vbucket is actually a "voluntarily crash". We do that in order to restart replication later. I.e. when replicating from some node if some of vbuckets we need to replicate from are not ready yet (i.e. we're second replica and 1st is not yet ready to be replicated from) we just don't replicate those vbuckets, but after 30 seconds we perform harakiri so that supervisor restarts us and we check again. This was quick fix in few days before 1.7.0 and I'm really sorry for not making log messages clearer that it's not a problem at all. 1.8.1 will fix that. So this message has nothing at all to do with rebalance failing. May we ask logs from master node ? Master node can be identified by looking at user visible logs. Server that logs "rebalance failed" message is the master node for that failed rebalance.

          People

          • Assignee:
            farshid Farshid Ghods (Inactive)
            Reporter:
            farshid Farshid Ghods (Inactive)
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Gerrit Reviews

              There are no open Gerrit changes