Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-6497

Not ready to replicate from vbuckets cause rebalance failure due to bad_replicas when replica count > 1

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Blocker
    • Resolution: Fixed
    • Affects Version/s: 2.0-beta
    • Fix Version/s: 2.0-beta
    • Component/s: ns_server
    • Security Level: Public
    • Labels:
      None

      Description

      See reopening of MB-4673.

      We've recently changed replication supervisor's childs to have type temporary. And that means supervisor will not try to restart failed childs.

      But when some vbuckets on source are not yet ready to be replicated from (see later when this happens) we deal with that by 'crashing' replicator after 30 seconds expecting us to be restarted and to deal with new ready set of vbuckets.

      It can be seen that there's tiny race in both 1.8.1 style and new-style vbucket filter change logic where vbucket filter change command can be sent to dying ebucketmigrator. So that's another related bug.

      When this happens? This is 'typical' for replica count > 2 case even for our 'reliable' replica building attempt. Basically we do replica building in star formation and that means that when vbucket movement is done some replicas even later in chain may be slightly ahead of previous in chain replicas (but ofcourse never ahead of master). If that 'being ahead of' actually means later checkpoint id, that will cause backfill into that later replica, which will mean that this replica will be with open checkpoint 0 for some time. Condition where replication from is not possible. So that's it, that's where we cannot replicate some subset of vbuckets and have to restart ourselves later.

      No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

        Hide
        farshid Farshid Ghods (Inactive) added a comment -

        related to failure seen in rebalance regression runs

        test: http://qa.hq.northscale.net/job/centos-64-2.0-rebalance-regressions/23/consoleFull

        [user:info,2012-08-30T8:37:45.519,ns_1@10.3.121.92:<0.4827.52>:ns_rebalancer:verify_replication:380]Bad replicators after rebalance:
        Missing = [

        {'ns_1@10.3.121.98','ns_1@10.3.121.94',278}

        ,

        {'ns_1@10.3.121.98','ns_1@10.3.121.94',279}

        ,

        {'ns_1@10.3.121.98','ns_1@10.3.121.94',280}

        ,

        {'ns_1@10.3.121.98','ns_1@10.3.121.94',281}

        ,

        {'ns_1@10.3.121.98','ns_1@10.3.121.94',282}

        ,

        {'ns_1@10.3.121.98','ns_1@10.3.121.94',283}

        ,

        {'ns_1@10.3.121.98','ns_1@10.3.121.94',284}

        ,

        {'ns_1@10.3.121.98','ns_1@10.3.121.94',285}

        ]
        Extras = []
        [user:info,2012-08-30T8:37:45.521,ns_1@10.3.121.92:<0.8968.2>:ns_orchestrator:handle_info:295]Rebalance exited with reason bad_replicas

        Show
        farshid Farshid Ghods (Inactive) added a comment - related to failure seen in rebalance regression runs test: http://qa.hq.northscale.net/job/centos-64-2.0-rebalance-regressions/23/consoleFull [user:info,2012-08-30T8:37:45.519,ns_1@10.3.121.92:<0.4827.52>:ns_rebalancer:verify_replication:380] Bad replicators after rebalance: Missing = [ {'ns_1@10.3.121.98','ns_1@10.3.121.94',278} , {'ns_1@10.3.121.98','ns_1@10.3.121.94',279} , {'ns_1@10.3.121.98','ns_1@10.3.121.94',280} , {'ns_1@10.3.121.98','ns_1@10.3.121.94',281} , {'ns_1@10.3.121.98','ns_1@10.3.121.94',282} , {'ns_1@10.3.121.98','ns_1@10.3.121.94',283} , {'ns_1@10.3.121.98','ns_1@10.3.121.94',284} , {'ns_1@10.3.121.98','ns_1@10.3.121.94',285} ] Extras = [] [user:info,2012-08-30T8:37:45.521,ns_1@10.3.121.92:<0.8968.2>:ns_orchestrator:handle_info:295] Rebalance exited with reason bad_replicas
        Hide
        farshid Farshid Ghods (Inactive) added a comment -
        Show
        farshid Farshid Ghods (Inactive) added a comment - MB-4673
        Hide
        farshid Farshid Ghods (Inactive) added a comment -

        Andrei,

        please provide the test case which can be run to reproduce this issue.

        Show
        farshid Farshid Ghods (Inactive) added a comment - Andrei, please provide the test case which can be run to reproduce this issue.
        Hide
        alkondratenko Aleksey Kondratenko (Inactive) added a comment -

        We merged fix

        Show
        alkondratenko Aleksey Kondratenko (Inactive) added a comment - We merged fix
        Hide
        thuan Thuan Nguyen added a comment -

        Integrated in github-ns-server-2-0 #461 (See http://qa.hq.northscale.net/job/github-ns-server-2-0/461/)
        MB-6497: separated replication management into own gen_server (Revision f13f9c77ac8b3cf21295c1bc5043b1112172c62b)

        Result = SUCCESS
        pwansch :
        Files :

        • src/tap_replication_manager.erl
        • src/ebucketmigrator_srv.erl
        • src/replication_changes.erl
        • src/ns_vbm_new_sup.erl
        • src/ns_memcached_sup.erl
        • src/janitor_agent.erl
        Show
        thuan Thuan Nguyen added a comment - Integrated in github-ns-server-2-0 #461 (See http://qa.hq.northscale.net/job/github-ns-server-2-0/461/ ) MB-6497 : separated replication management into own gen_server (Revision f13f9c77ac8b3cf21295c1bc5043b1112172c62b) Result = SUCCESS pwansch : Files : src/tap_replication_manager.erl src/ebucketmigrator_srv.erl src/replication_changes.erl src/ns_vbm_new_sup.erl src/ns_memcached_sup.erl src/janitor_agent.erl

          People

          • Assignee:
            alkondratenko Aleksey Kondratenko (Inactive)
            Reporter:
            alkondratenko Aleksey Kondratenko (Inactive)
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Gerrit Reviews

              There are no open Gerrit changes