Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-5298

Rebalance failed with reason {case_clause,{{ok,replica},{ok,replica}}} when rebalancing out a node which was failed over due to netwrok connectivity issues but it re-appears while rebalancing

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Critical
    • Resolution: Fixed
    • Affects Version/s: 1.8.1-release-candidate
    • Fix Version/s: 2.0
    • Component/s: ns_server
    • Security Level: Public
    • Environment:
      1.8.1-815-rel

      Description

      Failing testcase
      failovertests.FailoverTests.test_failover_firewall,replica=3,load_ratio=10

      [ns_server:info] [2012-05-14 13:26:46] [ns_1@10.1.3.55:<0.3846.2>:ns_janitor:wait_for_memcached:286] Waiting for "default" on ['ns_1@10.1.3.50','ns_1@10.1.3.51','ns_1@10.1.3.52',
      'ns_1@10.1.3.54']
      [ns_server:debug] [2012-05-14 13:26:46] [ns_1@10.1.3.55:ns_bucket_worker:ns_bucket_sup:update_childs:91] Stopping child for dead bucket: {{per_bucket_sup,"default"},
      <0.23773.0>,supervisor,
      [single_bucket_sup]}

      [ns_server:debug] [2012-05-14 13:26:46] [ns_1@10.1.3.55:<0.23773.0>:single_bucket_sup:top_loop:28] Delegating exit

      {'EXIT',<0.23699.0>,shutdown}

      to child supervisor: <0.23774.0>

      [error_logger:error] [2012-05-14 13:26:03] [ns_1@10.1.3.50:error_logger:ale_error_logger_handler:log_report:72]
      =========================CRASH REPORT=========================
      crasher:
      initial call: ns_vbucket_mover:init/1
      pid: <0.29733.2>
      registered_name: []
      exception exit: {case_clause,{{ok,replica},

      {ok,replica}}}
      in function gen_server:terminate/6
      ancestors: [<0.28176.2>]
      messages: [{'EXIT',<0.19829.3>,
      {exited,
      {'EXIT',<0.29733.2>,
      {case_clause,{{ok,replica}

      ,{ok,replica}}}}}},
      {'EXIT',<0.19792.3>,
      {exited,
      {'EXIT',<0.29733.2>,
      {case_clause,ok,replica},{ok,replica}}}},
      {'EXIT',<0.15598.3>,
      {exited,
      {'EXIT',<0.29733.2>,
      {case_clause,ok,replica},{ok,replica}}}}]

      # Subject Project Status CR V
      For Gerrit Dashboard: &For+MB-5298=message:MB-5298

        Activity

        karan Karan Kumar (Inactive) created issue -
        Hide
        alkondratenko Aleksey Kondratenko (Inactive) added a comment -

        I've added more logging in affected area(s). Please retest with latest stuff

        Show
        alkondratenko Aleksey Kondratenko (Inactive) added a comment - I've added more logging in affected area(s). Please retest with latest stuff
        alkondratenko Aleksey Kondratenko (Inactive) made changes -
        Field Original Value New Value
        Assignee Aleksey Kondratenko [ alkondratenko ] Karan Kumar [ karan ]
        Hide
        karan Karan Kumar (Inactive) added a comment -

        Adding the latest logs from 181-832-rel

        Show
        karan Karan Kumar (Inactive) added a comment - Adding the latest logs from 181-832-rel
        karan Karan Kumar (Inactive) made changes -
        Assignee Karan Kumar [ karan ] Aleksey Kondratenko [ alkondratenko ]
        Hide
        karan Karan Kumar (Inactive) added a comment -

        Failing test:-
        failovertests.FailoverTests.test_failover_firewall,replica=3,load_ratio=10

        Show
        karan Karan Kumar (Inactive) added a comment - Failing test:- failovertests.FailoverTests.test_failover_firewall,replica=3,load_ratio=10
        dipti Dipti Borkar made changes -
        Sprint Status Current Sprint
        Hide
        alkondratenko Aleksey Kondratenko (Inactive) added a comment -

        Good find. Now we have evidence that one of previously firewalled and failed over nodes is being (seemingly gradually) un-firewalled in the middle of rebalance. When rest of cluster actually already ejected this node. But this node discovers it was ejected 1 minute after it's replicator was able to push-replicate to one of existing nodes. We're starting to really hit limits of our naive cluster orchestration approach.

        Show
        alkondratenko Aleksey Kondratenko (Inactive) added a comment - Good find. Now we have evidence that one of previously firewalled and failed over nodes is being (seemingly gradually) un-firewalled in the middle of rebalance. When rest of cluster actually already ejected this node. But this node discovers it was ejected 1 minute after it's replicator was able to push-replicate to one of existing nodes. We're starting to really hit limits of our naive cluster orchestration approach.
        Hide
        alkondratenko Aleksey Kondratenko (Inactive) added a comment - - edited

        I think reasonably simple treatment (still partial and naive) is to never restart replication automatically until janitor restarts it. It has some potential data safety implications though. I.e. janitor being really conservative in some cases will not restart replications that previously were automagically restarted. So not sure.

        Show
        alkondratenko Aleksey Kondratenko (Inactive) added a comment - - edited I think reasonably simple treatment (still partial and naive) is to never restart replication automatically until janitor restarts it. It has some potential data safety implications though. I.e. janitor being really conservative in some cases will not restart replications that previously were automagically restarted. So not sure.
        Hide
        alkondratenko Aleksey Kondratenko (Inactive) added a comment -

        Pull-based replication which is part of branch-18 would help here as well. Sadly we were not allowed to have it on branch-181.

        Show
        alkondratenko Aleksey Kondratenko (Inactive) added a comment - Pull-based replication which is part of branch-18 would help here as well. Sadly we were not allowed to have it on branch-181.
        Hide
        alkondratenko Aleksey Kondratenko (Inactive) added a comment -

        Need PM decision here

        Show
        alkondratenko Aleksey Kondratenko (Inactive) added a comment - Need PM decision here
        alkondratenko Aleksey Kondratenko (Inactive) made changes -
        Assignee Aleksey Kondratenko [ alkondratenko ] Dipti Borkar [ dipti ]
        dipti Dipti Borkar made changes -
        Sprint Priority 0
        Hide
        alkondratenko Aleksey Kondratenko (Inactive) added a comment -

        May I ask for testing if this is a regression? My understanding no it's not.

        Show
        alkondratenko Aleksey Kondratenko (Inactive) added a comment - May I ask for testing if this is a regression? My understanding no it's not.
        Hide
        farshid Farshid Ghods (Inactive) added a comment -

        should we run this against 1.8.0 ?

        Show
        farshid Farshid Ghods (Inactive) added a comment - should we run this against 1.8.0 ?
        Hide
        alkondratenko Aleksey Kondratenko (Inactive) added a comment -

        If you can, please do. I'm pretty sure you'll hit this in 1.8.0 because failover & replication logic (hint: no quick failover was allowed for 1.8.1) is same.

        Show
        alkondratenko Aleksey Kondratenko (Inactive) added a comment - If you can, please do. I'm pretty sure you'll hit this in 1.8.0 because failover & replication logic (hint: no quick failover was allowed for 1.8.1) is same.
        Hide
        karan Karan Kumar (Inactive) added a comment -

        We are pretty much hitting this consistently on 181.

        Show
        karan Karan Kumar (Inactive) added a comment - We are pretty much hitting this consistently on 181.
        Hide
        alkondratenko Aleksey Kondratenko (Inactive) added a comment -

        Just stop doing this questionable practice of abusing firewall.

        Show
        alkondratenko Aleksey Kondratenko (Inactive) added a comment - Just stop doing this questionable practice of abusing firewall.
        Hide
        dipti Dipti Borkar added a comment -

        what are our options here to fix on 181? Given the likely hood of hitting this is more, we should try to fix.

        Show
        dipti Dipti Borkar added a comment - what are our options here to fix on 181? Given the likely hood of hitting this is more, we should try to fix.
        Hide
        alkondratenko Aleksey Kondratenko (Inactive) added a comment -

        My understanding is that probability of hitting this in practice approaches zero. We had this issue since 1.6.0 yet nobody seen reported this problem.

        Show
        alkondratenko Aleksey Kondratenko (Inactive) added a comment - My understanding is that probability of hitting this in practice approaches zero. We had this issue since 1.6.0 yet nobody seen reported this problem.
        Hide
        farshid Farshid Ghods (Inactive) added a comment -

        this is not about abusing the firewall . its about node coming back up or re-appearing after its failed over.

        if this is purely due to firewall then its ok to defer this

        Show
        farshid Farshid Ghods (Inactive) added a comment - this is not about abusing the firewall . its about node coming back up or re-appearing after its failed over. if this is purely due to firewall then its ok to defer this
        Hide
        alkondratenko Aleksey Kondratenko (Inactive) added a comment -

        your specific way of using firewall makes this problem probable to observe. That's my understanding.

        Show
        alkondratenko Aleksey Kondratenko (Inactive) added a comment - your specific way of using firewall makes this problem probable to observe. That's my understanding.
        Hide
        farshid Farshid Ghods (Inactive) added a comment -

        firewall is our way of simulating a node disappearing and re-appearing. we can simulate that by also shutting down the network interface or pulling the network cable if it helps

        Show
        farshid Farshid Ghods (Inactive) added a comment - firewall is our way of simulating a node disappearing and re-appearing. we can simulate that by also shutting down the network interface or pulling the network cable if it helps
        Hide
        alkondratenko Aleksey Kondratenko (Inactive) added a comment -

        I have evidence that you're enabling firewall back in some very specific way. Particularly memcached traffic is re-enabled first. And then minutes later you re-enable erlang traffic.

        Show
        alkondratenko Aleksey Kondratenko (Inactive) added a comment - I have evidence that you're enabling firewall back in some very specific way. Particularly memcached traffic is re-enabled first. And then minutes later you re-enable erlang traffic.
        Hide
        dipti Dipti Borkar added a comment -

        will defer to later release.

        Show
        dipti Dipti Borkar added a comment - will defer to later release.
        farshid Farshid Ghods (Inactive) made changes -
        Labels 1.8.1-release-notes
        Fix Version/s 2.0-developer-preview-5 [ 10290 ]
        Fix Version/s 1.8.1 [ 10295 ]
        dipti Dipti Borkar made changes -
        Assignee Dipti Borkar [ dipti ] Aleksey Kondratenko [ alkondratenko ]
        Sprint Status Current Sprint
        Sprint Priority 0
        peter peter made changes -
        Priority Blocker [ 1 ] Critical [ 2 ]
        farshid Farshid Ghods (Inactive) made changes -
        Summary Rebalance failed with reason {case_clause,{{ok,replica},{ok,replica}}} Rebalance failed with reason {case_clause,{{ok,replica},{ok,replica}}} when rebalancing a node which was failed over but it appears back ( firewall enable/disable ) while rebalancing
        farshid Farshid Ghods (Inactive) made changes -
        Summary Rebalance failed with reason {case_clause,{{ok,replica},{ok,replica}}} when rebalancing a node which was failed over but it appears back ( firewall enable/disable ) while rebalancing Rebalance failed with reason {case_clause,{{ok,replica},{ok,replica}}} when rebalancing out a node which was failed over due to netwrok connectivity issues but it re-appears while rebalancing
        farshid Farshid Ghods (Inactive) made changes -
        Fix Version/s 2.0-beta [ 10113 ]
        Fix Version/s 2.0-developer-preview-5 [ 10290 ]
        peter peter made changes -
        Fix Version/s 2.0 [ 10114 ]
        Fix Version/s 2.0-beta [ 10113 ]
        Hide
        alkondratenko Aleksey Kondratenko (Inactive) added a comment -

        Cannot happen on replicator on destination.

        Show
        alkondratenko Aleksey Kondratenko (Inactive) added a comment - Cannot happen on replicator on destination.
        alkondratenko Aleksey Kondratenko (Inactive) made changes -
        Status Open [ 1 ] Resolved [ 5 ]
        Resolution Fixed [ 1 ]
        farshid Farshid Ghods (Inactive) made changes -
        Status Resolved [ 5 ] Closed [ 6 ]

          People

          • Assignee:
            alkondratenko Aleksey Kondratenko (Inactive)
            Reporter:
            karan Karan Kumar (Inactive)
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Gerrit Reviews

              There are no open Gerrit changes