Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-4366

ns_server is reusing tap names unsafely which causes data loss or inconsistency in replication when a node is removed and added back

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Blocker
    • Resolution: Fixed
    • Affects Version/s: 1.7.2, 1.8.0
    • Fix Version/s: 1.8.1
    • Component/s: ns_server
    • Security Level: Public

      Description

      screenshot attached

      NOTE: we're converting this to main 'named tap issues' ticket.

      So what's not safe about reusing named taps as of 1.8.0?

      If something happened to destination node after tap was disconnected. And if that something affected data for vbuckets replicated as part of named tap, then subsequent reuse of named tap will incorrectly assume that we can continue sending stuff instead of re-negotiating which data needs to be resent.

      # Subject Project Status CR V
      For Gerrit Dashboard: &For+MB-4366=message:MB-4366

        Activity

        Hide
        farshid Farshid Ghods (Inactive) added a comment -

        another screenshot : 5 minutes after stopping the rebalance

        Show
        farshid Farshid Ghods (Inactive) added a comment - another screenshot : 5 minutes after stopping the rebalance
        Hide
        farshid Farshid Ghods (Inactive) added a comment -

        tap stream only stops if there is no item added to the backlog
        if the user keeps the load running this tap stream remains alive forever

        Show
        farshid Farshid Ghods (Inactive) added a comment - tap stream only stops if there is no item added to the backlog if the user keeps the load running this tap stream remains alive forever
        Hide
        alkondratenko Aleksey Kondratenko (Inactive) added a comment -

        Farshid, cannot make sense of this screenshots. Can you elaborate?

        Show
        alkondratenko Aleksey Kondratenko (Inactive) added a comment - Farshid, cannot make sense of this screenshots. Can you elaborate?
        Hide
        farshid Farshid Ghods (Inactive) added a comment -

        basically that means there is still one tap_rebalance stream open and running even after rebalance was stopped.

        we seem to be stopping most of the streams except one

        Show
        farshid Farshid Ghods (Inactive) added a comment - basically that means there is still one tap_rebalance stream open and running even after rebalance was stopped. we seem to be stopping most of the streams except one
        Hide
        farshid Farshid Ghods (Inactive) added a comment -

        waiting 5 minutes will not work if there are ongoing mutuations in the cluster because this tap stream only times out after 5 minutes of inactivity

        Show
        farshid Farshid Ghods (Inactive) added a comment - waiting 5 minutes will not work if there are ongoing mutuations in the cluster because this tap stream only times out after 5 minutes of inactivity
        Hide
        alkondratenko Aleksey Kondratenko (Inactive) added a comment -

        so it's ep-engine issue then ? I mean we close tap streams as much as possible in ns_server. Named tap streams are kept alive by ep-engine. If there's anything ns_server can do to really stop those tap producers, I'll be happy to do that.

        Show
        alkondratenko Aleksey Kondratenko (Inactive) added a comment - so it's ep-engine issue then ? I mean we close tap streams as much as possible in ns_server. Named tap streams are kept alive by ep-engine. If there's anything ns_server can do to really stop those tap producers, I'll be happy to do that.
        Hide
        steve Steve Yen added a comment -

        this is the main ticket for the named tap approach/fix

        Show
        steve Steve Yen added a comment - this is the main ticket for the named tap approach/fix
        Hide
        steve Steve Yen added a comment -

        is this a blocker for 1.8.1?

        Show
        steve Steve Yen added a comment - is this a blocker for 1.8.1?
        Hide
        dipti Dipti Borkar added a comment -

        Yes, because this may be causing data loss in some conditions.

        Farshid, I believe there are a few other tickets where this is the underlying problem. Can you reference them here for completeness? Thanks

        Show
        dipti Dipti Borkar added a comment - Yes, because this may be causing data loss in some conditions. Farshid, I believe there are a few other tickets where this is the underlying problem. Can you reference them here for completeness? Thanks
        Hide
        alkondratenko Aleksey Kondratenko (Inactive) added a comment -

        http://review.couchbase.org/14555 fixes it on 1.8.1.

        1.8 and master have a bit different code in this area so this work still needs some forward-porting.

        Show
        alkondratenko Aleksey Kondratenko (Inactive) added a comment - http://review.couchbase.org/14555 fixes it on 1.8.1. 1.8 and master have a bit different code in this area so this work still needs some forward-porting.
        Hide
        steve Steve Yen added a comment -

        fix is in gerrit (but more work still needed to enable 1.8.2)

        Show
        steve Steve Yen added a comment - fix is in gerrit (but more work still needed to enable 1.8.2)
        Hide
        alkondratenko Aleksey Kondratenko (Inactive) added a comment -

        let's keep this open for now. While I'll adapt it for 1.8.2 I may have to change 1.8.1 code to enable forward-compatibility with 1.8.2 and master

        Show
        alkondratenko Aleksey Kondratenko (Inactive) added a comment - let's keep this open for now. While I'll adapt it for 1.8.2 I may have to change 1.8.1 code to enable forward-compatibility with 1.8.2 and master
        Hide
        dipti Dipti Borkar added a comment -

        Aliaksey, code complete is friday and we need to merge everything in by then.
        What changes need to be made to ensure forward-compatibility?

        Show
        dipti Dipti Borkar added a comment - Aliaksey, code complete is friday and we need to merge everything in by then. What changes need to be made to ensure forward-compatibility?
        Hide
        alkondratenko Aleksey Kondratenko (Inactive) added a comment -

        Minor. I'll be doing that tomorrow first-priority.

        Show
        alkondratenko Aleksey Kondratenko (Inactive) added a comment - Minor. I'll be doing that tomorrow first-priority.
        Hide
        alkondratenko Aleksey Kondratenko (Inactive) added a comment -

        I've found no further changes to 1.8.1 are needed. 1.8.2 implementation is here http://review.couchbase.org/14827

        Show
        alkondratenko Aleksey Kondratenko (Inactive) added a comment - I've found no further changes to 1.8.1 are needed. 1.8.2 implementation is here http://review.couchbase.org/14827
        Hide
        thuan Thuan Nguyen added a comment -

        Integrated in github-ns-server-2-0 #333 (See http://qa.hq.northscale.net/job/github-ns-server-2-0/333/)
        only reuse tap name when changing vbucket filter.MB-4366 (Revision 61bf78355e64fff2e807939fea385862ca6919d5)
        reimplemented named tap fix for branch-18. MB-4366 (Revision e3b833480ceb5b7832e22131ed5d3fb532e6ea83)

        Result = SUCCESS
        Aliaksey Artamonau :
        Files :

        • src/ns_server_cluster_sup.erl
        • src/ebucketmigrator_srv.erl
        • src/ns_vbm_sup.erl

        Aliaksey Artamonau :
        Files :

        • src/ns_vbm_new_sup.erl
        • src/ns_vbm_sup.erl
        • src/ebucketmigrator_srv.erl
        • src/ns_server_cluster_sup.erl
        • src/cb_gen_vbm_sup.erl
        Show
        thuan Thuan Nguyen added a comment - Integrated in github-ns-server-2-0 #333 (See http://qa.hq.northscale.net/job/github-ns-server-2-0/333/ ) only reuse tap name when changing vbucket filter. MB-4366 (Revision 61bf78355e64fff2e807939fea385862ca6919d5) reimplemented named tap fix for branch-18. MB-4366 (Revision e3b833480ceb5b7832e22131ed5d3fb532e6ea83) Result = SUCCESS Aliaksey Artamonau : Files : src/ns_server_cluster_sup.erl src/ebucketmigrator_srv.erl src/ns_vbm_sup.erl Aliaksey Artamonau : Files : src/ns_vbm_new_sup.erl src/ns_vbm_sup.erl src/ebucketmigrator_srv.erl src/ns_server_cluster_sup.erl src/cb_gen_vbm_sup.erl
        Hide
        thuan Thuan Nguyen added a comment -

        Integrated in github-ns-server-2-0 #337 (See http://qa.hq.northscale.net/job/github-ns-server-2-0/337/)
        fixed typo in start_vbucket_filter_change. MB-4366 (Revision 5db3c35e8a5ff6a5885271df4466b30c5369fa38)

        Result = SUCCESS
        Steve Yen :
        Files :

        • src/ebucketmigrator_srv.erl
        Show
        thuan Thuan Nguyen added a comment - Integrated in github-ns-server-2-0 #337 (See http://qa.hq.northscale.net/job/github-ns-server-2-0/337/ ) fixed typo in start_vbucket_filter_change. MB-4366 (Revision 5db3c35e8a5ff6a5885271df4466b30c5369fa38) Result = SUCCESS Steve Yen : Files : src/ebucketmigrator_srv.erl

          People

          • Assignee:
            alkondratenko Aleksey Kondratenko (Inactive)
            Reporter:
            farshid Farshid Ghods (Inactive)
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Gerrit Reviews

              There are no open Gerrit changes