Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-7262

curr_items after deleting/recreating a bucket on bidirectional xdcr setup seems inconsistent

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Critical
    • Resolution: Duplicate
    • Affects Version/s: 2.0
    • Fix Version/s: 3.0
    • Component/s: XDCR
    • Security Level: Public
    • Environment:
      build#1965

      Description

      Steps to reproduce:

      1. Create a 4node cluster cl1 with default bucket (1024 vbuckets, 1 replica)
      2. Create a 4node cluster cl2 with default bucket (1024 vbuckets, 1 replica)
      3. Setup bidirectional XDCR for the default bucket on cl1, cl2
      4. Load 10k items to cl1 default bucket.
      5. Verify both cl1, cl2 have 10k items after a while(xdcr replication)
      6. Delete and recreate bucket default on cl1.
      7. Add 3 documents to cl2 default bucket.
      8. Total number of documents in default bucket on cl1 is 29 and on cl2 is 10003(expected).

      The curr_items on cl1(i.e 29) seems little strange.

      The cluster is available if you want to take a look:
      cl1 - 10.3.2.57
      cl2 - 10.3.3.95

      No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

        deepkaran.salooja Deepkaran Salooja created issue -
        ketaki Ketaki Gangal made changes -
        Field Original Value New Value
        Assignee Ketaki Gangal [ ketaki ] Junyi Xie [ junyi ]
        steve Steve Yen made changes -
        Fix Version/s 2.0.1 [ 10399 ]
        Fix Version/s 2.0 [ 10114 ]
        Priority Major [ 3 ] Blocker [ 1 ]
        Hide
        ketaki Ketaki Gangal added a comment -

        Abhinav,

        Can you repro this and add information on this?

        Show
        ketaki Ketaki Gangal added a comment - Abhinav, Can you repro this and add information on this?
        ketaki Ketaki Gangal made changes -
        Assignee Junyi Xie [ junyi ] Abhinav Dangeti [ abhinav ]
        ketaki Ketaki Gangal made changes -
        Fix Version/s 2.0 [ 10114 ]
        Fix Version/s 2.0.1 [ 10399 ]
        steve Steve Yen made changes -
        Assignee Abhinav Dangeti [ abhinav ] Damien Katz [ damien ]
        Hide
        abhinav Abhinav Dangeti added a comment -

        On reproducing this:

        • After the bucket is deleted, and upon recreation of the bucket, I delete and recreate replication as well, replication worked as expected.
        • However I did notice one thing (when I do not recreate the replication), from the original description of the bug, at stage 7: Upon adding 3 items, 27 items were replicated to c1.
        • After this, when i delete and recreate replication on c2, the correct number of items were replicated on c1.
        • Moving on, added 3 items on c1 (currently 10006 items), the 3 items weren't replicated to c2 (as after recreation of bucket on c1, replication wasn't re-setup, replicating just shows as "starting-up"), upon re-setting up the replication on c1, the right number of items were replicated to c2.
        • Deleted the bucket on c2 now: No replication errors seen on c1 (for about 15min)
        • After recreating the bucket on c2, the replication from c2 stays in "starting-up" state, while replication on c1 says "replicating", but no items replicated.
        • Add 3 items on c1, 37 items somehow replicated to c2, against c1's now 10009.
        • 3 XDCR errors on c1:

        2012-11-26 12:37:01 - Error replicating vbucket 537: {bad_return_value, {db_not_found, <<"http://Administrator:*****@10.3.121.127:8092/default%2f537%3baabf41c1fa84ae48124892ea6cf855e7/">>}}
        2012-11-26 12:35:51 - Error replicating vbucket 246: {bad_return_value, {db_not_found, <<"http://Administrator:*****@10.1.3.237:8092/default%2f246%3baabf41c1fa84ae48124892ea6cf855e7/">>}}
        2012-11-26 12:26:13 - Error replicating vbucket 770: {bad_return_value, {db_not_found, <<"http://Administrator:*****@10.3.2.55:8092/default%2f770%3baabf41c1fa84ae48124892ea6cf855e7/">>}}

        • Recreate replication on c1: items start replicating to c2, and replication completes successfully.
        • Replication on c2 however still says "starting-up" and this changes to "replicating" only upon recreating the replication.
        Show
        abhinav Abhinav Dangeti added a comment - On reproducing this: After the bucket is deleted, and upon recreation of the bucket, I delete and recreate replication as well, replication worked as expected. However I did notice one thing (when I do not recreate the replication), from the original description of the bug, at stage 7: Upon adding 3 items, 27 items were replicated to c1. After this, when i delete and recreate replication on c2, the correct number of items were replicated on c1. Moving on, added 3 items on c1 (currently 10006 items), the 3 items weren't replicated to c2 (as after recreation of bucket on c1, replication wasn't re-setup, replicating just shows as "starting-up"), upon re-setting up the replication on c1, the right number of items were replicated to c2. Deleted the bucket on c2 now: No replication errors seen on c1 (for about 15min) After recreating the bucket on c2, the replication from c2 stays in "starting-up" state, while replication on c1 says "replicating", but no items replicated. Add 3 items on c1, 37 items somehow replicated to c2, against c1's now 10009. 3 XDCR errors on c1: 2012-11-26 12:37:01 - Error replicating vbucket 537: {bad_return_value, {db_not_found, <<"http://Administrator:*****@10.3.121.127:8092/default%2f537%3baabf41c1fa84ae48124892ea6cf855e7/">>}} 2012-11-26 12:35:51 - Error replicating vbucket 246: {bad_return_value, {db_not_found, <<"http://Administrator:*****@10.1.3.237:8092/default%2f246%3baabf41c1fa84ae48124892ea6cf855e7/">>}} 2012-11-26 12:26:13 - Error replicating vbucket 770: {bad_return_value, {db_not_found, <<"http://Administrator:*****@10.3.2.55:8092/default%2f770%3baabf41c1fa84ae48124892ea6cf855e7/">>}} Recreate replication on c1: items start replicating to c2, and replication completes successfully. Replication on c2 however still says "starting-up" and this changes to "replicating" only upon recreating the replication.
        Hide
        junyi Junyi Xie (Inactive) added a comment -

        This is a corner case touching the dark side of bucket identity management in XDCR (probably a generic issue in couchbase server).

        This happens only when

        1) you delete a bucket during an inbound ongoing replication, without deleting the replication first, and immediately create another one with the same name
        and
        2) after you re-create the bucket with the same name, you load some data on the source to wake up some vb replicators

        IMHO, the expected behavior is that no replication should happen after you re-create the bucket, because even with the same name, they are considered completely different buckets and thus no replication should resume.

        Today XDCR does not check the UUID and make sure the remote cluster is still the old one when replicator is initialized. Instead, it is just trying fetch the remote cluster and start replicating. In short, we do not maintain the identity of remote cluster during XDCR, although we do that during replication of single vb replicator. In this test case filed by Deepkaran, data loading at step 7 woke up few vb replicators at C2, and they started replication without checking the bucket identity change at the other side.

        Some thoughts to fix the issue

        First of all, today we should recommend or requite users to delete any bucket only AFTER the users have deleted all XDCR toward that bucket. It is not clear to me to how to identify an inbound XDCR for a bucket since the incoming traffic from XDCR is just a stream of setMeta/getMeta/deleteWithMeta

        Second, probably we should maintain the remote bucket identity for the whole XDCR, instead of single vb replicator. Say, we can store the UUID of remote bucket when XDCR is created, and each time when vb replicator is initialized we should check to make sure the remote bucket UUID does not change.

        This fix may possibly involve change in both XDCR and remote_cluster_info module. At this time, it does not look a blocker to me, and I would like to defer the fix to 2.0.1 given the limited timing.

        Show
        junyi Junyi Xie (Inactive) added a comment - This is a corner case touching the dark side of bucket identity management in XDCR (probably a generic issue in couchbase server). This happens only when 1) you delete a bucket during an inbound ongoing replication, without deleting the replication first, and immediately create another one with the same name and 2) after you re-create the bucket with the same name, you load some data on the source to wake up some vb replicators IMHO, the expected behavior is that no replication should happen after you re-create the bucket, because even with the same name, they are considered completely different buckets and thus no replication should resume. Today XDCR does not check the UUID and make sure the remote cluster is still the old one when replicator is initialized. Instead, it is just trying fetch the remote cluster and start replicating. In short, we do not maintain the identity of remote cluster during XDCR, although we do that during replication of single vb replicator. In this test case filed by Deepkaran, data loading at step 7 woke up few vb replicators at C2, and they started replication without checking the bucket identity change at the other side. Some thoughts to fix the issue First of all, today we should recommend or requite users to delete any bucket only AFTER the users have deleted all XDCR toward that bucket. It is not clear to me to how to identify an inbound XDCR for a bucket since the incoming traffic from XDCR is just a stream of setMeta/getMeta/deleteWithMeta Second, probably we should maintain the remote bucket identity for the whole XDCR, instead of single vb replicator. Say, we can store the UUID of remote bucket when XDCR is created, and each time when vb replicator is initialized we should check to make sure the remote bucket UUID does not change. This fix may possibly involve change in both XDCR and remote_cluster_info module. At this time, it does not look a blocker to me, and I would like to defer the fix to 2.0.1 given the limited timing.
        junyi Junyi Xie (Inactive) made changes -
        Assignee Damien Katz [ damien ] Junyi Xie [ junyi ]
        junyi Junyi Xie (Inactive) made changes -
        Fix Version/s 2.0.1 [ 10399 ]
        Fix Version/s 2.0 [ 10114 ]
        Priority Blocker [ 1 ] Critical [ 2 ]
        farshid Farshid Ghods (Inactive) made changes -
        Labels 2.0-release-notes
        Hide
        kzeller kzeller added a comment -

        Added to RN as:

        Be aware that if you are using XDCR for replication to a destination bucket and you remove and create a
        new bucket with the same name, it has a different UUIDs. Therefore any replication you had established
        with the deleted bucket will not apply to the new bucket.

        Will add to XDCR chapter.

        Show
        kzeller kzeller added a comment - Added to RN as: Be aware that if you are using XDCR for replication to a destination bucket and you remove and create a new bucket with the same name, it has a different UUIDs. Therefore any replication you had established with the deleted bucket will not apply to the new bucket. Will add to XDCR chapter.
        Hide
        junyi Junyi Xie (Inactive) added a comment -

        Karen / Dipti,

        What Karen said is what I wanted but today the Couchbase server itself does not have consistent way to identify bucket. Within XDCR, we use UUID to identify a bucket and post docs to that bucket, but this does not apply for normal front-end operations. For example, if you are loading some data to a bucket, in the middle of which you delete it and recreate a new bucket with the same name, you are still able to continue loading your data to the new bucket. That essentially means we still use the name to identify a bucket in some cases.

        Within XDCR, either using UUID or name as bucket identifier is OK with me, but we need to be consistent that the XDCR behavior is consistent with other Couchbase Server components.

        Dipti,

        I put this bug on hold and hand it over to you. You may want to talk to ns_server team to find out which way to go, then XDCR will fix this issue accordingly.

        Thanks,

        Show
        junyi Junyi Xie (Inactive) added a comment - Karen / Dipti, What Karen said is what I wanted but today the Couchbase server itself does not have consistent way to identify bucket. Within XDCR, we use UUID to identify a bucket and post docs to that bucket, but this does not apply for normal front-end operations. For example, if you are loading some data to a bucket, in the middle of which you delete it and recreate a new bucket with the same name, you are still able to continue loading your data to the new bucket. That essentially means we still use the name to identify a bucket in some cases. Within XDCR, either using UUID or name as bucket identifier is OK with me, but we need to be consistent that the XDCR behavior is consistent with other Couchbase Server components. Dipti, I put this bug on hold and hand it over to you. You may want to talk to ns_server team to find out which way to go, then XDCR will fix this issue accordingly. Thanks,
        junyi Junyi Xie (Inactive) made changes -
        Assignee Junyi Xie [ junyi ] Dipti Borkar [ dipti ]
        Hide
        junyi Junyi Xie (Inactive) added a comment -

        We already have one bug handling bucket identity issue. This one is a duplicate to MB-7266.

        Show
        junyi Junyi Xie (Inactive) added a comment - We already have one bug handling bucket identity issue. This one is a duplicate to MB-7266 .
        Hide
        junyi Junyi Xie (Inactive) added a comment -
        Show
        junyi Junyi Xie (Inactive) added a comment - MB-7266
        junyi Junyi Xie (Inactive) made changes -
        Status Open [ 1 ] Resolved [ 5 ]
        Fix Version/s 2.1 [ 10414 ]
        Fix Version/s 2.0.1 [ 10399 ]
        Resolution Duplicate [ 3 ]
        deepkaran.salooja Deepkaran Salooja made changes -
        Status Resolved [ 5 ] Closed [ 6 ]

          People

          • Assignee:
            dipti Dipti Borkar
            Reporter:
            deepkaran.salooja Deepkaran Salooja
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Gerrit Reviews

              There are no open Gerrit changes