Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-6041

XDC replication keeps on replicating even after replication document is removed

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.0
    • Component/s: XDCR
    • Security Level: Public
    • Labels:
      None

      Description

      • create replication
      • upload some data into the source bucket
      • remove the replication (replication document is not present in _replicator/_all_docs anymore)
      • observe that number of items in the destination bucket keeps growing

      seeing in this on current HEAD

      1. ns-diag-20120727231728.txt.xz
        661 kB
        Aliaksey Artamonau
      2. ns-diag-20120823192112.txt.bz2
        1.77 MB
        Aliaksey Artamonau
      No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

        Aliaksey Artamonau Aliaksey Artamonau created issue -
        Hide
        junyi Junyi Xie (Inactive) added a comment -

        There could be some delay between the time you remove the rep doc and XDCR manager got notified and canceled all replications. Can you please provide the log of source?

        Show
        junyi Junyi Xie (Inactive) added a comment - There could be some delay between the time you remove the rep doc and XDCR manager got notified and canceled all replications. Can you please provide the log of source?
        Hide
        Aliaksey Artamonau Aliaksey Artamonau added a comment -

        It seems that it happens when there are more than one replication (possibly to the same cluster). I initially observed it when I had two replications between two clusters. Then I tried to reproduced with only one replication. It worked flawlessly. Then I tried it with two replications again and again observed the bug. Attaching a diag from a source.

        Show
        Aliaksey Artamonau Aliaksey Artamonau added a comment - It seems that it happens when there are more than one replication (possibly to the same cluster). I initially observed it when I had two replications between two clusters. Then I tried to reproduced with only one replication. It worked flawlessly. Then I tried it with two replications again and again observed the bug. Attaching a diag from a source.
        Aliaksey Artamonau Aliaksey Artamonau made changes -
        Field Original Value New Value
        Attachment ns-diag-20120727231728.txt.xz [ 14168 ]
        mikew Mike Wiederhold made changes -
        Component/s cross-datacenter-replication [ 10136 ]
        Hide
        junyi Junyi Xie (Inactive) added a comment -

        Cannot open xz file on MacOS. Can you upload a .gz or .tar package? Thanks

        Show
        junyi Junyi Xie (Inactive) added a comment - Cannot open xz file on MacOS. Can you upload a .gz or .tar package? Thanks
        junyi Junyi Xie (Inactive) made changes -
        Sprint Status Next Sprint
        Hide
        peter peter added a comment -

        Ketaki, can you try this one out before and after Damien's changes.

        Show
        peter peter added a comment - Ketaki, can you try this one out before and after Damien's changes.
        peter peter made changes -
        Assignee Junyi Xie [ junyi ] Ketaki Gangal [ ketaki ]
        Sprint Status Next Sprint Current Sprint
        ketaki Ketaki Gangal made changes -
        Assignee Ketaki Gangal [ ketaki ] Abhinav Dangeti [ abhinav ]
        Hide
        abhinav Abhinav Dangeti added a comment -
        • Set up a 2:2 unidirectional replication on build 1623.
        • Load on source, replication kicks off on destination.
        • Deleted the replication on the source side:
        • Replication ceases to stop immediately on the destination.
        • I expected the replication would stop when the item count reaches the count on the source when I killed the replication.
        • However, the count surpasses that check point but does stop at a point much later, with the load on the source still going.
        Show
        abhinav Abhinav Dangeti added a comment - Set up a 2:2 unidirectional replication on build 1623. Load on source, replication kicks off on destination. Deleted the replication on the source side: Replication ceases to stop immediately on the destination. I expected the replication would stop when the item count reaches the count on the source when I killed the replication. However, the count surpasses that check point but does stop at a point much later, with the load on the source still going.
        Hide
        junyi Junyi Xie (Inactive) added a comment -

        Locally I created 1-1 clusters, each with two bucktes, default and default2. Start two concurrent XDCR for default and defult2, and then delete the two replication docs from UI. Both replications stopped within several seconds after I deleted the replication doc. At least at local testing, I do not see any issue.

        Aliaksey, can you please retry the latest code to see if the issue still exists? Thanks.

        Show
        junyi Junyi Xie (Inactive) added a comment - Locally I created 1-1 clusters, each with two bucktes, default and default2. Start two concurrent XDCR for default and defult2, and then delete the two replication docs from UI. Both replications stopped within several seconds after I deleted the replication doc. At least at local testing, I do not see any issue. Aliaksey, can you please retry the latest code to see if the issue still exists? Thanks.
        junyi Junyi Xie (Inactive) made changes -
        Assignee Abhinav Dangeti [ abhinav ] Aliaksey Artamonau [ aliaksey artamonau ]
        Hide
        Aliaksey Artamonau Aliaksey Artamonau added a comment -

        I was able to reproduce it by creating two replications from the same bucket on the source to two different buckets on destination. Probably it's not very realistic scenario. But it might uncover an important issue. Will attach diag from the source cluster shortly.

        Show
        Aliaksey Artamonau Aliaksey Artamonau added a comment - I was able to reproduce it by creating two replications from the same bucket on the source to two different buckets on destination. Probably it's not very realistic scenario. But it might uncover an important issue. Will attach diag from the source cluster shortly.
        Aliaksey Artamonau Aliaksey Artamonau made changes -
        Assignee Aliaksey Artamonau [ aliaksey artamonau ] Junyi Xie [ junyi ]
        Aliaksey Artamonau Aliaksey Artamonau made changes -
        Attachment ns-diag-20120823192112.txt.bz2 [ 14563 ]
        Hide
        Aliaksey Artamonau Aliaksey Artamonau added a comment -

        Replications stopped finally stopped several minutes after I removed corresponding replication documents.

        Show
        Aliaksey Artamonau Aliaksey Artamonau added a comment - Replications stopped finally stopped several minutes after I removed corresponding replication documents.
        Hide
        junyi Junyi Xie (Inactive) added a comment - - edited

        I tried the same setting as yours (1 -> 1 replication, default@node1 -> default@node2, and default@node1 -> default2@node2), and it seems there is nothing wrong.

        From the log below, XDCR replication manager got notified from ns_server instantly after I deleted the replication doc from UI, it instantly shutdown all ongoing bucket replication process, with no delay. And all XDCR activity stopped at source right after that. However, there could be some activity on destination cluster even after XDCR stopped replication on source side, because it may take a while to persist all items in memory to storage. I am not sure if there is any delay between UI stats and the real activity. Also, if both nodes in your test are on the local machine with 1024 vbuckets, it may take longer to finish. I think the delay should be much shorter if we use VMs to conduct the test.

        At this time I am not sure what to fix. I merged some logs for timing purpose, and will ask Ketaki to do the same test on VM. If it is really an issue, we will reopen this bug and investigate the logs from VM.

        [couchdb:info,2012-08-28T14:43:47.255,n_0@127.0.0.1:<0.742.0>:couch_log:info:39]127.0.0.1 - - DELETE /_replicator/1d38c26cdc5c5bb0e6be126e8ae272be%2Fdefault%2Fdefault?rev=1-9ee1a1c9 200
        [xdcr:debug,2012-08-28T14:43:47.257,n_0@127.0.0.1:xdc_rep_manager:xdc_rep_manager:process_update:174]replication doc deleted (docId: <<"1d38c26cdc5c5bb0e6be126e8ae272be/default/default">>), stop all replications
        [xdcr:debug,2012-08-28T14:43:47.258,n_0@127.0.0.1:xdc_rep_manager:xdc_replication_sup:stop_replication:49]all replications for DocId <<"1d38c26cdc5c5bb0e6be126e8ae272be/default/default">> have been stopped

        [ns_server:debug,2012-08-28T14:43:47.259,n_0@127.0.0.1:<0.2113.0>:ns_pubsub:do_subscribe_link:134]Parent process of subscription

        {ns_config_events,<0.2112.0>} exited with reason shutdown
        [ns_server:debug,2012-08-28T14:43:47.260,n_0@127.0.0.1:<0.2113.0>:ns_pubsub:do_subscribe_link:149]Deleting {ns_config_events,<0.2112.0>}

        event handler: ok
        [xdcr:debug,2012-08-28T14:43:47.296,n_0@127.0.0.1:<0.11655.0>:xdc_vbucket_rep_worker:find_missing:121]after conflict resolution at target ("http://Administrator:asdasd@127.0.0.1:9501/default%2f87%3b5816\
        f256233b9dffc119c2c32325a512/"), out of all 396 docs the number of docs we need to replicate is: 396
        [couchdb:info,2012-08-28T14:43:47.304,n_0@127.0.0.1:<0.1858.0>:couch_log:info:39]checkpointing view update at seq 5 for _replicator _design/_replicator_info
        [couchdb:info,2012-08-28T14:43:47.320,n_0@127.0.0.1:<0.1852.0>:couch_log:info:39]127.0.0.1 - - GET /replicator/_design/_replicator_info/_view/infos?group_level=1&=1346179427278 200
        [ns_server:debug,2012-08-28T14:44:00.037,n_0@127.0.0.1:compaction_daemon:compaction_daemon:handle_info:269]Starting compaction for the following buckets:
        [<<"default">>]
        [ns_server:info,2012-08-28T14:44:00.074,n_0@127.0.0.1:<0.13612.0>:compaction_daemon:try_to_cleanup_indexes:439]Cleaning up indexes for bucket `default`
        [ns_server:info,2012-08-28T14:44:00.164,n_0@127.0.0.1:<0.13612.0>:compaction_daemon:spawn_bucket_compactor:404]Compacting bucket default with config:
        [{database_fragmentation_threshold,{30,undefined}},

        Show
        junyi Junyi Xie (Inactive) added a comment - - edited I tried the same setting as yours (1 -> 1 replication, default@node1 -> default@node2, and default@node1 -> default2@node2), and it seems there is nothing wrong. From the log below, XDCR replication manager got notified from ns_server instantly after I deleted the replication doc from UI, it instantly shutdown all ongoing bucket replication process, with no delay. And all XDCR activity stopped at source right after that. However, there could be some activity on destination cluster even after XDCR stopped replication on source side, because it may take a while to persist all items in memory to storage. I am not sure if there is any delay between UI stats and the real activity. Also, if both nodes in your test are on the local machine with 1024 vbuckets, it may take longer to finish. I think the delay should be much shorter if we use VMs to conduct the test. At this time I am not sure what to fix. I merged some logs for timing purpose, and will ask Ketaki to do the same test on VM. If it is really an issue, we will reopen this bug and investigate the logs from VM. [couchdb:info,2012-08-28T14:43:47.255,n_0@127.0.0.1:<0.742.0>:couch_log:info:39] 127.0.0.1 - - DELETE /_replicator/1d38c26cdc5c5bb0e6be126e8ae272be%2Fdefault%2Fdefault?rev=1-9ee1a1c9 200 [xdcr:debug,2012-08-28T14:43:47.257,n_0@127.0.0.1:xdc_rep_manager:xdc_rep_manager:process_update:174] replication doc deleted (docId: <<"1d38c26cdc5c5bb0e6be126e8ae272be/default/default">>), stop all replications [xdcr:debug,2012-08-28T14:43:47.258,n_0@127.0.0.1:xdc_rep_manager:xdc_replication_sup:stop_replication:49] all replications for DocId <<"1d38c26cdc5c5bb0e6be126e8ae272be/default/default">> have been stopped [ns_server:debug,2012-08-28T14:43:47.259,n_0@127.0.0.1:<0.2113.0>:ns_pubsub:do_subscribe_link:134] Parent process of subscription {ns_config_events,<0.2112.0>} exited with reason shutdown [ns_server:debug,2012-08-28T14:43:47.260,n_0@127.0.0.1:<0.2113.0>:ns_pubsub:do_subscribe_link:149] Deleting {ns_config_events,<0.2112.0>} event handler: ok [xdcr:debug,2012-08-28T14:43:47.296,n_0@127.0.0.1:<0.11655.0>:xdc_vbucket_rep_worker:find_missing:121] after conflict resolution at target ("http://Administrator:asdasd@127.0.0.1:9501/default%2f87%3b5816\ f256233b9dffc119c2c32325a512/"), out of all 396 docs the number of docs we need to replicate is: 396 [couchdb:info,2012-08-28T14:43:47.304,n_0@127.0.0.1:<0.1858.0>:couch_log:info:39] checkpointing view update at seq 5 for _replicator _design/_replicator_info [couchdb:info,2012-08-28T14:43:47.320,n_0@127.0.0.1:<0.1852.0>:couch_log:info:39] 127.0.0.1 - - GET / replicator/_design/_replicator_info/_view/infos?group_level=1& =1346179427278 200 [ns_server:debug,2012-08-28T14:44:00.037,n_0@127.0.0.1:compaction_daemon:compaction_daemon:handle_info:269] Starting compaction for the following buckets: [<<"default">>] [ns_server:info,2012-08-28T14:44:00.074,n_0@127.0.0.1:<0.13612.0>:compaction_daemon:try_to_cleanup_indexes:439] Cleaning up indexes for bucket `default` [ns_server:info,2012-08-28T14:44:00.164,n_0@127.0.0.1:<0.13612.0>:compaction_daemon:spawn_bucket_compactor:404] Compacting bucket default with config: [{database_fragmentation_threshold,{30,undefined}},
        Show
        junyi Junyi Xie (Inactive) added a comment - http://review.couchbase.org/#/c/20196/5
        junyi Junyi Xie (Inactive) made changes -
        Status Open [ 1 ] Resolved [ 5 ]
        Fix Version/s 2.0 [ 10114 ]
        Resolution Fixed [ 1 ]
        Hide
        thuan Thuan Nguyen added a comment -

        Integrated in github-ns-server-2-0 #456 (See http://qa.hq.northscale.net/job/github-ns-server-2-0/456/)
        MB-6041: add logs to time replication stop (Revision 1b1cf1f99f6e84b0baaa90a9ac2504b46e1d583a)

        Result = SUCCESS
        Junyi Xie :
        Files :

        • src/xdc_rep_manager.erl
        • src/xdc_replication_sup.erl
        Show
        thuan Thuan Nguyen added a comment - Integrated in github-ns-server-2-0 #456 (See http://qa.hq.northscale.net/job/github-ns-server-2-0/456/ ) MB-6041 : add logs to time replication stop (Revision 1b1cf1f99f6e84b0baaa90a9ac2504b46e1d583a) Result = SUCCESS Junyi Xie : Files : src/xdc_rep_manager.erl src/xdc_replication_sup.erl
        peter peter made changes -
        Sprint Status Current Sprint

          People

          • Assignee:
            junyi Junyi Xie (Inactive)
            Reporter:
            Aliaksey Artamonau Aliaksey Artamonau
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Gerrit Reviews

              There are no open Gerrit changes