Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-31352

xdcr replication hang

    XMLWordPrintable

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 5.5.2
    • 6.5.0
    • XDCR
    • None
    • Triaged
    • No

    Description

      I was trying to setup xdcr replication from an in-house cluster to EC2. Following things happened:

      1. Setup XDCR from 4 node(2 data nodes) inhouse cluster to 16 node EC2 cluster ( 8 data nodes) for 1 bucket (msm).

      2. I didn't provide hostname from the EC2 nodes initially. That caused the UI to initially took a long time to respond about what's happening. But after it came back, I tried to delete the replication. There was initially no response and then an error reported on the UI and multiple attempts to delete the replication were unsuccessful.

      3. I restarted the goxdcr process on one data node in source cluster 172.23.97.37. This cleaned up the XDCR replication from the UI.

      Restarted at:
      ns_1@172.23.97.37 6:43:52 PM Tue Sep 18, 2018

      4. I fixed the hostnames on EC2 cluster. Setup the XDCR replication on the source cluster again. This time it started replicating.

      5. After replicating 50% of the data, the progress stopped. I then killed the goxdcr process on the 2nd data node in the source cluster and that kicked off the replication of the remaining 50% of the data.

      Message in logs before restart:

      2018-09-19T04:09:59.411-07:00 INFO GOXDCR.PipelineMgr: Replication Status = map[a8da6785a5cce7dc20c1f861ba93a500/msm/msm:name={a8da6785a5cce7dc20c1f861ba93a500/msm/msm}, status={Pending}, errors={[]}, progress={Pipeline has been stopped}
      

      Restarted at:
      ns_1@172.23.97.38 12:19:26 PM Wed Sep 19, 2018

      Source Cluster Logs:
      https://s3.amazonaws.com/cb-customers/deepkaran/collectinfo-2018-09-19T192037-ns_1%40172.23.97.37.zip
      https://s3.amazonaws.com/cb-customers/deepkaran/collectinfo-2018-09-19T192037-ns_1%40172.23.97.38.zip
      https://s3.amazonaws.com/cb-customers/deepkaran/collectinfo-2018-09-19T192037-ns_1%40172.23.97.39.zip
      https://s3.amazonaws.com/cb-customers/deepkaran/collectinfo-2018-09-19T192037-ns_1%40172.23.97.40.zip

      Let me know if you need destination cluster logs as well.

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            neil.huang Neil Huang added a comment -

            I'm getting "{{AccessDenied}}Access DeniedF62C464CE3F3ABF9PcJ22hG6sVVqaT2dguVscJ8LyUI65eVEs+meBfOqF6JCexFd7oa6a5H+BcgqRvDlAIDbe/FpotQ=" error when trying to download log files.

            neil.huang Neil Huang added a comment - I'm getting "{{AccessDenied}}Access DeniedF62C464CE3F3ABF9PcJ22hG6sVVqaT2dguVscJ8LyUI65eVEs+meBfOqF6JCexFd7oa6a5H+BcgqRvDlAIDbe/FpotQ=" error when trying to download log files.

            Neil Huang please let me know if this bug needs to be fixed in Alice

             

            raju Raju Suravarjjala added a comment - Neil Huang please let me know if this bug needs to be fixed in Alice  
            neil.huang Neil Huang added a comment - - edited

            The first time replication was created to the EC2 instance, we see a lot of logs with network i/o timeout errors. But first, the replication was able to pass the pre-req test and gets created:

            On Node37

            2018-09-18T18:31:20.446-07:00 INFO GOXDCR.XDCRFactory: Target topology retrieved. kvVBMap = map[172.31.5.142:11210:[5… 546… 639] 172.31.5.182:11210:[…
            

            Once it gets created though, we started seeing errors contacting to the host.

            2018-09-18T18:31:20.156-07:00 ERRO GOXDCR.RemClusterSvc: Failed to refresh remote cluster reference remoteCluster/b8l-T9bSsLzGJhWCfQhVVgnewRpDBaTgoP4RYFbGeDc= since none of the nodes in target node list is accessible. node list = [172.31.9.28:8091 172.31.5.142:8091 172.31.13.49:8091 172.31.5.212:8091 172.31.5.182:8091 172.31.0.239:8091 172.31.15.98:8091 172.31.15.42:8091]
            

            2018-09-18T18:36:23.408-07:00 WARN GOXDCR.Utils: ExponentialBackoffExecutor for GetRemoteMemcachedConnection encountered error (dial tcp 172.31.5.142:11210: i/o timeout). Sleeping 400ms
            

            We also see that GoXDCR does backoff and it could last a while in between tries:

            2018-09-18T18:35:19.807-07:00 WARN GOXDCR.CheckpointMgr: Failed to construct memcached client for 172.31.5.142:11210, err=dial tcp 172.31.5.142:11210: i/o timeout
            2018-09-18T18:35:19.807-07:00 WARN GOXDCR.Utils: ExponentialBackoffExecutor for GetRemoteMemcachedConnection encountered error (dial tcp 172.31.5.142:11210: i/o timeout). Sleeping 3.2s
            

            XDCR was forced restart at:

            2018/09/18 18:37:56 child process exited with status 143
            

            Then a replication deletion:

            2018-09-18T18:38:25.617-07:00 INFO GOXDCR.AdminPort: doDeleteReplicationRequest
            2018-09-18T18:38:25.617-07:00 INFO GOXDCR.AdminPort: Request params: replicationId=a8da6785a5cce7dc20c1f861ba93a500/msm/msm
            2018-09-18T18:38:25.623-07:00 INFO GOXDCR.ReplMgr: Deleting replication a8da6785a5cce7dc20c1f861ba93a500/msm/msm
            

            Followed by RC deletion:

            2018-09-18T18:38:58.175-07:00 INFO GOXDCR.AdminPort: doDeleteRemoteClusterRequest
            2018-09-18T18:38:58.186-07:00 INFO GOXDCR.RemoteClusterChangeListener: metakvCallback called on listener RemoteClusterChangeListener with path = /remoteCluster/b8l-T9bSsLzGJhWCfQhVVgnewRpDBaTgoP4RYFbGeDc=
            2018-09-18T18:38:58.186-07:00 INFO GOXDCR.RemClusterSvc: metakvCallback called on path = /remoteCluster/b8l-T9bSsLzGJhWCfQhVVgnewRpDBaTgoP4RYFbGeDc=
            2018-09-18T18:38:58.186-07:00 INFO GOXDCR.RemClusterSvc: Remote cluster Appd_deep deleted from metadata store
            

            Then XDCR on node 37 was killed again before recreating the remote cluster as well as the replication:

            [goport(/opt/couchbase/bin/goxdcr)] 2018/09/18 18:43:52 child process exited with status 143
            2018-09-18T18:44:37.666-07:00 INFO GOXDCR.AdminPort: doCreateRemoteClusterRequest req=&{POST /pools/default/remoteClusters …
            2018-09-18T18:46:10.589-07:00 INFO GOXDCR.PipelineMgr: Starting the pipeline a8da6785a5cce7dc20c1f861ba93a500/msm/msm
            2018-09-18T18:46:10.589-07:00 INFO GOXDCR.PipelineMgr: Validating pipeline a8da6785a5cce7dc20c1f861ba93a500/msm/msm
            2018-09-18T18:46:10.591-07:00 INFO GOXDCR.ReplMgr: Success adding replication specification a8da6785a5cce7dc20c1f861ba93a500/msm/msm
            2018-09-18T18:46:10.591-07:00 INFO GOXDCR.ReplMgr: Replication specification a8da6785a5cce7dc20c1f861ba93a500/msm/msm is created
            2018-09-18T18:46:10.591-07:00 INFO GOXDCR.AdminPort: Finished doCreateReplicationRequest call
            

            Then a delete:

            2018-09-18T18:46:47.431-07:00 INFO GOXDCR.AdminPort: Finished doDeleteReplicationRequest
            

            Then a create that failed due to unable to contact the EC2 cluster for bucket validation.

            2018-09-18T18:54:35.541-07:00 INFO GOXDCR.AdminPort: doCreateReplicationRequest
            2018-09-18T18:54:35.543-07:00 INFO GOXDCR.AdminPort: Request parameters: justValidate=false, fromBucket=msm, toCluster=deep_aws, toBucket=msm, settings=map[optimistic_replication_threshold:256 source_nozzle_per_node:2 log_level:Info checkpoint_interval:600 replication_type:xmem stats_interval:1000 worker_batch_size:500 failure_restart_interval:10 target_nozzle_per_node:2 doc_batch_size_kb:2048 bandwidth_limit:0 compression_type:3]
            2018-09-18T18:54:39.374-07:00 ERRO GOXDCR.AdminPort: Error creating replication. errorsMap=map[toBucket:Error validating target bucket 'msm'. err=Operation failed after max retries.  Last error: Bucket doesn't exist]
            

            Then finally, a create that works:

            2018-09-18T18:55:02.288-07:00 INFO GOXDCR.AdminPort: doCreateReplicationRequest
            2018-09-18T18:55:02.290-07:00 INFO GOXDCR.AdminPort: Request parameters: justValidate=false, fromBucket=msm, toCluster=deep_aws, toBucket=msm, settings=map[doc_batch_size_kb:2048 stats_interval:1000 optimistic_replication_threshold:256 source_nozzle_per_node:2 log_level:Info worker_batch_size:500 replication_type:xmem checkpoint_interval:600 failure_restart_interval:10 compression_type:3 target_nozzle_per_node:2 bandwidth_limit:0]
            2018-09-18T18:55:02.466-07:00 INFO GOXDCR.ReplSpecSvc: Successfully retrieved target cluster reference 0x803d30. time taken=158.493889ms
            2018-09-18T18:55:02.468-07:00 INFO GOXDCR.ReplSpecSvc: Validated that source bucket and target bucket are not the same. time taken=2.02316ms
            2018-09-18T18:55:02.563-07:00 INFO GOXDCR.ReplSpecSvc: Result from remote bucket look up: connStr=ec2-54-89-185-115.compute-1.amazonaws.com:8091, bucketName=msm, targetBucketType=membase, err_target=<nil>, time taken=94.566424ms
            2018-09-18T18:55:02.573-07:00 INFO GOXDCR.ReplSpecSvc: GetMemcachedConnection serverAddr=172.23.97.37:11210, bucketName=msm
            2018-09-18T18:55:05.705-07:00 INFO GOXDCR.AdminPort: Finished doCreateReplicationRequest call
            2018-09-18T18:55:11.271-07:00 INFO GOXDCR.PipelineMgr: Pipeline a8da6785a5cce7dc20c1f861ba93a500/msm/msm has been updated successfully
            

            Everything seems fine, and none of those TCP dial i/o errors from this point. The interesting thing is that the last XDCR restart didn’t “fix” the dial error, and it went away on its own (see previous create that failed at 18:54) The dial didn’t see any error again until the next day:

            2018-09-19T17:00:22.929-07:00 WARN GOXDCR.RemClusterSvc: When refreshing remote cluster reference remoteCluster/Lfhydk6_fIgdxTHlw0EWcNLuZ2Kw5ZYw6BzDOujMKXE=, skipping node ec2-204-236-202-221.compute-1.amazonaws.com:8091 because of error retrieving default pool info from target. err=Failed on calling host=ec2-204-236-202-221.compute-1.amazonaws.com:8091, path=/pools/default, err=Get http://ec2-204-236-202-221.compute-1.amazonaws.com:8091/pools/default: dial tcp 204.236.202.221:8091: i/o timeout, statusCode=0
            

            In the meantime, from the create that works, let’s take a look at the other data node 38:

            The create from node 37 is seen on 38:

            2018-09-18T18:55:05.709-07:00 INFO GOXDCR.ReplicationSpecChangeListener: metakvCallback called on listener ReplicationSpecChangeListener with path = /replicationSpec/a8da6785a5cce7dc20c1f861ba93a500/msm/msm
            2018-09-18T18:55:05.709-07:00 INFO GOXDCR.ReplSpecSvc: ReplicationSpecServiceCallback called on path = /replicationSpec/a8da6785a5cce7dc20c1f861ba93a500/msm/msm
            2018-09-18T18:55:05.709-07:00 INFO GOXDCR.ReplicationSpecChangeListener: specChangedCallback called on id = a8da6785a5cce7dc20c1f861ba93a500/msm/msm, oldSpec=<nil>, newSpec=&{a8da6785a5cce7dc20c1f861ba93a500/msm/msm MdMjrWfLwqNUMAazTpC0XQ== msm ac272b0f8cccb858dbab37ba6a52666c a8da6785a5cce7dc20c1f861ba93a500 msm fcf5e9438f10a2734aeac4da89a0d5bc 0xc42001f220 [131 108 0 0 0 1 104 2 109 0 0 0 32 50 51 57 52 97 51 51 101 49 54 101 52 48 52 97 51 101 53 50 101 56 57 97 48 57 102 57 101 101 101 97 102 104 2 97 6 110 5 0 121 40 22 213 14 106]}
            2018-09-18T18:55:05.709-07:00 INFO GOXDCR.ReplicationSpecChangeListener: new spec settings=&{xmem  true 600 500 2048 10 256 2 2 0 0 Info 1000 0 3 <nil>}
            2018-09-18T18:55:05.709-07:00 INFO GOXDCR.ReplicationSpecChangeListener: Starting pipeline a8da6785a5cce7dc20c1f861ba93a500/msm/msm since the replication spec has been changed to active
            

            But from the pipeline manager message, it seems that it was not starting as it’s still in a paused state, with the last error from 18:33.

            2018-09-18T18:55:44.407-07:00 INFO GOXDCR.PipelineMgr: Replication Status = map[a8da6785a5cce7dc20c1f861ba93a500/msm/msm:name={a8da6785a5cce7dc20c1f861ba93a500/msm/msm}, status={Pending}, errors={[{"time":"2018-09-18T18:33:26.788282457-07:00","errMsg":"dial tcp 172.31.5.142:11210: i/o timeout"}]}, progress={Pipeline failed to start, err=map[genericPipeline.context.Start:dial tcp 172.31.5.142:11210: i/o timeout]}
            ]
            

            Looking backwards, it seems that the last message of essence was:

            2018-09-18T18:46:20.394-07:00 INFO GOXDCR.ReplicationSpecChangeListener: Stopping pipeline a8da6785a5cce7dc20c1f861ba93a500/msm/msm since the replication spec has been changed to inactive
            

            And we don’t see any action taken by Pipeline manager. Looking further back, the one last visible action taken was:

            2018-09-18T18:33:36.789-07:00 INFO GOXDCR.PipelineMgr: Try to stop pipeline a8da6785a5cce7dc20c1f861ba93a500/msm/msm
            2018-09-18T18:33:36.789-07:00 INFO GOXDCR.PipelineMgr: Trying to stop the pipeline a8da6785a5cce7dc20c1f861ba93a500/msm/msm
            2018-09-18T18:33:36.789-07:00 INFO GOXDCR.GenericPipeline: Stopping pipeline a8da6785a5cce7dc20c1f861ba93a500/msm/msm-453901974
            2018-09-18T18:33:36.789-07:00 INFO GOXDCR.CheckpointMgr: Starting checkpointing for pipeline a8da6785a5cce7dc20c1f861ba93a500/msm/msm before stopping
            2018-09-18T18:33:36.790-07:00 INFO GOXDCR.CheckpointMgr: Start one time checkpointing for replication a8da6785a5cce7dc20c1f861ba93a500/msm/msm
            2018-09-18T18:36:36.789-07:00 INFO GOXDCR.CheckpointMgr: Checkpointing for pipeline a8da6785a5cce7dc20c1f861ba93a500/msm/msm timed out after 3m0s.
            

            The checkpoint timed out and it proceeded to go and stop the async listener services, which includes Checkpoint manager itself. It is here where the pipeline manager is held off from completing, because we never see “Async Listener has stopped”.

            If anything, given the fact that checkpoint manager had problem with checkpointing before, it most likely is stuck closing the clients.
            https://github.com/couchbase/goxdcr/blob/e9ae947df57c37a65ee7776a50324e3981d748aa/pipeline_svc/checkpoint_manager.go#L392

            It is possible that a client has already been established, but due to the intermittent errors, causes the checkpoint manager’s connection closer to hang since it must depend on the client itself to finish closing the tcp connection. There’s no concrete evidence of this, because XDCR doesn’t log that level of detail due to the need for logging brevity.

            neil.huang Neil Huang added a comment - - edited The first time replication was created to the EC2 instance, we see a lot of logs with network i/o timeout errors. But first, the replication was able to pass the pre-req test and gets created: On Node37 2018-09-18T18:31:20.446-07:00 INFO GOXDCR.XDCRFactory: Target topology retrieved. kvVBMap = map[172.31.5.142:11210:[5… 546… 639] 172.31.5.182:11210:[… Once it gets created though, we started seeing errors contacting to the host. 2018-09-18T18:31:20.156-07:00 ERRO GOXDCR.RemClusterSvc: Failed to refresh remote cluster reference remoteCluster/b8l-T9bSsLzGJhWCfQhVVgnewRpDBaTgoP4RYFbGeDc= since none of the nodes in target node list is accessible. node list = [172.31.9.28:8091 172.31.5.142:8091 172.31.13.49:8091 172.31.5.212:8091 172.31.5.182:8091 172.31.0.239:8091 172.31.15.98:8091 172.31.15.42:8091] 2018-09-18T18:36:23.408-07:00 WARN GOXDCR.Utils: ExponentialBackoffExecutor for GetRemoteMemcachedConnection encountered error (dial tcp 172.31.5.142:11210: i/o timeout). Sleeping 400ms We also see that GoXDCR does backoff and it could last a while in between tries: 2018-09-18T18:35:19.807-07:00 WARN GOXDCR.CheckpointMgr: Failed to construct memcached client for 172.31.5.142:11210, err=dial tcp 172.31.5.142:11210: i/o timeout 2018-09-18T18:35:19.807-07:00 WARN GOXDCR.Utils: ExponentialBackoffExecutor for GetRemoteMemcachedConnection encountered error (dial tcp 172.31.5.142:11210: i/o timeout). Sleeping 3.2s XDCR was forced restart at: 2018/09/18 18:37:56 child process exited with status 143 Then a replication deletion: 2018-09-18T18:38:25.617-07:00 INFO GOXDCR.AdminPort: doDeleteReplicationRequest 2018-09-18T18:38:25.617-07:00 INFO GOXDCR.AdminPort: Request params: replicationId=a8da6785a5cce7dc20c1f861ba93a500/msm/msm 2018-09-18T18:38:25.623-07:00 INFO GOXDCR.ReplMgr: Deleting replication a8da6785a5cce7dc20c1f861ba93a500/msm/msm Followed by RC deletion: 2018-09-18T18:38:58.175-07:00 INFO GOXDCR.AdminPort: doDeleteRemoteClusterRequest 2018-09-18T18:38:58.186-07:00 INFO GOXDCR.RemoteClusterChangeListener: metakvCallback called on listener RemoteClusterChangeListener with path = /remoteCluster/b8l-T9bSsLzGJhWCfQhVVgnewRpDBaTgoP4RYFbGeDc= 2018-09-18T18:38:58.186-07:00 INFO GOXDCR.RemClusterSvc: metakvCallback called on path = /remoteCluster/b8l-T9bSsLzGJhWCfQhVVgnewRpDBaTgoP4RYFbGeDc= 2018-09-18T18:38:58.186-07:00 INFO GOXDCR.RemClusterSvc: Remote cluster Appd_deep deleted from metadata store Then XDCR on node 37 was killed again before recreating the remote cluster as well as the replication: [goport(/opt/couchbase/bin/goxdcr)] 2018/09/18 18:43:52 child process exited with status 143 … 2018-09-18T18:44:37.666-07:00 INFO GOXDCR.AdminPort: doCreateRemoteClusterRequest req=&{POST /pools/default/remoteClusters … … 2018-09-18T18:46:10.589-07:00 INFO GOXDCR.PipelineMgr: Starting the pipeline a8da6785a5cce7dc20c1f861ba93a500/msm/msm 2018-09-18T18:46:10.589-07:00 INFO GOXDCR.PipelineMgr: Validating pipeline a8da6785a5cce7dc20c1f861ba93a500/msm/msm 2018-09-18T18:46:10.591-07:00 INFO GOXDCR.ReplMgr: Success adding replication specification a8da6785a5cce7dc20c1f861ba93a500/msm/msm 2018-09-18T18:46:10.591-07:00 INFO GOXDCR.ReplMgr: Replication specification a8da6785a5cce7dc20c1f861ba93a500/msm/msm is created 2018-09-18T18:46:10.591-07:00 INFO GOXDCR.AdminPort: Finished doCreateReplicationRequest call Then a delete: 2018-09-18T18:46:47.431-07:00 INFO GOXDCR.AdminPort: Finished doDeleteReplicationRequest Then a create that failed due to unable to contact the EC2 cluster for bucket validation. 2018-09-18T18:54:35.541-07:00 INFO GOXDCR.AdminPort: doCreateReplicationRequest 2018-09-18T18:54:35.543-07:00 INFO GOXDCR.AdminPort: Request parameters: justValidate=false, fromBucket=msm, toCluster=deep_aws, toBucket=msm, settings=map[optimistic_replication_threshold:256 source_nozzle_per_node:2 log_level:Info checkpoint_interval:600 replication_type:xmem stats_interval:1000 worker_batch_size:500 failure_restart_interval:10 target_nozzle_per_node:2 doc_batch_size_kb:2048 bandwidth_limit:0 compression_type:3] … 2018-09-18T18:54:39.374-07:00 ERRO GOXDCR.AdminPort: Error creating replication. errorsMap=map[toBucket:Error validating target bucket 'msm'. err=Operation failed after max retries. Last error: Bucket doesn't exist] Then finally, a create that works: 2018-09-18T18:55:02.288-07:00 INFO GOXDCR.AdminPort: doCreateReplicationRequest 2018-09-18T18:55:02.290-07:00 INFO GOXDCR.AdminPort: Request parameters: justValidate=false, fromBucket=msm, toCluster=deep_aws, toBucket=msm, settings=map[doc_batch_size_kb:2048 stats_interval:1000 optimistic_replication_threshold:256 source_nozzle_per_node:2 log_level:Info worker_batch_size:500 replication_type:xmem checkpoint_interval:600 failure_restart_interval:10 compression_type:3 target_nozzle_per_node:2 bandwidth_limit:0] … 2018-09-18T18:55:02.466-07:00 INFO GOXDCR.ReplSpecSvc: Successfully retrieved target cluster reference 0x803d30. time taken=158.493889ms 2018-09-18T18:55:02.468-07:00 INFO GOXDCR.ReplSpecSvc: Validated that source bucket and target bucket are not the same. time taken=2.02316ms 2018-09-18T18:55:02.563-07:00 INFO GOXDCR.ReplSpecSvc: Result from remote bucket look up: connStr=ec2-54-89-185-115.compute-1.amazonaws.com:8091, bucketName=msm, targetBucketType=membase, err_target=<nil>, time taken=94.566424ms 2018-09-18T18:55:02.573-07:00 INFO GOXDCR.ReplSpecSvc: GetMemcachedConnection serverAddr=172.23.97.37:11210, bucketName=msm … 2018-09-18T18:55:05.705-07:00 INFO GOXDCR.AdminPort: Finished doCreateReplicationRequest call … 2018-09-18T18:55:11.271-07:00 INFO GOXDCR.PipelineMgr: Pipeline a8da6785a5cce7dc20c1f861ba93a500/msm/msm has been updated successfully Everything seems fine, and none of those TCP dial i/o errors from this point. The interesting thing is that the last XDCR restart didn’t “fix” the dial error, and it went away on its own (see previous create that failed at 18:54) The dial didn’t see any error again until the next day: 2018-09-19T17:00:22.929-07:00 WARN GOXDCR.RemClusterSvc: When refreshing remote cluster reference remoteCluster/Lfhydk6_fIgdxTHlw0EWcNLuZ2Kw5ZYw6BzDOujMKXE=, skipping node ec2-204-236-202-221.compute-1.amazonaws.com:8091 because of error retrieving default pool info from target. err=Failed on calling host=ec2-204-236-202-221.compute-1.amazonaws.com:8091, path=/pools/default, err=Get http://ec2-204-236-202-221.compute-1.amazonaws.com:8091/pools/default: dial tcp 204.236.202.221:8091: i/o timeout, statusCode=0 In the meantime, from the create that works, let’s take a look at the other data node 38 : The create from node 37 is seen on 38: 2018-09-18T18:55:05.709-07:00 INFO GOXDCR.ReplicationSpecChangeListener: metakvCallback called on listener ReplicationSpecChangeListener with path = /replicationSpec/a8da6785a5cce7dc20c1f861ba93a500/msm/msm 2018-09-18T18:55:05.709-07:00 INFO GOXDCR.ReplSpecSvc: ReplicationSpecServiceCallback called on path = /replicationSpec/a8da6785a5cce7dc20c1f861ba93a500/msm/msm 2018-09-18T18:55:05.709-07:00 INFO GOXDCR.ReplicationSpecChangeListener: specChangedCallback called on id = a8da6785a5cce7dc20c1f861ba93a500/msm/msm, oldSpec=<nil>, newSpec=&{a8da6785a5cce7dc20c1f861ba93a500/msm/msm MdMjrWfLwqNUMAazTpC0XQ== msm ac272b0f8cccb858dbab37ba6a52666c a8da6785a5cce7dc20c1f861ba93a500 msm fcf5e9438f10a2734aeac4da89a0d5bc 0xc42001f220 [131 108 0 0 0 1 104 2 109 0 0 0 32 50 51 57 52 97 51 51 101 49 54 101 52 48 52 97 51 101 53 50 101 56 57 97 48 57 102 57 101 101 101 97 102 104 2 97 6 110 5 0 121 40 22 213 14 106]} 2018-09-18T18:55:05.709-07:00 INFO GOXDCR.ReplicationSpecChangeListener: new spec settings=&{xmem true 600 500 2048 10 256 2 2 0 0 Info 1000 0 3 <nil>} 2018-09-18T18:55:05.709-07:00 INFO GOXDCR.ReplicationSpecChangeListener: Starting pipeline a8da6785a5cce7dc20c1f861ba93a500/msm/msm since the replication spec has been changed to active But from the pipeline manager message, it seems that it was not starting as it’s still in a paused state, with the last error from 18:33. 2018-09-18T18:55:44.407-07:00 INFO GOXDCR.PipelineMgr: Replication Status = map[a8da6785a5cce7dc20c1f861ba93a500/msm/msm:name={a8da6785a5cce7dc20c1f861ba93a500/msm/msm}, status={Pending}, errors={[{"time":"2018-09-18T18:33:26.788282457-07:00","errMsg":"dial tcp 172.31.5.142:11210: i/o timeout"}]}, progress={Pipeline failed to start, err=map[genericPipeline.context.Start:dial tcp 172.31.5.142:11210: i/o timeout]} ] Looking backwards, it seems that the last message of essence was: 2018-09-18T18:46:20.394-07:00 INFO GOXDCR.ReplicationSpecChangeListener: Stopping pipeline a8da6785a5cce7dc20c1f861ba93a500/msm/msm since the replication spec has been changed to inactive And we don’t see any action taken by Pipeline manager. Looking further back, the one last visible action taken was: 2018-09-18T18:33:36.789-07:00 INFO GOXDCR.PipelineMgr: Try to stop pipeline a8da6785a5cce7dc20c1f861ba93a500/msm/msm 2018-09-18T18:33:36.789-07:00 INFO GOXDCR.PipelineMgr: Trying to stop the pipeline a8da6785a5cce7dc20c1f861ba93a500/msm/msm 2018-09-18T18:33:36.789-07:00 INFO GOXDCR.GenericPipeline: Stopping pipeline a8da6785a5cce7dc20c1f861ba93a500/msm/msm-453901974 2018-09-18T18:33:36.789-07:00 INFO GOXDCR.CheckpointMgr: Starting checkpointing for pipeline a8da6785a5cce7dc20c1f861ba93a500/msm/msm before stopping 2018-09-18T18:33:36.790-07:00 INFO GOXDCR.CheckpointMgr: Start one time checkpointing for replication a8da6785a5cce7dc20c1f861ba93a500/msm/msm … 2018-09-18T18:36:36.789-07:00 INFO GOXDCR.CheckpointMgr: Checkpointing for pipeline a8da6785a5cce7dc20c1f861ba93a500/msm/msm timed out after 3m0s. The checkpoint timed out and it proceeded to go and stop the async listener services, which includes Checkpoint manager itself. It is here where the pipeline manager is held off from completing, because we never see “Async Listener has stopped”. If anything, given the fact that checkpoint manager had problem with checkpointing before, it most likely is stuck closing the clients. https://github.com/couchbase/goxdcr/blob/e9ae947df57c37a65ee7776a50324e3981d748aa/pipeline_svc/checkpoint_manager.go#L392 It is possible that a client has already been established, but due to the intermittent errors, causes the checkpoint manager’s connection closer to hang since it must depend on the client itself to finish closing the tcp connection. There’s no concrete evidence of this, because XDCR doesn’t log that level of detail due to the need for logging brevity.
            yu Yu Sui (Inactive) added a comment - - edited

            The issue is that, checkpoint manager tried to save a checkpoint before replication stops. It repeatedly ran into issues when saving checkpoint, though, because target was inaccessible. Checkpoint manager has a retry mechanism, a timeout, and a finChan. When timeout was up, which happened at 2018-09-18T18:36:36, checkpoint manager closed the finChan, hoping that it would get the checkpointing routine to abort right away. This did not happen, though, because checkpointing routine was retrying at the step of target stats retrieval, where the finChan was not yet effective.  

            Checkpoint manager finally gave up on retrying on 2018-09-18T18:46:20.

            2018-09-18T18:46:20.209-07:00 ERRO GOXDCR.CheckpointMgr: a8da6785a5cce7dc20c1f861ba93a500/msm/msm Retrieval of high seqno and vbuuid stats failed after 5 retries.

            At the same time manual pause and resume of replication happened. Hence it is not clear whether replication would have proceeded to stop successfully if left to its own device. However, replication did not start successfully after the manual resume, which is an issue we need to look into. Will do some more digging.

            As for the issue that checkpoint manager waited for longer time to abort than planned, there are two things we could do:

            1. Use ExponentialBackoffExecutorWithFinishSignal, which can make use of a passed in finChan and abort quicker.
            2. Let checkpoint manager one-time-checkpoint function abort without waiting for the actual checkpointing routine to exit.

            This may not be the root cause of the stuck issue, though.

            yu Yu Sui (Inactive) added a comment - - edited The issue is that, checkpoint manager tried to save a checkpoint before replication stops. It repeatedly ran into issues when saving checkpoint, though, because target was inaccessible. Checkpoint manager has a retry mechanism, a timeout, and a finChan. When timeout was up, which happened at 2018-09-18T18:36:36, checkpoint manager closed the finChan, hoping that it would get the checkpointing routine to abort right away. This did not happen, though, because checkpointing routine was retrying at the step of target stats retrieval, where the finChan was not yet effective.   Checkpoint manager finally gave up on retrying on 2018-09-18T18:46:20. 2018-09-18T18:46:20.209-07:00 ERRO GOXDCR.CheckpointMgr: a8da6785a5cce7dc20c1f861ba93a500/msm/msm Retrieval of high seqno and vbuuid stats failed after 5 retries. At the same time manual pause and resume of replication happened. Hence it is not clear whether replication would have proceeded to stop successfully if left to its own device. However, replication did not start successfully after the manual resume, which is an issue we need to look into. Will do some more digging. As for the issue that checkpoint manager waited for longer time to abort than planned, there are two things we could do: Use ExponentialBackoffExecutorWithFinishSignal, which can make use of a passed in finChan and abort quicker. Let checkpoint manager one-time-checkpoint function abort without waiting for the actual checkpointing routine to exit. This may not be the root cause of the stuck issue, though.

            After a second look at the log files, checkpoint manager did get stuck after it gave up retrying on 2018-09-18T18:46:20. ckmgr.getHighSeqnoAndVBUuidFromTarget() prints high_seqno_and_vbuuid_map before it exits. high_seqno_and_vbuuid_map was never printed, though, indicating that getHighSeqnoAndVBUuidFromTarget() never completed.

            func (ckmgr *CheckpointManager) getHighSeqnoAndVBUuidFromTarget() map[uint16][]uint64 {
            ...
            for serverAddr, vbnos := range ckmgr.target_kv_vb_map { ckmgr.getHighSeqnoAndVBUuidForServerWithRetry(serverAddr, vbnos, high_seqno_and_vbuuid_map) }
            ckmgr.logger.Infof("high_seqno_and_vbuuid_map=%v\n", high_seqno_and_vbuuid_map)

            ...

             

            The only possible scenario I could think of is that, there were two nodes in the target clusters. After checkpoint manager gave up retrying on one node, it moved on to call getHighSeqnoAndVBUuidForServerWithRetry() on the other node. Such a call never came back, causing checkpoint manager to get stuck.

            Fix #2 in previous comment, let checkpoint manager one-time-checkpoint function abort without waiting for the actual checkpointing routine to exit, would fix stuckness in this scenario. This would still leave a stuck routine, though.
            In the longer term, we could look into golang Context to see if there are better ways to fix this without leaving any stuck routines.

             

             

            yu Yu Sui (Inactive) added a comment - After a second look at the log files, checkpoint manager did get stuck after it gave up retrying on 2018-09-18T18:46:20. ckmgr.getHighSeqnoAndVBUuidFromTarget() prints high_seqno_and_vbuuid_map before it exits. high_seqno_and_vbuuid_map was never printed, though, indicating that getHighSeqnoAndVBUuidFromTarget() never completed. func (ckmgr *CheckpointManager) getHighSeqnoAndVBUuidFromTarget() map [uint16] []uint64 { ... for serverAddr, vbnos := range ckmgr.target_kv_vb_map { ckmgr.getHighSeqnoAndVBUuidForServerWithRetry(serverAddr, vbnos, high_seqno_and_vbuuid_map) } ckmgr.logger.Infof("high_seqno_and_vbuuid_map=%v\n", high_seqno_and_vbuuid_map) ...   The only possible scenario I could think of is that, there were two nodes in the target clusters. After checkpoint manager gave up retrying on one node, it moved on to call getHighSeqnoAndVBUuidForServerWithRetry() on the other node. Such a call never came back, causing checkpoint manager to get stuck. Fix #2 in previous comment, l et checkpoint manager one-time-checkpoint function abort without waiting for the actual checkpointing routine to exit, would fix stuckness in this scenario. This would still leave a stuck routine, though. In the longer term, we could look into golang Context to see if there are better ways to fix this without leaving any stuck routines.    

            Build couchbase-server-6.5.0-1351 contains goxdcr commit a1d21e8 with commit message:
            MB-31352 fix replication hanging problem

            build-team Couchbase Build Team added a comment - Build couchbase-server-6.5.0-1351 contains goxdcr commit a1d21e8 with commit message: MB-31352 fix replication hanging problem

            People

              pavithra.mahamani Pavithra Mahamani
              deepkaran.salooja Deepkaran Salooja
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty