The first time replication was created to the EC2 instance, we see a lot of logs with network i/o timeout errors. But first, the replication was able to pass the pre-req test and gets created:
On Node37
2018-09-18T18:31:20.446-07:00 INFO GOXDCR.XDCRFactory: Target topology retrieved. kvVBMap = map[172.31.5.142:11210:[5… 546… 639] 172.31.5.182:11210:[…
|
Once it gets created though, we started seeing errors contacting to the host.
2018-09-18T18:31:20.156-07:00 ERRO GOXDCR.RemClusterSvc: Failed to refresh remote cluster reference remoteCluster/b8l-T9bSsLzGJhWCfQhVVgnewRpDBaTgoP4RYFbGeDc= since none of the nodes in target node list is accessible. node list = [172.31.9.28:8091 172.31.5.142:8091 172.31.13.49:8091 172.31.5.212:8091 172.31.5.182:8091 172.31.0.239:8091 172.31.15.98:8091 172.31.15.42:8091]
|
2018-09-18T18:36:23.408-07:00 WARN GOXDCR.Utils: ExponentialBackoffExecutor for GetRemoteMemcachedConnection encountered error (dial tcp 172.31.5.142:11210: i/o timeout). Sleeping 400ms
|
We also see that GoXDCR does backoff and it could last a while in between tries:
2018-09-18T18:35:19.807-07:00 WARN GOXDCR.CheckpointMgr: Failed to construct memcached client for 172.31.5.142:11210, err=dial tcp 172.31.5.142:11210: i/o timeout
|
2018-09-18T18:35:19.807-07:00 WARN GOXDCR.Utils: ExponentialBackoffExecutor for GetRemoteMemcachedConnection encountered error (dial tcp 172.31.5.142:11210: i/o timeout). Sleeping 3.2s
|
XDCR was forced restart at:
2018/09/18 18:37:56 child process exited with status 143
|
Then a replication deletion:
2018-09-18T18:38:25.617-07:00 INFO GOXDCR.AdminPort: doDeleteReplicationRequest
|
2018-09-18T18:38:25.617-07:00 INFO GOXDCR.AdminPort: Request params: replicationId=a8da6785a5cce7dc20c1f861ba93a500/msm/msm
|
2018-09-18T18:38:25.623-07:00 INFO GOXDCR.ReplMgr: Deleting replication a8da6785a5cce7dc20c1f861ba93a500/msm/msm
|
Followed by RC deletion:
2018-09-18T18:38:58.175-07:00 INFO GOXDCR.AdminPort: doDeleteRemoteClusterRequest
|
2018-09-18T18:38:58.186-07:00 INFO GOXDCR.RemoteClusterChangeListener: metakvCallback called on listener RemoteClusterChangeListener with path = /remoteCluster/b8l-T9bSsLzGJhWCfQhVVgnewRpDBaTgoP4RYFbGeDc=
|
2018-09-18T18:38:58.186-07:00 INFO GOXDCR.RemClusterSvc: metakvCallback called on path = /remoteCluster/b8l-T9bSsLzGJhWCfQhVVgnewRpDBaTgoP4RYFbGeDc=
|
2018-09-18T18:38:58.186-07:00 INFO GOXDCR.RemClusterSvc: Remote cluster Appd_deep deleted from metadata store
|
Then XDCR on node 37 was killed again before recreating the remote cluster as well as the replication:
[goport(/opt/couchbase/bin/goxdcr)] 2018/09/18 18:43:52 child process exited with status 143
|
…
|
2018-09-18T18:44:37.666-07:00 INFO GOXDCR.AdminPort: doCreateRemoteClusterRequest req=&{POST /pools/default/remoteClusters …
|
…
|
2018-09-18T18:46:10.589-07:00 INFO GOXDCR.PipelineMgr: Starting the pipeline a8da6785a5cce7dc20c1f861ba93a500/msm/msm
|
2018-09-18T18:46:10.589-07:00 INFO GOXDCR.PipelineMgr: Validating pipeline a8da6785a5cce7dc20c1f861ba93a500/msm/msm
|
2018-09-18T18:46:10.591-07:00 INFO GOXDCR.ReplMgr: Success adding replication specification a8da6785a5cce7dc20c1f861ba93a500/msm/msm
|
2018-09-18T18:46:10.591-07:00 INFO GOXDCR.ReplMgr: Replication specification a8da6785a5cce7dc20c1f861ba93a500/msm/msm is created
|
2018-09-18T18:46:10.591-07:00 INFO GOXDCR.AdminPort: Finished doCreateReplicationRequest call
|
Then a delete:
2018-09-18T18:46:47.431-07:00 INFO GOXDCR.AdminPort: Finished doDeleteReplicationRequest
|
Then a create that failed due to unable to contact the EC2 cluster for bucket validation.
2018-09-18T18:54:35.541-07:00 INFO GOXDCR.AdminPort: doCreateReplicationRequest
|
2018-09-18T18:54:35.543-07:00 INFO GOXDCR.AdminPort: Request parameters: justValidate=false, fromBucket=msm, toCluster=deep_aws, toBucket=msm, settings=map[optimistic_replication_threshold:256 source_nozzle_per_node:2 log_level:Info checkpoint_interval:600 replication_type:xmem stats_interval:1000 worker_batch_size:500 failure_restart_interval:10 target_nozzle_per_node:2 doc_batch_size_kb:2048 bandwidth_limit:0 compression_type:3]
|
…
|
2018-09-18T18:54:39.374-07:00 ERRO GOXDCR.AdminPort: Error creating replication. errorsMap=map[toBucket:Error validating target bucket 'msm'. err=Operation failed after max retries. Last error: Bucket doesn't exist]
|
Then finally, a create that works:
2018-09-18T18:55:02.288-07:00 INFO GOXDCR.AdminPort: doCreateReplicationRequest
|
2018-09-18T18:55:02.290-07:00 INFO GOXDCR.AdminPort: Request parameters: justValidate=false, fromBucket=msm, toCluster=deep_aws, toBucket=msm, settings=map[doc_batch_size_kb:2048 stats_interval:1000 optimistic_replication_threshold:256 source_nozzle_per_node:2 log_level:Info worker_batch_size:500 replication_type:xmem checkpoint_interval:600 failure_restart_interval:10 compression_type:3 target_nozzle_per_node:2 bandwidth_limit:0]
|
…
|
2018-09-18T18:55:02.466-07:00 INFO GOXDCR.ReplSpecSvc: Successfully retrieved target cluster reference 0x803d30. time taken=158.493889ms
|
2018-09-18T18:55:02.468-07:00 INFO GOXDCR.ReplSpecSvc: Validated that source bucket and target bucket are not the same. time taken=2.02316ms
|
2018-09-18T18:55:02.563-07:00 INFO GOXDCR.ReplSpecSvc: Result from remote bucket look up: connStr=ec2-54-89-185-115.compute-1.amazonaws.com:8091, bucketName=msm, targetBucketType=membase, err_target=<nil>, time taken=94.566424ms
|
2018-09-18T18:55:02.573-07:00 INFO GOXDCR.ReplSpecSvc: GetMemcachedConnection serverAddr=172.23.97.37:11210, bucketName=msm
|
…
|
2018-09-18T18:55:05.705-07:00 INFO GOXDCR.AdminPort: Finished doCreateReplicationRequest call
|
…
|
2018-09-18T18:55:11.271-07:00 INFO GOXDCR.PipelineMgr: Pipeline a8da6785a5cce7dc20c1f861ba93a500/msm/msm has been updated successfully
|
Everything seems fine, and none of those TCP dial i/o errors from this point. The interesting thing is that the last XDCR restart didn’t “fix” the dial error, and it went away on its own (see previous create that failed at 18:54) The dial didn’t see any error again until the next day:
2018-09-19T17:00:22.929-07:00 WARN GOXDCR.RemClusterSvc: When refreshing remote cluster reference remoteCluster/Lfhydk6_fIgdxTHlw0EWcNLuZ2Kw5ZYw6BzDOujMKXE=, skipping node ec2-204-236-202-221.compute-1.amazonaws.com:8091 because of error retrieving default pool info from target. err=Failed on calling host=ec2-204-236-202-221.compute-1.amazonaws.com:8091, path=/pools/default, err=Get http://ec2-204-236-202-221.compute-1.amazonaws.com:8091/pools/default: dial tcp 204.236.202.221:8091: i/o timeout, statusCode=0
|
In the meantime, from the create that works, let’s take a look at the other data node 38:
The create from node 37 is seen on 38:
2018-09-18T18:55:05.709-07:00 INFO GOXDCR.ReplicationSpecChangeListener: metakvCallback called on listener ReplicationSpecChangeListener with path = /replicationSpec/a8da6785a5cce7dc20c1f861ba93a500/msm/msm
|
2018-09-18T18:55:05.709-07:00 INFO GOXDCR.ReplSpecSvc: ReplicationSpecServiceCallback called on path = /replicationSpec/a8da6785a5cce7dc20c1f861ba93a500/msm/msm
|
2018-09-18T18:55:05.709-07:00 INFO GOXDCR.ReplicationSpecChangeListener: specChangedCallback called on id = a8da6785a5cce7dc20c1f861ba93a500/msm/msm, oldSpec=<nil>, newSpec=&{a8da6785a5cce7dc20c1f861ba93a500/msm/msm MdMjrWfLwqNUMAazTpC0XQ== msm ac272b0f8cccb858dbab37ba6a52666c a8da6785a5cce7dc20c1f861ba93a500 msm fcf5e9438f10a2734aeac4da89a0d5bc 0xc42001f220 [131 108 0 0 0 1 104 2 109 0 0 0 32 50 51 57 52 97 51 51 101 49 54 101 52 48 52 97 51 101 53 50 101 56 57 97 48 57 102 57 101 101 101 97 102 104 2 97 6 110 5 0 121 40 22 213 14 106]}
|
2018-09-18T18:55:05.709-07:00 INFO GOXDCR.ReplicationSpecChangeListener: new spec settings=&{xmem true 600 500 2048 10 256 2 2 0 0 Info 1000 0 3 <nil>}
|
2018-09-18T18:55:05.709-07:00 INFO GOXDCR.ReplicationSpecChangeListener: Starting pipeline a8da6785a5cce7dc20c1f861ba93a500/msm/msm since the replication spec has been changed to active
|
But from the pipeline manager message, it seems that it was not starting as it’s still in a paused state, with the last error from 18:33.
2018-09-18T18:55:44.407-07:00 INFO GOXDCR.PipelineMgr: Replication Status = map[a8da6785a5cce7dc20c1f861ba93a500/msm/msm:name={a8da6785a5cce7dc20c1f861ba93a500/msm/msm}, status={Pending}, errors={[{"time":"2018-09-18T18:33:26.788282457-07:00","errMsg":"dial tcp 172.31.5.142:11210: i/o timeout"}]}, progress={Pipeline failed to start, err=map[genericPipeline.context.Start:dial tcp 172.31.5.142:11210: i/o timeout]}
|
]
|
Looking backwards, it seems that the last message of essence was:
2018-09-18T18:46:20.394-07:00 INFO GOXDCR.ReplicationSpecChangeListener: Stopping pipeline a8da6785a5cce7dc20c1f861ba93a500/msm/msm since the replication spec has been changed to inactive
|
And we don’t see any action taken by Pipeline manager. Looking further back, the one last visible action taken was:
2018-09-18T18:33:36.789-07:00 INFO GOXDCR.PipelineMgr: Try to stop pipeline a8da6785a5cce7dc20c1f861ba93a500/msm/msm
|
2018-09-18T18:33:36.789-07:00 INFO GOXDCR.PipelineMgr: Trying to stop the pipeline a8da6785a5cce7dc20c1f861ba93a500/msm/msm
|
2018-09-18T18:33:36.789-07:00 INFO GOXDCR.GenericPipeline: Stopping pipeline a8da6785a5cce7dc20c1f861ba93a500/msm/msm-453901974
|
2018-09-18T18:33:36.789-07:00 INFO GOXDCR.CheckpointMgr: Starting checkpointing for pipeline a8da6785a5cce7dc20c1f861ba93a500/msm/msm before stopping
|
2018-09-18T18:33:36.790-07:00 INFO GOXDCR.CheckpointMgr: Start one time checkpointing for replication a8da6785a5cce7dc20c1f861ba93a500/msm/msm
|
…
|
2018-09-18T18:36:36.789-07:00 INFO GOXDCR.CheckpointMgr: Checkpointing for pipeline a8da6785a5cce7dc20c1f861ba93a500/msm/msm timed out after 3m0s.
|
The checkpoint timed out and it proceeded to go and stop the async listener services, which includes Checkpoint manager itself. It is here where the pipeline manager is held off from completing, because we never see “Async Listener has stopped”.
If anything, given the fact that checkpoint manager had problem with checkpointing before, it most likely is stuck closing the clients.
https://github.com/couchbase/goxdcr/blob/e9ae947df57c37a65ee7776a50324e3981d748aa/pipeline_svc/checkpoint_manager.go#L392
It is possible that a client has already been established, but due to the intermittent errors, causes the checkpoint manager’s connection closer to hang since it must depend on the client itself to finish closing the tcp connection. There’s no concrete evidence of this, because XDCR doesn’t log that level of detail due to the need for logging brevity.
I'm getting "{{AccessDenied}}Access DeniedF62C464CE3F3ABF9PcJ22hG6sVVqaT2dguVscJ8LyUI65eVEs+meBfOqF6JCexFd7oa6a5H+BcgqRvDlAIDbe/FpotQ=" error when trying to download log files.