Trond Norbye any updates on this? this is affecting XDCR regular regression pass percentage pretty bad - it is happening in almost all the jobs now
Arunkumar Senthilnathan (Inactive)
added a comment - Trond Norbye any updates on this? this is affecting XDCR regular regression pass percentage pretty bad - it is happening in almost all the jobs now
Arunkumar Senthilnathan I haven't been able to look at all logs yet, but I did look at the memcached logs yesterday and from what I could see stuff looked like expected:
[ns_server:debug,2017-03-05T11:13:41.233-08:00,ns_1@127.0.0.1:compaction_new_daemon<0.408.0>:compaction_scheduler:schedule_next:60]Finished compaction for compact_views too soon. Next run will be in 30s
[ns_server:debug,2017-03-05T11:14:54.439-08:00,ns_1@127.0.0.1:<0.30315.3>:diag_handler:collect_diag_per_node_binary:210]Collected data for key system_info
[ns_server:debug,2017-03-05T11:16:05.921-08:00,ns_1@172.23.105.180:<0.25046.4>:diag_handler:collect_diag_per_node_binary:210]Collected data for key system_info
[ns_server:debug,2017-03-05T11:18:00.624-08:00,ns_1@172.23.105.181:<0.23235.5>:diag_handler:collect_diag_per_node_binary:210]Collected data for key system_info
[ns_server:debug,2017-03-05T11:12:43.928-08:00,ns_1@172.23.107.120:compaction_new_daemon<0.408.0>:compaction_scheduler:schedule_next:60]Finished compaction for compact_kv too soon. Next run will be in 30s
[ns_server:debug,2017-03-05T11:15:12.942-08:00,ns_1@172.23.107.169:compaction_new_daemon<0.25741.4>:compaction_scheduler:schedule_next:60]Finished compaction for compact_views too soon. Next run will be in 30s
[ns_server:debug,2017-03-05T11:10:28.100-08:00,ns_1@172.23.107.171:compaction_new_daemon<0.5862.5>:compaction_scheduler:schedule_next:60]Finished compaction for compact_kv too soon. Next run will be in 30s
Dave Finlay
added a comment - The logs attached to this ticket only go up to 11:18 at the latest. The bucket deletion problem happened at 11:23.
./mb-23161 $ tail -1 */ns_server.debug.log
==> cbcollect_info_ns_1@172.23.105.178_20170305-191254/ns_server.debug.log <==
[ns_server:debug,2017-03-05T11:13:41.233-08:00,ns_1@127.0.0.1:compaction_new_daemon<0.408.0>:compaction_scheduler:schedule_next:60]Finished compaction for compact_views too soon. Next run will be in 30s
==> cbcollect_info_ns_1@172.23.105.179_20170305-191421/ns_server.debug.log <==
[ns_server:debug,2017-03-05T11:14:54.439-08:00,ns_1@127.0.0.1:<0.30315.3>:diag_handler:collect_diag_per_node_binary:210]Collected data for key system_info
==> cbcollect_info_ns_1@172.23.105.180_20170305-191532/ns_server.debug.log <==
[ns_server:debug,2017-03-05T11:16:05.921-08:00,ns_1@172.23.105.180:<0.25046.4>:diag_handler:collect_diag_per_node_binary:210]Collected data for key system_info
==> cbcollect_info_ns_1@172.23.105.181_20170305-191726/ns_server.debug.log <==
[ns_server:debug,2017-03-05T11:18:00.624-08:00,ns_1@172.23.105.181:<0.23235.5>:diag_handler:collect_diag_per_node_binary:210]Collected data for key system_info
==> cbcollect_info_ns_1@172.23.107.120_20170305-191152/ns_server.debug.log <==
[ns_server:debug,2017-03-05T11:12:43.928-08:00,ns_1@172.23.107.120:compaction_new_daemon<0.408.0>:compaction_scheduler:schedule_next:60]Finished compaction for compact_kv too soon. Next run will be in 30s
==> cbcollect_info_ns_1@172.23.107.169_20170305-191429/ns_server.debug.log <==
[ns_server:debug,2017-03-05T11:15:12.942-08:00,ns_1@172.23.107.169:compaction_new_daemon<0.25741.4>:compaction_scheduler:schedule_next:60]Finished compaction for compact_views too soon. Next run will be in 30s
==> cbcollect_info_ns_1@172.23.107.171_20170305-190929/ns_server.debug.log <==
[ns_server:debug,2017-03-05T11:10:28.100-08:00,ns_1@172.23.107.171:compaction_new_daemon<0.5862.5>:compaction_scheduler:schedule_next:60]Finished compaction for compact_kv too soon. Next run will be in 30s
==> cbcollect_info_ns_1@172.23.107.198_20170305-190640/ns_server.debug.log <==
[ns_server:info,2017-03-05T11:07:57.873-08:00,ns_1@172.23.107.198:ns_config_rep<0.244.0>:ns_config_rep:pull_one_node:349]Pulling config from: 'ns_1@172.23.107.171'
2017-03-12T23:11:27.884-07:00, ns_orchestrator:4:info:message(ns_1@172.23.105.3) - Starting rebalance, KeepNodes = ['ns_1@172.23.105.3'], EjectNodes = ['ns_1@172.23.105.4'], Failed over and being ejected nodes = []; no delta recovery nodes
2017-03-12T23:11:47.910-07:00, ns_rebalancer:0:critical:message(ns_1@172.23.105.3) - Failed to wait deletion of some buckets on some nodes: [{'ns_1@172.23.105.3',
{'EXIT',
{old_buckets_shutdown_wait_failed,
["default"]}}}]
2017-03-12T23:11:47.911-07:00, ns_orchestrator:0:critical:message(ns_1@172.23.105.3) - Rebalance exited with reason {buckets_shutdown_wait_failed,
Please refer to latest 172.23.105.3.zip - bucket deletion times out followed by which rebalance issue occurs - bucket eventually gets deleted but the rebalance issue fails the test and affects following tests too
Arunkumar Senthilnathan (Inactive)
added a comment - Trond Norbye - This is the issue:
2017-03-12T23:11:27.884-07:00, ns_orchestrator:4:info:message(ns_1@172.23.105.3) - Starting rebalance, KeepNodes = ['ns_1@172.23.105.3'], EjectNodes = ['ns_1@172.23.105.4'], Failed over and being ejected nodes = []; no delta recovery nodes
2017-03-12T23:11:47.910-07:00, ns_rebalancer:0:critical:message(ns_1@172.23.105.3) - Failed to wait deletion of some buckets on some nodes: [{'ns_1@172.23.105.3',
{'EXIT',
{old_buckets_shutdown_wait_failed,
["default"]}}}]
2017-03-12T23:11:47.911-07:00, ns_orchestrator:0:critical:message(ns_1@172.23.105.3) - Rebalance exited with reason {buckets_shutdown_wait_failed,
[{'ns_1@172.23.105.3',
{'EXIT',
{old_buckets_shutdown_wait_failed,
["default"]}}}]}
-------------------------------
per_node_processes('ns_1@172.23.105.3') =
{<0.32673.5>,
[{registered_name,ns_heart_slow_status_updater},
{status,waiting},
{initial_call,{proc_lib,init_p,5}},
{backtrace,[<<"Program counter: 0x0000000000891840 (unknown function)">>,
<<"CP: 0x0000000000891848 (<terminate process normally>)">>,
<<"arity = 3">>,<<" ns_heart">>,
<<" slow_updater_loop">>,<<" []">>,<<>>]},
{error_handler,error_handler},
{garbage_collection,[{min_bin_vheap_size,46422},
{min_heap_size,233},
{fullsweep_after,512},
{minor_gcs,4}]},
{heap_size,22},
{total_heap_size,22},
{links,[<0.280.0>]},
{monitors,[]},
{monitored_by,[]},
{memory,1016},
{messages,[]},
{message_queue_len,0},
{reductions,2197192},
{trap_exit,false},
{current_location,{erlang,hibernate,3,[]}},
{dictionary,[{'$ancestors',[ns_heart_sup,ns_server_sup,
ns_server_nodes_sup,<0.170.0>,
ns_server_cluster_sup,<0.88.0>]},
{'$initial_call',{erlang,apply,2}}]}]}
{<0.32668.5>,
[{registered_name,[]},
{status,waiting},
{initial_call,{proc_lib,init_p,5}},
{backtrace,[<<"Program counter: 0x00007f2269886390 (ns_pubsub:do_subscribe_link/4 + 392)">>,
<<"CP: 0x0000000000000000 (invalid)">>,<<"arity = 0">>,
<<>>,
<<"0x00007f22706cfcd8 Return addr 0x00007f226de8e640 (proc_lib:init_p_do_apply/3 + 56)">>,
<<"y(0) []">>,
<<"y(1) {ns_pubsub,#Ref<0.0.3.94025>}">>,
<<"y(2) <0.32667.5>">>,<<"y(3) buckets_events">>,
<<>>,
<<"0x00007f22706cfd00 Return addr 0x0000000000891848 (<terminate process normally>)">>,
<<"y(0) Catch 0x00007f226de8e660 (proc_lib:init_p_do_apply/3 + 88)">>,
<<>>]},
{error_handler,error_handler},
{garbage_collection,[{min_bin_vheap_size,46422},
{min_heap_size,233},
{fullsweep_after,512},
{minor_gcs,0}]},
{heap_size,233},
{total_heap_size,233},
{links,[<0.32667.5>,<0.275.0>]},
{monitors,[]},
{monitored_by,[]},
{memory,2744},
{messages,[]},
{message_queue_len,0},
{reductions,21},
{trap_exit,true},
{current_location,{ns_pubsub,do_subscribe_link,4,
[{file,"src/ns_pubsub.erl"},{line,125}]}},
{dictionary,[{'$ancestors',[ns_heart,ns_heart_sup,ns_server_sup,
ns_server_nodes_sup,<0.170.0>,
ns_server_cluster_sup,<0.88.0>]},
{'$initial_call',{ns_pubsub,do_subscribe_link,4}}]}]}
{<0.32667.5>,
....
Please refer to latest 172.23.105.3.zip - bucket deletion times out followed by which rebalance issue occurs - bucket eventually gets deleted but the rebalance issue fails the test and affects following tests too
I looked at the logs and for some reason two of the nodes starts on the deletion task ~5 minutes later than the first one. We need to figure out why that happens (when the message is being sent from ns_server, and if there is too much work blocked on that worker thread causing the insanely large delay to happen).
Trond Norbye
added a comment - I looked at the logs and for some reason two of the nodes starts on the deletion task ~5 minutes later than the first one. We need to figure out why that happens (when the message is being sent from ns_server, and if there is too much work blocked on that worker thread causing the insanely large delay to happen).
Trond - Arun did lots of bucket deletes (& recreates) so it's possible you got some of the different traces confused. The problem delete is the one that occurred at 23:10:57. We know it starts because we see the following in the logs:
[ns_server:debug,2017-03-12T23:10:57.809-07:00,ns_1@172.23.105.3:ns_bucket_worker<0.378.0>:ns_bucket_sup:update_children:116]Stopping child for dead bucket: {{single_bucket_kv_sup,"default"},
<0.14280.5>,supervisor,
[single_bucket_kv_sup]}
We also know because 30 seconds later (which is the timeout) we see the REST API request fail with a 500.
At 23:10:57 there are 2 nodes in the cluster: .105.3 and .105.4. We know this because as soon as the delete bucket fails, Arun attempts to start a rebalance (the one that ultimately fails) and those are the only nodes in the cluster.
[user:info,2017-03-12T23:11:27.884-07:00,ns_1@172.23.105.3:<0.1474.0>:ns_orchestrator:idle:689]Starting rebalance, KeepNodes = ['ns_1@172.23.105.3'], EjectNodes = ['ns_1@172.23.105.4'], Failed over and being ejected nodes = []; no delta recovery nodes
Grepping the memcached log for "Delete bucket" on .105.4 reveals what looks like healthy behavior, i.e. the call is received at 23:10:57 and 8 seconds later the bucket is shutdown.
The bucket delete command is received at 23:10:58 (i.e. pretty much right away) but there's still one client hanging on for about 2 1/2 minutes and the bucket shutdown eventually happens at 23:13:25.
Arunkumar Senthilnathan: Is the default bucket empty. You are deleting and recreating it pretty furiously. Are you also loading it with data?
Dave Finlay
added a comment - - edited Trond - Arun did lots of bucket deletes (& recreates) so it's possible you got some of the different traces confused. The problem delete is the one that occurred at 23:10:57. We know it starts because we see the following in the logs:
[ns_server:debug,2017-03-12T23:10:57.809-07:00,ns_1@172.23.105.3:ns_bucket_worker<0.378.0>:ns_bucket_sup:update_children:116]Stopping child for dead bucket: {{single_bucket_kv_sup,"default"},
<0.14280.5>,supervisor,
[single_bucket_kv_sup]}
We also know because 30 seconds later (which is the timeout) we see the REST API request fail with a 500.
172.23.107.140 - Administrator [12/Mar/2017:23:11:27 -0700] "DELETE /pools/default/buckets/default HTTP/1.1" 500 64 - Python-httplib2/$Rev: 259 $
At 23:10:57 there are 2 nodes in the cluster: .105.3 and .105.4. We know this because as soon as the delete bucket fails, Arun attempts to start a rebalance (the one that ultimately fails) and those are the only nodes in the cluster.
[user:info,2017-03-12T23:11:27.884-07:00,ns_1@172.23.105.3:<0.1474.0>:ns_orchestrator:idle:689]Starting rebalance, KeepNodes = ['ns_1@172.23.105.3'], EjectNodes = ['ns_1@172.23.105.4'], Failed over and being ejected nodes = []; no delta recovery nodes
Grepping the memcached log for "Delete bucket" on .105.4 reveals what looks like healthy behavior, i.e. the call is received at 23:10:57 and 8 seconds later the bucket is shutdown.
$ fgrep "Delete bucket" memcached.log | fgrep T23:1
2017-03-12T23:10:57.918639-07:00 NOTICE 42 Delete bucket [default]. Notifying all registered ON_DELETE_BUCKET callbacks
2017-03-12T23:10:57.944124-07:00 NOTICE 42 Delete bucket [default]. Wait for clients to disconnect
2017-03-12T23:10:57.944147-07:00 NOTICE 581295832 Delete bucket [default]. Still waiting: 14 clients connected
2017-03-12T23:10:58.945203-07:00 NOTICE 581295832 Delete bucket [default]. Still waiting: 1 clients connected
...
2017-03-12T23:11:03.945193-07:00 NOTICE 581295832 Delete bucket [default]. Still waiting: 1 clients connected
2017-03-12T23:11:04.841212-07:00 NOTICE 42 Delete bucket [default]. Shut down the bucket
2017-03-12T23:11:05.153639-07:00 NOTICE 42 Delete bucket [default]. Clean up allocated resources
2017-03-12T23:11:05.153761-07:00 NOTICE 42 Delete bucket [default] complete
.105.3 is where the problem is. Grepping for delete bucket there reveals:
2017-03-12T23:10:58.034612-07:00 NOTICE 42 Delete bucket [default]. Notifying all registered ON_DELETE_BUCKET callbacks
2017-03-12T23:10:58.053923-07:00 NOTICE 42 Delete bucket [default]. Wait for clients to disconnect
2017-03-12T23:10:58.054038-07:00 NOTICE 3627761688 Delete bucket [default]. Still waiting: 11 clients connected
2017-03-12T23:10:59.055189-07:00 NOTICE 3627761688 Delete bucket [default]. Still waiting: 1 clients connected
...
2017-03-12T23:13:22.055160-07:00 NOTICE 3627761688 Delete bucket [default]. Still waiting: 1 clients connected
2017-03-12T23:13:23.055196-07:00 NOTICE 3627761688 Delete bucket [default]. Still waiting: 1 clients connected
2017-03-12T23:13:24.055193-07:00 NOTICE 3627761688 Delete bucket [default]. Still waiting: 1 clients connected
2017-03-12T23:13:25.055153-07:00 NOTICE 3627761688 Delete bucket [default]. Still waiting: 1 clients connected
2017-03-12T23:13:25.647730-07:00 NOTICE 42 Delete bucket [default]. Shut down the bucket
2017-03-12T23:13:25.910250-07:00 NOTICE 42 Delete bucket [default]. Clean up allocated resources
2017-03-12T23:13:25.910366-07:00 NOTICE 42 Delete bucket [default] complete
The bucket delete command is received at 23:10:58 (i.e. pretty much right away) but there's still one client hanging on for about 2 1/2 minutes and the bucket shutdown eventually happens at 23:13:25.
Arunkumar Senthilnathan : Is the default bucket empty. You are deleting and recreating it pretty furiously. Are you also loading it with data?
From what i can see the entire bucket deletion task takes 2:28 minutes? (23:10:58 - 23:13:26)? (How fast do we expect a bucket deletion to be? What if a compaction is running? etc)
The entire wait is for the client on socket 47 to complete it's transfer so that we can disconnect it. For a short while we did try a more aggressive disconnect logic, but that made us crash in other locations. Back to client on socket 47, the send buffer on that socket is full which means that the transfer is currently blocket. In the log we can read:
Once the client (xdcr) close the connection on the other end (the process was most likely killed by the test framework), memcached is unblocked and the deletion completes.
Trond Norbye
added a comment - From what i can see the entire bucket deletion task takes 2:28 minutes? (23:10:58 - 23:13:26)? (How fast do we expect a bucket deletion to be? What if a compaction is running? etc)
The entire wait is for the client on socket 47 to complete it's transfer so that we can disconnect it. For a short while we did try a more aggressive disconnect logic, but that made us crash in other locations. Back to client on socket 47, the send buffer on that socket is full which means that the transfer is currently blocket. In the log we can read:
2017-03-12T23:05:36.060781-07:00 NOTICE 47: HELO [Goxdcr PipelineSupervisor SourceBucket:default TargetBucket:default] TCP NODELAY [ 172.23.105.3:33484 - 172.23.105.3:11210 (_admin) ]
And for the connection that is currently blocked:
"socket": 47,
"protocol": "memcached",
"peername": "172.23.105.3:33484",
"sockname": "172.23.105.3:11210",
"state": "conn_write",
"msglist":
{
"size": 5,
"used": 1,
"curr": 0,
"bytes": 3582842
}
,
Once the client (xdcr) close the connection on the other end (the process was most likely killed by the test framework), memcached is unblocked and the deletion completes.
We've been talking a bit about this bug, and the fact that a client may block the deletion is a problem we've had for a while.
An interesting question here is why do the Goxdcr PipelineSupervisor not drain its input buffer so that the server may complete transmitting the data to the client?
Trond Norbye
added a comment - We've been talking a bit about this bug, and the fact that a client may block the deletion is a problem we've had for a while.
An interesting question here is why do the Goxdcr PipelineSupervisor not drain its input buffer so that the server may complete transmitting the data to the client?
Yu - would you mind taking a look at this ticket? In particular can you look at the xdcr logs and the memcached logs contained in 172.23.105.3 from the 13th of March? (No need to download all the other stuff.) Trond notices that the "pipeline supervisor" connection isn't reading from the socket buffer – which causes memcached to wait and the bucket deletion to get delayed. Memcached should be able to abruptly terminate these connections, but this is not as simple as it sounds for reasons internal to memcached.
Anyway, if you have insight it would be good to know why the XDCR client wasn't reading from the socket.
(Trond Norbye - can you post indications from the memcached log that XDCR wasn't reading from the socket.)
Dave Finlay
added a comment - - edited Assigning to Yu Sui .
Yu - would you mind taking a look at this ticket? In particular can you look at the xdcr logs and the memcached logs contained in 172.23.105.3 from the 13th of March? (No need to download all the other stuff.) Trond notices that the "pipeline supervisor" connection isn't reading from the socket buffer – which causes memcached to wait and the bucket deletion to get delayed. Memcached should be able to abruptly terminate these connections, but this is not as simple as it sounds for reasons internal to memcached.
Anyway, if you have insight it would be good to know why the XDCR client wasn't reading from the socket.
( Trond Norbye - can you post indications from the memcached log that XDCR wasn't reading from the socket.)
If you look in the memcached.log for .105.3 you'll see all of the lines starting with: Worker thread 1: {"connection":"0x7f32d8deca80","socket":47, and all of them have the same value for "total_send":4237662 .
Trond Norbye
added a comment - If you look in the memcached.log for .105.3 you'll see all of the lines starting with: Worker thread 1: {"connection":"0x7f32d8deca80","socket":47, and all of them have the same value for "total_send":4237662 .
I will check what is the timeout setting in gomemcached, which is used to set up the connection. Maybe I could increase the time out setting, and also close the connection and set up a new one when timeout is still seen.
Yu Sui (Inactive)
added a comment - The connection set up by xdcr pipeline supervisor timed out, which may be the reason it did not continue to read from the socket.
PipelineSupervisor_615ceec9956939205f5bdc3561f0ccf4/default/default Error getting dcp stats for kv 172.23.105.3:11210. err=read tcp 172.23.105.3:33484->172.23.105.3:11210: i/o timeout
I will check what is the timeout setting in gomemcached, which is used to set up the connection. Maybe I could increase the time out setting, and also close the connection and set up a new one when timeout is still seen.
I see some interesting stuff in the logs. This pipeline supervisor is having a hard time – it continually seems to be unable to get DCP stats due to time out. Then it gets torn down when the bucket is deleted, gets rebuilt when the bucket is recreated and proceeds to again time out getting DCP stats.
Separately, the last incarnation of the pipeline supervisor goes as follows. First it gets created:
2017-03-12T23:03:36.045-07:00 INFO GOXDCR.GenericSupervisor: Starting supervisor PipelineSupervisor_615ceec9956939205f5bdc3561f0ccf4/default/default.
2017-03-12T23:03:36.045-07:00 INFO GOXDCR.GenericSupervisor: Started supervisor PipelineSupervisor_615ceec9956939205f5bdc3561f0ccf4/default/default.
2017-03-12T23:03:36.046-07:00 INFO GOXDCR.GenericSupervisor: PipelineSupervisor_615ceec9956939205f5bdc3561f0ccf4/default/default monitorPipelineHealth started
It takes memcached precisely 2 minutes to figure out that it exists. (Weird)
Precisely 2 minutes later XDCR reports a timeout getting the DCP stats. And another timeout 2 minutes after that.
2017-03-12T23:07:36.060-07:00 INFO GOXDCR.GenericSupervisor: PipelineSupervisor_615ceec9956939205f5bdc3561f0ccf4/default/default Error getting dcp stats for kv 172.23.105.3:11210. err=read tcp 172.23.105.3:33484->172.23.105.3:11210: i/o timeout
2017-03-12T23:07:36.060-07:00 ERRO GOXDCR.GenericSupervisor: PipelineSupervisor_615ceec9956939205f5bdc3561f0ccf4/default/default Failed to get dcp stats. Skipping dcp health check.
2017-03-12T23:09:36.062-07:00 INFO GOXDCR.GenericSupervisor: PipelineSupervisor_615ceec9956939205f5bdc3561f0ccf4/default/default Error getting dcp stats for kv 172.23.105.3:11210. err=write tcp 172.23.105.3:33484->172.23.105.3:11210: i/o timeout
2017-03-12T23:09:36.062-07:00 ERRO GOXDCR.GenericSupervisor: PipelineSupervisor_615ceec9956939205f5bdc3561f0ccf4/default/default Failed to get dcp stats. Skipping dcp health check.
Then finally, once the bucket is deleted, XDCR finds out more or less immediately that the DCP feed has been closed:
2017-03-12T23:10:58.061-07:00 ERRO GOXDCR.GenericSupervisor: PipelineSupervisor_615ceec9956939205f5bdc3561f0ccf4/default/default Received error report : map[dcp_615ceec9956939205f5bdc3561f0ccf4/default/default_172.23.105.3:11210_0:DCP upr feed has been closed.]
2017-03-12T23:10:58.061-07:00 INFO GOXDCR.ReplMgr: Supervisor PipelineSupervisor_615ceec9956939205f5bdc3561f0ccf4/default/default of type *supervisor.GenericSupervisor reported errors map[dcp_615ceec9956939205f5bdc3561f0ccf4/default/default_172.23.105.3:11210_0:DCP upr feed has been closed.]
But according to Trond memcached is trying to send data and is reporting as much in these traces:
./cbcollect_info_ns_1@172.23.105.3_20170313-064501 $ fgrep 0x7f32d8deca80 memcached.log | head -1
Yu Sui: Can you explain why you don't tear down the connection when you find out that "the upr feed has been closed"? When do you log that is it just for one vbucket? Also: can you explain what timeouts are where and how much they are? (My guess is that increasing timeouts is going to fix anything here).
Once we have this information, I'll follow up with Trond on the questions I have for him.
Dave Finlay
added a comment - I see some interesting stuff in the logs. This pipeline supervisor is having a hard time – it continually seems to be unable to get DCP stats due to time out. Then it gets torn down when the bucket is deleted, gets rebuilt when the bucket is recreated and proceeds to again time out getting DCP stats.
Separately, the last incarnation of the pipeline supervisor goes as follows. First it gets created:
2017-03-12T23:03:36.045-07:00 INFO GOXDCR.GenericSupervisor: Starting supervisor PipelineSupervisor_615ceec9956939205f5bdc3561f0ccf4/default/default.
2017-03-12T23:03:36.045-07:00 INFO GOXDCR.GenericSupervisor: Started supervisor PipelineSupervisor_615ceec9956939205f5bdc3561f0ccf4/default/default.
2017-03-12T23:03:36.046-07:00 INFO GOXDCR.GenericSupervisor: PipelineSupervisor_615ceec9956939205f5bdc3561f0ccf4/default/default monitorPipelineHealth started
It takes memcached precisely 2 minutes to figure out that it exists. (Weird)
2017-03-12T23:05:36.060781-07:00 NOTICE 47: HELO [Goxdcr PipelineSupervisor SourceBucket:default TargetBucket:default] TCP NODELAY [ 172.23.105.3:33484 - 172.23.105.3:11210 (_admin) ]
Precisely 2 minutes later XDCR reports a timeout getting the DCP stats. And another timeout 2 minutes after that.
2017-03-12T23:07:36.060-07:00 INFO GOXDCR.GenericSupervisor: PipelineSupervisor_615ceec9956939205f5bdc3561f0ccf4/default/default Error getting dcp stats for kv 172.23.105.3:11210. err=read tcp 172.23.105.3:33484->172.23.105.3:11210: i/o timeout
2017-03-12T23:07:36.060-07:00 ERRO GOXDCR.GenericSupervisor: PipelineSupervisor_615ceec9956939205f5bdc3561f0ccf4/default/default Failed to get dcp stats. Skipping dcp health check.
2017-03-12T23:09:36.062-07:00 INFO GOXDCR.GenericSupervisor: PipelineSupervisor_615ceec9956939205f5bdc3561f0ccf4/default/default Error getting dcp stats for kv 172.23.105.3:11210. err=write tcp 172.23.105.3:33484->172.23.105.3:11210: i/o timeout
2017-03-12T23:09:36.062-07:00 ERRO GOXDCR.GenericSupervisor: PipelineSupervisor_615ceec9956939205f5bdc3561f0ccf4/default/default Failed to get dcp stats. Skipping dcp health check.
Then finally, once the bucket is deleted, XDCR finds out more or less immediately that the DCP feed has been closed:
2017-03-12T23:10:58.061-07:00 ERRO GOXDCR.GenericSupervisor: PipelineSupervisor_615ceec9956939205f5bdc3561f0ccf4/default/default Received error report : map[dcp_615ceec9956939205f5bdc3561f0ccf4/default/default_172.23.105.3:11210_0:DCP upr feed has been closed.]
2017-03-12T23:10:58.061-07:00 INFO GOXDCR.ReplMgr: Supervisor PipelineSupervisor_615ceec9956939205f5bdc3561f0ccf4/default/default of type *supervisor.GenericSupervisor reported errors map[dcp_615ceec9956939205f5bdc3561f0ccf4/default/default_172.23.105.3:11210_0:DCP upr feed has been closed.]
But according to Trond memcached is trying to send data and is reporting as much in these traces:
./cbcollect_info_ns_1@172.23.105.3_20170313-064501 $ fgrep 0x7f32d8deca80 memcached.log | head -1
2017-03-12T23:10:58.054195-07:00 NOTICE Worker thread 1: {"connection":"0x7f32d8deca80","socket":47,"protocol":"memcached","peername":"172.23.105.3:33484","sockname":"172.23.105.3:11210","parent_port":11210,"bucket_index":1,"internal":true,"username":"_admin","sasl_conn":"0x7f32e6148f00","nodelay":true,"refcount":1,"features":{"datatype":false,"mutation_extras":false},"engine_storage":"0x0","next":"0x0","thread":"0x7f3336283130","priority":"Medium","clustermap_revno":"unknown","tap":false,"dcp":false,"dcp_xattr_aware":false,"dcp_no_value":false,"opaque":"0xde030e00","max_reqs_per_event":20,"nevents":20,"state":"conn_write","cmd":"STAT","libevent":{"registered":true,"ev_flags":{"raw":"0x14","decoded":["write","persist"]},"which":{"raw":"0x4","decoded":["write"]}},"read":{"buf":"0x7f32d802c000","curr":"0x7f32d802c01b","size":2048,"bytes":0},"write":{"buf":"0x7f32e40ce800","curr":"0x7f32f4a00000","size":2048,"bytes":3582842},"write_and_go":"conn_new_cmd","ritem":"0x7f32d802c01b","rlbytes":0,"item":"0x0","iov":{"size":10,"used":1,"vector":[{"base":"0x7f32f4a9fd58","len":"0x2cae22"}]},"msglist":{"size":5,"used":1,"curr":0,"bytes":3582842},"itemlist":{"size":0},"temp_alloc_list":{"size":1},"noreply":false,"DynamicBuffer":{"buffer":"0x0","size":0,"offset":0},"cas":"0x0","aiostat":0,"ewouldblock":false,"ssl":{"enabled":false},"total_recv":272,"total_send":4237662}
Yu Sui : Can you explain why you don't tear down the connection when you find out that "the upr feed has been closed"? When do you log that is it just for one vbucket? Also: can you explain what timeouts are where and how much they are? (My guess is that increasing timeouts is going to fix anything here).
Once we have this information, I'll follow up with Trond on the questions I have for him.
When XDCR finds out that "upr feed has been closed", it first performs a checkpoint operation, then shuts down and restarts. From goxdcr log, the checkpoint op did not complete until 23:13:25, which was also when logs of the connection disappeared from memcached log.
As mentioned earlier, XDCR uses gomemcached API to set up the connection. I did not find any place in gomemcached (or XDCR) that explicitly sets timeout on such connection.
Trond, does memcached impose default read/write timeout for memcached connections?
Yu Sui (Inactive)
added a comment - - edited When XDCR finds out that "upr feed has been closed", it first performs a checkpoint operation, then shuts down and restarts. From goxdcr log, the checkpoint op did not complete until 23:13:25, which was also when logs of the connection disappeared from memcached log.
As mentioned earlier, XDCR uses gomemcached API to set up the connection. I did not find any place in gomemcached (or XDCR) that explicitly sets timeout on such connection.
Trond, does memcached impose default read/write timeout for memcached connections?
One note for folks following along at home: the precise 2 minute gaps are due to XDCRs periodic polling of memcached to get the stats info.
Dave Finlay
added a comment - - edited Yu - I think this is where the timeout is coming from: http://src.couchbase.org/source/xref/trunk/goproj/src/github.com/couchbase/goxdcr/base/connection_pool.go#130 (20 seconds).
One note for folks following along at home: the precise 2 minute gaps are due to XDCRs periodic polling of memcached to get the stats info.
You are right. The specified code does set timeout to 20 seconds.
Surprisingly, it is not what is causing the problem. I logged the time it took for the "get dcp stats" request to time out, and it was in the range of milliseconds. Removing the 20 second timeout did not help the situation.
PipelineSupervisor_0c94b5223c86eaa226e1dd02c63461f2/sasl_bucket_1/sasl_bucket_1 making getting dcp stats for kv 10.0.0.10:12000.
goxdcr.log:2017-03-22T10:09:23.574-07:00 INFO GOXDCR.GenericSupervisor: PipelineSupervisor_0c94b5223c86eaa226e1dd02c63461f2/sasl_bucket_1/sasl_bucket_1 Error getting dcp stats for kv 10.0.0.10:12000. err=read tcp 10.0.0.10:61742->10.0.0.10:12000: i/o timeout. time=12.949689ms
goxdcr.log:2017-03-22T10:18:39.587-07:00 INFO GOXDCR.GenericSupervisor: PipelineSupervisor_4a17125b4ca3ed1119fe08dba7206126/standard_bucket_1/standard_bucket_1 Error getting dcp stats for kv 10.0.0.10:12000. err=write tcp 10.0.0.10:53191->10.0.0.10:12000: i/o timeout. time=24.245µs
Something is seriously wrong with the connections. Need to do some more digging.
Yu Sui (Inactive)
added a comment - Dave,
You are right. The specified code does set timeout to 20 seconds.
Surprisingly, it is not what is causing the problem. I logged the time it took for the "get dcp stats" request to time out, and it was in the range of milliseconds. Removing the 20 second timeout did not help the situation.
PipelineSupervisor_0c94b5223c86eaa226e1dd02c63461f2/sasl_bucket_1/sasl_bucket_1 making getting dcp stats for kv 10.0.0.10:12000.
goxdcr.log:2017-03-22T10:09:23.574-07:00 INFO GOXDCR.GenericSupervisor: PipelineSupervisor_0c94b5223c86eaa226e1dd02c63461f2/sasl_bucket_1/sasl_bucket_1 Error getting dcp stats for kv 10.0.0.10:12000. err=read tcp 10.0.0.10:61742->10.0.0.10:12000: i/o timeout. time=12.949689ms
goxdcr.log:2017-03-22T10:18:39.587-07:00 INFO GOXDCR.GenericSupervisor: PipelineSupervisor_4a17125b4ca3ed1119fe08dba7206126/standard_bucket_1/standard_bucket_1 Error getting dcp stats for kv 10.0.0.10:12000. err=write tcp 10.0.0.10:53191->10.0.0.10:12000: i/o timeout. time=24.245µs
Something is seriously wrong with the connections. Need to do some more digging.
Not all "dcp stats" requests fail. The succeeded ones took ~100ms. It is not obvious why some other requests timed out after 10ms. Around the time when timeout happened, there were no relevant logs in memcached log.
Trond, can you shed some light from memcached side on why this could have happened? If there are some experiments we can try to pinpoint the problem, I would be happy to carry them out.
goxdcr.log:2017-03-22T10:31:53.297-07:00 INFO GOXDCR.GenericSupervisor: PipelineSupervisor_1347d997312ad52a3c21c3d3acfd2afe/sasl_bucket_1/sasl_bucket_1 making getting dcp stats for kv 10.0.0.10:12000.
goxdcr.log:2017-03-22T10:31:53.422-07:00 INFO GOXDCR.GenericSupervisor: PipelineSupervisor_1347d997312ad52a3c21c3d3acfd2afe/sasl_bucket_1/sasl_bucket_1 success getting dcp stats for kv 10.0.0.10:12000. time=125.164374ms
Yu Sui (Inactive)
added a comment - - edited Not all "dcp stats" requests fail. The succeeded ones took ~100ms. It is not obvious why some other requests timed out after 10ms. Around the time when timeout happened, there were no relevant logs in memcached log.
Trond, can you shed some light from memcached side on why this could have happened? If there are some experiments we can try to pinpoint the problem, I would be happy to carry them out.
goxdcr.log:2017-03-22T10:31:53.297-07:00 INFO GOXDCR.GenericSupervisor: PipelineSupervisor_1347d997312ad52a3c21c3d3acfd2afe/sasl_bucket_1/sasl_bucket_1 making getting dcp stats for kv 10.0.0.10:12000.
goxdcr.log:2017-03-22T10:31:53.422-07:00 INFO GOXDCR.GenericSupervisor: PipelineSupervisor_1347d997312ad52a3c21c3d3acfd2afe/sasl_bucket_1/sasl_bucket_1 success getting dcp stats for kv 10.0.0.10:12000. time=125.164374ms
Yu - these measurements you took - these were on a separate cluster, right? (I don't see these traces in the logs.) Were you able to reproduce the bucket deletion issue on that situation?
Dave Finlay
added a comment - Yu - these measurements you took - these were on a separate cluster, right? (I don't see these traces in the logs.) Were you able to reproduce the bucket deletion issue on that situation?
Yu Sui what happens on the client side when an operation time out? do it close the connection? You could run mctimings with -v to get a histogram of the execution times recorded by memcached.
The "dcp" stat sub-group is handled by the underlying bucket and it's been a while since I looked into that so I don't recall from the top of my head all of locks involved in that stat (and the amount of output it creates). Each of the KV-pair added in the stat adds 24 bytes overhead and all of this is spooled into a temporary buffer which grows (before it is being sent), which could result in a number of reallocations if the output is large. (We need to measure this)
Trond Norbye
added a comment - Yu Sui what happens on the client side when an operation time out? do it close the connection? You could run mctimings with -v to get a histogram of the execution times recorded by memcached.
The "dcp" stat sub-group is handled by the underlying bucket and it's been a while since I looked into that so I don't recall from the top of my head all of locks involved in that stat (and the amount of output it creates). Each of the KV-pair added in the stat adds 24 bytes overhead and all of this is spooled into a temporary buffer which grows (before it is being sent), which could result in a number of reallocations if the output is large. (We need to measure this)
The bucket in question in these logs ("default") gets deleted and recreated a bunch of times, which results in the addition and removal of the pipeline supervisor.
2017-03-12T22:36:05.065-07:00 INFO GOXDCR.GenericSupervisor: Adding child PipelineSupervisor_615ceec9956939205f5bdc3561f0ccf4/default/default to supervisor PipelineMasterSupervisor
2017-03-12T22:41:35.826-07:00 INFO GOXDCR.GenericSupervisor: Removing child PipelineSupervisor_615ceec9956939205f5bdc3561f0ccf4/default/default from supervisor PipelineMasterSupervisor
2017-03-12T22:41:35.979-07:00 INFO GOXDCR.GenericSupervisor: Adding child PipelineSupervisor_615ceec9956939205f5bdc3561f0ccf4/default/default to supervisor PipelineMasterSupervisor
2017-03-12T22:46:57.046-07:00 INFO GOXDCR.GenericSupervisor: Removing child PipelineSupervisor_615ceec9956939205f5bdc3561f0ccf4/default/default from supervisor PipelineMasterSupervisor
2017-03-12T22:49:25.281-07:00 INFO GOXDCR.GenericSupervisor: Adding child PipelineSupervisor_615ceec9956939205f5bdc3561f0ccf4/default/default to supervisor PipelineMasterSupervisor
2017-03-12T22:50:06.871-07:00 INFO GOXDCR.GenericSupervisor: Removing child PipelineSupervisor_615ceec9956939205f5bdc3561f0ccf4/default/default from supervisor PipelineMasterSupervisor
2017-03-12T22:50:07.100-07:00 INFO GOXDCR.GenericSupervisor: Adding child PipelineSupervisor_615ceec9956939205f5bdc3561f0ccf4/default/default to supervisor PipelineMasterSupervisor
2017-03-12T22:55:21.597-07:00 INFO GOXDCR.GenericSupervisor: Removing child PipelineSupervisor_615ceec9956939205f5bdc3561f0ccf4/default/default from supervisor PipelineMasterSupervisor
2017-03-12T22:55:21.785-07:00 INFO GOXDCR.GenericSupervisor: Adding child PipelineSupervisor_615ceec9956939205f5bdc3561f0ccf4/default/default to supervisor PipelineMasterSupervisor
2017-03-12T23:00:45.980-07:00 INFO GOXDCR.GenericSupervisor: Removing child PipelineSupervisor_615ceec9956939205f5bdc3561f0ccf4/default/default from supervisor PipelineMasterSupervisor
2017-03-12T23:03:35.692-07:00 INFO GOXDCR.GenericSupervisor: Adding child PipelineSupervisor_615ceec9956939205f5bdc3561f0ccf4/default/default to supervisor PipelineMasterSupervisor
2017-03-12T23:13:25.646-07:00 INFO GOXDCR.GenericSupervisor: Removing child PipelineSupervisor_615ceec9956939205f5bdc3561f0ccf4/default/default from supervisor PipelineMasterSupervisor
When the stats connection comes back after the pipeline supervisor is removed and added back – does it create a new connection, or use an existing one (from the pool)?
Basically I'm wondering if this connection got into a bad state sometime in the past. And now the send buffer is full on the memcached side - so stats requests will always time out.
Dave Finlay
added a comment - Yu - separate question.
The bucket in question in these logs ("default") gets deleted and recreated a bunch of times, which results in the addition and removal of the pipeline supervisor.
$ egrep "((Removing)|(Adding)).*PipelineSupervisor_615ceec9956939205f5bdc3561f0ccf4.*Master" ns_server.goxdcr.log
2017-03-12T22:36:05.065-07:00 INFO GOXDCR.GenericSupervisor: Adding child PipelineSupervisor_615ceec9956939205f5bdc3561f0ccf4/default/default to supervisor PipelineMasterSupervisor
2017-03-12T22:41:35.826-07:00 INFO GOXDCR.GenericSupervisor: Removing child PipelineSupervisor_615ceec9956939205f5bdc3561f0ccf4/default/default from supervisor PipelineMasterSupervisor
2017-03-12T22:41:35.979-07:00 INFO GOXDCR.GenericSupervisor: Adding child PipelineSupervisor_615ceec9956939205f5bdc3561f0ccf4/default/default to supervisor PipelineMasterSupervisor
2017-03-12T22:46:57.046-07:00 INFO GOXDCR.GenericSupervisor: Removing child PipelineSupervisor_615ceec9956939205f5bdc3561f0ccf4/default/default from supervisor PipelineMasterSupervisor
2017-03-12T22:49:25.281-07:00 INFO GOXDCR.GenericSupervisor: Adding child PipelineSupervisor_615ceec9956939205f5bdc3561f0ccf4/default/default to supervisor PipelineMasterSupervisor
2017-03-12T22:50:06.871-07:00 INFO GOXDCR.GenericSupervisor: Removing child PipelineSupervisor_615ceec9956939205f5bdc3561f0ccf4/default/default from supervisor PipelineMasterSupervisor
2017-03-12T22:50:07.100-07:00 INFO GOXDCR.GenericSupervisor: Adding child PipelineSupervisor_615ceec9956939205f5bdc3561f0ccf4/default/default to supervisor PipelineMasterSupervisor
2017-03-12T22:55:21.597-07:00 INFO GOXDCR.GenericSupervisor: Removing child PipelineSupervisor_615ceec9956939205f5bdc3561f0ccf4/default/default from supervisor PipelineMasterSupervisor
2017-03-12T22:55:21.785-07:00 INFO GOXDCR.GenericSupervisor: Adding child PipelineSupervisor_615ceec9956939205f5bdc3561f0ccf4/default/default to supervisor PipelineMasterSupervisor
2017-03-12T23:00:45.980-07:00 INFO GOXDCR.GenericSupervisor: Removing child PipelineSupervisor_615ceec9956939205f5bdc3561f0ccf4/default/default from supervisor PipelineMasterSupervisor
2017-03-12T23:03:35.692-07:00 INFO GOXDCR.GenericSupervisor: Adding child PipelineSupervisor_615ceec9956939205f5bdc3561f0ccf4/default/default to supervisor PipelineMasterSupervisor
2017-03-12T23:13:25.646-07:00 INFO GOXDCR.GenericSupervisor: Removing child PipelineSupervisor_615ceec9956939205f5bdc3561f0ccf4/default/default from supervisor PipelineMasterSupervisor
When the stats connection comes back after the pipeline supervisor is removed and added back – does it create a new connection, or use an existing one (from the pool)?
Basically I'm wondering if this connection got into a bad state sometime in the past. And now the send buffer is full on the memcached side - so stats requests will always time out.
On the client side, after receiving timeout error, the connection is kept alive for the next request. The connection is closed only if we receive errors for three consecutive requests.
When pipeline supervisor is removed and added back, the old connection is closed and a new one is set up.
I will modify the code so that a new connection is set up whenever timeout error is received. This does not explain or fix the issue of why the first timeout happened after 10ms, though. Yes, the dcp stats could be huge, and the following scenario is within expectation:
request for dcp stats timed out after 20 second.
connection ends up in bad state.
a second request for dcp stats timed out immediately [after 10 ms) because of the bad connection state
The observation is that the first request timed out after 10ms, which cannot be explained by server taking too long to assemble the response.
Yu Sui (Inactive)
added a comment - - edited On the client side, after receiving timeout error, the connection is kept alive for the next request. The connection is closed only if we receive errors for three consecutive requests.
When pipeline supervisor is removed and added back, the old connection is closed and a new one is set up.
I will modify the code so that a new connection is set up whenever timeout error is received. This does not explain or fix the issue of why the first timeout happened after 10ms, though. Yes, the dcp stats could be huge, and the following scenario is within expectation:
request for dcp stats timed out after 20 second.
connection ends up in bad state.
a second request for dcp stats timed out immediately [after 10 ms) because of the bad connection state
The observation is that the first request timed out after 10ms, which cannot be explained by server taking too long to assemble the response.
mctimings -u Administrator -P admin-password -h hostname -v STAT
that should give you a histogram of the service times for the stat command on the server. Unfortunately it does not keep track of which stat group you requested, and some of the stats are super-small and fast (but we should get a hint of how slow they might have been).
If you're going to modify the client, could you have it close the socket as part of a timeout?
Trond Norbye
added a comment - Could you run:
mctimings -u Administrator -P admin-password -h hostname -v STAT
that should give you a histogram of the service times for the stat command on the server. Unfortunately it does not keep track of which stat group you requested, and some of the stats are super-small and fast (but we should get a hint of how slow they might have been).
If you're going to modify the client, could you have it close the socket as part of a timeout?
From the timings data you have 30 requests slower than 10ms. Unfortunately we can't map those stats commands to the different sub commands... You could alternatively try to run
a number of time while you've got all of the DCP streams open, and then see if they all fall into the 10ms range..
Trond Norbye
added a comment - From the timings data you have 30 requests slower than 10ms. Unfortunately we can't map those stats commands to the different sub commands... You could alternatively try to run
mcstat -u Administrator -P <password> -b bucket dcp > /dev/zero
a number of time while you've got all of the DCP streams open, and then see if they all fall into the 10ms range..
I ran a small test on my machine to get a feel of this, and the dcp stat group seems to be "super slow"... On my macbook with a single bucket without any items and no other traffic I set up 5 different DCP streams (all including all of the vbuckets)...
And this is on a macbook with 16GB memory being "completely idle" except for requesting the dcp stat.
The "sad" part of this stat is that it request all the stats for all of the connected DCP streams, and I'm pretty sure no one actually cares about all of them...
Which portion of the dcp stat are you interested in? (I guess being able to filter out a given stream on the server would make your life easier as well?)
Trond Norbye
added a comment - I ran a small test on my machine to get a feel of this, and the dcp stat group seems to be "super slow"... On my macbook with a single bucket without any items and no other traffic I set up 5 different DCP streams (all including all of the vbuckets)...
Looping requesting the dcp stat group results in:
[ 1 - 1]ms ( 79.87%) 998 | ############################################
[ 2 - 2]ms ( 93.12%) 661 | #############################
[ 3 - 3]ms ( 95.47%) 117 | #####
[ 4 - 4]ms ( 97.21%) 87 | ###
[ 5 - 5]ms ( 97.87%) 33 | #
And this is on a macbook with 16GB memory being "completely idle" except for requesting the dcp stat.
The "sad" part of this stat is that it request all the stats for all of the connected DCP streams, and I'm pretty sure no one actually cares about all of them...
Which portion of the dcp stat are you interested in? (I guess being able to filter out a given stream on the server would make your life easier as well?)
2017-03-12T23:13:25.647373-07:00 WARNING Failed to write, and not due to blocking: Connection reset by peer
2017-03-12T23:13:25.647503-07:00 NOTICE 47 - Hard error, closing connection
2017-03-12T23:13:25.647525-07:00 NOTICE 47 The connected bucket is being deleted.. disconnecting
It was stuck for approx 2.5 minutes and it was not closed until we get a hard error that we close the connection.
Daniel Owen
added a comment - - edited I see:
2017-03-12T23:10:59.055189-07:00 NOTICE 3627761688 Delete bucket [default]. Still waiting: 1 clients connected
2017-03-12T23:10:59.055567-07:00 NOTICE Worker thread 1: {"connection":"0x7f32d8deca80","socket":47,"protocol":"memcached","peername":"172.23.105.3:33484","sockname":"172.23.105.3:11210","parent_port":11210,"bucket_index":1,"internal":true,"username":"_admin","sasl_conn":"0x7f32e6148f00","nodelay":true,"refcount":1,"features":{"datatype":false,"mutation_extras":false},"engine_storage":"0x0","next":"0x0","thread":"0x7f3336283130","priority":"Medium","clustermap_revno":"unknown","tap":false,"dcp":false,"dcp_xattr_aware":false,"dcp_no_value":false,"opaque":"0xde030e00","max_reqs_per_event":20,"nevents":20,"state":"conn_write","cmd":"STAT","libevent":{"registered":true,"ev_flags":{"raw":"0x14","decoded":["write","persist"]},"which":{"raw":"0x4","decoded":["write"]}},"read":{"buf":"0x7f32d802c000","curr":"0x7f32d802c01b","size":2048,"bytes":0},"write":{"buf":"0x7f32e40ce800","curr":"0x7f32f4a00000","size":2048,"bytes":3582842},"write_and_go":"conn_new_cmd","ritem":"0x7f32d802c01b","rlbytes":0,"item":"0x0","iov":{"size":10,"used":1,"vector":[{"base":"0x7f32f4a9fd58","len":"0x2cae22"}]},"msglist":{"size":5,"used":1,"curr":0,"bytes":3582842},"itemlist":{"size":0},"temp_alloc_list":{"size":1},"noreply":false,"DynamicBuffer":{"buffer":"0x0","size":0,"offset":0},"cas":"0x0","aiostat":0,"ewouldblock":false,"ssl":{"enabled":false},"total_recv":272,"total_send":4237662}
...
2017-03-12T23:13:25.055153-07:00 NOTICE 3627761688 Delete bucket [default]. Still waiting: 1 clients connected
2017-03-12T23:13:25.055381-07:00 NOTICE Worker thread 1: {"connection":"0x7f32d8deca80","socket":47,"protocol":"memcached","peername":"172.23.105.3:33484","sockname":"172.23.105.3:11210","parent_port":11210,"bucket_index":1,"internal":true,"username":"_admin","sasl_conn":"0x7f32e6148f00","nodelay":true,"refcount":1,"features":{"data
type":false,"mutation_extras":false},"engine_storage":"0x0","next":"0x0","thread":"0x7f3336283130","priority":"Medium","clustermap_revno":"unknown","tap":false,"dcp":false,"dcp_xattr_aware":false,"dcp_no_value":false,"opaque":"0xde030e00","max_reqs_per_event":20,"nevents":20,"state":"conn_write","cmd":"STAT","libevent":{"registered"
:true,"ev_flags":{"raw":"0x14","decoded":["write","persist"]},"which":{"raw":"0x4","decoded":["write"]}},"read":{"buf":"0x7f32d802c000","curr":"0x7f32d802c01b","size":2048,"bytes":0},"write":{"buf":"0x7f32e40ce800","curr":"0x7f32f4a00000","size":2048,"bytes":3582842},"write_and_go":"conn_new_cmd","ritem":"0x7f32d802c01b","rlbytes":0
,"item":"0x0","iov":{"size":10,"used":1,"vector":[{"base":"0x7f32f4a9fd58","len":"0x2cae22"}]},"msglist":{"size":5,"used":1,"curr":0,"bytes":3582842},"itemlist":{"size":0},"temp_alloc_list":{"size":1},"noreply":false,"DynamicBuffer":{"buffer":"0x0","size":0,"offset":0},"cas":"0x0","aiostat":0,"ewouldblock":false,"ssl":{"enabled":fal
se},"total_recv":272,"total_send":4237662}
2017-03-12T23:13:25.647373-07:00 WARNING Failed to write, and not due to blocking: Connection reset by peer
2017-03-12T23:13:25.647503-07:00 NOTICE 47 - Hard error, closing connection
2017-03-12T23:13:25.647525-07:00 NOTICE 47 The connected bucket is being deleted.. disconnecting
It was stuck for approx 2.5 minutes and it was not closed until we get a hard error that we close the connection.
2017-03-12T23:00:45.981193-07:00 WARNING Failed to write, and not due to blocking: Connection reset by peer
2017-03-12T23:00:45.981288-07:00 NOTICE 51 - Hard error, closing connection
2017-03-12T23:00:45.981298-07:00 NOTICE 51 The connected bucket is being deleted.. disconnecting
It was stuck for approx ~18 seconds, and a gain was not closed until we get a hard error.
Daniel Owen
added a comment - Also happened here:
2017-03-12T23:00:27.503176-07:00 NOTICE 3627417976 Delete bucket [default]. Still waiting: 1 clients connected
2017-03-12T23:00:27.503803-07:00 NOTICE Worker thread 3: {"connection":"0x7f32d8dece00","socket":51,"protocol":"memcached","peername":"172.23.105.3:33246","sockname":"172.23.105.3:11210","parent_port":11210,"bucket_index":1,"internal":true,"username":"_admin","sasl_conn":"0x7f32d8d12540","nodelay":true,"refcount":1,"features":{"data
type":false,"mutation_extras":false},"engine_storage":"0x0","next":"0x0","thread":"0x7f3336283390","priority":"Medium","clustermap_revno":"unknown","tap":false,"dcp":false,"dcp_xattr_aware":false,"dcp_no_value":false,"opaque":"0xde030e00","max_reqs_per_event":20,"nevents":20,"state":"conn_write","cmd":"STAT","libevent":{"registered"
:true,"ev_flags":{"raw":"0x14","decoded":["write","persist"]},"which":{"raw":"0x4","decoded":["write"]}},"read":{"buf":"0x7f32d7cb4000","curr":"0x7f32d7cb401b","size":8192,"bytes":0},"write":{"buf":"0x7f32db35f800","curr":"0x7f32f4400000","size":2048,"bytes":3456407},"write_and_go":"conn_new_cmd","ritem":"0x7f32d7cb401b","rlbytes":0
,"item":"0x0","iov":{"size":10,"used":1,"vector":[{"base":"0x7f32f447fde0","len":"0x2cbfb7"}]},"msglist":{"size":5,"used":1,"curr":0,"bytes":3456407},"itemlist":{"size":0},"temp_alloc_list":{"size":1},"noreply":false,"DynamicBuffer":{"buffer":"0x0","size":0,"offset":0},"cas":"0x0","aiostat":0,"ewouldblock":false,"ssl":{"enabled":fal
se},"total_recv":272,"total_send":3980291}
...
2017-03-12T23:00:45.505240-07:00 NOTICE 3627417976 Delete bucket [default]. Still waiting: 1 clients connected
2017-03-12T23:00:45.505858-07:00 NOTICE Worker thread 3: {"connection":"0x7f32d8dece00","socket":51,"protocol":"memcached","peername":"172.23.105.3:33246","sockname":"172.23.105.3:11210","parent_port":11210,"bucket_index":1,"internal":true,"username":"_admin","sasl_conn":"0x7f32d8d12540","nodelay":true,"refcount":1,"features":{"datatype":false,"mutation_extras":false},"engine_storage":"0x0","next":"0x0","thread":"0x7f3336283390","priority":"Medium","clustermap_revno":"unknown","tap":false,"dcp":false,"dcp_xattr_aware":false,"dcp_no_value":false,"opaque":"0xde030e00","max_reqs_per_event":20,"nevents":20,"state":"conn_write","cmd":"STAT","libevent":{"registered":true,"ev_flags":{"raw":"0x14","decoded":["write","persist"]},"which":{"raw":"0x4","decoded":["write"]}},"read":{"buf":"0x7f32d7cb4000","curr":"0x7f32d7cb401b","size":8192,"bytes":0},"write":{"buf":"0x7f32db35f800","curr":"0x7f32f4400000","size":2048,"bytes":3456407},"write_and_go":"conn_new_cmd","ritem":"0x7f32d7cb401b","rlbytes":0,"item":"0x0","iov":{"size":10,"used":1,"vector":[{"base":"0x7f32f447fde0","len":"0x2cbfb7"}]},"msglist":{"size":5,"used":1,"curr":0,"bytes":3456407},"itemlist":{"size":0},"temp_alloc_list":{"size":1},"noreply":false,"DynamicBuffer":{"buffer":"0x0","size":0,"offset":0},"cas":"0x0","aiostat":0,"ewouldblock":false,"ssl":{"enabled":false},"total_recv":272,"total_send":3980291}
2017-03-12T23:00:45.981193-07:00 WARNING Failed to write, and not due to blocking: Connection reset by peer
2017-03-12T23:00:45.981288-07:00 NOTICE 51 - Hard error, closing connection
2017-03-12T23:00:45.981298-07:00 NOTICE 51 The connected bucket is being deleted.. disconnecting
It was stuck for approx ~18 seconds, and a gain was not closed until we get a hard error.
Both stuck connections are in conn_write state.
Both also have ewouldblock set to false.
The write buffer are also non empty for both
socket 47 - "bytes":3582842
socket 51 - "bytes": 3456407
After talking with Dave Rigby the likelihood is that the TCP/IP write buffer for the connection is full.
Daniel Owen
added a comment - - edited Both stuck connections are in conn_write state.
Both also have ewouldblock set to false.
The write buffer are also non empty for both
socket 47 - "bytes":3582842
socket 51 - "bytes": 3456407
After talking with Dave Rigby the likelihood is that the TCP/IP write buffer for the connection is full.
Daniel Owen we've figured that out earlier (we don't try to force close connections which is blocked on write)... My wild guess here is that the go client timed out the operation and is not trying to read any data from the socket anymore (while there is still data to be sent...)
I am currently looking into (other) delete bucket failures we're seeing on the CI infrastructure, and it could be that we can fix this issue as part of that.
BUT as a separate issue we should probably also fix the client to be more "well behaved" and possibly send less data (and speed up the stats call, because such a slow command will affect all of the other connections bound to the same worker thread).
Trond Norbye
added a comment - Daniel Owen we've figured that out earlier (we don't try to force close connections which is blocked on write)... My wild guess here is that the go client timed out the operation and is not trying to read any data from the socket anymore (while there is still data to be sent...)
I am currently looking into (other) delete bucket failures we're seeing on the CI infrastructure, and it could be that we can fix this issue as part of that.
BUT as a separate issue we should probably also fix the client to be more "well behaved" and possibly send less data (and speed up the stats call, because such a slow command will affect all of the other connections bound to the same worker thread).
Daniel, Trond: if the client changes to have an infinite timeout on this connection, that would help right? I.e. no matter how long it takes, the response is going to get read? Alternatively, the client could time out and tear down the connection – though it seems like the timeout is going to need to be longer than 20s.
Also: anyway we can speed up the performance of this stats call? No stats call should take 2+ minutes.
Also: Yu - can you answer Trond's question about how you use the information returned in the stats call?
Dave Finlay
added a comment - Daniel, Trond: if the client changes to have an infinite timeout on this connection, that would help right? I.e. no matter how long it takes, the response is going to get read? Alternatively, the client could time out and tear down the connection – though it seems like the timeout is going to need to be longer than 20s.
Also: anyway we can speed up the performance of this stats call? No stats call should take 2+ minutes.
Also: Yu - can you answer Trond's question about how you use the information returned in the stats call?
I doubt the stat call is taking 2 minutes (then it would have been logged).. What I think happens is roughly this
1) client send stat call
2) client times out waiting for reply (this could be for a number of reason.. the same server thread could be busy so the op won't be processed immediately.. but we know that this stat call is also potentially slow)..
3) when the client times out it no longer reads from the socket, but start doing other stuff
4) server starts sending data... it is a lot so it doesn't fit in the kernel send queue and it "blocks"
we're now put into the state that the client don't know that the server is blocked on writing to a socket it no longer cares about because it timed out the operation but left the socket alone..
If the client close the socket as part of the timeout, memcached will fail writing to that socket and we won't get into this state...
Trond Norbye
added a comment - I doubt the stat call is taking 2 minutes (then it would have been logged).. What I think happens is roughly this
1) client send stat call
2) client times out waiting for reply (this could be for a number of reason.. the same server thread could be busy so the op won't be processed immediately.. but we know that this stat call is also potentially slow)..
3) when the client times out it no longer reads from the socket, but start doing other stuff
4) server starts sending data... it is a lot so it doesn't fit in the kernel send queue and it "blocks"
we're now put into the state that the client don't know that the server is blocked on writing to a socket it no longer cares about because it timed out the operation but left the socket alone..
If the client close the socket as part of the timeout, memcached will fail writing to that socket and we won't get into this state...
Sorry Trond Norbye I did not read the history..I have now, however I don't understand what the XDCR timeout is for the stats call?
From a previous comment by Yu - "The succeeded ones took ~100ms. It is not obvious why some other requests timed out after 10ms. "
We are not seeing any slow operations in the memcached logs - other than for the delete buckets (My understanding is we log if response is greater than 1s)
Dave Finlay I will revisit this stats and see what improvements we can make to speed it-up.
Daniel Owen
added a comment - Sorry Trond Norbye I did not read the history..I have now, however I don't understand what the XDCR timeout is for the stats call?
From a previous comment by Yu - "The succeeded ones took ~100ms. It is not obvious why some other requests timed out after 10ms. "
We are not seeing any slow operations in the memcached logs - other than for the delete buckets (My understanding is we log if response is greater than 1s)
Dave Finlay I will revisit this stats and see what improvements we can make to speed it-up.
Trond is also suggesting we may want to think of splitting this command out, which might be a reasonable idea.
For now though on this ticket, I think next steps should be look to fix things on the client side.
Dave Finlay
added a comment - Dan - the timeout on the XDCR side is 20 s.
Trond is also suggesting we may want to think of splitting this command out, which might be a reasonable idea.
For now though on this ticket, I think next steps should be look to fix things on the client side.
Dave Finlay The bit I'm confused about is yesterday Yu stated:
"The specified code does set timeout to 20 seconds.
Surprisingly, it is not what is causing the problem. I logged the time it took for the "get dcp stats" request to time out, and it was in the range of milliseconds. Removing the 20 second timeout did not help the situation. "
Daniel Owen
added a comment - Dave Finlay The bit I'm confused about is yesterday Yu stated:
"The specified code does set timeout to 20 seconds.
Surprisingly, it is not what is causing the problem. I logged the time it took for the "get dcp stats" request to time out, and it was in the range of milliseconds. Removing the 20 second timeout did not help the situation. "
Yu had run a local test on his side and saw that (1) he managed to reproduce the "i/o timeout" (2) it timed out in 12 ms. (Read his comment closely.) Obviously increasing the 20 s timeout when a timeout after 12 ms is observed is going to make no difference.
Guys - let's stop spamming the ticket and let's give Yu some time to make progress on his side.
Dave Finlay
added a comment - Yu had run a local test on his side and saw that (1) he managed to reproduce the "i/o timeout" (2) it timed out in 12 ms. (Read his comment closely.) Obviously increasing the 20 s timeout when a timeout after 12 ms is observed is going to make no difference.
Guys - let's stop spamming the ticket and let's give Yu some time to make progress on his side.
Yes, things would be much better if dcp provides filtering capability so that only relevant stats is returned.
My current plan is:
leave the current 20 second time out alone - it is long enough and is not the cause of the timeout observed.
when seeing timeout errors, close the memcached connection and set up new one. This should fix the bucket deletion issue.
The open question is why connection timed out in 10ms when the specified time out is 20 second.
Yu Sui (Inactive)
added a comment - - edited The dcp stats that xdcr needs is the number of remaining items in a dcp connection. An example is as follows:
eq_dcpq:xdcr:dcp_f58e0727200a19771e4459925908dd66/default/target_10.17.2.102:12000_0:items_remaining
Yes, things would be much better if dcp provides filtering capability so that only relevant stats is returned.
My current plan is:
leave the current 20 second time out alone - it is long enough and is not the cause of the timeout observed.
when seeing timeout errors, close the memcached connection and set up new one. This should fix the bucket deletion issue.
The open question is why connection timed out in 10ms when the specified time out is 20 second.
This is a long running job, will take 8-10 more hours to complete.
Aruna Piravi (Inactive)
added a comment - - edited Tests are running fine so far on build 5.0.0-2412, 18 that ran, passed - http://qa.sc.couchbase.com/view/XDCR/job/cen006-p1-xxdcr-vset07-00-goxdcr-rebalance/543/console
This is a long running job, will take 8-10 more hours to complete.
All the following tests fail due to the rebalance issue as observed before - due to some reason when this error occurs the logs collected are corrupted - some error like http://172.23.105.27:8091/diag IncompleteRead(12721 bytes read) is thrown - so we let the job run and when the error occurred we stopped the train and collected logs on the cluster manually - please note that the bucket deletion error occurs first time at timestamp 2017-03-28 07:09:40,395
Arunkumar Senthilnathan (Inactive)
added a comment - Issue found again in 5.0.0-2412 - please find logs here:
https://s3.amazonaws.com/bugdb/jira/MB23161/23161/23161.zip
https://s3.amazonaws.com/bugdb/jira/MB23161/23161/172.23.106.251.zip
https://s3.amazonaws.com/bugdb/jira/MB23161/23161/172.23.106.115.zip
https://s3.amazonaws.com/bugdb/jira/MB23161/23161/172.23.107.161.zip
https://s3.amazonaws.com/bugdb/jira/MB23161/23161/172.23.121.161.zip
https://s3.amazonaws.com/bugdb/jira/MB23161/23161/172.23.121.162.zip
https://s3.amazonaws.com/bugdb/jira/MB23161/23161/172.23.121.163.zip
https://s3.amazonaws.com/bugdb/jira/MB23161/23161/172.23.121.164.zip
https://s3.amazonaws.com/bugdb/jira/MB23161/23161/172.23.108.169.zip
https://s3.amazonaws.com/bugdb/jira/MB23161/23161/172.23.108.170.zip
Bucket deletion issue happens at test_21 here: http://qa.sc.couchbase.com/job/cen006-p1-xxdcr-vset07-00-goxdcr-rebalance/549/consoleText
All the following tests fail due to the rebalance issue as observed before - due to some reason when this error occurs the logs collected are corrupted - some error like http://172.23.105.27:8091/diag IncompleteRead(12721 bytes read) is thrown - so we let the job run and when the error occurred we stopped the train and collected logs on the cluster manually - please note that the bucket deletion error occurs first time at timestamp 2017-03-28 07:09:40,395
The dcp stats that xdcr needs is the number of remaining items in a dcp connection. An example is as follows:
eq_dcpq:xdcr:dcp_f58e0727200a19771e4459925908dd66/default/target_10.17.2.102:12000_0:items_remaining
Yes, things would be much better if dcp provides filtering capability so that only relevant stats is returned.
Why does XDCR need this statistic? It shouldn't need to care about how much data is remaining - and moreover the stat is not consistent with what's actually remaining as far as the client is concerned (won't count anything in flight between the two ends of the TCP/IP connection).
This seems to be an extra public interface which I wasn't aware KV-engine needed to support.
Dave Rigby
added a comment - Yu Sui
The dcp stats that xdcr needs is the number of remaining items in a dcp connection. An example is as follows:
eq_dcpq:xdcr:dcp_f58e0727200a19771e4459925908dd66/default/target_10.17.2.102:12000_0:items_remaining
Yes, things would be much better if dcp provides filtering capability so that only relevant stats is returned.
Why does XDCR need this statistic? It shouldn't need to care about how much data is remaining - and moreover the stat is not consistent with what's actually remaining as far as the client is concerned (won't count anything in flight between the two ends of the TCP/IP connection).
This seems to be an extra public interface which I wasn't aware KV-engine needed to support.
This is part of our effort for self monitoring and self healing. The idea is that if there are remaining items in dcp, and yet we have not received anything from dcp streams for a while, we would declare dcp to be stuck and re-set up dcp connection to recover. For this purpose, the stats does not need to be 100% accurate.
This code was put in because we did see dcp stuckness issues before. XDCR set up dcp connections through gomemcached. It is possible that the the issues were caused by some bugs in gomemcached. Regardless of whether the issues still exist today, the check on our side can serve as a catch all mechanism to get us out of the stuck state should we fall into it for whatever reason.
Yu Sui (Inactive)
added a comment - This is part of our effort for self monitoring and self healing. The idea is that if there are remaining items in dcp, and yet we have not received anything from dcp streams for a while, we would declare dcp to be stuck and re-set up dcp connection to recover. For this purpose, the stats does not need to be 100% accurate.
This code was put in because we did see dcp stuckness issues before. XDCR set up dcp connections through gomemcached. It is possible that the the issues were caused by some bugs in gomemcached. Regardless of whether the issues still exist today, the check on our side can serve as a catch all mechanism to get us out of the stuck state should we fall into it for whatever reason.
There is a dcp noop support in the server so you can enable and it'll send out a noop at a fixed interval (which you should reply to).. Then you won't need to poll any stats...that support has been in DCP from the beginning, so as long as the remote side supports DCP you should be able to use it...
Trond Norbye
added a comment - There is a dcp noop support in the server so you can enable and it'll send out a noop at a fixed interval (which you should reply to).. Then you won't need to poll any stats...that support has been in DCP from the beginning, so as long as the remote side supports DCP you should be able to use it...
gomemcached is having issues handling NOOP, MB-21268. Even if gomemcached is able to handle NOOP, that won't solve the dcp stuckness issue.
NOOP - keeps connection alive when downstream is slow/blocked
stuckness check - detects the situation that downstream is not blocked, dcp connection is alive and has items to produce, and yet downstream is not receiving anything from dcp.
Yu Sui (Inactive)
added a comment - gomemcached is having issues handling NOOP, MB-21268 . Even if gomemcached is able to handle NOOP, that won't solve the dcp stuckness issue.
NOOP - keeps connection alive when downstream is slow/blocked
stuckness check - detects the situation that downstream is not blocked, dcp connection is alive and has items to produce, and yet downstream is not receiving anything from dcp.
re: "stuckiness" - as Trond said, the NOOP protocol is designed to handle the case where either end of the connect hasn't responded recently. All other DCP clients use that to handle timeouts / stuck connections.
As such, as long as the consumer sees at least one normal message every 2x noop_interval then it can assume the connection is healthy (and it has useful data). If the producer has no "normal" data to produce it sends noops to tell the consumer "I'm still here, all is well". As such, if the consumer doesn't see either a normal or NOOP message by 2x noop_interval it should close the connection.
I don't think we should be adding a separate mechanism just for goXDCR, particularly when the protocol already has consideration for it - indeed the mechanism I described is the one used for node failover. It just feels like unnecessary complexity and the source of more bugs (as we've found here!).
Dave Rigby
added a comment - - edited re: "stuckiness" - as Trond said, the NOOP protocol is designed to handle the case where either end of the connect hasn't responded recently. All other DCP clients use that to handle timeouts / stuck connections.
To expand - the producer will only send NO-OPs if it's not already sending useful mutations - see the description at https://github.com/couchbaselabs/dcp-documentation/blob/master/documentation/dead-connections.md
As such, as long as the consumer sees at least one normal message every 2x noop_interval then it can assume the connection is healthy (and it has useful data). If the producer has no "normal" data to produce it sends noops to tell the consumer "I'm still here, all is well". As such, if the consumer doesn't see either a normal or NOOP message by 2x noop_interval it should close the connection.
I don't think we should be adding a separate mechanism just for goXDCR, particularly when the protocol already has consideration for it - indeed the mechanism I described is the one used for node failover. It just feels like unnecessary complexity and the source of more bugs (as we've found here!).
I agree that NOOP can be used for stuckness check, if gomemcached handles it correctly and passes it on to XDCR. Will think more about this alternative after spock.
Yu Sui (Inactive)
added a comment - I agree that NOOP can be used for stuckness check, if gomemcached handles it correctly and passes it on to XDCR. Will think more about this alternative after spock.
The problem should have been fixed by MB-23423. It is not occurring in last QE tests. QE will run another round of tests in a later build for more confirmation. If it still does not occur, we will close the bug.
Yu Sui (Inactive)
added a comment - The problem should have been fixed by MB-23423 . It is not occurring in last QE tests. QE will run another round of tests in a later build for more confirmation. If it still does not occur, we will close the bug.
Arunkumar Senthilnathan (Inactive)
added a comment - Issue found again in 5.0.0-2564 - logs are here:
https://s3.amazonaws.com/bugdb/jira/MB23161/23161/collectinfo-2017-04-10T040044-ns_1%40172.23.105.27.zip
https://s3.amazonaws.com/bugdb/jira/MB23161/23161/collectinfo-2017-04-10T040044-ns_1%40172.23.106.251.zip
https://s3.amazonaws.com/bugdb/jira/MB23161/23161/collectinfo-2017-04-10T040044-ns_1%40172.23.108.169.zip
Bucket deletion happened at test_39:
[2017-04-08 17:32:05,084] - [rest_client:784] ERROR - DELETE http://172.23.105.27:8091/pools/default/buckets/default
body: headers: {'Content-Type': 'application/x-www-form-urlencoded', 'Accept': ' / ', 'Authorization': 'Basic QWRtaW5pc3RyYXRvcjpwYXNzd29yZA==\n'} error: 500 reason: unknown {"_":"Bucket deletion not yet complete, but will continue.\r\n"} auth: Administrator:password
All the following testcases failed with rebalance issues - logs were collected at the end - job ca be found here:
http://qa.sc.couchbase.com/job/cen006-p1-xxdcr-vset07-00-goxdcr-rebalance/565/consoleFull
There are still two offending/waiting connections. The connections are no longer the ones set up by pipeline supervisors, though. The "cmd" field of the connection shows "DCP_MUTATION", which indicates that the connection is a dcp connection. In comparison, the "cmd" field was "STAT" for the connection set up by pipeline supervisor.
Yu Sui (Inactive)
added a comment - There are still two offending/waiting connections. The connections are no longer the ones set up by pipeline supervisors, though. The "cmd" field of the connection shows "DCP_MUTATION", which indicates that the connection is a dcp connection. In comparison, the "cmd" field was "STAT" for the connection set up by pipeline supervisor.
Will dig some more.
2017-04-09T21:01:12.891631-07:00 NOTICE Worker thread 3: {"connection":"0x7f18d375a180","socket":50,"protocol":"memcached","peername":"172.23.105.27:46401","sockname":"172.23.105.27:11210","parent_port":11210,"bucket_index":1,"internal":true,"username":"@goxdcr","sasl_conn":"0x7f18e15e0100","nodelay":true,"refcount":2,"features":{"mutation_extras":false,"xerror":false},"engine_storage":"0x7f18c588c900","next":"0x0","thread":"0x7f1927483390","priority":"Medium","clustermap_revno":"unknown","sasl_disabled":false,"tap":false,"dcp":true,"dcp_xattr_aware":false,"dcp_no_value":false,"opaque":"0x44002a68","max_reqs_per_event":5,"nevents":0,"state":"conn_mwrite","cmd":"DCP_MUTATION","libevent":{"registered":true,"ev_flags":{"raw":"0x14","decoded": ["write","persist"] },"which":{"raw":"0x4","decoded": ["write"] }},"read":{"buf":"0x7f18c5815000","curr":"0x7f18c5815000","size":8192,"bytes":0},"write" {"buf":"0x7f18d1e96000","curr":"0x7f18d1e96037","size":2048,"bytes":55},"write_and_go":"conn_ship_log","ritem":"0x7f18c5815048","rlbytes":0,"item":"0x0","iov" {"size":10,"used":3,"vector": [\{"base":"0x7f18d1e96000","len":"0x37"},\{"base":"0x7f18e0c516f9","len":"0xb"},\{"base":"0x7f18e13f402b","len":"0x161"}] },"msglist":{"size":5,"used":1,"curr":0,"bytes":578},"itemlist":{"size":1},"temp_alloc_list":{"size":0},"noreply":(false,"DynamicBuffer":{"buffer":"0x0","size":0,"offset":0},"cas":"0x0","aiostat":0,"ewouldblock":false,"ssl":{"enabled":false},"total_recv":10496,"total_send":1063612,"datatype":"snappy,json"}
There were two replication instances with the same replication id running at the time of problem. When test completed and a command was issued to stop the replication, only one replication instance was stopped and the other was left running. This should be the reason that some dcp connections were left open at the end of test run.
Will look for ways to prevent multiple instances from occuring.
Yu Sui (Inactive)
added a comment - There were two replication instances with the same replication id running at the time of problem. When test completed and a command was issued to stop the replication, only one replication instance was stopped and the other was left running. This should be the reason that some dcp connections were left open at the end of test run.
Will look for ways to prevent multiple instances from occuring.
Trond Norbye looks like
MB-16745has resurfaced in spock - please take a look