Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-23161

Bucket deletion timeout issue followed by rebalance failures found in XDCR tests

    XMLWordPrintable

Details

    • Bug
    • Status: Closed
    • Test Blocker
    • Resolution: Fixed
    • 5.0.0
    • 5.0.0
    • memcached, XDCR
    • Triaged
    • Yes

    Description

      [2017-03-05 11:23:17,744] - [cluster_helper:78] INFO - waiting for ns_server @ 172.23.105.181:8091
      [2017-03-05 11:23:17,750] - [cluster_helper:80] INFO - ns_server @ 172.23.105.181:8091 is running
      Cluster instance shutdown with force
      [2017-03-05 11:23:17,750] - [xdcrnewbasetests:1149] INFO - removing xdcr/nodes settings
      [2017-03-05 11:23:17,813] - [rest_client:1025] INFO - removing remote cluster name:remote_cluster_C2-C1
      [2017-03-05 11:23:17,848] - [xdcrnewbasetests:1159] INFO - cleanup [ip:172.23.107.120 port:8091 ssh_username:root, ip:172.23.107.169 port:8091 ssh_username:root, ip:172.23.105.180 port:8091 ssh_username:root]
      [2017-03-05 11:23:17,886] - [bucket_helper:138] INFO - deleting existing buckets [u'default'] on 172.23.107.120
      [2017-03-05 11:23:17,886] - [bucket_helper:140] INFO - remove bucket default ...
      [2017-03-05 11:23:47,906] - [rest_client:790] ERROR - http://172.23.107.120:8091/pools/default/buckets/default error 500 reason: unknown {"_":"Bucket deletion not yet complete, but will continue.\r\n"}
      [2017-03-05 11:23:47,906] - [rest_client:1898] WARNING - Bucket deletion timed out waiting for all nodes
      [2017-03-05 11:23:47,931] - [bucket_helper:153] ERROR - Unable to get timings for bucket
      [2017-03-05 11:23:47,931] - [bucket_helper:154] INFO - deleted bucket : default from 172.23.107.120
      [2017-03-05 11:23:47,931] - [bucket_helper:230] INFO - waiting for bucket deletion to complete....
      [2017-03-05 11:23:47,955] - [rest_client:134] INFO - node 172.23.107.120 existing buckets : []
      [2017-03-05 11:23:48,005] - [cluster_helper:255] INFO - rebalancing all nodes in order to remove nodes
      [2017-03-05 11:23:48,010] - [rest_client:1322] INFO - rebalance params : password=welcome&ejectedNodes=ns_1%40172.23.107.169%2Cns_1%40172.23.105.180&user=Administrator&knownNodes=ns_1%40172.23.107.169%2Cns_1%40172.23.105.180%2Cns_1%40172.23.107.120
      [2017-03-05 11:23:48,017] - [rest_client:1326] INFO - rebalance operation started
      [2017-03-05 11:23:48,047] - [rest_client:1457] INFO - rebalance percentage : 0.00 %
      [2017-03-05 11:23:58,064] - [rest_client:1457] INFO - rebalance percentage : 0.00 %
      [2017-03-05 11:24:08,083] - [rest_client:1440] ERROR - {u'status': u'none', u'errorMessage': u'Rebalance failed. See logs for detailed reason. You can try again.'} - rebalance failed
      [2017-03-05 11:24:08,208] - [rest_client:2726] INFO - Latest logs from UI on 172.23.107.120:
      [2017-03-05 11:24:08,208] - [rest_client:2727] ERROR - {u'node': u'ns_1@172.23.107.120', u'code': 0
      

      Bucket deletion issue found in XDCR test cleanup - same as MB-16745 - attaching logs and console output

      Attachments

        1. 172.23.105.102.zip
          2.67 MB
        2. 172.23.105.178-20170305-1116-diag.zip
          4.92 MB
        3. 172.23.105.179-20170305-1117-diag.zip
          913 kB
        4. 172.23.105.180-20170305-1118-diag.zip
          3.63 MB
        5. 172.23.105.181-20170305-1120-diag.zip
          4.67 MB
        6. 172.23.105.20.zip
          6.53 MB
        7. 172.23.105.3.zip
          6.15 MB
        8. 172.23.105.4.zip
          4.24 MB
        9. 172.23.105.67.zip
          2.64 MB
        10. 172.23.107.120-20170305-1111-diag.zip
          8.73 MB
        11. 172.23.107.169-20170305-1114-diag.zip
          5.69 MB
        12. 172.23.107.171-20170305-1109-diag.zip
          6.68 MB
        13. 172.23.107.198-20170305-1106-diag.zip
          10.12 MB
        14. 172.23.107.53.zip
          2.27 MB
        15. consoleFull.rtf
          91 kB

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            Trond Norbye looks like MB-16745 has resurfaced in spock - please take a look

            arunkumar Arunkumar Senthilnathan (Inactive) added a comment - Trond Norbye looks like MB-16745 has resurfaced in spock - please take a look

            Trond Norbye any updates on this? this is affecting XDCR regular regression pass percentage pretty bad - it is happening in almost all the jobs now

            arunkumar Arunkumar Senthilnathan (Inactive) added a comment - Trond Norbye any updates on this? this is affecting XDCR regular regression pass percentage pretty bad - it is happening in almost all the jobs now

            Arunkumar Senthilnathan I haven't been able to look at all logs yet, but I did look at the memcached logs yesterday and from what I could see stuff looked like expected:

            {{trond@ok:1091> egrep 'ON_DELETE_BUCKET|Delete bucket .default. complete' */memcached.log | sed -e s/cbcollect_info_ns_1@//g | sed -e 's,/memcached.log:, - ,g'
            172.23.105.181_20170305-191726 - 2017-03-05T09:50:07.104517-08:00 NOTICE 42 Delete bucket [default]. Notifying all registered ON_DELETE_BUCKET callbacks
            172.23.105.181_20170305-191726 - 2017-03-05T09:50:13.995638-08:00 NOTICE 42 Delete bucket [default] complete
            172.23.107.120_20170305-191152 - 2017-03-05T09:54:09.900105-08:00 NOTICE 42 Delete bucket [default]. Notifying all registered ON_DELETE_BUCKET callbacks
            172.23.107.120_20170305-191152 - 2017-03-05T09:54:10.661859-08:00 NOTICE 42 Delete bucket [default] complete
            172.23.107.120_20170305-191152 - 2017-03-05T10:10:27.186148-08:00 NOTICE 42 Delete bucket [default]. Notifying all registered ON_DELETE_BUCKET callbacks
            172.23.107.120_20170305-191152 - 2017-03-05T10:10:27.346177-08:00 NOTICE 42 Delete bucket [default] complete
            172.23.107.120_20170305-191152 - 2017-03-05T10:28:24.862689-08:00 NOTICE 42 Delete bucket [default]. Notifying all registered ON_DELETE_BUCKET callbacks
            172.23.107.120_20170305-191152 - 2017-03-05T10:28:25.041071-08:00 NOTICE 42 Delete bucket [default] complete
            172.23.107.120_20170305-191152 - 2017-03-05T10:46:08.661761-08:00 NOTICE 42 Delete bucket [default]. Notifying all registered ON_DELETE_BUCKET callbacks
            172.23.107.120_20170305-191152 - 2017-03-05T10:46:09.674211-08:00 NOTICE 42 Delete bucket [default] complete
            172.23.107.120_20170305-191152 - 2017-03-05T11:03:39.945438-08:00 NOTICE 42 Delete bucket [default]. Notifying all registered ON_DELETE_BUCKET callbacks
            172.23.107.120_20170305-191152 - 2017-03-05T11:03:40.145099-08:00 NOTICE 42 Delete bucket [default] complete
            172.23.107.169_20170305-191429 - 2017-03-05T09:38:09.689811-08:00 NOTICE 42 Delete bucket [default]. Notifying all registered ON_DELETE_BUCKET callbacks
            172.23.107.169_20170305-191429 - 2017-03-05T09:38:09.937632-08:00 NOTICE 42 Delete bucket [default] complete
            172.23.107.169_20170305-191429 - 2017-03-05T10:10:27.020415-08:00 NOTICE 42 Delete bucket [default]. Notifying all registered ON_DELETE_BUCKET callbacks
            172.23.107.169_20170305-191429 - 2017-03-05T10:10:27.535875-08:00 NOTICE 42 Delete bucket [default] complete
            172.23.107.169_20170305-191429 - 2017-03-05T10:12:52.167146-08:00 NOTICE 42 Delete bucket [default]. Notifying all registered ON_DELETE_BUCKET callbacks
            172.23.107.169_20170305-191429 - 2017-03-05T10:12:52.289876-08:00 NOTICE 42 Delete bucket [default] complete
            172.23.107.169_20170305-191429 - 2017-03-05T10:31:55.729634-08:00 NOTICE 42 Delete bucket [default]. Notifying all registered ON_DELETE_BUCKET callbacks
            172.23.107.169_20170305-191429 - 2017-03-05T10:31:56.956868-08:00 NOTICE 42 Delete bucket [default] complete
            172.23.107.169_20170305-191429 - 2017-03-05T11:03:39.637649-08:00 NOTICE 42 Delete bucket [default]. Notifying all registered ON_DELETE_BUCKET callbacks
            172.23.107.169_20170305-191429 - 2017-03-05T11:03:40.005392-08:00 NOTICE 42 Delete bucket [default] complete
            172.23.107.171_20170305-190929 - 2017-03-05T09:53:32.449055-08:00 NOTICE 42 Delete bucket [default]. Notifying all registered ON_DELETE_BUCKET callbacks
            172.23.107.171_20170305-190929 - 2017-03-05T09:53:38.924163-08:00 NOTICE 42 Delete bucket [default] complete
            172.23.107.171_20170305-190929 - 2017-03-05T09:56:14.575788-08:00 NOTICE 42 Delete bucket [default]. Notifying all registered ON_DELETE_BUCKET callbacks
            172.23.107.171_20170305-190929 - 2017-03-05T09:56:14.926609-08:00 NOTICE 42 Delete bucket [default] complete
            172.23.107.171_20170305-190929 - 2017-03-05T10:27:52.532535-08:00 NOTICE 42 Delete bucket [default]. Notifying all registered ON_DELETE_BUCKET callbacks
            172.23.107.171_20170305-190929 - 2017-03-05T10:27:53.635609-08:00 NOTICE 42 Delete bucket [default] complete
            172.23.107.171_20170305-190929 - 2017-03-05T10:30:28.759280-08:00 NOTICE 42 Delete bucket [default]. Notifying all registered ON_DELETE_BUCKET callbacks
            172.23.107.171_20170305-190929 - 2017-03-05T10:30:28.964274-08:00 NOTICE 42 Delete bucket [default] complete
            172.23.107.171_20170305-190929 - 2017-03-05T10:48:18.508046-08:00 NOTICE 42 Delete bucket [default]. Notifying all registered ON_DELETE_BUCKET callbacks
            172.23.107.171_20170305-190929 - 2017-03-05T10:48:18.698352-08:00 NOTICE 42 Delete bucket [default] complete
            172.23.107.198_20170305-190640 - 2017-03-05T09:53:31.518553-08:00 NOTICE 42 Delete bucket [default]. Notifying all registered ON_DELETE_BUCKET callbacks
            172.23.107.198_20170305-190640 - 2017-03-05T09:53:39.079820-08:00 NOTICE 42 Delete bucket [default] complete
            172.23.107.198_20170305-190640 - 2017-03-05T10:10:25.161377-08:00 NOTICE 42 Delete bucket [default]. Notifying all registered ON_DELETE_BUCKET callbacks
            172.23.107.198_20170305-190640 - 2017-03-05T10:10:26.198118-08:00 NOTICE 42 Delete bucket [default] complete
            172.23.107.198_20170305-190640 - 2017-03-05T10:27:51.570780-08:00 NOTICE 43 Delete bucket [default]. Notifying all registered ON_DELETE_BUCKET callbacks
            172.23.107.198_20170305-190640 - 2017-03-05T10:27:52.516228-08:00 NOTICE 43 Delete bucket [default] complete
            172.23.107.198_20170305-190640 - 2017-03-05T10:46:07.076718-08:00 NOTICE 42 Delete bucket [default]. Notifying all registered ON_DELETE_BUCKET callbacks
            172.23.107.198_20170305-190640 - 2017-03-05T10:46:07.389178-08:00 NOTICE 42 Delete bucket [default] complete
            172.23.107.198_20170305-190640 - 2017-03-05T11:03:38.433386-08:00 NOTICE 41 Delete bucket [default]. Notifying all registered ON_DELETE_BUCKET callbacks
            172.23.107.198_20170305-190640 - 2017-03-05T11:03:38.708813-08:00 NOTICE 41 Delete bucket [default] complete
            }}

            All delete bucket requests are nicely paired with a completion, and all of them completes within seconds.

            trond Trond Norbye added a comment - Arunkumar Senthilnathan I haven't been able to look at all logs yet, but I did look at the memcached logs yesterday and from what I could see stuff looked like expected: {{trond@ok:1091> egrep 'ON_DELETE_BUCKET|Delete bucket .default. complete' */memcached.log | sed -e s/cbcollect_info_ns_1@//g | sed -e 's,/memcached.log:, - ,g' 172.23.105.181_20170305-191726 - 2017-03-05T09:50:07.104517-08:00 NOTICE 42 Delete bucket [default] . Notifying all registered ON_DELETE_BUCKET callbacks 172.23.105.181_20170305-191726 - 2017-03-05T09:50:13.995638-08:00 NOTICE 42 Delete bucket [default] complete 172.23.107.120_20170305-191152 - 2017-03-05T09:54:09.900105-08:00 NOTICE 42 Delete bucket [default] . Notifying all registered ON_DELETE_BUCKET callbacks 172.23.107.120_20170305-191152 - 2017-03-05T09:54:10.661859-08:00 NOTICE 42 Delete bucket [default] complete 172.23.107.120_20170305-191152 - 2017-03-05T10:10:27.186148-08:00 NOTICE 42 Delete bucket [default] . Notifying all registered ON_DELETE_BUCKET callbacks 172.23.107.120_20170305-191152 - 2017-03-05T10:10:27.346177-08:00 NOTICE 42 Delete bucket [default] complete 172.23.107.120_20170305-191152 - 2017-03-05T10:28:24.862689-08:00 NOTICE 42 Delete bucket [default] . Notifying all registered ON_DELETE_BUCKET callbacks 172.23.107.120_20170305-191152 - 2017-03-05T10:28:25.041071-08:00 NOTICE 42 Delete bucket [default] complete 172.23.107.120_20170305-191152 - 2017-03-05T10:46:08.661761-08:00 NOTICE 42 Delete bucket [default] . Notifying all registered ON_DELETE_BUCKET callbacks 172.23.107.120_20170305-191152 - 2017-03-05T10:46:09.674211-08:00 NOTICE 42 Delete bucket [default] complete 172.23.107.120_20170305-191152 - 2017-03-05T11:03:39.945438-08:00 NOTICE 42 Delete bucket [default] . Notifying all registered ON_DELETE_BUCKET callbacks 172.23.107.120_20170305-191152 - 2017-03-05T11:03:40.145099-08:00 NOTICE 42 Delete bucket [default] complete 172.23.107.169_20170305-191429 - 2017-03-05T09:38:09.689811-08:00 NOTICE 42 Delete bucket [default] . Notifying all registered ON_DELETE_BUCKET callbacks 172.23.107.169_20170305-191429 - 2017-03-05T09:38:09.937632-08:00 NOTICE 42 Delete bucket [default] complete 172.23.107.169_20170305-191429 - 2017-03-05T10:10:27.020415-08:00 NOTICE 42 Delete bucket [default] . Notifying all registered ON_DELETE_BUCKET callbacks 172.23.107.169_20170305-191429 - 2017-03-05T10:10:27.535875-08:00 NOTICE 42 Delete bucket [default] complete 172.23.107.169_20170305-191429 - 2017-03-05T10:12:52.167146-08:00 NOTICE 42 Delete bucket [default] . Notifying all registered ON_DELETE_BUCKET callbacks 172.23.107.169_20170305-191429 - 2017-03-05T10:12:52.289876-08:00 NOTICE 42 Delete bucket [default] complete 172.23.107.169_20170305-191429 - 2017-03-05T10:31:55.729634-08:00 NOTICE 42 Delete bucket [default] . Notifying all registered ON_DELETE_BUCKET callbacks 172.23.107.169_20170305-191429 - 2017-03-05T10:31:56.956868-08:00 NOTICE 42 Delete bucket [default] complete 172.23.107.169_20170305-191429 - 2017-03-05T11:03:39.637649-08:00 NOTICE 42 Delete bucket [default] . Notifying all registered ON_DELETE_BUCKET callbacks 172.23.107.169_20170305-191429 - 2017-03-05T11:03:40.005392-08:00 NOTICE 42 Delete bucket [default] complete 172.23.107.171_20170305-190929 - 2017-03-05T09:53:32.449055-08:00 NOTICE 42 Delete bucket [default] . Notifying all registered ON_DELETE_BUCKET callbacks 172.23.107.171_20170305-190929 - 2017-03-05T09:53:38.924163-08:00 NOTICE 42 Delete bucket [default] complete 172.23.107.171_20170305-190929 - 2017-03-05T09:56:14.575788-08:00 NOTICE 42 Delete bucket [default] . Notifying all registered ON_DELETE_BUCKET callbacks 172.23.107.171_20170305-190929 - 2017-03-05T09:56:14.926609-08:00 NOTICE 42 Delete bucket [default] complete 172.23.107.171_20170305-190929 - 2017-03-05T10:27:52.532535-08:00 NOTICE 42 Delete bucket [default] . Notifying all registered ON_DELETE_BUCKET callbacks 172.23.107.171_20170305-190929 - 2017-03-05T10:27:53.635609-08:00 NOTICE 42 Delete bucket [default] complete 172.23.107.171_20170305-190929 - 2017-03-05T10:30:28.759280-08:00 NOTICE 42 Delete bucket [default] . Notifying all registered ON_DELETE_BUCKET callbacks 172.23.107.171_20170305-190929 - 2017-03-05T10:30:28.964274-08:00 NOTICE 42 Delete bucket [default] complete 172.23.107.171_20170305-190929 - 2017-03-05T10:48:18.508046-08:00 NOTICE 42 Delete bucket [default] . Notifying all registered ON_DELETE_BUCKET callbacks 172.23.107.171_20170305-190929 - 2017-03-05T10:48:18.698352-08:00 NOTICE 42 Delete bucket [default] complete 172.23.107.198_20170305-190640 - 2017-03-05T09:53:31.518553-08:00 NOTICE 42 Delete bucket [default] . Notifying all registered ON_DELETE_BUCKET callbacks 172.23.107.198_20170305-190640 - 2017-03-05T09:53:39.079820-08:00 NOTICE 42 Delete bucket [default] complete 172.23.107.198_20170305-190640 - 2017-03-05T10:10:25.161377-08:00 NOTICE 42 Delete bucket [default] . Notifying all registered ON_DELETE_BUCKET callbacks 172.23.107.198_20170305-190640 - 2017-03-05T10:10:26.198118-08:00 NOTICE 42 Delete bucket [default] complete 172.23.107.198_20170305-190640 - 2017-03-05T10:27:51.570780-08:00 NOTICE 43 Delete bucket [default] . Notifying all registered ON_DELETE_BUCKET callbacks 172.23.107.198_20170305-190640 - 2017-03-05T10:27:52.516228-08:00 NOTICE 43 Delete bucket [default] complete 172.23.107.198_20170305-190640 - 2017-03-05T10:46:07.076718-08:00 NOTICE 42 Delete bucket [default] . Notifying all registered ON_DELETE_BUCKET callbacks 172.23.107.198_20170305-190640 - 2017-03-05T10:46:07.389178-08:00 NOTICE 42 Delete bucket [default] complete 172.23.107.198_20170305-190640 - 2017-03-05T11:03:38.433386-08:00 NOTICE 41 Delete bucket [default] . Notifying all registered ON_DELETE_BUCKET callbacks 172.23.107.198_20170305-190640 - 2017-03-05T11:03:38.708813-08:00 NOTICE 41 Delete bucket [default] complete }} All delete bucket requests are nicely paired with a completion, and all of them completes within seconds.
            dfinlay Dave Finlay added a comment -

            The logs attached to this ticket only go up to 11:18 at the latest. The bucket deletion problem happened at 11:23.

            ./mb-23161 $ tail -1 */ns_server.debug.log 
            ==> cbcollect_info_ns_1@172.23.105.178_20170305-191254/ns_server.debug.log <==
            [ns_server:debug,2017-03-05T11:13:41.233-08:00,ns_1@127.0.0.1:compaction_new_daemon<0.408.0>:compaction_scheduler:schedule_next:60]Finished compaction for compact_views too soon. Next run will be in 30s
             
            ==> cbcollect_info_ns_1@172.23.105.179_20170305-191421/ns_server.debug.log <==
            [ns_server:debug,2017-03-05T11:14:54.439-08:00,ns_1@127.0.0.1:<0.30315.3>:diag_handler:collect_diag_per_node_binary:210]Collected data for key system_info
             
            ==> cbcollect_info_ns_1@172.23.105.180_20170305-191532/ns_server.debug.log <==
            [ns_server:debug,2017-03-05T11:16:05.921-08:00,ns_1@172.23.105.180:<0.25046.4>:diag_handler:collect_diag_per_node_binary:210]Collected data for key system_info
             
            ==> cbcollect_info_ns_1@172.23.105.181_20170305-191726/ns_server.debug.log <==
            [ns_server:debug,2017-03-05T11:18:00.624-08:00,ns_1@172.23.105.181:<0.23235.5>:diag_handler:collect_diag_per_node_binary:210]Collected data for key system_info
             
            ==> cbcollect_info_ns_1@172.23.107.120_20170305-191152/ns_server.debug.log <==
            [ns_server:debug,2017-03-05T11:12:43.928-08:00,ns_1@172.23.107.120:compaction_new_daemon<0.408.0>:compaction_scheduler:schedule_next:60]Finished compaction for compact_kv too soon. Next run will be in 30s
             
            ==> cbcollect_info_ns_1@172.23.107.169_20170305-191429/ns_server.debug.log <==
            [ns_server:debug,2017-03-05T11:15:12.942-08:00,ns_1@172.23.107.169:compaction_new_daemon<0.25741.4>:compaction_scheduler:schedule_next:60]Finished compaction for compact_views too soon. Next run will be in 30s
             
            ==> cbcollect_info_ns_1@172.23.107.171_20170305-190929/ns_server.debug.log <==
            [ns_server:debug,2017-03-05T11:10:28.100-08:00,ns_1@172.23.107.171:compaction_new_daemon<0.5862.5>:compaction_scheduler:schedule_next:60]Finished compaction for compact_kv too soon. Next run will be in 30s
             
            ==> cbcollect_info_ns_1@172.23.107.198_20170305-190640/ns_server.debug.log <==
            [ns_server:info,2017-03-05T11:07:57.873-08:00,ns_1@172.23.107.198:ns_config_rep<0.244.0>:ns_config_rep:pull_one_node:349]Pulling config from: 'ns_1@172.23.107.171'
            

            dfinlay Dave Finlay added a comment - The logs attached to this ticket only go up to 11:18 at the latest. The bucket deletion problem happened at 11:23. ./mb-23161 $ tail -1 */ns_server.debug.log ==> cbcollect_info_ns_1@172.23.105.178_20170305-191254/ns_server.debug.log <== [ns_server:debug,2017-03-05T11:13:41.233-08:00,ns_1@127.0.0.1:compaction_new_daemon<0.408.0>:compaction_scheduler:schedule_next:60]Finished compaction for compact_views too soon. Next run will be in 30s   ==> cbcollect_info_ns_1@172.23.105.179_20170305-191421/ns_server.debug.log <== [ns_server:debug,2017-03-05T11:14:54.439-08:00,ns_1@127.0.0.1:<0.30315.3>:diag_handler:collect_diag_per_node_binary:210]Collected data for key system_info   ==> cbcollect_info_ns_1@172.23.105.180_20170305-191532/ns_server.debug.log <== [ns_server:debug,2017-03-05T11:16:05.921-08:00,ns_1@172.23.105.180:<0.25046.4>:diag_handler:collect_diag_per_node_binary:210]Collected data for key system_info   ==> cbcollect_info_ns_1@172.23.105.181_20170305-191726/ns_server.debug.log <== [ns_server:debug,2017-03-05T11:18:00.624-08:00,ns_1@172.23.105.181:<0.23235.5>:diag_handler:collect_diag_per_node_binary:210]Collected data for key system_info   ==> cbcollect_info_ns_1@172.23.107.120_20170305-191152/ns_server.debug.log <== [ns_server:debug,2017-03-05T11:12:43.928-08:00,ns_1@172.23.107.120:compaction_new_daemon<0.408.0>:compaction_scheduler:schedule_next:60]Finished compaction for compact_kv too soon. Next run will be in 30s   ==> cbcollect_info_ns_1@172.23.107.169_20170305-191429/ns_server.debug.log <== [ns_server:debug,2017-03-05T11:15:12.942-08:00,ns_1@172.23.107.169:compaction_new_daemon<0.25741.4>:compaction_scheduler:schedule_next:60]Finished compaction for compact_views too soon. Next run will be in 30s   ==> cbcollect_info_ns_1@172.23.107.171_20170305-190929/ns_server.debug.log <== [ns_server:debug,2017-03-05T11:10:28.100-08:00,ns_1@172.23.107.171:compaction_new_daemon<0.5862.5>:compaction_scheduler:schedule_next:60]Finished compaction for compact_kv too soon. Next run will be in 30s   ==> cbcollect_info_ns_1@172.23.107.198_20170305-190640/ns_server.debug.log <== [ns_server:info,2017-03-05T11:07:57.873-08:00,ns_1@172.23.107.198:ns_config_rep<0.244.0>:ns_config_rep:pull_one_node:349]Pulling config from: 'ns_1@172.23.107.171'

            Please take a look at the latest logs attached

            arunkumar Arunkumar Senthilnathan (Inactive) added a comment - Please take a look at the latest logs attached
            dfinlay Dave Finlay added a comment -

            Trond - apparently the logs should now include the time where the bucket deletion happened.

            dfinlay Dave Finlay added a comment - Trond - apparently the logs should now include the time where the bucket deletion happened.

            Trond Norbye - This is the issue:

            2017-03-12T23:11:27.884-07:00, ns_orchestrator:4:info:message(ns_1@172.23.105.3) - Starting rebalance, KeepNodes = ['ns_1@172.23.105.3'], EjectNodes = ['ns_1@172.23.105.4'], Failed over and being ejected nodes = []; no delta recovery nodes
             
            2017-03-12T23:11:47.910-07:00, ns_rebalancer:0:critical:message(ns_1@172.23.105.3) - Failed to wait deletion of some buckets on some nodes: [{'ns_1@172.23.105.3',
                                                                     {'EXIT',
                                                                      {old_buckets_shutdown_wait_failed,
                                                                       ["default"]}}}]
             
            2017-03-12T23:11:47.911-07:00, ns_orchestrator:0:critical:message(ns_1@172.23.105.3) - Rebalance exited with reason {buckets_shutdown_wait_failed,
                                          [{'ns_1@172.23.105.3',
                                            {'EXIT',
                                             {old_buckets_shutdown_wait_failed,
                                              ["default"]}}}]}
            -------------------------------
             
             
            per_node_processes('ns_1@172.23.105.3') =
                 {<0.32673.5>,
                  [{registered_name,ns_heart_slow_status_updater},
                   {status,waiting},
                   {initial_call,{proc_lib,init_p,5}},
                   {backtrace,[<<"Program counter: 0x0000000000891840 (unknown function)">>,
                               <<"CP: 0x0000000000891848 (<terminate process normally>)">>,
                               <<"arity = 3">>,<<"   ns_heart">>,
                               <<"   slow_updater_loop">>,<<"   []">>,<<>>]},
                   {error_handler,error_handler},
                   {garbage_collection,[{min_bin_vheap_size,46422},
                                        {min_heap_size,233},
                                        {fullsweep_after,512},
                                        {minor_gcs,4}]},
                   {heap_size,22},
                   {total_heap_size,22},
                   {links,[<0.280.0>]},
                   {monitors,[]},
                   {monitored_by,[]},
                   {memory,1016},
                   {messages,[]},
                   {message_queue_len,0},
                   {reductions,2197192},
                   {trap_exit,false},
                   {current_location,{erlang,hibernate,3,[]}},
                   {dictionary,[{'$ancestors',[ns_heart_sup,ns_server_sup,
                                               ns_server_nodes_sup,<0.170.0>,
                                               ns_server_cluster_sup,<0.88.0>]},
                                {'$initial_call',{erlang,apply,2}}]}]}
                 {<0.32668.5>,
                  [{registered_name,[]},
                   {status,waiting},
                   {initial_call,{proc_lib,init_p,5}},
                   {backtrace,[<<"Program counter: 0x00007f2269886390 (ns_pubsub:do_subscribe_link/4 + 392)">>,
                               <<"CP: 0x0000000000000000 (invalid)">>,<<"arity = 0">>,
                               <<>>,
                               <<"0x00007f22706cfcd8 Return addr 0x00007f226de8e640 (proc_lib:init_p_do_apply/3 + 56)">>,
                               <<"y(0)     []">>,
                               <<"y(1)     {ns_pubsub,#Ref<0.0.3.94025>}">>,
                               <<"y(2)     <0.32667.5>">>,<<"y(3)     buckets_events">>,
                               <<>>,
                               <<"0x00007f22706cfd00 Return addr 0x0000000000891848 (<terminate process normally>)">>,
                               <<"y(0)     Catch 0x00007f226de8e660 (proc_lib:init_p_do_apply/3 + 88)">>,
                               <<>>]},
                   {error_handler,error_handler},
                   {garbage_collection,[{min_bin_vheap_size,46422},
                                        {min_heap_size,233},
                                        {fullsweep_after,512},
                                        {minor_gcs,0}]},
                   {heap_size,233},
                   {total_heap_size,233},
                   {links,[<0.32667.5>,<0.275.0>]},
                   {monitors,[]},
                   {monitored_by,[]},
                   {memory,2744},
                   {messages,[]},
                   {message_queue_len,0},
                   {reductions,21},
                   {trap_exit,true},
                   {current_location,{ns_pubsub,do_subscribe_link,4,
                                                [{file,"src/ns_pubsub.erl"},{line,125}]}},
                   {dictionary,[{'$ancestors',[ns_heart,ns_heart_sup,ns_server_sup,
                                               ns_server_nodes_sup,<0.170.0>,
                                               ns_server_cluster_sup,<0.88.0>]},
                                {'$initial_call',{ns_pubsub,do_subscribe_link,4}}]}]}
                 {<0.32667.5>,
            ....
            

            Please refer to latest 172.23.105.3.zip - bucket deletion times out followed by which rebalance issue occurs - bucket eventually gets deleted but the rebalance issue fails the test and affects following tests too

            arunkumar Arunkumar Senthilnathan (Inactive) added a comment - Trond Norbye - This is the issue: 2017-03-12T23:11:27.884-07:00, ns_orchestrator:4:info:message(ns_1@172.23.105.3) - Starting rebalance, KeepNodes = ['ns_1@172.23.105.3'], EjectNodes = ['ns_1@172.23.105.4'], Failed over and being ejected nodes = []; no delta recovery nodes   2017-03-12T23:11:47.910-07:00, ns_rebalancer:0:critical:message(ns_1@172.23.105.3) - Failed to wait deletion of some buckets on some nodes: [{'ns_1@172.23.105.3', {'EXIT', {old_buckets_shutdown_wait_failed, ["default"]}}}]   2017-03-12T23:11:47.911-07:00, ns_orchestrator:0:critical:message(ns_1@172.23.105.3) - Rebalance exited with reason {buckets_shutdown_wait_failed, [{'ns_1@172.23.105.3', {'EXIT', {old_buckets_shutdown_wait_failed, ["default"]}}}]} -------------------------------     per_node_processes('ns_1@172.23.105.3') = {<0.32673.5>, [{registered_name,ns_heart_slow_status_updater}, {status,waiting}, {initial_call,{proc_lib,init_p,5}}, {backtrace,[<<"Program counter: 0x0000000000891840 (unknown function)">>, <<"CP: 0x0000000000891848 (<terminate process normally>)">>, <<"arity = 3">>,<<" ns_heart">>, <<" slow_updater_loop">>,<<" []">>,<<>>]}, {error_handler,error_handler}, {garbage_collection,[{min_bin_vheap_size,46422}, {min_heap_size,233}, {fullsweep_after,512}, {minor_gcs,4}]}, {heap_size,22}, {total_heap_size,22}, {links,[<0.280.0>]}, {monitors,[]}, {monitored_by,[]}, {memory,1016}, {messages,[]}, {message_queue_len,0}, {reductions,2197192}, {trap_exit,false}, {current_location,{erlang,hibernate,3,[]}}, {dictionary,[{'$ancestors',[ns_heart_sup,ns_server_sup, ns_server_nodes_sup,<0.170.0>, ns_server_cluster_sup,<0.88.0>]}, {'$initial_call',{erlang,apply,2}}]}]} {<0.32668.5>, [{registered_name,[]}, {status,waiting}, {initial_call,{proc_lib,init_p,5}}, {backtrace,[<<"Program counter: 0x00007f2269886390 (ns_pubsub:do_subscribe_link/4 + 392)">>, <<"CP: 0x0000000000000000 (invalid)">>,<<"arity = 0">>, <<>>, <<"0x00007f22706cfcd8 Return addr 0x00007f226de8e640 (proc_lib:init_p_do_apply/3 + 56)">>, <<"y(0) []">>, <<"y(1) {ns_pubsub,#Ref<0.0.3.94025>}">>, <<"y(2) <0.32667.5>">>,<<"y(3) buckets_events">>, <<>>, <<"0x00007f22706cfd00 Return addr 0x0000000000891848 (<terminate process normally>)">>, <<"y(0) Catch 0x00007f226de8e660 (proc_lib:init_p_do_apply/3 + 88)">>, <<>>]}, {error_handler,error_handler}, {garbage_collection,[{min_bin_vheap_size,46422}, {min_heap_size,233}, {fullsweep_after,512}, {minor_gcs,0}]}, {heap_size,233}, {total_heap_size,233}, {links,[<0.32667.5>,<0.275.0>]}, {monitors,[]}, {monitored_by,[]}, {memory,2744}, {messages,[]}, {message_queue_len,0}, {reductions,21}, {trap_exit,true}, {current_location,{ns_pubsub,do_subscribe_link,4, [{file,"src/ns_pubsub.erl"},{line,125}]}}, {dictionary,[{'$ancestors',[ns_heart,ns_heart_sup,ns_server_sup, ns_server_nodes_sup,<0.170.0>, ns_server_cluster_sup,<0.88.0>]}, {'$initial_call',{ns_pubsub,do_subscribe_link,4}}]}]} {<0.32667.5>, .... Please refer to latest 172.23.105.3.zip - bucket deletion times out followed by which rebalance issue occurs - bucket eventually gets deleted but the rebalance issue fails the test and affects following tests too

            I looked at the logs and for some reason two of the nodes starts on the deletion task ~5 minutes later than the first one. We need to figure out why that happens (when the message is being sent from ns_server, and if there is too much work blocked on that worker thread causing the insanely large delay to happen).

            trond Trond Norbye added a comment - I looked at the logs and for some reason two of the nodes starts on the deletion task ~5 minutes later than the first one. We need to figure out why that happens (when the message is being sent from ns_server, and if there is too much work blocked on that worker thread causing the insanely large delay to happen).

            Trond Norbye can you please provide an ETA for this?

            arunkumar Arunkumar Senthilnathan (Inactive) added a comment - Trond Norbye can you please provide an ETA for this?
            trond Trond Norbye added a comment -

            Arunkumar Senthilnathan Unfortunately not

            trond Trond Norbye added a comment - Arunkumar Senthilnathan Unfortunately not
            dfinlay Dave Finlay added a comment - - edited

            Trond - Arun did lots of bucket deletes (& recreates) so it's possible you got some of the different traces confused. The problem delete is the one that occurred at 23:10:57. We know it starts because we see the following in the logs:

            [ns_server:debug,2017-03-12T23:10:57.809-07:00,ns_1@172.23.105.3:ns_bucket_worker<0.378.0>:ns_bucket_sup:update_children:116]Stopping child for dead bucket: {{single_bucket_kv_sup,"default"},
                                             <0.14280.5>,supervisor,
                                             [single_bucket_kv_sup]}
            

            We also know because 30 seconds later (which is the timeout) we see the REST API request fail with a 500.

            172.23.107.140 - Administrator [12/Mar/2017:23:11:27 -0700] "DELETE /pools/default/buckets/default HTTP/1.1" 500 64 - Python-httplib2/$Rev: 259 $
            

            At 23:10:57 there are 2 nodes in the cluster: .105.3 and .105.4. We know this because as soon as the delete bucket fails, Arun attempts to start a rebalance (the one that ultimately fails) and those are the only nodes in the cluster.

            [user:info,2017-03-12T23:11:27.884-07:00,ns_1@172.23.105.3:<0.1474.0>:ns_orchestrator:idle:689]Starting rebalance, KeepNodes = ['ns_1@172.23.105.3'], EjectNodes = ['ns_1@172.23.105.4'], Failed over and being ejected nodes = []; no delta recovery nodes
            

            Grepping the memcached log for "Delete bucket" on .105.4 reveals what looks like healthy behavior, i.e. the call is received at 23:10:57 and 8 seconds later the bucket is shutdown.

             $ fgrep "Delete bucket" memcached.log | fgrep T23:1
            2017-03-12T23:10:57.918639-07:00 NOTICE 42 Delete bucket [default]. Notifying all registered ON_DELETE_BUCKET callbacks
            2017-03-12T23:10:57.944124-07:00 NOTICE 42 Delete bucket [default]. Wait for clients to disconnect
            2017-03-12T23:10:57.944147-07:00 NOTICE 581295832 Delete bucket [default]. Still waiting: 14 clients connected
            2017-03-12T23:10:58.945203-07:00 NOTICE 581295832 Delete bucket [default]. Still waiting: 1 clients connected
            ...
            2017-03-12T23:11:03.945193-07:00 NOTICE 581295832 Delete bucket [default]. Still waiting: 1 clients connected
            2017-03-12T23:11:04.841212-07:00 NOTICE 42 Delete bucket [default]. Shut down the bucket
            2017-03-12T23:11:05.153639-07:00 NOTICE 42 Delete bucket [default]. Clean up allocated resources 
            2017-03-12T23:11:05.153761-07:00 NOTICE 42 Delete bucket [default] complete
            

            .105.3 is where the problem is. Grepping for delete bucket there reveals:

            2017-03-12T23:10:58.034612-07:00 NOTICE 42 Delete bucket [default]. Notifying all registered ON_DELETE_BUCKET callbacks
            2017-03-12T23:10:58.053923-07:00 NOTICE 42 Delete bucket [default]. Wait for clients to disconnect
            2017-03-12T23:10:58.054038-07:00 NOTICE 3627761688 Delete bucket [default]. Still waiting: 11 clients connected
            2017-03-12T23:10:59.055189-07:00 NOTICE 3627761688 Delete bucket [default]. Still waiting: 1 clients connected
            ...
            2017-03-12T23:13:22.055160-07:00 NOTICE 3627761688 Delete bucket [default]. Still waiting: 1 clients connected
            2017-03-12T23:13:23.055196-07:00 NOTICE 3627761688 Delete bucket [default]. Still waiting: 1 clients connected
            2017-03-12T23:13:24.055193-07:00 NOTICE 3627761688 Delete bucket [default]. Still waiting: 1 clients connected
            2017-03-12T23:13:25.055153-07:00 NOTICE 3627761688 Delete bucket [default]. Still waiting: 1 clients connected
            2017-03-12T23:13:25.647730-07:00 NOTICE 42 Delete bucket [default]. Shut down the bucket
            2017-03-12T23:13:25.910250-07:00 NOTICE 42 Delete bucket [default]. Clean up allocated resources 
            2017-03-12T23:13:25.910366-07:00 NOTICE 42 Delete bucket [default] complete
            

            The bucket delete command is received at 23:10:58 (i.e. pretty much right away) but there's still one client hanging on for about 2 1/2 minutes and the bucket shutdown eventually happens at 23:13:25.

            Arunkumar Senthilnathan: Is the default bucket empty. You are deleting and recreating it pretty furiously. Are you also loading it with data?

            dfinlay Dave Finlay added a comment - - edited Trond - Arun did lots of bucket deletes (& recreates) so it's possible you got some of the different traces confused. The problem delete is the one that occurred at 23:10:57. We know it starts because we see the following in the logs: [ns_server:debug,2017-03-12T23:10:57.809-07:00,ns_1@172.23.105.3:ns_bucket_worker<0.378.0>:ns_bucket_sup:update_children:116]Stopping child for dead bucket: {{single_bucket_kv_sup,"default"}, <0.14280.5>,supervisor, [single_bucket_kv_sup]} We also know because 30 seconds later (which is the timeout) we see the REST API request fail with a 500. 172.23.107.140 - Administrator [12/Mar/2017:23:11:27 -0700] "DELETE /pools/default/buckets/default HTTP/1.1" 500 64 - Python-httplib2/$Rev: 259 $ At 23:10:57 there are 2 nodes in the cluster: .105.3 and .105.4. We know this because as soon as the delete bucket fails, Arun attempts to start a rebalance (the one that ultimately fails) and those are the only nodes in the cluster. [user:info,2017-03-12T23:11:27.884-07:00,ns_1@172.23.105.3:<0.1474.0>:ns_orchestrator:idle:689]Starting rebalance, KeepNodes = ['ns_1@172.23.105.3'], EjectNodes = ['ns_1@172.23.105.4'], Failed over and being ejected nodes = []; no delta recovery nodes Grepping the memcached log for "Delete bucket" on .105.4 reveals what looks like healthy behavior, i.e. the call is received at 23:10:57 and 8 seconds later the bucket is shutdown. $ fgrep "Delete bucket" memcached.log | fgrep T23:1 2017-03-12T23:10:57.918639-07:00 NOTICE 42 Delete bucket [default]. Notifying all registered ON_DELETE_BUCKET callbacks 2017-03-12T23:10:57.944124-07:00 NOTICE 42 Delete bucket [default]. Wait for clients to disconnect 2017-03-12T23:10:57.944147-07:00 NOTICE 581295832 Delete bucket [default]. Still waiting: 14 clients connected 2017-03-12T23:10:58.945203-07:00 NOTICE 581295832 Delete bucket [default]. Still waiting: 1 clients connected ... 2017-03-12T23:11:03.945193-07:00 NOTICE 581295832 Delete bucket [default]. Still waiting: 1 clients connected 2017-03-12T23:11:04.841212-07:00 NOTICE 42 Delete bucket [default]. Shut down the bucket 2017-03-12T23:11:05.153639-07:00 NOTICE 42 Delete bucket [default]. Clean up allocated resources 2017-03-12T23:11:05.153761-07:00 NOTICE 42 Delete bucket [default] complete .105.3 is where the problem is. Grepping for delete bucket there reveals: 2017-03-12T23:10:58.034612-07:00 NOTICE 42 Delete bucket [default]. Notifying all registered ON_DELETE_BUCKET callbacks 2017-03-12T23:10:58.053923-07:00 NOTICE 42 Delete bucket [default]. Wait for clients to disconnect 2017-03-12T23:10:58.054038-07:00 NOTICE 3627761688 Delete bucket [default]. Still waiting: 11 clients connected 2017-03-12T23:10:59.055189-07:00 NOTICE 3627761688 Delete bucket [default]. Still waiting: 1 clients connected ... 2017-03-12T23:13:22.055160-07:00 NOTICE 3627761688 Delete bucket [default]. Still waiting: 1 clients connected 2017-03-12T23:13:23.055196-07:00 NOTICE 3627761688 Delete bucket [default]. Still waiting: 1 clients connected 2017-03-12T23:13:24.055193-07:00 NOTICE 3627761688 Delete bucket [default]. Still waiting: 1 clients connected 2017-03-12T23:13:25.055153-07:00 NOTICE 3627761688 Delete bucket [default]. Still waiting: 1 clients connected 2017-03-12T23:13:25.647730-07:00 NOTICE 42 Delete bucket [default]. Shut down the bucket 2017-03-12T23:13:25.910250-07:00 NOTICE 42 Delete bucket [default]. Clean up allocated resources 2017-03-12T23:13:25.910366-07:00 NOTICE 42 Delete bucket [default] complete The bucket delete command is received at 23:10:58 (i.e. pretty much right away) but there's still one client hanging on for about 2 1/2 minutes and the bucket shutdown eventually happens at 23:13:25. Arunkumar Senthilnathan : Is the default bucket empty. You are deleting and recreating it pretty furiously. Are you also loading it with data?

            From what i can see the entire bucket deletion task takes 2:28 minutes? (23:10:58 - 23:13:26)? (How fast do we expect a bucket deletion to be? What if a compaction is running? etc)

            The entire wait is for the client on socket 47 to complete it's transfer so that we can disconnect it. For a short while we did try a more aggressive disconnect logic, but that made us crash in other locations. Back to client on socket 47, the send buffer on that socket is full which means that the transfer is currently blocket. In the log we can read:

            2017-03-12T23:05:36.060781-07:00 NOTICE 47: HELO [Goxdcr PipelineSupervisor SourceBucket:default TargetBucket:default] TCP NODELAY [ 172.23.105.3:33484 - 172.23.105.3:11210 (_admin) ]

            And for the connection that is currently blocked:

            "socket": 47,
            "protocol": "memcached",
            "peername": "172.23.105.3:33484",
            "sockname": "172.23.105.3:11210",
            "state": "conn_write",
            "msglist":

            { "size": 5, "used": 1, "curr": 0, "bytes": 3582842 }

            ,

            Once the client (xdcr) close the connection on the other end (the process was most likely killed by the test framework), memcached is unblocked and the deletion completes.

            trond Trond Norbye added a comment - From what i can see the entire bucket deletion task takes 2:28 minutes? (23:10:58 - 23:13:26)? (How fast do we expect a bucket deletion to be? What if a compaction is running? etc) The entire wait is for the client on socket 47 to complete it's transfer so that we can disconnect it. For a short while we did try a more aggressive disconnect logic, but that made us crash in other locations. Back to client on socket 47, the send buffer on that socket is full which means that the transfer is currently blocket. In the log we can read: 2017-03-12T23:05:36.060781-07:00 NOTICE 47: HELO [Goxdcr PipelineSupervisor SourceBucket:default TargetBucket:default] TCP NODELAY [ 172.23.105.3:33484 - 172.23.105.3:11210 (_admin) ] And for the connection that is currently blocked: "socket": 47, "protocol": "memcached", "peername": "172.23.105.3:33484", "sockname": "172.23.105.3:11210", "state": "conn_write", "msglist": { "size": 5, "used": 1, "curr": 0, "bytes": 3582842 } , Once the client (xdcr) close the connection on the other end (the process was most likely killed by the test framework), memcached is unblocked and the deletion completes.
            trond Trond Norbye added a comment -

            We've been talking a bit about this bug, and the fact that a client may block the deletion is a problem we've had for a while.

            An interesting question here is why do the Goxdcr PipelineSupervisor not drain its input buffer so that the server may complete transmitting the data to the client?

            trond Trond Norbye added a comment - We've been talking a bit about this bug, and the fact that a client may block the deletion is a problem we've had for a while. An interesting question here is why do the Goxdcr PipelineSupervisor not drain its input buffer so that the server may complete transmitting the data to the client?
            dfinlay Dave Finlay added a comment - - edited

            Assigning to Yu Sui.

            Yu - would you mind taking a look at this ticket? In particular can you look at the xdcr logs and the memcached logs contained in 172.23.105.3 from the 13th of March? (No need to download all the other stuff.) Trond notices that the "pipeline supervisor" connection isn't reading from the socket buffer – which causes memcached to wait and the bucket deletion to get delayed. Memcached should be able to abruptly terminate these connections, but this is not as simple as it sounds for reasons internal to memcached.

            Anyway, if you have insight it would be good to know why the XDCR client wasn't reading from the socket.

            (Trond Norbye - can you post indications from the memcached log that XDCR wasn't reading from the socket.)

            dfinlay Dave Finlay added a comment - - edited Assigning to Yu Sui . Yu - would you mind taking a look at this ticket? In particular can you look at the xdcr logs and the memcached logs contained in 172.23.105.3 from the 13th of March? (No need to download all the other stuff.) Trond notices that the "pipeline supervisor" connection isn't reading from the socket buffer – which causes memcached to wait and the bucket deletion to get delayed. Memcached should be able to abruptly terminate these connections, but this is not as simple as it sounds for reasons internal to memcached. Anyway, if you have insight it would be good to know why the XDCR client wasn't reading from the socket. ( Trond Norbye - can you post indications from the memcached log that XDCR wasn't reading from the socket.)
            trond Trond Norbye added a comment -

            If you look in the memcached.log for .105.3 you'll see all of the lines starting with: Worker thread 1: {"connection":"0x7f32d8deca80","socket":47, and all of them have the same value for "total_send":4237662 .

            trond Trond Norbye added a comment - If you look in the memcached.log for .105.3 you'll see all of the lines starting with: Worker thread 1: {"connection":"0x7f32d8deca80","socket":47, and all of them have the same value for "total_send":4237662 .

            The connection set up by xdcr pipeline supervisor timed out, which may be the reason it did not continue to read from the socket.

            PipelineSupervisor_615ceec9956939205f5bdc3561f0ccf4/default/default Error getting dcp stats for kv 172.23.105.3:11210. err=read tcp 172.23.105.3:33484->172.23.105.3:11210: i/o timeout

            I will check what is the timeout setting in gomemcached, which is used to set up the connection. Maybe I could increase the time out setting, and also close the connection and set up a new one when timeout is still seen.

             

            yu Yu Sui (Inactive) added a comment - The connection set up by xdcr pipeline supervisor timed out, which may be the reason it did not continue to read from the socket. PipelineSupervisor_615ceec9956939205f5bdc3561f0ccf4/default/default Error getting dcp stats for kv 172.23.105.3:11210. err=read tcp 172.23.105.3:33484->172.23.105.3:11210: i/o timeout I will check what is the timeout setting in gomemcached, which is used to set up the connection. Maybe I could increase the time out setting, and also close the connection and set up a new one when timeout is still seen.  
            dfinlay Dave Finlay added a comment -

            I see some interesting stuff in the logs. This pipeline supervisor is having a hard time – it continually seems to be unable to get DCP stats due to time out. Then it gets torn down when the bucket is deleted, gets rebuilt when the bucket is recreated and proceeds to again time out getting DCP stats.

            Separately, the last incarnation of the pipeline supervisor goes as follows. First it gets created:

            2017-03-12T23:03:36.045-07:00 INFO GOXDCR.GenericSupervisor: Starting supervisor PipelineSupervisor_615ceec9956939205f5bdc3561f0ccf4/default/default.
            2017-03-12T23:03:36.045-07:00 INFO GOXDCR.GenericSupervisor: Started supervisor PipelineSupervisor_615ceec9956939205f5bdc3561f0ccf4/default/default.
            2017-03-12T23:03:36.046-07:00 INFO GOXDCR.GenericSupervisor: PipelineSupervisor_615ceec9956939205f5bdc3561f0ccf4/default/default monitorPipelineHealth started
            

            It takes memcached precisely 2 minutes to figure out that it exists. (Weird)

            2017-03-12T23:05:36.060781-07:00 NOTICE 47: HELO [Goxdcr PipelineSupervisor  SourceBucket:default TargetBucket:default] TCP NODELAY [ 172.23.105.3:33484 - 172.23.105.3:11210 (_admin) ]
            

            Precisely 2 minutes later XDCR reports a timeout getting the DCP stats. And another timeout 2 minutes after that.

            2017-03-12T23:07:36.060-07:00 INFO GOXDCR.GenericSupervisor: PipelineSupervisor_615ceec9956939205f5bdc3561f0ccf4/default/default Error getting dcp stats for kv 172.23.105.3:11210. err=read tcp 172.23.105.3:33484->172.23.105.3:11210: i/o timeout
            2017-03-12T23:07:36.060-07:00 ERRO GOXDCR.GenericSupervisor: PipelineSupervisor_615ceec9956939205f5bdc3561f0ccf4/default/default Failed to get dcp stats. Skipping dcp health check.
            2017-03-12T23:09:36.062-07:00 INFO GOXDCR.GenericSupervisor: PipelineSupervisor_615ceec9956939205f5bdc3561f0ccf4/default/default Error getting dcp stats for kv 172.23.105.3:11210. err=write tcp 172.23.105.3:33484->172.23.105.3:11210: i/o timeout
            2017-03-12T23:09:36.062-07:00 ERRO GOXDCR.GenericSupervisor: PipelineSupervisor_615ceec9956939205f5bdc3561f0ccf4/default/default Failed to get dcp stats. Skipping dcp health check.
            

            Then finally, once the bucket is deleted, XDCR finds out more or less immediately that the DCP feed has been closed:

            2017-03-12T23:10:58.061-07:00 ERRO GOXDCR.GenericSupervisor: PipelineSupervisor_615ceec9956939205f5bdc3561f0ccf4/default/default Received error report : map[dcp_615ceec9956939205f5bdc3561f0ccf4/default/default_172.23.105.3:11210_0:DCP upr feed has been closed.]
            2017-03-12T23:10:58.061-07:00 INFO GOXDCR.ReplMgr: Supervisor PipelineSupervisor_615ceec9956939205f5bdc3561f0ccf4/default/default of type *supervisor.GenericSupervisor reported errors map[dcp_615ceec9956939205f5bdc3561f0ccf4/default/default_172.23.105.3:11210_0:DCP upr feed has been closed.]
            

            But according to Trond memcached is trying to send data and is reporting as much in these traces:

            ./cbcollect_info_ns_1@172.23.105.3_20170313-064501 $ fgrep 0x7f32d8deca80 memcached.log  | head -1
            2017-03-12T23:10:58.054195-07:00 NOTICE Worker thread 1: {"connection":"0x7f32d8deca80","socket":47,"protocol":"memcached","peername":"172.23.105.3:33484","sockname":"172.23.105.3:11210","parent_port":11210,"bucket_index":1,"internal":true,"username":"_admin","sasl_conn":"0x7f32e6148f00","nodelay":true,"refcount":1,"features":{"datatype":false,"mutation_extras":false},"engine_storage":"0x0","next":"0x0","thread":"0x7f3336283130","priority":"Medium","clustermap_revno":"unknown","tap":false,"dcp":false,"dcp_xattr_aware":false,"dcp_no_value":false,"opaque":"0xde030e00","max_reqs_per_event":20,"nevents":20,"state":"conn_write","cmd":"STAT","libevent":{"registered":true,"ev_flags":{"raw":"0x14","decoded":["write","persist"]},"which":{"raw":"0x4","decoded":["write"]}},"read":{"buf":"0x7f32d802c000","curr":"0x7f32d802c01b","size":2048,"bytes":0},"write":{"buf":"0x7f32e40ce800","curr":"0x7f32f4a00000","size":2048,"bytes":3582842},"write_and_go":"conn_new_cmd","ritem":"0x7f32d802c01b","rlbytes":0,"item":"0x0","iov":{"size":10,"used":1,"vector":[{"base":"0x7f32f4a9fd58","len":"0x2cae22"}]},"msglist":{"size":5,"used":1,"curr":0,"bytes":3582842},"itemlist":{"size":0},"temp_alloc_list":{"size":1},"noreply":false,"DynamicBuffer":{"buffer":"0x0","size":0,"offset":0},"cas":"0x0","aiostat":0,"ewouldblock":false,"ssl":{"enabled":false},"total_recv":272,"total_send":4237662}
            

            Yu Sui: Can you explain why you don't tear down the connection when you find out that "the upr feed has been closed"? When do you log that is it just for one vbucket? Also: can you explain what timeouts are where and how much they are? (My guess is that increasing timeouts is going to fix anything here).

            Once we have this information, I'll follow up with Trond on the questions I have for him.

            dfinlay Dave Finlay added a comment - I see some interesting stuff in the logs. This pipeline supervisor is having a hard time – it continually seems to be unable to get DCP stats due to time out. Then it gets torn down when the bucket is deleted, gets rebuilt when the bucket is recreated and proceeds to again time out getting DCP stats. Separately, the last incarnation of the pipeline supervisor goes as follows. First it gets created: 2017-03-12T23:03:36.045-07:00 INFO GOXDCR.GenericSupervisor: Starting supervisor PipelineSupervisor_615ceec9956939205f5bdc3561f0ccf4/default/default. 2017-03-12T23:03:36.045-07:00 INFO GOXDCR.GenericSupervisor: Started supervisor PipelineSupervisor_615ceec9956939205f5bdc3561f0ccf4/default/default. 2017-03-12T23:03:36.046-07:00 INFO GOXDCR.GenericSupervisor: PipelineSupervisor_615ceec9956939205f5bdc3561f0ccf4/default/default monitorPipelineHealth started It takes memcached precisely 2 minutes to figure out that it exists. (Weird) 2017-03-12T23:05:36.060781-07:00 NOTICE 47: HELO [Goxdcr PipelineSupervisor SourceBucket:default TargetBucket:default] TCP NODELAY [ 172.23.105.3:33484 - 172.23.105.3:11210 (_admin) ] Precisely 2 minutes later XDCR reports a timeout getting the DCP stats. And another timeout 2 minutes after that. 2017-03-12T23:07:36.060-07:00 INFO GOXDCR.GenericSupervisor: PipelineSupervisor_615ceec9956939205f5bdc3561f0ccf4/default/default Error getting dcp stats for kv 172.23.105.3:11210. err=read tcp 172.23.105.3:33484->172.23.105.3:11210: i/o timeout 2017-03-12T23:07:36.060-07:00 ERRO GOXDCR.GenericSupervisor: PipelineSupervisor_615ceec9956939205f5bdc3561f0ccf4/default/default Failed to get dcp stats. Skipping dcp health check. 2017-03-12T23:09:36.062-07:00 INFO GOXDCR.GenericSupervisor: PipelineSupervisor_615ceec9956939205f5bdc3561f0ccf4/default/default Error getting dcp stats for kv 172.23.105.3:11210. err=write tcp 172.23.105.3:33484->172.23.105.3:11210: i/o timeout 2017-03-12T23:09:36.062-07:00 ERRO GOXDCR.GenericSupervisor: PipelineSupervisor_615ceec9956939205f5bdc3561f0ccf4/default/default Failed to get dcp stats. Skipping dcp health check. Then finally, once the bucket is deleted, XDCR finds out more or less immediately that the DCP feed has been closed: 2017-03-12T23:10:58.061-07:00 ERRO GOXDCR.GenericSupervisor: PipelineSupervisor_615ceec9956939205f5bdc3561f0ccf4/default/default Received error report : map[dcp_615ceec9956939205f5bdc3561f0ccf4/default/default_172.23.105.3:11210_0:DCP upr feed has been closed.] 2017-03-12T23:10:58.061-07:00 INFO GOXDCR.ReplMgr: Supervisor PipelineSupervisor_615ceec9956939205f5bdc3561f0ccf4/default/default of type *supervisor.GenericSupervisor reported errors map[dcp_615ceec9956939205f5bdc3561f0ccf4/default/default_172.23.105.3:11210_0:DCP upr feed has been closed.] But according to Trond memcached is trying to send data and is reporting as much in these traces: ./cbcollect_info_ns_1@172.23.105.3_20170313-064501 $ fgrep 0x7f32d8deca80 memcached.log | head -1 2017-03-12T23:10:58.054195-07:00 NOTICE Worker thread 1: {"connection":"0x7f32d8deca80","socket":47,"protocol":"memcached","peername":"172.23.105.3:33484","sockname":"172.23.105.3:11210","parent_port":11210,"bucket_index":1,"internal":true,"username":"_admin","sasl_conn":"0x7f32e6148f00","nodelay":true,"refcount":1,"features":{"datatype":false,"mutation_extras":false},"engine_storage":"0x0","next":"0x0","thread":"0x7f3336283130","priority":"Medium","clustermap_revno":"unknown","tap":false,"dcp":false,"dcp_xattr_aware":false,"dcp_no_value":false,"opaque":"0xde030e00","max_reqs_per_event":20,"nevents":20,"state":"conn_write","cmd":"STAT","libevent":{"registered":true,"ev_flags":{"raw":"0x14","decoded":["write","persist"]},"which":{"raw":"0x4","decoded":["write"]}},"read":{"buf":"0x7f32d802c000","curr":"0x7f32d802c01b","size":2048,"bytes":0},"write":{"buf":"0x7f32e40ce800","curr":"0x7f32f4a00000","size":2048,"bytes":3582842},"write_and_go":"conn_new_cmd","ritem":"0x7f32d802c01b","rlbytes":0,"item":"0x0","iov":{"size":10,"used":1,"vector":[{"base":"0x7f32f4a9fd58","len":"0x2cae22"}]},"msglist":{"size":5,"used":1,"curr":0,"bytes":3582842},"itemlist":{"size":0},"temp_alloc_list":{"size":1},"noreply":false,"DynamicBuffer":{"buffer":"0x0","size":0,"offset":0},"cas":"0x0","aiostat":0,"ewouldblock":false,"ssl":{"enabled":false},"total_recv":272,"total_send":4237662} Yu Sui : Can you explain why you don't tear down the connection when you find out that "the upr feed has been closed"? When do you log that is it just for one vbucket? Also: can you explain what timeouts are where and how much they are? (My guess is that increasing timeouts is going to fix anything here). Once we have this information, I'll follow up with Trond on the questions I have for him.
            yu Yu Sui (Inactive) added a comment - - edited

            When XDCR finds out that "upr feed has been closed", it first performs a checkpoint operation, then shuts down and restarts. From goxdcr log, the checkpoint op did not complete until 23:13:25, which was also when logs of the connection disappeared from memcached log.

            As mentioned earlier, XDCR uses gomemcached API to set up the connection. I did not find any place in gomemcached (or XDCR) that explicitly sets timeout on such connection.

            Trond, does memcached impose default read/write timeout for memcached connections?

            yu Yu Sui (Inactive) added a comment - - edited When XDCR finds out that "upr feed has been closed", it first performs a checkpoint operation, then shuts down and restarts. From goxdcr log, the checkpoint op did not complete until 23:13:25, which was also when logs of the connection disappeared from memcached log. As mentioned earlier, XDCR uses gomemcached API to set up the connection. I did not find any place in gomemcached (or XDCR) that explicitly sets timeout on such connection. Trond, does memcached impose default read/write timeout for memcached connections?
            dfinlay Dave Finlay added a comment - - edited

            Yu - I think this is where the timeout is coming from: http://src.couchbase.org/source/xref/trunk/goproj/src/github.com/couchbase/goxdcr/base/connection_pool.go#130 (20 seconds).

            One note for folks following along at home: the precise 2 minute gaps are due to XDCRs periodic polling of memcached to get the stats info.

            dfinlay Dave Finlay added a comment - - edited Yu - I think this is where the timeout is coming from: http://src.couchbase.org/source/xref/trunk/goproj/src/github.com/couchbase/goxdcr/base/connection_pool.go#130 (20 seconds). One note for folks following along at home: the precise 2 minute gaps are due to XDCRs periodic polling of memcached to get the stats info.

            Dave,

            You are right. The specified code does set timeout to 20 seconds.

            Surprisingly, it is not what is causing the problem. I logged the time it took for the "get dcp stats" request to time out, and it was in the range of milliseconds. Removing the 20 second timeout did not help the situation. 

            PipelineSupervisor_0c94b5223c86eaa226e1dd02c63461f2/sasl_bucket_1/sasl_bucket_1 making getting dcp stats for kv 10.0.0.10:12000.
            goxdcr.log:2017-03-22T10:09:23.574-07:00 INFO GOXDCR.GenericSupervisor: PipelineSupervisor_0c94b5223c86eaa226e1dd02c63461f2/sasl_bucket_1/sasl_bucket_1 Error getting dcp stats for kv 10.0.0.10:12000. err=read tcp 10.0.0.10:61742->10.0.0.10:12000: i/o timeout. time=12.949689ms

            goxdcr.log:2017-03-22T10:18:39.587-07:00 INFO GOXDCR.GenericSupervisor: PipelineSupervisor_4a17125b4ca3ed1119fe08dba7206126/standard_bucket_1/standard_bucket_1 Error getting dcp stats for kv 10.0.0.10:12000. err=write tcp 10.0.0.10:53191->10.0.0.10:12000: i/o timeout. time=24.245µs

            Something is seriously wrong with the connections.  Need to do some more digging.

            yu Yu Sui (Inactive) added a comment - Dave, You are right. The specified code does set timeout to 20 seconds. Surprisingly, it is not what is causing the problem. I logged the time it took for the "get dcp stats" request to time out, and it was in the range of milliseconds. Removing the 20 second timeout did not help the situation.  PipelineSupervisor_0c94b5223c86eaa226e1dd02c63461f2/sasl_bucket_1/sasl_bucket_1 making getting dcp stats for kv 10.0.0.10:12000. goxdcr.log:2017-03-22T10:09:23.574-07:00 INFO GOXDCR.GenericSupervisor: PipelineSupervisor_0c94b5223c86eaa226e1dd02c63461f2/sasl_bucket_1/sasl_bucket_1 Error getting dcp stats for kv 10.0.0.10:12000. err=read tcp 10.0.0.10:61742->10.0.0.10:12000: i/o timeout. time=12.949689ms goxdcr.log:2017-03-22T10:18:39.587-07:00 INFO GOXDCR.GenericSupervisor: PipelineSupervisor_4a17125b4ca3ed1119fe08dba7206126/standard_bucket_1/standard_bucket_1 Error getting dcp stats for kv 10.0.0.10:12000. err=write tcp 10.0.0.10:53191->10.0.0.10:12000: i/o timeout. time=24.245µs Something is seriously wrong with the connections.  Need to do some more digging.
            yu Yu Sui (Inactive) added a comment - - edited

            Not all "dcp stats" requests fail. The succeeded ones took ~100ms. It is not obvious why some other requests timed out after 10ms. Around the time when timeout happened, there were no relevant logs in memcached log.

            Trond, can you shed some light from memcached side on why this could have happened? If there are some experiments we can try to pinpoint the problem, I would be happy to carry them out.

            goxdcr.log:2017-03-22T10:31:53.297-07:00 INFO GOXDCR.GenericSupervisor: PipelineSupervisor_1347d997312ad52a3c21c3d3acfd2afe/sasl_bucket_1/sasl_bucket_1 making getting dcp stats for kv 10.0.0.10:12000.
            goxdcr.log:2017-03-22T10:31:53.422-07:00 INFO GOXDCR.GenericSupervisor: PipelineSupervisor_1347d997312ad52a3c21c3d3acfd2afe/sasl_bucket_1/sasl_bucket_1 success getting dcp stats for kv 10.0.0.10:12000. time=125.164374ms

            yu Yu Sui (Inactive) added a comment - - edited Not all "dcp stats" requests fail. The succeeded ones took ~100ms. It is not obvious why some other requests timed out after 10ms. Around the time when timeout happened, there were no relevant logs in memcached log. Trond, can you shed some light from memcached side on why this could have happened? If there are some experiments we can try to pinpoint the problem, I would be happy to carry them out. goxdcr.log:2017-03-22T10:31:53.297-07:00 INFO GOXDCR.GenericSupervisor: PipelineSupervisor_1347d997312ad52a3c21c3d3acfd2afe/sasl_bucket_1/sasl_bucket_1 making getting dcp stats for kv 10.0.0.10:12000. goxdcr.log:2017-03-22T10:31:53.422-07:00 INFO GOXDCR.GenericSupervisor: PipelineSupervisor_1347d997312ad52a3c21c3d3acfd2afe/sasl_bucket_1/sasl_bucket_1 success getting dcp stats for kv 10.0.0.10:12000. time=125.164374ms
            dfinlay Dave Finlay added a comment -

            Yu - these measurements you took - these were on a separate cluster, right? (I don't see these traces in the logs.) Were you able to reproduce the bucket deletion issue on that situation?

            dfinlay Dave Finlay added a comment - Yu - these measurements you took - these were on a separate cluster, right? (I don't see these traces in the logs.) Were you able to reproduce the bucket deletion issue on that situation?

            Yes, they are on a separate cluster - my local one. I did not see the bucket deletion issue here.

            yu Yu Sui (Inactive) added a comment - Yes, they are on a separate cluster - my local one. I did not see the bucket deletion issue here.

            Yu Sui what happens on the client side when an operation time out? do it close the connection? You could run mctimings with -v to get a histogram of the execution times recorded by memcached.

            The "dcp" stat sub-group is handled by the underlying bucket and it's been a while since I looked into that so I don't recall from the top of my head all of locks involved in that stat (and the amount of output it creates). Each of the KV-pair added in the stat adds 24 bytes overhead and all of this is spooled into a temporary buffer which grows (before it is being sent), which could result in a number of reallocations if the output is large. (We need to measure this)

            trond Trond Norbye added a comment - Yu Sui what happens on the client side when an operation time out? do it close the connection? You could run mctimings with -v to get a histogram of the execution times recorded by memcached. The "dcp" stat sub-group is handled by the underlying bucket and it's been a while since I looked into that so I don't recall from the top of my head all of locks involved in that stat (and the amount of output it creates). Each of the KV-pair added in the stat adds 24 bytes overhead and all of this is spooled into a temporary buffer which grows (before it is being sent), which could result in a number of reallocations if the output is large. (We need to measure this)
            dfinlay Dave Finlay added a comment -

            Yu - separate question.

            The bucket in question in these logs ("default") gets deleted and recreated a bunch of times, which results in the addition and removal of the pipeline supervisor.

            $ egrep "((Removing)|(Adding)).*PipelineSupervisor_615ceec9956939205f5bdc3561f0ccf4.*Master" ns_server.goxdcr.log 
            2017-03-12T22:36:05.065-07:00 INFO GOXDCR.GenericSupervisor: Adding child PipelineSupervisor_615ceec9956939205f5bdc3561f0ccf4/default/default to supervisor PipelineMasterSupervisor
            2017-03-12T22:41:35.826-07:00 INFO GOXDCR.GenericSupervisor: Removing child PipelineSupervisor_615ceec9956939205f5bdc3561f0ccf4/default/default from supervisor PipelineMasterSupervisor
            2017-03-12T22:41:35.979-07:00 INFO GOXDCR.GenericSupervisor: Adding child PipelineSupervisor_615ceec9956939205f5bdc3561f0ccf4/default/default to supervisor PipelineMasterSupervisor
            2017-03-12T22:46:57.046-07:00 INFO GOXDCR.GenericSupervisor: Removing child PipelineSupervisor_615ceec9956939205f5bdc3561f0ccf4/default/default from supervisor PipelineMasterSupervisor
            2017-03-12T22:49:25.281-07:00 INFO GOXDCR.GenericSupervisor: Adding child PipelineSupervisor_615ceec9956939205f5bdc3561f0ccf4/default/default to supervisor PipelineMasterSupervisor
            2017-03-12T22:50:06.871-07:00 INFO GOXDCR.GenericSupervisor: Removing child PipelineSupervisor_615ceec9956939205f5bdc3561f0ccf4/default/default from supervisor PipelineMasterSupervisor
            2017-03-12T22:50:07.100-07:00 INFO GOXDCR.GenericSupervisor: Adding child PipelineSupervisor_615ceec9956939205f5bdc3561f0ccf4/default/default to supervisor PipelineMasterSupervisor
            2017-03-12T22:55:21.597-07:00 INFO GOXDCR.GenericSupervisor: Removing child PipelineSupervisor_615ceec9956939205f5bdc3561f0ccf4/default/default from supervisor PipelineMasterSupervisor
            2017-03-12T22:55:21.785-07:00 INFO GOXDCR.GenericSupervisor: Adding child PipelineSupervisor_615ceec9956939205f5bdc3561f0ccf4/default/default to supervisor PipelineMasterSupervisor
            2017-03-12T23:00:45.980-07:00 INFO GOXDCR.GenericSupervisor: Removing child PipelineSupervisor_615ceec9956939205f5bdc3561f0ccf4/default/default from supervisor PipelineMasterSupervisor
            2017-03-12T23:03:35.692-07:00 INFO GOXDCR.GenericSupervisor: Adding child PipelineSupervisor_615ceec9956939205f5bdc3561f0ccf4/default/default to supervisor PipelineMasterSupervisor
            2017-03-12T23:13:25.646-07:00 INFO GOXDCR.GenericSupervisor: Removing child PipelineSupervisor_615ceec9956939205f5bdc3561f0ccf4/default/default from supervisor PipelineMasterSupervisor
            

            When the stats connection comes back after the pipeline supervisor is removed and added back – does it create a new connection, or use an existing one (from the pool)?

            Basically I'm wondering if this connection got into a bad state sometime in the past. And now the send buffer is full on the memcached side - so stats requests will always time out.

            dfinlay Dave Finlay added a comment - Yu - separate question. The bucket in question in these logs ("default") gets deleted and recreated a bunch of times, which results in the addition and removal of the pipeline supervisor. $ egrep "((Removing)|(Adding)).*PipelineSupervisor_615ceec9956939205f5bdc3561f0ccf4.*Master" ns_server.goxdcr.log 2017-03-12T22:36:05.065-07:00 INFO GOXDCR.GenericSupervisor: Adding child PipelineSupervisor_615ceec9956939205f5bdc3561f0ccf4/default/default to supervisor PipelineMasterSupervisor 2017-03-12T22:41:35.826-07:00 INFO GOXDCR.GenericSupervisor: Removing child PipelineSupervisor_615ceec9956939205f5bdc3561f0ccf4/default/default from supervisor PipelineMasterSupervisor 2017-03-12T22:41:35.979-07:00 INFO GOXDCR.GenericSupervisor: Adding child PipelineSupervisor_615ceec9956939205f5bdc3561f0ccf4/default/default to supervisor PipelineMasterSupervisor 2017-03-12T22:46:57.046-07:00 INFO GOXDCR.GenericSupervisor: Removing child PipelineSupervisor_615ceec9956939205f5bdc3561f0ccf4/default/default from supervisor PipelineMasterSupervisor 2017-03-12T22:49:25.281-07:00 INFO GOXDCR.GenericSupervisor: Adding child PipelineSupervisor_615ceec9956939205f5bdc3561f0ccf4/default/default to supervisor PipelineMasterSupervisor 2017-03-12T22:50:06.871-07:00 INFO GOXDCR.GenericSupervisor: Removing child PipelineSupervisor_615ceec9956939205f5bdc3561f0ccf4/default/default from supervisor PipelineMasterSupervisor 2017-03-12T22:50:07.100-07:00 INFO GOXDCR.GenericSupervisor: Adding child PipelineSupervisor_615ceec9956939205f5bdc3561f0ccf4/default/default to supervisor PipelineMasterSupervisor 2017-03-12T22:55:21.597-07:00 INFO GOXDCR.GenericSupervisor: Removing child PipelineSupervisor_615ceec9956939205f5bdc3561f0ccf4/default/default from supervisor PipelineMasterSupervisor 2017-03-12T22:55:21.785-07:00 INFO GOXDCR.GenericSupervisor: Adding child PipelineSupervisor_615ceec9956939205f5bdc3561f0ccf4/default/default to supervisor PipelineMasterSupervisor 2017-03-12T23:00:45.980-07:00 INFO GOXDCR.GenericSupervisor: Removing child PipelineSupervisor_615ceec9956939205f5bdc3561f0ccf4/default/default from supervisor PipelineMasterSupervisor 2017-03-12T23:03:35.692-07:00 INFO GOXDCR.GenericSupervisor: Adding child PipelineSupervisor_615ceec9956939205f5bdc3561f0ccf4/default/default to supervisor PipelineMasterSupervisor 2017-03-12T23:13:25.646-07:00 INFO GOXDCR.GenericSupervisor: Removing child PipelineSupervisor_615ceec9956939205f5bdc3561f0ccf4/default/default from supervisor PipelineMasterSupervisor When the stats connection comes back after the pipeline supervisor is removed and added back – does it create a new connection, or use an existing one (from the pool)? Basically I'm wondering if this connection got into a bad state sometime in the past. And now the send buffer is full on the memcached side - so stats requests will always time out.
            yu Yu Sui (Inactive) added a comment - - edited

            On the client side, after receiving timeout error, the connection is kept alive for the next request. The connection is closed only if we receive errors for three consecutive requests.

            When pipeline supervisor is removed and added back, the old connection is closed and a new one is set up. 

            I will modify the code so that a new connection is set up whenever timeout error is received. This does not explain or fix the issue of why the first timeout happened after 10ms, though. Yes, the dcp stats could be huge, and the following scenario is within expectation:

            1. request for dcp stats timed out after 20 second.
            2. connection ends up in bad state.
            3. a second request for dcp stats timed out immediately [after 10 ms) because of the bad connection state

            The observation is that the first request timed out after 10ms, which cannot be explained by server taking too long to assemble the response.

            yu Yu Sui (Inactive) added a comment - - edited On the client side, after receiving timeout error, the connection is kept alive for the next request. The connection is closed only if we receive errors for three consecutive requests. When pipeline supervisor is removed and added back, the old connection is closed and a new one is set up.  I will modify the code so that a new connection is set up whenever timeout error is received. This does not explain or fix the issue of why the first timeout happened after 10ms, though. Yes, the dcp stats could be huge, and the following scenario is within expectation: request for dcp stats timed out after 20 second. connection ends up in bad state. a second request for dcp stats timed out immediately [after 10 ms) because of the bad connection state The observation is that the first request timed out after 10ms, which cannot be explained by server taking too long to assemble the response.
            trond Trond Norbye added a comment -

            Could you run:

            mctimings -u Administrator -P admin-password -h hostname -v STAT

            that should give you a histogram of the service times for the stat command on the server. Unfortunately it does not keep track of which stat group you requested, and some of the stats are super-small and fast (but we should get a hint of how slow they might have been).

            If you're going to modify the client, could you have it close the socket as part of a timeout?

            trond Trond Norbye added a comment - Could you run: mctimings -u Administrator -P admin-password -h hostname -v STAT that should give you a histogram of the service times for the stat command on the server. Unfortunately it does not keep track of which stat group you requested, and some of the stats are super-small and fast (but we should get a hint of how slow they might have been). If you're going to modify the client, could you have it close the socket as part of a timeout?

            I do plan to close gomemcached connection at timeout. Which should get the socket closed.

            Following is the output of mctimings. I only included the bottom portion for slower operations

            [ 970 -  979]us ( 84.36%)    23 |
            [ 980 -  989]us ( 84.54%)    15 |
            [ 990 -  999]us ( 84.69%)    12 |
            [   1 -    1]ms ( 88.18%)   283 | ###########
            [   2 -    2]ms ( 88.33%)    12 |
            [   3 -    3]ms ( 88.36%)     3 |
            [   4 -    4]ms ( 92.01%)   296 | ###########
            [   5 -    5]ms ( 93.92%)   155 | ######
            [   6 -    6]ms ( 94.96%)    84 | ###
            [   7 -    7]ms ( 96.25%)   105 | ####
            [   8 -    8]ms ( 99.15%)   235 | #########
            [   9 -    9]ms ( 99.63%)    39 | #
            [  10 -   10]ms ( 99.69%)     5 |
            [  11 -   11]ms ( 99.70%)     1 |
            [  12 -   12]ms ( 99.74%)     3 |
            [  15 -   15]ms ( 99.77%)     2 |
            [  16 -   16]ms ( 99.80%)     3 |
            [  22 -   22]ms ( 99.82%)     1 |
            [  23 -   23]ms ( 99.83%)     1 |
            [  27 -   27]ms ( 99.84%)     1 |
            [  28 -   28]ms ( 99.85%)     1 |
            [  30 -   30]ms ( 99.86%)     1 |
            [  31 -   31]ms ( 99.89%)     2 |
            [  32 -   32]ms ( 99.90%)     1 |
            [  42 -   42]ms ( 99.91%)     1 |
            [  50 -  499]ms (100.00%)     7 |

            yu Yu Sui (Inactive) added a comment - I do plan to close gomemcached connection at timeout. Which should get the socket closed. Following is the output of mctimings. I only included the bottom portion for slower operations [ 970 -  979]us ( 84.36%)    23 | [ 980 -  989]us ( 84.54%)    15 | [ 990 -  999]us ( 84.69%)    12 | [   1 -    1] ms ( 88.18%)   283 | ########### [   2 -    2] ms ( 88.33%)    12 | [   3 -    3] ms ( 88.36%)     3 | [   4 -    4] ms ( 92.01%)   296 | ########### [   5 -    5] ms ( 93.92%)   155 | ###### [   6 -    6] ms ( 94.96%)    84 | ### [   7 -    7] ms ( 96.25%)   105 | #### [   8 -    8] ms ( 99.15%)   235 | ######### [   9 -    9] ms ( 99.63%)    39 | # [  10 -   10] ms ( 99.69%)     5 | [  11 -   11] ms ( 99.70%)     1 | [  12 -   12] ms ( 99.74%)     3 | [  15 -   15] ms ( 99.77%)     2 | [  16 -   16] ms ( 99.80%)     3 | [  22 -   22] ms ( 99.82%)     1 | [  23 -   23] ms ( 99.83%)     1 | [  27 -   27] ms ( 99.84%)     1 | [  28 -   28] ms ( 99.85%)     1 | [  30 -   30] ms ( 99.86%)     1 | [  31 -   31] ms ( 99.89%)     2 | [  32 -   32] ms ( 99.90%)     1 | [  42 -   42] ms ( 99.91%)     1 | [  50 -  499] ms (100.00%)     7 |

            From the timings data you have 30 requests slower than 10ms. Unfortunately we can't map those stats commands to the different sub commands... You could alternatively try to run

            mcstat -u Administrator -P <password> -b bucket dcp > /dev/zero

            a number of time while you've got all of the DCP streams open, and then see if they all fall into the 10ms range..

            trond Trond Norbye added a comment - From the timings data you have 30 requests slower than 10ms. Unfortunately we can't map those stats commands to the different sub commands... You could alternatively try to run mcstat -u Administrator -P <password> -b bucket dcp > /dev/zero a number of time while you've got all of the DCP streams open, and then see if they all fall into the 10ms range..
            trond Trond Norbye added a comment -

            I ran a small test on my machine to get a feel of this, and the dcp stat group seems to be "super slow"... On my macbook with a single bucket without any items and no other traffic I set up 5 different DCP streams (all including all of the vbuckets)...

            Looping requesting the dcp stat group results in:

            [ 1 - 1]ms ( 79.87%) 998 | ############################################
            [ 2 - 2]ms ( 93.12%) 661 | #############################
            [ 3 - 3]ms ( 95.47%) 117 | #####
            [ 4 - 4]ms ( 97.21%) 87 | ###
            [ 5 - 5]ms ( 97.87%) 33 | #

            And this is on a macbook with 16GB memory being "completely idle" except for requesting the dcp stat.

            The "sad" part of this stat is that it request all the stats for all of the connected DCP streams, and I'm pretty sure no one actually cares about all of them...

            Which portion of the dcp stat are you interested in? (I guess being able to filter out a given stream on the server would make your life easier as well?)

            trond Trond Norbye added a comment - I ran a small test on my machine to get a feel of this, and the dcp stat group seems to be "super slow"... On my macbook with a single bucket without any items and no other traffic I set up 5 different DCP streams (all including all of the vbuckets)... Looping requesting the dcp stat group results in: [ 1 - 1]ms ( 79.87%) 998 | ############################################ [ 2 - 2]ms ( 93.12%) 661 | ############################# [ 3 - 3]ms ( 95.47%) 117 | ##### [ 4 - 4]ms ( 97.21%) 87 | ### [ 5 - 5]ms ( 97.87%) 33 | # And this is on a macbook with 16GB memory being "completely idle" except for requesting the dcp stat. The "sad" part of this stat is that it request all the stats for all of the connected DCP streams, and I'm pretty sure no one actually cares about all of them... Which portion of the dcp stat are you interested in? (I guess being able to filter out a given stream on the server would make your life easier as well?)
            owend Daniel Owen added a comment - - edited

            I see:

            2017-03-12T23:10:59.055189-07:00 NOTICE 3627761688 Delete bucket [default]. Still waiting: 1 clients connected
            2017-03-12T23:10:59.055567-07:00 NOTICE Worker thread 1: {"connection":"0x7f32d8deca80","socket":47,"protocol":"memcached","peername":"172.23.105.3:33484","sockname":"172.23.105.3:11210","parent_port":11210,"bucket_index":1,"internal":true,"username":"_admin","sasl_conn":"0x7f32e6148f00","nodelay":true,"refcount":1,"features":{"datatype":false,"mutation_extras":false},"engine_storage":"0x0","next":"0x0","thread":"0x7f3336283130","priority":"Medium","clustermap_revno":"unknown","tap":false,"dcp":false,"dcp_xattr_aware":false,"dcp_no_value":false,"opaque":"0xde030e00","max_reqs_per_event":20,"nevents":20,"state":"conn_write","cmd":"STAT","libevent":{"registered":true,"ev_flags":{"raw":"0x14","decoded":["write","persist"]},"which":{"raw":"0x4","decoded":["write"]}},"read":{"buf":"0x7f32d802c000","curr":"0x7f32d802c01b","size":2048,"bytes":0},"write":{"buf":"0x7f32e40ce800","curr":"0x7f32f4a00000","size":2048,"bytes":3582842},"write_and_go":"conn_new_cmd","ritem":"0x7f32d802c01b","rlbytes":0,"item":"0x0","iov":{"size":10,"used":1,"vector":[{"base":"0x7f32f4a9fd58","len":"0x2cae22"}]},"msglist":{"size":5,"used":1,"curr":0,"bytes":3582842},"itemlist":{"size":0},"temp_alloc_list":{"size":1},"noreply":false,"DynamicBuffer":{"buffer":"0x0","size":0,"offset":0},"cas":"0x0","aiostat":0,"ewouldblock":false,"ssl":{"enabled":false},"total_recv":272,"total_send":4237662}
            ...
             
            2017-03-12T23:13:25.055153-07:00 NOTICE 3627761688 Delete bucket [default]. Still waiting: 1 clients connected
            2017-03-12T23:13:25.055381-07:00 NOTICE Worker thread 1: {"connection":"0x7f32d8deca80","socket":47,"protocol":"memcached","peername":"172.23.105.3:33484","sockname":"172.23.105.3:11210","parent_port":11210,"bucket_index":1,"internal":true,"username":"_admin","sasl_conn":"0x7f32e6148f00","nodelay":true,"refcount":1,"features":{"data
            type":false,"mutation_extras":false},"engine_storage":"0x0","next":"0x0","thread":"0x7f3336283130","priority":"Medium","clustermap_revno":"unknown","tap":false,"dcp":false,"dcp_xattr_aware":false,"dcp_no_value":false,"opaque":"0xde030e00","max_reqs_per_event":20,"nevents":20,"state":"conn_write","cmd":"STAT","libevent":{"registered"
            :true,"ev_flags":{"raw":"0x14","decoded":["write","persist"]},"which":{"raw":"0x4","decoded":["write"]}},"read":{"buf":"0x7f32d802c000","curr":"0x7f32d802c01b","size":2048,"bytes":0},"write":{"buf":"0x7f32e40ce800","curr":"0x7f32f4a00000","size":2048,"bytes":3582842},"write_and_go":"conn_new_cmd","ritem":"0x7f32d802c01b","rlbytes":0
            ,"item":"0x0","iov":{"size":10,"used":1,"vector":[{"base":"0x7f32f4a9fd58","len":"0x2cae22"}]},"msglist":{"size":5,"used":1,"curr":0,"bytes":3582842},"itemlist":{"size":0},"temp_alloc_list":{"size":1},"noreply":false,"DynamicBuffer":{"buffer":"0x0","size":0,"offset":0},"cas":"0x0","aiostat":0,"ewouldblock":false,"ssl":{"enabled":fal
            se},"total_recv":272,"total_send":4237662}
            2017-03-12T23:13:25.647373-07:00 WARNING Failed to write, and not due to blocking: Connection reset by peer
            2017-03-12T23:13:25.647503-07:00 NOTICE 47 - Hard error, closing connection
            2017-03-12T23:13:25.647525-07:00 NOTICE 47 The connected bucket is being deleted.. disconnecting
            

            It was stuck for approx 2.5 minutes and it was not closed until we get a hard error that we close the connection.

            owend Daniel Owen added a comment - - edited I see: 2017-03-12T23:10:59.055189-07:00 NOTICE 3627761688 Delete bucket [default]. Still waiting: 1 clients connected 2017-03-12T23:10:59.055567-07:00 NOTICE Worker thread 1: {"connection":"0x7f32d8deca80","socket":47,"protocol":"memcached","peername":"172.23.105.3:33484","sockname":"172.23.105.3:11210","parent_port":11210,"bucket_index":1,"internal":true,"username":"_admin","sasl_conn":"0x7f32e6148f00","nodelay":true,"refcount":1,"features":{"datatype":false,"mutation_extras":false},"engine_storage":"0x0","next":"0x0","thread":"0x7f3336283130","priority":"Medium","clustermap_revno":"unknown","tap":false,"dcp":false,"dcp_xattr_aware":false,"dcp_no_value":false,"opaque":"0xde030e00","max_reqs_per_event":20,"nevents":20,"state":"conn_write","cmd":"STAT","libevent":{"registered":true,"ev_flags":{"raw":"0x14","decoded":["write","persist"]},"which":{"raw":"0x4","decoded":["write"]}},"read":{"buf":"0x7f32d802c000","curr":"0x7f32d802c01b","size":2048,"bytes":0},"write":{"buf":"0x7f32e40ce800","curr":"0x7f32f4a00000","size":2048,"bytes":3582842},"write_and_go":"conn_new_cmd","ritem":"0x7f32d802c01b","rlbytes":0,"item":"0x0","iov":{"size":10,"used":1,"vector":[{"base":"0x7f32f4a9fd58","len":"0x2cae22"}]},"msglist":{"size":5,"used":1,"curr":0,"bytes":3582842},"itemlist":{"size":0},"temp_alloc_list":{"size":1},"noreply":false,"DynamicBuffer":{"buffer":"0x0","size":0,"offset":0},"cas":"0x0","aiostat":0,"ewouldblock":false,"ssl":{"enabled":false},"total_recv":272,"total_send":4237662} ...   2017-03-12T23:13:25.055153-07:00 NOTICE 3627761688 Delete bucket [default]. Still waiting: 1 clients connected 2017-03-12T23:13:25.055381-07:00 NOTICE Worker thread 1: {"connection":"0x7f32d8deca80","socket":47,"protocol":"memcached","peername":"172.23.105.3:33484","sockname":"172.23.105.3:11210","parent_port":11210,"bucket_index":1,"internal":true,"username":"_admin","sasl_conn":"0x7f32e6148f00","nodelay":true,"refcount":1,"features":{"data type":false,"mutation_extras":false},"engine_storage":"0x0","next":"0x0","thread":"0x7f3336283130","priority":"Medium","clustermap_revno":"unknown","tap":false,"dcp":false,"dcp_xattr_aware":false,"dcp_no_value":false,"opaque":"0xde030e00","max_reqs_per_event":20,"nevents":20,"state":"conn_write","cmd":"STAT","libevent":{"registered" :true,"ev_flags":{"raw":"0x14","decoded":["write","persist"]},"which":{"raw":"0x4","decoded":["write"]}},"read":{"buf":"0x7f32d802c000","curr":"0x7f32d802c01b","size":2048,"bytes":0},"write":{"buf":"0x7f32e40ce800","curr":"0x7f32f4a00000","size":2048,"bytes":3582842},"write_and_go":"conn_new_cmd","ritem":"0x7f32d802c01b","rlbytes":0 ,"item":"0x0","iov":{"size":10,"used":1,"vector":[{"base":"0x7f32f4a9fd58","len":"0x2cae22"}]},"msglist":{"size":5,"used":1,"curr":0,"bytes":3582842},"itemlist":{"size":0},"temp_alloc_list":{"size":1},"noreply":false,"DynamicBuffer":{"buffer":"0x0","size":0,"offset":0},"cas":"0x0","aiostat":0,"ewouldblock":false,"ssl":{"enabled":fal se},"total_recv":272,"total_send":4237662} 2017-03-12T23:13:25.647373-07:00 WARNING Failed to write, and not due to blocking: Connection reset by peer 2017-03-12T23:13:25.647503-07:00 NOTICE 47 - Hard error, closing connection 2017-03-12T23:13:25.647525-07:00 NOTICE 47 The connected bucket is being deleted.. disconnecting It was stuck for approx 2.5 minutes and it was not closed until we get a hard error that we close the connection.
            owend Daniel Owen added a comment -

            Also happened here:

            2017-03-12T23:00:27.503176-07:00 NOTICE 3627417976 Delete bucket [default]. Still waiting: 1 clients connected
            2017-03-12T23:00:27.503803-07:00 NOTICE Worker thread 3: {"connection":"0x7f32d8dece00","socket":51,"protocol":"memcached","peername":"172.23.105.3:33246","sockname":"172.23.105.3:11210","parent_port":11210,"bucket_index":1,"internal":true,"username":"_admin","sasl_conn":"0x7f32d8d12540","nodelay":true,"refcount":1,"features":{"data
            type":false,"mutation_extras":false},"engine_storage":"0x0","next":"0x0","thread":"0x7f3336283390","priority":"Medium","clustermap_revno":"unknown","tap":false,"dcp":false,"dcp_xattr_aware":false,"dcp_no_value":false,"opaque":"0xde030e00","max_reqs_per_event":20,"nevents":20,"state":"conn_write","cmd":"STAT","libevent":{"registered"
            :true,"ev_flags":{"raw":"0x14","decoded":["write","persist"]},"which":{"raw":"0x4","decoded":["write"]}},"read":{"buf":"0x7f32d7cb4000","curr":"0x7f32d7cb401b","size":8192,"bytes":0},"write":{"buf":"0x7f32db35f800","curr":"0x7f32f4400000","size":2048,"bytes":3456407},"write_and_go":"conn_new_cmd","ritem":"0x7f32d7cb401b","rlbytes":0
            ,"item":"0x0","iov":{"size":10,"used":1,"vector":[{"base":"0x7f32f447fde0","len":"0x2cbfb7"}]},"msglist":{"size":5,"used":1,"curr":0,"bytes":3456407},"itemlist":{"size":0},"temp_alloc_list":{"size":1},"noreply":false,"DynamicBuffer":{"buffer":"0x0","size":0,"offset":0},"cas":"0x0","aiostat":0,"ewouldblock":false,"ssl":{"enabled":fal
            se},"total_recv":272,"total_send":3980291}
            ...
            2017-03-12T23:00:45.505240-07:00 NOTICE 3627417976 Delete bucket [default]. Still waiting: 1 clients connected
            2017-03-12T23:00:45.505858-07:00 NOTICE Worker thread 3: {"connection":"0x7f32d8dece00","socket":51,"protocol":"memcached","peername":"172.23.105.3:33246","sockname":"172.23.105.3:11210","parent_port":11210,"bucket_index":1,"internal":true,"username":"_admin","sasl_conn":"0x7f32d8d12540","nodelay":true,"refcount":1,"features":{"datatype":false,"mutation_extras":false},"engine_storage":"0x0","next":"0x0","thread":"0x7f3336283390","priority":"Medium","clustermap_revno":"unknown","tap":false,"dcp":false,"dcp_xattr_aware":false,"dcp_no_value":false,"opaque":"0xde030e00","max_reqs_per_event":20,"nevents":20,"state":"conn_write","cmd":"STAT","libevent":{"registered":true,"ev_flags":{"raw":"0x14","decoded":["write","persist"]},"which":{"raw":"0x4","decoded":["write"]}},"read":{"buf":"0x7f32d7cb4000","curr":"0x7f32d7cb401b","size":8192,"bytes":0},"write":{"buf":"0x7f32db35f800","curr":"0x7f32f4400000","size":2048,"bytes":3456407},"write_and_go":"conn_new_cmd","ritem":"0x7f32d7cb401b","rlbytes":0,"item":"0x0","iov":{"size":10,"used":1,"vector":[{"base":"0x7f32f447fde0","len":"0x2cbfb7"}]},"msglist":{"size":5,"used":1,"curr":0,"bytes":3456407},"itemlist":{"size":0},"temp_alloc_list":{"size":1},"noreply":false,"DynamicBuffer":{"buffer":"0x0","size":0,"offset":0},"cas":"0x0","aiostat":0,"ewouldblock":false,"ssl":{"enabled":false},"total_recv":272,"total_send":3980291}
            2017-03-12T23:00:45.981193-07:00 WARNING Failed to write, and not due to blocking: Connection reset by peer
            2017-03-12T23:00:45.981288-07:00 NOTICE 51 - Hard error, closing connection
            2017-03-12T23:00:45.981298-07:00 NOTICE 51 The connected bucket is being deleted.. disconnecting
            

            It was stuck for approx ~18 seconds, and a gain was not closed until we get a hard error.

            owend Daniel Owen added a comment - Also happened here: 2017-03-12T23:00:27.503176-07:00 NOTICE 3627417976 Delete bucket [default]. Still waiting: 1 clients connected 2017-03-12T23:00:27.503803-07:00 NOTICE Worker thread 3: {"connection":"0x7f32d8dece00","socket":51,"protocol":"memcached","peername":"172.23.105.3:33246","sockname":"172.23.105.3:11210","parent_port":11210,"bucket_index":1,"internal":true,"username":"_admin","sasl_conn":"0x7f32d8d12540","nodelay":true,"refcount":1,"features":{"data type":false,"mutation_extras":false},"engine_storage":"0x0","next":"0x0","thread":"0x7f3336283390","priority":"Medium","clustermap_revno":"unknown","tap":false,"dcp":false,"dcp_xattr_aware":false,"dcp_no_value":false,"opaque":"0xde030e00","max_reqs_per_event":20,"nevents":20,"state":"conn_write","cmd":"STAT","libevent":{"registered" :true,"ev_flags":{"raw":"0x14","decoded":["write","persist"]},"which":{"raw":"0x4","decoded":["write"]}},"read":{"buf":"0x7f32d7cb4000","curr":"0x7f32d7cb401b","size":8192,"bytes":0},"write":{"buf":"0x7f32db35f800","curr":"0x7f32f4400000","size":2048,"bytes":3456407},"write_and_go":"conn_new_cmd","ritem":"0x7f32d7cb401b","rlbytes":0 ,"item":"0x0","iov":{"size":10,"used":1,"vector":[{"base":"0x7f32f447fde0","len":"0x2cbfb7"}]},"msglist":{"size":5,"used":1,"curr":0,"bytes":3456407},"itemlist":{"size":0},"temp_alloc_list":{"size":1},"noreply":false,"DynamicBuffer":{"buffer":"0x0","size":0,"offset":0},"cas":"0x0","aiostat":0,"ewouldblock":false,"ssl":{"enabled":fal se},"total_recv":272,"total_send":3980291} ... 2017-03-12T23:00:45.505240-07:00 NOTICE 3627417976 Delete bucket [default]. Still waiting: 1 clients connected 2017-03-12T23:00:45.505858-07:00 NOTICE Worker thread 3: {"connection":"0x7f32d8dece00","socket":51,"protocol":"memcached","peername":"172.23.105.3:33246","sockname":"172.23.105.3:11210","parent_port":11210,"bucket_index":1,"internal":true,"username":"_admin","sasl_conn":"0x7f32d8d12540","nodelay":true,"refcount":1,"features":{"datatype":false,"mutation_extras":false},"engine_storage":"0x0","next":"0x0","thread":"0x7f3336283390","priority":"Medium","clustermap_revno":"unknown","tap":false,"dcp":false,"dcp_xattr_aware":false,"dcp_no_value":false,"opaque":"0xde030e00","max_reqs_per_event":20,"nevents":20,"state":"conn_write","cmd":"STAT","libevent":{"registered":true,"ev_flags":{"raw":"0x14","decoded":["write","persist"]},"which":{"raw":"0x4","decoded":["write"]}},"read":{"buf":"0x7f32d7cb4000","curr":"0x7f32d7cb401b","size":8192,"bytes":0},"write":{"buf":"0x7f32db35f800","curr":"0x7f32f4400000","size":2048,"bytes":3456407},"write_and_go":"conn_new_cmd","ritem":"0x7f32d7cb401b","rlbytes":0,"item":"0x0","iov":{"size":10,"used":1,"vector":[{"base":"0x7f32f447fde0","len":"0x2cbfb7"}]},"msglist":{"size":5,"used":1,"curr":0,"bytes":3456407},"itemlist":{"size":0},"temp_alloc_list":{"size":1},"noreply":false,"DynamicBuffer":{"buffer":"0x0","size":0,"offset":0},"cas":"0x0","aiostat":0,"ewouldblock":false,"ssl":{"enabled":false},"total_recv":272,"total_send":3980291} 2017-03-12T23:00:45.981193-07:00 WARNING Failed to write, and not due to blocking: Connection reset by peer 2017-03-12T23:00:45.981288-07:00 NOTICE 51 - Hard error, closing connection 2017-03-12T23:00:45.981298-07:00 NOTICE 51 The connected bucket is being deleted.. disconnecting It was stuck for approx ~18 seconds, and a gain was not closed until we get a hard error.
            owend Daniel Owen added a comment - - edited

            Both stuck connections are in conn_write state.
            Both also have ewouldblock set to false.

            The write buffer are also non empty for both
            socket 47 - "bytes":3582842
            socket 51 - "bytes": 3456407

            After talking with Dave Rigby the likelihood is that the TCP/IP write buffer for the connection is full.

            owend Daniel Owen added a comment - - edited Both stuck connections are in conn_write state. Both also have ewouldblock set to false. The write buffer are also non empty for both socket 47 - "bytes":3582842 socket 51 - "bytes": 3456407 After talking with Dave Rigby the likelihood is that the TCP/IP write buffer for the connection is full.
            trond Trond Norbye added a comment -

            Daniel Owen we've figured that out earlier (we don't try to force close connections which is blocked on write)... My wild guess here is that the go client timed out the operation and is not trying to read any data from the socket anymore (while there is still data to be sent...)

            I am currently looking into (other) delete bucket failures we're seeing on the CI infrastructure, and it could be that we can fix this issue as part of that.

            BUT as a separate issue we should probably also fix the client to be more "well behaved" and possibly send less data (and speed up the stats call, because such a slow command will affect all of the other connections bound to the same worker thread).

            trond Trond Norbye added a comment - Daniel Owen we've figured that out earlier (we don't try to force close connections which is blocked on write)... My wild guess here is that the go client timed out the operation and is not trying to read any data from the socket anymore (while there is still data to be sent...) I am currently looking into (other) delete bucket failures we're seeing on the CI infrastructure, and it could be that we can fix this issue as part of that. BUT as a separate issue we should probably also fix the client to be more "well behaved" and possibly send less data (and speed up the stats call, because such a slow command will affect all of the other connections bound to the same worker thread).
            dfinlay Dave Finlay added a comment -

            Daniel, Trond: if the client changes to have an infinite timeout on this connection, that would help right? I.e. no matter how long it takes, the response is going to get read? Alternatively, the client could time out and tear down the connection – though it seems like the timeout is going to need to be longer than 20s.

            Also: anyway we can speed up the performance of this stats call? No stats call should take 2+ minutes.

            Also: Yu - can you answer Trond's question about how you use the information returned in the stats call?

            dfinlay Dave Finlay added a comment - Daniel, Trond: if the client changes to have an infinite timeout on this connection, that would help right? I.e. no matter how long it takes, the response is going to get read? Alternatively, the client could time out and tear down the connection – though it seems like the timeout is going to need to be longer than 20s. Also: anyway we can speed up the performance of this stats call? No stats call should take 2+ minutes. Also: Yu - can you answer Trond's question about how you use the information returned in the stats call?
            trond Trond Norbye added a comment -

            I doubt the stat call is taking 2 minutes (then it would have been logged).. What I think happens is roughly this

            1) client send stat call
            2) client times out waiting for reply (this could be for a number of reason.. the same server thread could be busy so the op won't be processed immediately.. but we know that this stat call is also potentially slow)..
            3) when the client times out it no longer reads from the socket, but start doing other stuff
            4) server starts sending data... it is a lot so it doesn't fit in the kernel send queue and it "blocks"

            we're now put into the state that the client don't know that the server is blocked on writing to a socket it no longer cares about because it timed out the operation but left the socket alone..

            If the client close the socket as part of the timeout, memcached will fail writing to that socket and we won't get into this state...

            trond Trond Norbye added a comment - I doubt the stat call is taking 2 minutes (then it would have been logged).. What I think happens is roughly this 1) client send stat call 2) client times out waiting for reply (this could be for a number of reason.. the same server thread could be busy so the op won't be processed immediately.. but we know that this stat call is also potentially slow).. 3) when the client times out it no longer reads from the socket, but start doing other stuff 4) server starts sending data... it is a lot so it doesn't fit in the kernel send queue and it "blocks" we're now put into the state that the client don't know that the server is blocked on writing to a socket it no longer cares about because it timed out the operation but left the socket alone.. If the client close the socket as part of the timeout, memcached will fail writing to that socket and we won't get into this state...
            owend Daniel Owen added a comment -

            Sorry Trond Norbye I did not read the history..I have now, however I don't understand what the XDCR timeout is for the stats call?

            From a previous comment by Yu - "The succeeded ones took ~100ms. It is not obvious why some other requests timed out after 10ms. "

            We are not seeing any slow operations in the memcached logs - other than for the delete buckets (My understanding is we log if response is greater than 1s)

            Dave Finlay I will revisit this stats and see what improvements we can make to speed it-up.

            owend Daniel Owen added a comment - Sorry Trond Norbye I did not read the history..I have now, however I don't understand what the XDCR timeout is for the stats call? From a previous comment by Yu - "The succeeded ones took ~100ms. It is not obvious why some other requests timed out after 10ms. " We are not seeing any slow operations in the memcached logs - other than for the delete buckets (My understanding is we log if response is greater than 1s) Dave Finlay I will revisit this stats and see what improvements we can make to speed it-up.
            dfinlay Dave Finlay added a comment -

            Dan - the timeout on the XDCR side is 20 s.

            Trond is also suggesting we may want to think of splitting this command out, which might be a reasonable idea.

            For now though on this ticket, I think next steps should be look to fix things on the client side.

            dfinlay Dave Finlay added a comment - Dan - the timeout on the XDCR side is 20 s. Trond is also suggesting we may want to think of splitting this command out, which might be a reasonable idea. For now though on this ticket, I think next steps should be look to fix things on the client side.
            owend Daniel Owen added a comment -

            Dave Finlay The bit I'm confused about is yesterday Yu stated:
            "The specified code does set timeout to 20 seconds.
            Surprisingly, it is not what is causing the problem. I logged the time it took for the "get dcp stats" request to time out, and it was in the range of milliseconds. Removing the 20 second timeout did not help the situation. "

            owend Daniel Owen added a comment - Dave Finlay The bit I'm confused about is yesterday Yu stated: "The specified code does set timeout to 20 seconds. Surprisingly, it is not what is causing the problem. I logged the time it took for the "get dcp stats" request to time out, and it was in the range of milliseconds. Removing the 20 second timeout did not help the situation. "
            dfinlay Dave Finlay added a comment -

            Yu had run a local test on his side and saw that (1) he managed to reproduce the "i/o timeout" (2) it timed out in 12 ms. (Read his comment closely.) Obviously increasing the 20 s timeout when a timeout after 12 ms is observed is going to make no difference.

            Guys - let's stop spamming the ticket and let's give Yu some time to make progress on his side.

            dfinlay Dave Finlay added a comment - Yu had run a local test on his side and saw that (1) he managed to reproduce the "i/o timeout" (2) it timed out in 12 ms. (Read his comment closely.) Obviously increasing the 20 s timeout when a timeout after 12 ms is observed is going to make no difference. Guys - let's stop spamming the ticket and let's give Yu some time to make progress on his side.
            yu Yu Sui (Inactive) added a comment - - edited

            The dcp stats that xdcr needs is the number of remaining items in a dcp connection. An example is as follows:

            eq_dcpq:xdcr:dcp_f58e0727200a19771e4459925908dd66/default/target_10.17.2.102:12000_0:items_remaining

            Yes, things would be much better if dcp provides filtering capability so that only relevant stats is returned.

            My current plan is:

            1. leave the current 20 second time out alone - it is long enough and is not the cause of the timeout observed.
            2. when seeing timeout errors, close the memcached connection and set up new one. This should fix the bucket deletion issue.

            The open question is why connection timed out in 10ms when the specified time out is 20 second.

             

            yu Yu Sui (Inactive) added a comment - - edited The dcp stats that xdcr needs is the number of remaining items in a dcp connection. An example is as follows: eq_dcpq:xdcr:dcp_f58e0727200a19771e4459925908dd66/default/target_10.17.2.102:12000_0:items_remaining Yes, things would be much better if dcp provides filtering capability so that only relevant stats is returned. My current plan is: leave the current 20 second time out alone - it is long enough and is not the cause of the timeout observed. when seeing timeout errors, close the memcached connection and set up new one. This should fix the bucket deletion issue. The open question is why connection timed out in 10ms when the specified time out is 20 second.  

            Build 5.0.0-2408 contains goxdcr commit 23ee180316d745d9dd13f32c0ceec22d6062c027 with commit message:
            MB-23161 reset memcached connections when receiving error
            https://github.com/couchbase/goxdcr/commit/23ee180316d745d9dd13f32c0ceec22d6062c027

            build-team Couchbase Build Team added a comment - Build 5.0.0-2408 contains goxdcr commit 23ee180316d745d9dd13f32c0ceec22d6062c027 with commit message: MB-23161 reset memcached connections when receiving error https://github.com/couchbase/goxdcr/commit/23ee180316d745d9dd13f32c0ceec22d6062c027

            Arun,

            A fix has been delivered from XDCR side. Please verify if the problem still exists in 5.0.0-2408.

            yu Yu Sui (Inactive) added a comment - Arun, A fix has been delivered from XDCR side. Please verify if the problem still exists in 5.0.0-2408.
            apiravi Aruna Piravi (Inactive) added a comment - - edited

            Tests are running fine so far on build 5.0.0-2412, 18 that ran, passed - http://qa.sc.couchbase.com/view/XDCR/job/cen006-p1-xxdcr-vset07-00-goxdcr-rebalance/543/console

            This is a long running job, will take 8-10 more hours to complete.

            apiravi Aruna Piravi (Inactive) added a comment - - edited Tests are running fine so far on build 5.0.0-2412, 18 that ran, passed - http://qa.sc.couchbase.com/view/XDCR/job/cen006-p1-xxdcr-vset07-00-goxdcr-rebalance/543/console This is a long running job, will take 8-10 more hours to complete.

            Issue found again in 5.0.0-2412 - please find logs here:

            https://s3.amazonaws.com/bugdb/jira/MB23161/23161/23161.zip

            https://s3.amazonaws.com/bugdb/jira/MB23161/23161/172.23.106.251.zip

            https://s3.amazonaws.com/bugdb/jira/MB23161/23161/172.23.106.115.zip

            https://s3.amazonaws.com/bugdb/jira/MB23161/23161/172.23.107.161.zip

            https://s3.amazonaws.com/bugdb/jira/MB23161/23161/172.23.121.161.zip

            https://s3.amazonaws.com/bugdb/jira/MB23161/23161/172.23.121.162.zip

            https://s3.amazonaws.com/bugdb/jira/MB23161/23161/172.23.121.163.zip

            https://s3.amazonaws.com/bugdb/jira/MB23161/23161/172.23.121.164.zip

            https://s3.amazonaws.com/bugdb/jira/MB23161/23161/172.23.108.169.zip

            https://s3.amazonaws.com/bugdb/jira/MB23161/23161/172.23.108.170.zip

            Bucket deletion issue happens at test_21 here: http://qa.sc.couchbase.com/job/cen006-p1-xxdcr-vset07-00-goxdcr-rebalance/549/consoleText 

            All the following tests fail due to the rebalance issue as observed before - due to some reason when this error occurs the logs collected are corrupted - some error like http://172.23.105.27:8091/diag IncompleteRead(12721 bytes read) is thrown - so we let the job run and when the error occurred we stopped the train and collected logs on the cluster manually - please note that the bucket deletion error occurs first time at timestamp 2017-03-28 07:09:40,395

             

            arunkumar Arunkumar Senthilnathan (Inactive) added a comment - Issue found again in 5.0.0-2412 - please find logs here: https://s3.amazonaws.com/bugdb/jira/MB23161/23161/23161.zip https://s3.amazonaws.com/bugdb/jira/MB23161/23161/172.23.106.251.zip https://s3.amazonaws.com/bugdb/jira/MB23161/23161/172.23.106.115.zip https://s3.amazonaws.com/bugdb/jira/MB23161/23161/172.23.107.161.zip https://s3.amazonaws.com/bugdb/jira/MB23161/23161/172.23.121.161.zip https://s3.amazonaws.com/bugdb/jira/MB23161/23161/172.23.121.162.zip https://s3.amazonaws.com/bugdb/jira/MB23161/23161/172.23.121.163.zip https://s3.amazonaws.com/bugdb/jira/MB23161/23161/172.23.121.164.zip https://s3.amazonaws.com/bugdb/jira/MB23161/23161/172.23.108.169.zip https://s3.amazonaws.com/bugdb/jira/MB23161/23161/172.23.108.170.zip Bucket deletion issue happens at test_21 here: http://qa.sc.couchbase.com/job/cen006-p1-xxdcr-vset07-00-goxdcr-rebalance/549/consoleText   All the following tests fail due to the rebalance issue as observed before - due to some reason when this error occurs the logs collected are corrupted - some error like  http://172.23.105.27:8091/diag IncompleteRead(12721 bytes read) is thrown - so we let the job run and when the error occurred we stopped the train and collected logs on the cluster manually - please note that the bucket deletion error occurs first time at timestamp 2017-03-28 07:09:40,395  
            drigby Dave Rigby added a comment -

            Yu Sui

            The dcp stats that xdcr needs is the number of remaining items in a dcp connection. An example is as follows:
            eq_dcpq:xdcr:dcp_f58e0727200a19771e4459925908dd66/default/target_10.17.2.102:12000_0:items_remaining

            Yes, things would be much better if dcp provides filtering capability so that only relevant stats is returned.

            Why does XDCR need this statistic? It shouldn't need to care about how much data is remaining - and moreover the stat is not consistent with what's actually remaining as far as the client is concerned (won't count anything in flight between the two ends of the TCP/IP connection).

            This seems to be an extra public interface which I wasn't aware KV-engine needed to support.

            drigby Dave Rigby added a comment - Yu Sui The dcp stats that xdcr needs is the number of remaining items in a dcp connection. An example is as follows: eq_dcpq:xdcr:dcp_f58e0727200a19771e4459925908dd66/default/target_10.17.2.102:12000_0:items_remaining Yes, things would be much better if dcp provides filtering capability so that only relevant stats is returned. Why does XDCR need this statistic? It shouldn't need to care about how much data is remaining - and moreover the stat is not consistent with what's actually remaining as far as the client is concerned (won't count anything in flight between the two ends of the TCP/IP connection). This seems to be an extra public interface which I wasn't aware KV-engine needed to support.

            This is part of our effort for self monitoring and self healing. The idea is that if there are remaining items in dcp, and yet we have not received anything from dcp streams for a while, we would declare dcp to be stuck and re-set up dcp connection to recover. For this purpose, the stats does not need to be 100% accurate.

            This code was put in because we did see dcp stuckness issues before. XDCR set up dcp connections through gomemcached. It is possible that the the issues were caused by some bugs in gomemcached. Regardless of whether the issues still exist today, the check on our side can serve as a catch all mechanism to get us out of the stuck state should we fall into it for whatever reason.

            yu Yu Sui (Inactive) added a comment - This is part of our effort for self monitoring and self healing. The idea is that if there are remaining items in dcp, and yet we have not received anything from dcp streams for a while, we would declare dcp to be stuck and re-set up dcp connection to recover. For this purpose, the stats does not need to be 100% accurate. This code was put in because we did see dcp stuckness issues before. XDCR set up dcp connections through gomemcached. It is possible that the the issues were caused by some bugs in gomemcached. Regardless of whether the issues still exist today, the check on our side can serve as a catch all mechanism to get us out of the stuck state should we fall into it for whatever reason.

            There is a dcp noop support in the server so you can enable and it'll send out a noop at a fixed interval (which you should reply to).. Then you won't need to poll any stats...that support has been in DCP from the beginning, so as long as the remote side supports DCP you should be able to use it...

            trond Trond Norbye added a comment - There is a dcp noop support in the server so you can enable and it'll send out a noop at a fixed interval (which you should reply to).. Then you won't need to poll any stats...that support has been in DCP from the beginning, so as long as the remote side supports DCP you should be able to use it...

            gomemcached is having issues handling NOOP, MB-21268. Even if gomemcached is able to handle NOOP, that won't solve the dcp stuckness issue.

            NOOP - keeps connection alive when downstream is slow/blocked

            stuckness check - detects the situation that downstream is not blocked, dcp connection is alive and has items to produce, and yet downstream is not receiving anything from dcp.

            yu Yu Sui (Inactive) added a comment - gomemcached is having issues handling NOOP, MB-21268 . Even if gomemcached is able to handle NOOP, that won't solve the dcp stuckness issue. NOOP - keeps connection alive when downstream is slow/blocked stuckness check - detects the situation that downstream is not blocked, dcp connection is alive and has items to produce, and yet downstream is not receiving anything from dcp.
            drigby Dave Rigby added a comment - - edited

            re: "stuckiness" - as Trond said, the NOOP protocol is designed to handle the case where either end of the connect hasn't responded recently. All other DCP clients use that to handle timeouts / stuck connections.

            To expand - the producer will only send NO-OPs if it's not already sending useful mutations - see the description at https://github.com/couchbaselabs/dcp-documentation/blob/master/documentation/dead-connections.md

            As such, as long as the consumer sees at least one normal message every 2x noop_interval then it can assume the connection is healthy (and it has useful data). If the producer has no "normal" data to produce it sends noops to tell the consumer "I'm still here, all is well". As such, if the consumer doesn't see either a normal or NOOP message by 2x noop_interval it should close the connection.

            I don't think we should be adding a separate mechanism just for goXDCR, particularly when the protocol already has consideration for it - indeed the mechanism I described is the one used for node failover. It just feels like unnecessary complexity and the source of more bugs (as we've found here!).

            drigby Dave Rigby added a comment - - edited re: "stuckiness" - as Trond said, the NOOP protocol is designed to handle the case where either end of the connect hasn't responded recently. All other DCP clients use that to handle timeouts / stuck connections. To expand - the producer will only send NO-OPs if it's not already sending useful mutations - see the description at https://github.com/couchbaselabs/dcp-documentation/blob/master/documentation/dead-connections.md As such, as long as the consumer sees at least one normal message every 2x noop_interval then it can assume the connection is healthy (and it has useful data). If the producer has no "normal" data to produce it sends noops to tell the consumer "I'm still here, all is well". As such, if the consumer doesn't see either a normal or NOOP message by 2x noop_interval it should close the connection. I don't think we should be adding a separate mechanism just for goXDCR, particularly when the protocol already has consideration for it - indeed the mechanism I described is the one used for node failover. It just feels like unnecessary complexity and the source of more bugs (as we've found here!).

            I agree that NOOP can be used for stuckness check, if gomemcached handles it correctly and passes it on to XDCR. Will think more about this alternative after spock.

            yu Yu Sui (Inactive) added a comment - I agree that NOOP can be used for stuckness check, if gomemcached handles it correctly and passes it on to XDCR. Will think more about this alternative after spock.

            Yu Sui do we have an ETA on this?

            arunkumar Arunkumar Senthilnathan (Inactive) added a comment - Yu Sui do we have an ETA on this?

            The problem should have been fixed by MB-23423. It is not occurring in last QE tests. QE will run another round of tests in a later build for more confirmation. If it still does not occur, we will close the bug.

            yu Yu Sui (Inactive) added a comment - The problem should have been fixed by MB-23423 . It is not occurring in last QE tests. QE will run another round of tests in a later build for more confirmation. If it still does not occur, we will close the bug.

            From Yu's comment above, marking as resolved and assigning back to QE for verification.

            dhaikney David Haikney added a comment - From Yu's comment above, marking as resolved and assigning back to QE for verification.

            Issue found again in 5.0.0-2564 - logs are here:
            https://s3.amazonaws.com/bugdb/jira/MB23161/23161/collectinfo-2017-04-10T040044-ns_1%40172.23.105.27.zip
            https://s3.amazonaws.com/bugdb/jira/MB23161/23161/collectinfo-2017-04-10T040044-ns_1%40172.23.106.251.zip
            https://s3.amazonaws.com/bugdb/jira/MB23161/23161/collectinfo-2017-04-10T040044-ns_1%40172.23.108.169.zip
             
            Bucket deletion happened at test_39:
             
            [2017-04-08 17:32:05,084] - [rest_client:784] ERROR - DELETE http://172.23.105.27:8091/pools/default/buckets/default
            body: headers: {'Content-Type': 'application/x-www-form-urlencoded', 'Accept': '/', 'Authorization': 'Basic QWRtaW5pc3RyYXRvcjpwYXNzd29yZA==\n'} error: 500 reason: unknown {"_":"Bucket deletion not yet complete, but will continue.\r\n"} auth: Administrator:password
             
            All the following testcases failed with rebalance issues - logs were collected at the end - job ca be found here:
             
            http://qa.sc.couchbase.com/job/cen006-p1-xxdcr-vset07-00-goxdcr-rebalance/565/consoleFull
             
             

            arunkumar Arunkumar Senthilnathan (Inactive) added a comment - Issue found again in 5.0.0-2564 - logs are here: https://s3.amazonaws.com/bugdb/jira/MB23161/23161/collectinfo-2017-04-10T040044-ns_1%40172.23.105.27.zip https://s3.amazonaws.com/bugdb/jira/MB23161/23161/collectinfo-2017-04-10T040044-ns_1%40172.23.106.251.zip https://s3.amazonaws.com/bugdb/jira/MB23161/23161/collectinfo-2017-04-10T040044-ns_1%40172.23.108.169.zip   Bucket deletion happened at test_39:   [2017-04-08 17:32:05,084] - [rest_client:784] ERROR - DELETE http://172.23.105.27:8091/pools/default/buckets/default body: headers: {'Content-Type': 'application/x-www-form-urlencoded', 'Accept': ' / ', 'Authorization': 'Basic QWRtaW5pc3RyYXRvcjpwYXNzd29yZA==\n'} error: 500 reason: unknown {"_":"Bucket deletion not yet complete, but will continue.\r\n"} auth: Administrator:password   All the following testcases failed with rebalance issues - logs were collected at the end - job ca be found here:   http://qa.sc.couchbase.com/job/cen006-p1-xxdcr-vset07-00-goxdcr-rebalance/565/consoleFull    

            There are still two offending/waiting connections. The connections are no longer the ones set up by pipeline supervisors, though. The "cmd" field of the connection shows "DCP_MUTATION", which indicates that the connection is a dcp connection. In comparison, the "cmd" field was "STAT" for the connection set up by pipeline supervisor.

            Will dig some more.

            2017-04-09T21:01:12.891631-07:00 NOTICE Worker thread 3: {"connection":"0x7f18d375a180","socket":50,"protocol":"memcached","peername":"172.23.105.27:46401","sockname":"172.23.105.27:11210","parent_port":11210,"bucket_index":1,"internal":true,"username":"@goxdcr","sasl_conn":"0x7f18e15e0100","nodelay":true,"refcount":2,"features":{"mutation_extras":false,"xerror":false},"engine_storage":"0x7f18c588c900","next":"0x0","thread":"0x7f1927483390","priority":"Medium","clustermap_revno":"unknown","sasl_disabled":false,"tap":false,"dcp":true,"dcp_xattr_aware":false,"dcp_no_value":false,"opaque":"0x44002a68","max_reqs_per_event":5,"nevents":0,"state":"conn_mwrite","cmd":"DCP_MUTATION","libevent":{"registered":true,"ev_flags":{"raw":"0x14","decoded":["write","persist"]},"which":{"raw":"0x4","decoded":["write"]}},"read":{"buf":"0x7f18c5815000","curr":"0x7f18c5815000","size":8192,"bytes":0},"write"{"buf":"0x7f18d1e96000","curr":"0x7f18d1e96037","size":2048,"bytes":55},"write_and_go":"conn_ship_log","ritem":"0x7f18c5815048","rlbytes":0,"item":"0x0","iov"{"size":10,"used":3,"vector":[\{"base":"0x7f18d1e96000","len":"0x37"},\{"base":"0x7f18e0c516f9","len":"0xb"},\{"base":"0x7f18e13f402b","len":"0x161"}]},"msglist":{"size":5,"used":1,"curr":0,"bytes":578},"itemlist":{"size":1},"temp_alloc_list":{"size":0},"noreply":(false,"DynamicBuffer":{"buffer":"0x0","size":0,"offset":0},"cas":"0x0","aiostat":0,"ewouldblock":false,"ssl":{"enabled":false},"total_recv":10496,"total_send":1063612,"datatype":"snappy,json"}

            yu Yu Sui (Inactive) added a comment - There are still two offending/waiting connections. The connections are no longer the ones set up by pipeline supervisors, though. The "cmd" field of the connection shows "DCP_MUTATION", which indicates that the connection is a dcp connection. In comparison, the "cmd" field was "STAT" for the connection set up by pipeline supervisor. Will dig some more. 2017-04-09T21:01:12.891631-07:00 NOTICE Worker thread 3: {"connection":"0x7f18d375a180","socket":50,"protocol":"memcached","peername":"172.23.105.27:46401","sockname":"172.23.105.27:11210","parent_port":11210,"bucket_index":1,"internal":true,"username":"@goxdcr","sasl_conn":"0x7f18e15e0100","nodelay":true,"refcount":2,"features":{"mutation_extras":false,"xerror":false},"engine_storage":"0x7f18c588c900","next":"0x0","thread":"0x7f1927483390","priority":"Medium","clustermap_revno":"unknown","sasl_disabled":false,"tap":false,"dcp":true,"dcp_xattr_aware":false,"dcp_no_value":false,"opaque":"0x44002a68","max_reqs_per_event":5,"nevents":0,"state":"conn_mwrite","cmd":"DCP_MUTATION","libevent":{"registered":true,"ev_flags":{"raw":"0x14","decoded": ["write","persist"] },"which":{"raw":"0x4","decoded": ["write"] }},"read":{"buf":"0x7f18c5815000","curr":"0x7f18c5815000","size":8192,"bytes":0},"write" {"buf":"0x7f18d1e96000","curr":"0x7f18d1e96037","size":2048,"bytes":55},"write_and_go":"conn_ship_log","ritem":"0x7f18c5815048","rlbytes":0,"item":"0x0","iov" {"size":10,"used":3,"vector": [\{"base":"0x7f18d1e96000","len":"0x37"},\{"base":"0x7f18e0c516f9","len":"0xb"},\{"base":"0x7f18e13f402b","len":"0x161"}] },"msglist":{"size":5,"used":1,"curr":0,"bytes":578},"itemlist":{"size":1},"temp_alloc_list":{"size":0},"noreply":(false,"DynamicBuffer":{"buffer":"0x0","size":0,"offset":0},"cas":"0x0","aiostat":0,"ewouldblock":false,"ssl":{"enabled":false},"total_recv":10496,"total_send":1063612,"datatype":"snappy,json"}

            There were two replication instances with the same replication id running at the time of problem. When test completed and a command was issued to stop the replication, only one replication instance was stopped and the other was left running. This should be the reason that some dcp connections were left open at the end of test run.

            Will look for ways to prevent multiple instances from occuring.

            yu Yu Sui (Inactive) added a comment - There were two replication instances with the same replication id running at the time of problem. When test completed and a command was issued to stop the replication, only one replication instance was stopped and the other was left running. This should be the reason that some dcp connections were left open at the end of test run. Will look for ways to prevent multiple instances from occuring.

            Build 5.0.0-2629 contains goxdcr commit ac368d248c34024045f4c5a2fc64393e2a9fcbdc with commit message:
            MB-23161 fix replication stop sequence when replication spec is deleted
            https://github.com/couchbase/goxdcr/commit/ac368d248c34024045f4c5a2fc64393e2a9fcbdc

            build-team Couchbase Build Team added a comment - Build 5.0.0-2629 contains goxdcr commit ac368d248c34024045f4c5a2fc64393e2a9fcbdc with commit message: MB-23161 fix replication stop sequence when replication spec is deleted https://github.com/couchbase/goxdcr/commit/ac368d248c34024045f4c5a2fc64393e2a9fcbdc

            Issue not seen in 5.0.0-2634 regression - closing it out

            arunkumar Arunkumar Senthilnathan (Inactive) added a comment - Issue not seen in 5.0.0-2634 regression - closing it out

            People

              yu Yu Sui (Inactive)
              arunkumar Arunkumar Senthilnathan (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              12 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty