Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-17042

memcached not terminated properly during shutdown

    XMLWordPrintable

Details

    • Untriaged
    • Unknown
    • KV: Mar 30 - April 12

    Description

      XDCR testcases failing as memcached is not terminated properly during shutdown

      http://qa.sc.couchbase.com/job/cen006-p1-xxdcr-vset07-01-goxdcr-rebalance-ssl/122/consoleFull failed on Dec 5 - the first test case failed in cleanup throwing errors as follows:

      [2015-12-05 11:38:34,686] - [rest_client:746] ERROR - http://172.23.105.20:8091/pools/default/buckets/default error 500 reason: unknown

      {"_":"Bucket deletion not yet complete, but will continue.\r\n"}

      [2015-12-05 11:37:41,818] - [xdcrnewbasetests:2330] INFO - ====  XDCRNewbasetests cleanup is started for test #2 async_rebalance_in ====
      [2015-12-05 11:37:41,818] - [xdcrnewbasetests:1090] INFO - removing xdcr/nodes settings
      [2015-12-05 11:37:41,851] - [rest_client:936] INFO - Deleting replication /controller/cancelXDCR/5d97799bac38d87794867d8e4d8146de%2Fdefault%2Fdefault
      [2015-12-05 11:37:41,903] - [rest_client:884] INFO - removing remote cluster name:remote_cluster_C1-C2
      [2015-12-05 11:37:41,934] - [xdcrnewbasetests:1100] INFO - cleanup [ip:172.23.105.3 port:8091 ssh_username:root, ip:172.23.105.4 port:8091 ssh_username:root, ip:172.23.105.102 port:8091 ssh_username:root]
      [2015-12-05 11:37:41,965] - [bucket_helper:137] INFO - deleting existing buckets [u'default'] on 172.23.105.3
      [2015-12-05 11:37:41,965] - [bucket_helper:139] INFO - remove bucket default ...
      [2015-12-05 11:37:44,297] - [bucket_helper:153] INFO - deleted bucket : default from 172.23.105.3
      [2015-12-05 11:37:44,297] - [bucket_helper:226] INFO - waiting for bucket deletion to complete....
      [2015-12-05 11:37:44,306] - [rest_client:124] INFO - node 172.23.105.3 existing buckets : []
      [2015-12-05 11:37:44,345] - [cluster_helper:255] INFO - rebalancing all nodes in order to remove nodes
      [2015-12-05 11:37:44,350] - [rest_client:1181] INFO - rebalance params : password=password&ejectedNodes=ns_1%40172.23.105.102%2Cns_1%40172.23.105.4&user=Administrator&knownNodes=ns_1%40172.23.105.102%2Cns_1%40172.23.105.3%2Cns_1%40172.23.105.4
      [2015-12-05 11:37:44,357] - [rest_client:1185] INFO - rebalance operation started
      [2015-12-05 11:37:44,364] - [rest_client:1303] INFO - rebalance percentage : 0.00 %
      [2015-12-05 11:37:54,388] - [rest_client:1258] INFO - rebalance progress took 10.0304489136 seconds 
      [2015-12-05 11:37:54,388] - [rest_client:1259] INFO - sleep for 10 seconds after rebalance...
      [2015-12-05 11:38:04,420] - [cluster_helper:295] INFO - removed all the nodes from cluster associated with ip:172.23.105.3 port:8091 ssh_username:root ? [(u'ns_1@172.23.105.102', 8091), (u'ns_1@172.23.105.4', 8091)]
      [2015-12-05 11:38:04,427] - [cluster_helper:78] INFO - waiting for ns_server @ 172.23.105.3:8091
      [2015-12-05 11:38:04,434] - [cluster_helper:80] INFO - ns_server @ 172.23.105.3:8091 is running
      [2015-12-05 11:38:04,443] - [bucket_helper:137] INFO - deleting existing buckets [] on 172.23.105.4
      [2015-12-05 11:38:04,475] - [cluster_helper:78] INFO - waiting for ns_server @ 172.23.105.4:8091
      [2015-12-05 11:38:04,480] - [cluster_helper:80] INFO - ns_server @ 172.23.105.4:8091 is running
      [2015-12-05 11:38:04,488] - [bucket_helper:137] INFO - deleting existing buckets [] on 172.23.105.102
      [2015-12-05 11:38:04,510] - [cluster_helper:78] INFO - waiting for ns_server @ 172.23.105.102:8091
      [2015-12-05 11:38:04,518] - [cluster_helper:80] INFO - ns_server @ 172.23.105.102:8091 is running
      Cluster instance shutdown with force
      [2015-12-05 11:38:04,518] - [xdcrnewbasetests:1090] INFO - removing xdcr/nodes settings
      [2015-12-05 11:38:04,552] - [xdcrnewbasetests:1100] INFO - cleanup [ip:172.23.105.20 port:8091 ssh_username:root, ip:172.23.105.67 port:8091 ssh_username:root, ip:172.23.107.53 port:8091 ssh_username:root]
      [2015-12-05 11:38:04,663] - [bucket_helper:137] INFO - deleting existing buckets [u'default'] on 172.23.105.20
      [2015-12-05 11:38:04,663] - [bucket_helper:139] INFO - remove bucket default ...
      [2015-12-05 11:38:34,686] - [rest_client:746] ERROR - http://172.23.105.20:8091/pools/default/buckets/default error 500 reason: unknown {"_":"Bucket deletion not yet complete, but will continue.\r\n"}
      [2015-12-05 11:38:34,687] - [rest_client:1666] WARNING - Bucket deletion timed out waiting for all nodes
      [2015-12-05 11:38:34,700] - [remote_util:162] INFO - connecting to 172.23.105.20 with username : root password : couchbase ssh_key: 
      [2015-12-05 11:38:34,863] - [remote_util:195] INFO - Connected to 172.23.105.20
      [2015-12-05 11:38:35,422] - [remote_util:1997] INFO - running command.raw on 172.23.105.20: ls -LR '/opt/couchbase/var/lib/couchbase/data'
      [2015-12-05 11:38:35,506] - [remote_util:2034] INFO - command executed successfully
      [2015-12-05 11:38:35,506] - [remote_util:1936] INFO - /opt/couchbase/var/lib/couchbase/data:
      [2015-12-05 11:38:35,506] - [remote_util:1936] INFO - _replicator.couch.1
      [2015-12-05 11:38:35,506] - [remote_util:1936] INFO - _users.couch.1
      [2015-12-05 11:38:35,521] - [data_helper:295] INFO - creating direct client 172.23.105.20:11210 default
      

      This created an orphan memcached process in 172.23.105.67 which has been there since Dec 5. I have attached a screenshot of ps -ef on this box showing the orphan process. All the subsequent tests in the same build and all the builds that have run since Dec 5 on this cluster have all failed as they are not able to reach 172.23.105.67:

      ======================================================================
      ERROR: async_rebalance_out_master (xdcr.rebalanceXDCR.Rebalance)
      ----------------------------------------------------------------------
      Traceback (most recent call last):
        File "pytests/xdcr/rebalanceXDCR.py", line 9, in setUp
          super(Rebalance, self).setUp()
        File "pytests/xdcr/xdcrnewbasetests.py", line 2278, in setUp
          self.__init_parameters()
        File "pytests/xdcr/xdcrnewbasetests.py", line 2403, in __init_parameters
          self.__initialize_error_count_dict()
        File "pytests/xdcr/xdcrnewbasetests.py", line 2456, in __initialize_error_count_dict
          self.__error_count_dict[node.ip][error] = NodeHelper.check_goxdcr_log(node, error)
        File "pytests/xdcr/xdcrnewbasetests.py", line 396, in check_goxdcr_log
          goxdcr_log = NodeHelper.get_goxdcr_log_dir(server) + '/goxdcr.log*'
        File "pytests/xdcr/xdcrnewbasetests.py", line 387, in get_goxdcr_log_dir
          _, dir = RestConnection(node).diag_eval('filename:absname(element(2, application:get_env(ns_server,error_logger_mf_dir))).')
        File "lib/membase/api/rest_client.py", line 285, in __init__
          raise ServerUnavailableException(self.ip)
      ServerUnavailableException: unable to reach the host @ 172.23.105.67
       
      ----------------------------------------------------------------------
      

      These are the builds that have run after Dec 5 on the same cluster and have failed with same error above:

      http://qa.sc.couchbase.com/job/cen006-p1-xxdcr-vset07-01-goxdcr-rebalance-ssl/123/consoleFull
      http://qa.sc.couchbase.com/job/cen006-p1-xxdcr-vset07-01-goxdcr-rebalance-ssl/124/consoleFull
      http://qa.sc.couchbase.com/job/cen006-p1-xxdcr-vset07-01-goxdcr-rebalance-ssl/125/consoleFull

      I am attaching logs from the first job which created the orphan memcached process. This is a live cluster with the orphan process still in there and jobs still failing. You can take a look the boxes if needed:

      1:172.23.105.3
      2:172.23.105.4
      3:172.23.105.20
      4:172.23.105.67
      5:172.23.107.53
      6:172.23.105.102

      Attachments

        1. 172.23.105.102-20151205-1146-diag.zip
          2.54 MB
        2. 172.23.105.178-20160211-2211-diag.zip
          3.97 MB
        3. 172.23.105.178-20160305-0620-diag.zip
          867 kB
        4. 172.23.105.179-20160211-2213-diag.zip
          842 kB
        5. 172.23.105.179-20160305-0621-diag.zip
          803 kB
        6. 172.23.105.180-20160211-2214-diag.zip
          4.12 MB
        7. 172.23.105.180-20160305-0622-diag.zip
          3.63 MB
        8. 172.23.105.181-20160211-2216-diag.zip
          3.72 MB
        9. 172.23.105.181-20160305-0624-diag.zip
          3.52 MB
        10. 172.23.105.20-20151205-1143-diag.zip
          4.30 MB
        11. 172.23.105.3-20151205-1141-diag.zip
          3.01 MB
        12. 172.23.105.4-20151205-1142-diag.zip
          2.11 MB
        13. 172.23.105.67-20151205-1144-diag.zip
          2.84 MB
        14. 172.23.107.118-20160211-2202-diag.zip
          9.53 MB
        15. 172.23.107.118-20160305-0612-diag.zip
          9.11 MB
        16. 172.23.107.120-20160211-2206-diag.zip
          9.90 MB
        17. 172.23.107.120-20160305-0615-diag.zip
          9.30 MB
        18. 172.23.107.169-20160211-2209-diag.zip
          5.82 MB
        19. 172.23.107.169-20160305-0618-diag.zip
          5.49 MB
        20. 172.23.107.171-20160211-2204-diag.zip
          4.67 MB
        21. 172.23.107.171-20160305-0614-diag.zip
          5.37 MB
        22. 172.23.107.53-20151205-1145-diag.zip
          1.41 MB
        23. 172.23.121.109-20160214-1420-diag.zip
          5.27 MB
        24. 172.23.121.110-20160214-1417-diag.zip
          4.93 MB
        25. 172.23.121.111-20160214-1418-diag.zip
          5.68 MB
        26. 172.23.121.119-20160214-1415-diag.zip
          8.85 MB
        27. consoleFull.rtf
          146 kB
        28. consoleFull.rtf
          148 kB
        29. consoleFull2.rtf
          148 kB
        30. Screen Shot 2015-12-10 at 5.21.36 pm.png
          Screen Shot 2015-12-10 at 5.21.36 pm.png
          292 kB

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            People

              arunkumar Arunkumar Senthilnathan (Inactive)
              arunkumar Arunkumar Senthilnathan (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              14 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty