Details
-
Bug
-
Resolution: Fixed
-
Blocker
-
4.5.0
-
Security Level: Public
-
Untriaged
-
Unknown
-
KV: Mar 30 - April 12
Description
XDCR testcases failing as memcached is not terminated properly during shutdown
http://qa.sc.couchbase.com/job/cen006-p1-xxdcr-vset07-01-goxdcr-rebalance-ssl/122/consoleFull failed on Dec 5 - the first test case failed in cleanup throwing errors as follows:
[2015-12-05 11:38:34,686] - [rest_client:746] ERROR - http://172.23.105.20:8091/pools/default/buckets/default error 500 reason: unknown
{"_":"Bucket deletion not yet complete, but will continue.\r\n"}
[2015-12-05 11:37:41,818] - [xdcrnewbasetests:2330] INFO - ==== XDCRNewbasetests cleanup is started for test #2 async_rebalance_in ====
|
[2015-12-05 11:37:41,818] - [xdcrnewbasetests:1090] INFO - removing xdcr/nodes settings
|
[2015-12-05 11:37:41,851] - [rest_client:936] INFO - Deleting replication /controller/cancelXDCR/5d97799bac38d87794867d8e4d8146de%2Fdefault%2Fdefault
|
[2015-12-05 11:37:41,903] - [rest_client:884] INFO - removing remote cluster name:remote_cluster_C1-C2
|
[2015-12-05 11:37:41,934] - [xdcrnewbasetests:1100] INFO - cleanup [ip:172.23.105.3 port:8091 ssh_username:root, ip:172.23.105.4 port:8091 ssh_username:root, ip:172.23.105.102 port:8091 ssh_username:root]
|
[2015-12-05 11:37:41,965] - [bucket_helper:137] INFO - deleting existing buckets [u'default'] on 172.23.105.3
|
[2015-12-05 11:37:41,965] - [bucket_helper:139] INFO - remove bucket default ...
|
[2015-12-05 11:37:44,297] - [bucket_helper:153] INFO - deleted bucket : default from 172.23.105.3
|
[2015-12-05 11:37:44,297] - [bucket_helper:226] INFO - waiting for bucket deletion to complete....
|
[2015-12-05 11:37:44,306] - [rest_client:124] INFO - node 172.23.105.3 existing buckets : []
|
[2015-12-05 11:37:44,345] - [cluster_helper:255] INFO - rebalancing all nodes in order to remove nodes
|
[2015-12-05 11:37:44,350] - [rest_client:1181] INFO - rebalance params : password=password&ejectedNodes=ns_1%40172.23.105.102%2Cns_1%40172.23.105.4&user=Administrator&knownNodes=ns_1%40172.23.105.102%2Cns_1%40172.23.105.3%2Cns_1%40172.23.105.4
|
[2015-12-05 11:37:44,357] - [rest_client:1185] INFO - rebalance operation started
|
[2015-12-05 11:37:44,364] - [rest_client:1303] INFO - rebalance percentage : 0.00 %
|
[2015-12-05 11:37:54,388] - [rest_client:1258] INFO - rebalance progress took 10.0304489136 seconds
|
[2015-12-05 11:37:54,388] - [rest_client:1259] INFO - sleep for 10 seconds after rebalance...
|
[2015-12-05 11:38:04,420] - [cluster_helper:295] INFO - removed all the nodes from cluster associated with ip:172.23.105.3 port:8091 ssh_username:root ? [(u'ns_1@172.23.105.102', 8091), (u'ns_1@172.23.105.4', 8091)]
|
[2015-12-05 11:38:04,427] - [cluster_helper:78] INFO - waiting for ns_server @ 172.23.105.3:8091
|
[2015-12-05 11:38:04,434] - [cluster_helper:80] INFO - ns_server @ 172.23.105.3:8091 is running
|
[2015-12-05 11:38:04,443] - [bucket_helper:137] INFO - deleting existing buckets [] on 172.23.105.4
|
[2015-12-05 11:38:04,475] - [cluster_helper:78] INFO - waiting for ns_server @ 172.23.105.4:8091
|
[2015-12-05 11:38:04,480] - [cluster_helper:80] INFO - ns_server @ 172.23.105.4:8091 is running
|
[2015-12-05 11:38:04,488] - [bucket_helper:137] INFO - deleting existing buckets [] on 172.23.105.102
|
[2015-12-05 11:38:04,510] - [cluster_helper:78] INFO - waiting for ns_server @ 172.23.105.102:8091
|
[2015-12-05 11:38:04,518] - [cluster_helper:80] INFO - ns_server @ 172.23.105.102:8091 is running
|
Cluster instance shutdown with force
|
[2015-12-05 11:38:04,518] - [xdcrnewbasetests:1090] INFO - removing xdcr/nodes settings
|
[2015-12-05 11:38:04,552] - [xdcrnewbasetests:1100] INFO - cleanup [ip:172.23.105.20 port:8091 ssh_username:root, ip:172.23.105.67 port:8091 ssh_username:root, ip:172.23.107.53 port:8091 ssh_username:root]
|
[2015-12-05 11:38:04,663] - [bucket_helper:137] INFO - deleting existing buckets [u'default'] on 172.23.105.20
|
[2015-12-05 11:38:04,663] - [bucket_helper:139] INFO - remove bucket default ...
|
[2015-12-05 11:38:34,686] - [rest_client:746] ERROR - http://172.23.105.20:8091/pools/default/buckets/default error 500 reason: unknown {"_":"Bucket deletion not yet complete, but will continue.\r\n"}
|
[2015-12-05 11:38:34,687] - [rest_client:1666] WARNING - Bucket deletion timed out waiting for all nodes
|
[2015-12-05 11:38:34,700] - [remote_util:162] INFO - connecting to 172.23.105.20 with username : root password : couchbase ssh_key:
|
[2015-12-05 11:38:34,863] - [remote_util:195] INFO - Connected to 172.23.105.20
|
[2015-12-05 11:38:35,422] - [remote_util:1997] INFO - running command.raw on 172.23.105.20: ls -LR '/opt/couchbase/var/lib/couchbase/data'
|
[2015-12-05 11:38:35,506] - [remote_util:2034] INFO - command executed successfully
|
[2015-12-05 11:38:35,506] - [remote_util:1936] INFO - /opt/couchbase/var/lib/couchbase/data:
|
[2015-12-05 11:38:35,506] - [remote_util:1936] INFO - _replicator.couch.1
|
[2015-12-05 11:38:35,506] - [remote_util:1936] INFO - _users.couch.1
|
[2015-12-05 11:38:35,521] - [data_helper:295] INFO - creating direct client 172.23.105.20:11210 default
|
This created an orphan memcached process in 172.23.105.67 which has been there since Dec 5. I have attached a screenshot of ps -ef on this box showing the orphan process. All the subsequent tests in the same build and all the builds that have run since Dec 5 on this cluster have all failed as they are not able to reach 172.23.105.67:
======================================================================
|
ERROR: async_rebalance_out_master (xdcr.rebalanceXDCR.Rebalance)
|
----------------------------------------------------------------------
|
Traceback (most recent call last):
|
File "pytests/xdcr/rebalanceXDCR.py", line 9, in setUp
|
super(Rebalance, self).setUp()
|
File "pytests/xdcr/xdcrnewbasetests.py", line 2278, in setUp
|
self.__init_parameters()
|
File "pytests/xdcr/xdcrnewbasetests.py", line 2403, in __init_parameters
|
self.__initialize_error_count_dict()
|
File "pytests/xdcr/xdcrnewbasetests.py", line 2456, in __initialize_error_count_dict
|
self.__error_count_dict[node.ip][error] = NodeHelper.check_goxdcr_log(node, error)
|
File "pytests/xdcr/xdcrnewbasetests.py", line 396, in check_goxdcr_log
|
goxdcr_log = NodeHelper.get_goxdcr_log_dir(server) + '/goxdcr.log*'
|
File "pytests/xdcr/xdcrnewbasetests.py", line 387, in get_goxdcr_log_dir
|
_, dir = RestConnection(node).diag_eval('filename:absname(element(2, application:get_env(ns_server,error_logger_mf_dir))).')
|
File "lib/membase/api/rest_client.py", line 285, in __init__
|
raise ServerUnavailableException(self.ip)
|
ServerUnavailableException: unable to reach the host @ 172.23.105.67
|
|
----------------------------------------------------------------------
|
These are the builds that have run after Dec 5 on the same cluster and have failed with same error above:
http://qa.sc.couchbase.com/job/cen006-p1-xxdcr-vset07-01-goxdcr-rebalance-ssl/123/consoleFull
http://qa.sc.couchbase.com/job/cen006-p1-xxdcr-vset07-01-goxdcr-rebalance-ssl/124/consoleFull
http://qa.sc.couchbase.com/job/cen006-p1-xxdcr-vset07-01-goxdcr-rebalance-ssl/125/consoleFull
I am attaching logs from the first job which created the orphan memcached process. This is a live cluster with the orphan process still in there and jobs still failing. You can take a look the boxes if needed:
1:172.23.105.3
2:172.23.105.4
3:172.23.105.20
4:172.23.105.67
5:172.23.107.53
6:172.23.105.102
Attachments
Issue Links
- is duplicated by
-
MB-18135 intermittent failure of test_xdcr_with_filter
- Closed
- is triggered by
-
MB-16745 [XDCR] Rebalance fails due to incomplete bucket deletion in cleanup
- Closed
- relates to
-
MB-19125 Rebalance test cases - bucket deletion issue
- Closed
-
MB-18691 Bucket deletion issue seen in goxdcr upgrade jenkins job
- Closed
-
MB-18688 Bucket deletion issue seen in bixdcr jenkins job
- Closed
-
MB-18689 Bucket deletion issue seen in goxdcr rebalance ssl jenkins job
- Closed
-
MB-18690 Bucket deletion issue seen in goxdcr pause resume P1 jenkins job
- Closed
-
MB-19354 TSan: data race in ~FlowControl during test_mb17042_duplicate_name_consumer_connections
- Closed