Loading...

XML

Word

Printable

Details

Type: Bug
Resolution: Fixed
Priority: Critical
Fix Version/s: 2.1.0
Affects Version/s: 2.0, 2.0.1
Component/s: ns_server
Security Level: Public
Labels:
None

Description

http://qa.hq.northscale.net/job/centos-64-2.0-failover-tests/566/consoleFull
./testrunner -i resources/jenkins/centos-64-5node-failover.ini get-logs=True,GROUP=BAT,get-cbcollect-info=True -t failovertests.FailoverTests.test_failover_firewall,replica=2,items=20000,GROUP=BAT
version=2.0.0-1982-rel

Steps:
1. in cluster enabled firewall on ip:10.1.3.117
2. then wait for 5 minutes, the node becomes unhealthy in /nodeStatuses API

if in case of falling output "iptables --list", UI logs that shows node went down and /nodeStatuses

I saw the same on different versions 2.0.0 & 2.0.1 but filled a bug only after add to test additional logs to identify the problem

2013-03-04 05:40:20 | INFO | MainProcess | MainThread | [remote_util.enable_firewall] enabled firewall on ip:10.1.3.117 port:8091 ssh_username:root
2013-03-04 05:40:20 | INFO | MainProcess | MainThread | [remote_util.execute_command_raw] running command.raw /sbin/iptables --list
2013-03-04 05:40:20 | INFO | MainProcess | MainThread | [remote_util.execute_command_raw] command executed successfully
2013-03-04 05:40:20 | INFO | MainProcess | MainThread | [remote_util.log_command_output] Chain INPUT (policy ACCEPT)
2013-03-04 05:40:20 | INFO | MainProcess | MainThread | [remote_util.log_command_output] target prot opt source destination
2013-03-04 05:40:20 | INFO | MainProcess | MainThread | [remote_util.log_command_output] REJECT tcp – anywhere anywhere tcp dpts:cadlock2:60000 reject-with icmp-port-unreachable
2013-03-04 05:40:20 | INFO | MainProcess | MainThread | [remote_util.log_command_output]
2013-03-04 05:40:20 | INFO | MainProcess | MainThread | [remote_util.log_command_output] Chain FORWARD (policy ACCEPT)
2013-03-04 05:40:20 | INFO | MainProcess | MainThread | [remote_util.log_command_output] target prot opt source destination
2013-03-04 05:40:20 | INFO | MainProcess | MainThread | [remote_util.log_command_output]
2013-03-04 05:40:20 | INFO | MainProcess | MainThread | [remote_util.log_command_output] Chain OUTPUT (policy ACCEPT)
2013-03-04 05:40:20 | INFO | MainProcess | MainThread | [remote_util.log_command_output] target prot opt source destination
2013-03-04 05:40:20 | INFO | MainProcess | MainThread | [remote_util.log_command_output]
2013-03-04 05:40:20 | INFO | MainProcess | MainThread | [remote_util.log_command_output] Chain RH-Firewall-1-INPUT (0 references)
2013-03-04 05:40:20 | INFO | MainProcess | MainThread | [remote_util.log_command_output] target prot opt source destination
2013-03-04 05:40:20 | INFO | MainProcess | MainThread | [rest_client.wait_for_node_status] node ns_1@10.1.3.117 status : healthy
2013-03-04 05:40:20 | INFO | MainProcess | MainThread | [rest_client.wait_for_node_status] sleep for 5
seconds before reading the node.status again
....
2013-03-04 05:45:17 | INFO | MainProcess | MainThread | [rest_client.wait_for_node_status] sleep for 5 seconds before reading the node.status again
2013-03-04 05:45:22 | INFO | MainProcess | MainThread | [rest_client.wait_for_node_status] node ns_1@10.1.3.117 status_reached : False
2013-03-04 05:45:22 | INFO | MainProcess | MainThread | [remote_util.__init__] connecting to 10.1.3.117 with username : root password : couchbase ssh_key:
2013-03-04 05:45:22 | INFO | MainProcess | MainThread | [remote_util.__init__] Connected
2013-03-04 05:45:22 | INFO | MainProcess | MainThread | [remote_util.execute_command_raw] running command.raw sudo cat /proc/cpuinfo
2013-03-04 05:45:22 | INFO | MainProcess | MainThread | [remote_util.execute_command_raw] command executed successfully
2013-03-04 05:45:22 | INFO | MainProcess | MainThread | [remote_util.execute_command_raw] running command.raw sudo df -Th
2013-03-04 05:45:22 | INFO | MainProcess | MainThread | [remote_util.execute_command_raw] command executed successfully
2013-03-04 05:45:22 | INFO | MainProcess | MainThread | [remote_util.execute_command_raw] running command.raw sudo cat /proc/meminfo
2013-03-04 05:45:22 | INFO | MainProcess | MainThread | [remote_util.execute_command_raw] command executed successfully
2013-03-04 05:45:22 | INFO | MainProcess | MainThread | [remote_util.execute_command_raw] running command.raw hostname
2013-03-04 05:45:22 | INFO | MainProcess | MainThread | [remote_util.execute_command_raw] command executed successfully
2013-03-04 05:45:22 | INFO | MainProcess | MainThread | [remote_util.execute_command_raw] running command.raw /sbin/iptables --list
2013-03-04 05:45:22 | INFO | MainProcess | MainThread | [remote_util.execute_command_raw] command executed successfully
2013-03-04 05:45:22 | INFO | MainProcess | MainThread | [remote_util.log_command_output] Chain INPUT (policy ACCEPT)
2013-03-04 05:45:22 | INFO | MainProcess | MainThread | [remote_util.log_command_output] target prot opt source destination
2013-03-04 05:45:22 | INFO | MainProcess | MainThread | [remote_util.log_command_output] REJECT tcp – anywhere anywhere tcp dpts:cadlock2:60000 reject-with icmp-port-unreachable
2013-03-04 05:45:22 | INFO | MainProcess | MainThread | [remote_util.log_command_output]
2013-03-04 05:45:22 | INFO | MainProcess | MainThread | [remote_util.log_command_output] Chain FORWARD (policy ACCEPT)
2013-03-04 05:45:22 | INFO | MainProcess | MainThread | [remote_util.log_command_output] target prot opt source destination
2013-03-04 05:45:22 | INFO | MainProcess | MainThread | [remote_util.log_command_output]
2013-03-04 05:45:22 | INFO | MainProcess | MainThread | [remote_util.log_command_output] Chain OUTPUT (policy ACCEPT)
2013-03-04 05:45:22 | INFO | MainProcess | MainThread | [remote_util.log_command_output] target prot opt source destination
2013-03-04 05:45:22 | INFO | MainProcess | MainThread | [remote_util.log_command_output]
2013-03-04 05:45:22 | INFO | MainProcess | MainThread | [remote_util.log_command_output] Chain RH-Firewall-1-INPUT (0 references)
2013-03-04 05:45:22 | INFO | MainProcess | MainThread | [remote_util.log_command_output] target prot opt source destination
2013-03-04 05:45:23 | ERROR | MainProcess | MainThread | [failovertests.common_test_body]

{u'node': u'ns_1@10.1.3.115', u'code': 5, u'text': u"Node 'ns_1@10.1.3.115' saw that node 'ns_1@10.1.3.117' went down.", u'shortText': u'node down', u'module': u'ns_node_disco', u'tstamp': 1362404495905, u'type': u'warning'}

2013-03-04 05:45:23 | ERROR | MainProcess | MainThread | [failovertests.common_test_body]

{u'node': u'ns_1@10.1.3.118', u'code': 5, u'text': u"Node 'ns_1@10.1.3.118' saw that node 'ns_1@10.1.3.117' went down.", u'shortText': u'node down', u'module': u'ns_node_disco', u'tstamp': 1362404482733, u'type': u'warning'}

2013-03-04 05:45:23 | ERROR | MainProcess | MainThread | [failovertests.common_test_body]

{u'node': u'ns_1@10.1.3.117', u'code': 5, u'text': u"Node 'ns_1@10.1.3.117' saw that node 'ns_1@10.1.3.116' went down.", u'shortText': u'node down', u'module': u'ns_node_disco', u'tstamp': 1362404482729, u'type': u'warning'}

2013-03-04 05:45:23 | ERROR | MainProcess | MainThread | [failovertests.common_test_body]

{u'node': u'ns_1@10.1.3.117', u'code': 5, u'text': u"Node 'ns_1@10.1.3.117' saw that node 'ns_1@10.1.3.115' went down.", u'shortText': u'node down', u'module': u'ns_node_disco', u'tstamp': 1362404482729, u'type': u'warning'}

2013-03-04 05:45:23 | ERROR | MainProcess | MainThread | [failovertests.common_test_body]

{u'node': u'ns_1@10.1.3.117', u'code': 5, u'text': u"Node 'ns_1@10.1.3.117' saw that node 'ns_1@10.1.3.118' went down.", u'shortText': u'node down', u'module': u'ns_node_disco', u'tstamp': 1362404482728, u'type': u'warning'}

2013-03-04 05:45:23 | ERROR | MainProcess | MainThread | [failovertests.common_test_body]

{u'node': u'ns_1@10.1.3.116', u'code': 5, u'text': u"Node 'ns_1@10.1.3.116' saw that node 'ns_1@10.1.3.117' went down.", u'shortText': u'node down', u'module': u'ns_node_disco', u'tstamp': 1362404482110, u'type': u'warning'}

2013-03-04 05:45:23 | ERROR | MainProcess | MainThread | [failovertests.common_test_body]

{u'node': u'ns_1@10.1.3.114', u'code': 1, u'text': u'Rebalance completed successfully.\n', u'shortText': u'message', u'module': u'ns_orchestrator', u'tstamp': 1362404388568, u'type': u'info'}

2013-03-04 05:45:23 | ERROR | MainProcess | MainThread | [failovertests.common_test_body]

{u'node': u'ns_1@10.1.3.114', u'code': 0, u'text': u'Bucket "default" rebalance does not seem to be swap rebalance', u'shortText': u'message', u'module': u'ns_vbucket_mover', u'tstamp': 1362404080697, u'type': u'info'}

2013-03-04 05:45:23 | ERROR | MainProcess | MainThread | [failovertests.common_test_body]

{u'node': u'ns_1@10.1.3.118', u'code': 1, u'text': u'Bucket "default" loaded on node \'ns_1@10.1.3.118\' in 0 seconds.', u'shortText': u'message', u'module': u'ns_memcached', u'tstamp': 1362404080100, u'type': u'info'}

2013-03-04 05:45:23 | ERROR | MainProcess | MainThread | [failovertests.common_test_body]

{u'node': u'ns_1@10.1.3.117', u'code': 1, u'text': u'Bucket "default" loaded on node \'ns_1@10.1.3.117\' in 0 seconds.', u'shortText': u'message', u'module': u'ns_memcached', u'tstamp': 1362404079864, u'type': u'info'}

2013-03-04 05:45:23 | INFO | MainProcess | MainThread | [failovertests.common_test_body] nodeStatuses: {u'10.1.3.114:8091':

{u'status': u'healthy', u'replication': 1.0, u'otpNode': u'ns_1@10.1.3.114'}

, u'10.1.3.117:8091':

{u'status': u'healthy', u'replication': 0.0, u'otpNode': u'ns_1@10.1.3.117'}

, u'10.1.3.115:8091':

{u'status': u'healthy', u'replication': 1.0, u'otpNode': u'ns_1@10.1.3.115'}

, u'10.1.3.118:8091':

{u'status': u'healthy', u'replication': 0.0, u'otpNode': u'ns_1@10.1.3.118'}

, u'10.1.3.116:8091': {u'status': u'healthy', u'replication': 0.0, u'otpNode': u'ns_1@10.1.3.116'}}
FAIL

now don't have logs due to:

IMPOSSIBLE TO GRAB CBCOLLECT FROM 10.1.3.118
Traceback (most recent call last):
File "testrunner", line 362, in <module>
xunit.write("

{0}

/report-

{1}

".format(logs_folder, str_time))
File "lib/xunit.py", line 102, in write
report_xml_file.close()
IOError: [Errno 28] No space left on device

Attachments

Gerrit Reviews

- Issue Only
- Show All Reviews
- Show Open Reviews
- Show All Issues
- Show Open Issues

For Gerrit Dashboard: MB-7863
#	Subject	Branch	Project	Status	CR	V
26022,2	MB-7863: wait 2 minutes that node status changed after failover	master	testrunner	Status: MERGED	+2	+1
26165,2	MB-7863 Don't link to waited process in wait_for_process.	master	ns_server	Status: MERGED	+2	+1
26278,1	Merge branch 'master' into 2.0.2	2.0.2	ns_server	Status: MERGED	+2	+1
26404,2	MB-7863: minor fix in logs and timeouts	master	testrunner	Status: MERGED	+2	+1

Activity

People

Assignee:: iryna

Reporter:: Andrei Baranouski

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 05/Mar/13 1:39 AM

Updated:: 29/May/13 3:25 PM

Resolved:: 07/May/13 6:25 PM

Gerrit Reviews

There are no open Gerrit changes

Show There are 4 closed Gerrit changes

Hide There are 4 closed Gerrit changes

MB-7863: wait 2 minutes that node status changed after failover: Gerrit Review:

MB-7863 Don't link to waited process in wait_for_process.: Gerrit Review:

Merge branch 'master' into 2.0.2: Gerrit Review:

MB-7863: minor fix in logs and timeouts: Gerrit Review:

PagerDuty