Details
Description
this problem is one to one as MB-7863
but don't reopen it because see the issue only on 2.2.0-837 build
http://qa.sc.couchbase.com/job/centos_x64--03_01--failover_tests-P0/33/consoleFull
1) [2014-02-12 01:16:37,211] - [rest_client:969] INFO - rebalance params : password=password&ejectedNodes=&user=Administrator&knownNodes=ns_1%4010.1.3.117%2Cns_1%4010.1.3.141%2Cns_1%4010.1.3.115%2Cns_1%4010.1.3.118%2Cns_1%4010.1.3.116
2) [2014-02-12 01:33:05,889] - [rest_client:1438] INFO - settings/autoFailover params : enabled=true&timeout=30
[2014-02-12 01:33:05,896] - [autofailovertests:129] INFO - sleep for 5 secs. ...
2) failover node 10.1.3.115
[2014-02-12 01:33:11,784] - [remote_util:1437] INFO - running command.raw on 10.1.3.115: /sbin/iptables -A INPUT -p tcp -i eth0 --dport 1000:60000 -j REJECT
[2014-02-12 01:33:11,928] - [remote_util:1466] INFO - command executed successfully
[2014-02-12 01:33:11,930] - [remote_util:2250] INFO - enabled firewall on ip:10.1.3.115 port:8091 ssh_username:root
[2014-02-12 01:33:11,930] - [remote_util:1437] INFO - running command.raw on 10.1.3.115: /sbin/iptables --list
[2014-02-12 01:33:12,096] - [remote_util:1466] INFO - command executed successfully
[2014-02-12 01:33:12,097] - [remote_util:1388] INFO - Chain INPUT (policy ACCEPT)
[2014-02-12 01:33:12,098] - [remote_util:1388] INFO - target prot opt source destination
[2014-02-12 01:33:12,098] - [remote_util:1388] INFO - REJECT tcp – anywhere anywhere tcp dpts:cadlock2:60000 reject-with icmp-port-unreachable
[2014-02-12 01:33:12,098] - [remote_util:1388] INFO -
[2014-02-12 01:33:12,098] - [remote_util:1388] INFO - Chain FORWARD (policy ACCEPT)
[2014-02-12 01:33:12,099] - [remote_util:1388] INFO - target prot opt source destination
[2014-02-12 01:33:12,099] - [remote_util:1388] INFO -
[2014-02-12 01:33:12,099] - [remote_util:1388] INFO - Chain OUTPUT (policy ACCEPT)
[2014-02-12 01:33:12,099] - [remote_util:1388] INFO - target prot opt source destination
[2014-02-12 01:33:12,100] - [remote_util:1388] INFO -
[2014-02-12 01:33:12,100] - [remote_util:1388] INFO - Chain RH-Firewall-1-INPUT (0 references)
[2014-02-12 01:33:12,100] - [remote_util:1388] INFO - target prot opt source destination
wait till until 10.1.3.115 becomes inactive
[2014-02-12 01:33:12,231] - [autofailovertests:106] INFO - 'clusterMembership' for node ns_1@10.1.3.141 is active
[2014-02-12 01:33:12,232] - [autofailovertests:106] INFO - 'clusterMembership' for node ns_1@10.1.3.117 is active
[2014-02-12 01:33:12,232] - [autofailovertests:106] INFO - 'clusterMembership' for node ns_1@10.1.3.116 is active
[2014-02-12 01:33:12,233] - [autofailovertests:106] INFO - 'clusterMembership' for node ns_1@10.1.3.115 is active
....
[2014-02-12 01:35:40,830] - [autofailovertests:106] INFO - 'clusterMembership' for node ns_1@10.1.3.115 is active
[2014-02-12 01:35:40,831] - [autofailovertests:106] INFO - 'clusterMembership' for node ns_1@10.1.3.117 is active
[2014-02-12 01:35:42,845] - [autofailovertests:74] INFO - Latest logs from UI:
[2014-02-12 01:35:43,192] - [autofailovertests:75] ERROR - {u'node': u'ns_1@10.1.3.118', u'code': 5, u'text': u"Node 'ns_1@10.1.3.118' saw that node 'ns_1@10.1.3.115' went down. Details: [
[2014-02-12 01:35:43,193] - [autofailovertests:75] ERROR - {u'node': u'ns_1@10.1.3.117', u'code': 5, u'text': u"Node 'ns_1@10.1.3.117' saw that node 'ns_1@10.1.3.115' went down. Details: [{nodedown_reason,\n connection_closed}]", u'shortText': u'node down', u'module': u'ns_node_disco', u'tstamp': 1392197694042, u'type': u'warning'}
[2014-02-12 01:35:43,193] - [autofailovertests:75] ERROR - {u'node': u'ns_1@10.1.3.116', u'code': 5, u'text': u"Node 'ns_1@10.1.3.116' saw that node 'ns_1@10.1.3.115' went down. Details: [{nodedown_reason,n connection_closed}]", u'shortText': u'node down', u'module': u'ns_node_disco', u'tstamp': 1392197694022, u'type': u'warning'}
[2014-02-12 01:35:43,194] - [autofailovertests:75] ERROR - {u'node': u'ns_1@10.1.3.115', u'code': 5, u'text': u"Node 'ns_1@10.1.3.115' saw that node 'ns_1@10.1.3.116' went down. Details: [{nodedown_reason,n net_tick_timeout}
]", u'shortText': u'node down', u'module': u'ns_node_disco', u'tstamp': 1392197694020, u'type': u'warning'}
[2014-02-12 01:35:43,194] - [autofailovertests:75] ERROR - {u'node': u'ns_1@10.1.3.115', u'code': 5, u'text': u"Node 'ns_1@10.1.3.115' saw that node 'ns_1@10.1.3.117' went down. Details: [
[2014-02-12 01:35:43,195] - [autofailovertests:75] ERROR - {u'node': u'ns_1@10.1.3.115', u'code': 5, u'text': u"Node 'ns_1@10.1.3.115' saw that node 'ns_1@10.1.3.118' went down. Details: [{nodedown_reason,n net_tick_timeout}
]", u'shortText': u'node down', u'module': u'ns_node_disco', u'tstamp': 1392197694019, u'type': u'warning'}
[2014-02-12 01:35:43,195] - [autofailovertests:75] ERROR -
[2014-02-12 01:35:43,196] - [autofailovertests:75] ERROR -
{u'node': u'ns_1@10.1.3.141', u'code': 1, u'text': u'Rebalance completed successfully.\n', u'shortText': u'message', u'module': u'ns_orchestrator', u'tstamp': 1392196970008, u'type': u'info'}[2014-02-12 01:35:43,196] - [autofailovertests:75] ERROR -
{u'node': u'ns_1@10.1.3.115', u'code': 1, u'text': u'Bucket "default" loaded on node \'ns_1@10.1.3.115\' in 0 seconds.', u'shortText': u'message', u'module': u'ns_memcached', u'tstamp': 1392196633035, u'type': u'info'}[2014-02-12 01:35:43,197] - [autofailovertests:75] ERROR -
{u'node': u'ns_1@10.1.3.118', u'code': 1, u'text': u'Bucket "default" loaded on node \'ns_1@10.1.3.118\' in 0 seconds.', u'shortText': u'message', u'module': u'ns_memcached', u'tstamp': 1392196632540, u'type': u'info'}[2014-02-12 01:35:43,219] - [autofailovertests:76] WARNING - pools/default from 10.1.3.141 : {u'rebalanceProgressUri': u'/pools/default/rebalanceProgress', u'maxBucketCount': 10, u'name': u'default', u'fastWarmupSettings':
{u'fastWarmupEnabled': True, u'minItemsThreshold': 10, u'minMemoryThreshold': 10}, u'buckets':
{u'uri': u'/pools/default/buckets?v=131509077&uuid=5f6c347bf2415d227157dffe1b91bc4f'}, u'alerts': [], u'counters':
{u'rebalance_success': 26, u'rebalance_start': 26, u'failover_node': 4}, u'remoteClusters':
{u'validateURI': u'/pools/default/remoteClusters?just_validate=1', u'uri': u'/pools/default/remoteClusters?uuid=5f6c347bf2415d227157dffe1b91bc4f'}, u'controllers': {u'setAutoCompaction':
{u'validateURI': u'/controller/setAutoCompaction?just_validate=1', u'uri': u'/controller/setAutoCompaction?uuid=5f6c347bf2415d227157dffe1b91bc4f'}, u'addNode':
{u'uri': u'/controller/addNode?uuid=5f6c347bf2415d227157dffe1b91bc4f'}, u'setFastWarmup':
{u'validateURI': u'/controller/setFastWarmup?just_validate=1', u'uri': u'/controller/setFastWarmup?uuid=5f6c347bf2415d227157dffe1b91bc4f'}, u'failOver':
{u'uri': u'/controller/failOver?uuid=5f6c347bf2415d227157dffe1b91bc4f'}, u'reAddNode':
{u'uri': u'/controller/reAddNode?uuid=5f6c347bf2415d227157dffe1b91bc4f'}, u'replication':
{u'createURI': u'/controller/createReplication?uuid=5f6c347bf2415d227157dffe1b91bc4f', u'validateURI': u'/controller/createReplication?just_validate=1'}, u'ejectNode':
{u'uri': u'/controller/ejectNode?uuid=5f6c347bf2415d227157dffe1b91bc4f'}, u'rebalance': {u'uri': u'/controller/rebalance?uuid=5f6c347bf2415d227157dffe1b91bc4f'}}, u'tasks':
{u'uri': u'/pools/default/tasks?v=133172395'}, u'stopRebalanceUri': u'/controller/stopRebalance?uuid=5f6c347bf2415d227157dffe1b91bc4f', u'autoCompactionSettings': {u'viewFragmentationThreshold':
{u'percentage': 30, u'size': u'undefined'}, u'parallelDBAndViewCompaction': False, u'databaseFragmentationThreshold': {u'percentage': 30, u'size': u'undefined'}}, u'nodes': [{u'status': u'healthy', u'couchApiBase': u'http://10.1.3.116:8092/', u'uptime': u'11667', u'clusterMembership': u'active', u'mcdMemoryReserved': 3082, u'memoryFree': 1450782720, u'memoryTotal': 4040077312, u'hostname': u'10.1.3.116:8091', u'clusterCompatibility': 131072, u'version': u'2.2.0-837-rel-enterprise', u'otpNode': u'ns_1@10.1.3.116', u'mcdMemoryAllocated': 3082, u'otpCookie': u'arpvtvqtckjhcyil', u'systemStats':
{u'swap_total': 6140452864, u'mem_free': 1450782720, u'swap_used': 102400, u'cpu_utilization_rate': 1.256281407035176, u'mem_total': 4040077312}, u'os': u'x86_64-unknown-linux-gnu', u'ports':
{u'proxy': 11211, u'direct': 11210}, u'interestingStats': {u'couch_views_data_size': 0, u'couch_docs_actual_disk_size': 1091907614, u'couch_views_actual_disk_size': 0, u'ops': 0.0, u'vb_replica_curr_items': 200121, u'cmd_get': 0.0, u'mem_used': 1055470912, u'get_hits': 0.0, u'curr_items': 200338, u'ep_bg_fetched': 0.0, u'curr_items_tot': 400459, u'couch_docs_data_size': 942368284}}, {u'status': u'healthy', u'couchApiBase': u'http://10.1.3.115:8092/', u'uptime': u'11666', u'clusterMembership': u'active', u'mcdMemoryReserved': 3082, u'memoryFree': 1516736512, u'memoryTotal': 4040077312, u'hostname': u'10.1.3.115:8091', u'clusterCompatibility': 131072, u'version': u'2.2.0-837-rel-enterprise', u'otpNode': u'ns_1@10.1.3.115', u'mcdMemoryAllocated': 3082, u'otpCookie': u'arpvtvqtckjhcyil', u'systemStats':
{u'swap_total': 6140452864, u'mem_free': 1516736512, u'swap_used': 102400, u'cpu_utilization_rate': 1.2594458438287153, u'mem_total': 4040077312}, u'os': u'x86_64-unknown-linux-gnu', u'ports':
{u'proxy': 11211, u'direct': 11210}, u'interestingStats': {u'couch_views_data_size': 0, u'couch_docs_actual_disk_size': 1301962560, u'couch_views_actual_disk_size': 0, u'ops': 0.0, u'vb_replica_curr_items': 200221, u'cmd_get': 0.0, u'mem_used': 1055105712, u'get_hits': 0.0, u'curr_items': 200189, u'ep_bg_fetched': 0.0, u'curr_items_tot': 400410, u'couch_docs_data_size': 947419810}}, {u'status': u'healthy', u'couchApiBase': u'http://10.1.3.118:8092/', u'uptime': u'11667', u'clusterMembership': u'active', u'mcdMemoryReserved': 3082, u'memoryFree': 1519489024, u'memoryTotal': 4040077312, u'hostname': u'10.1.3.118:8091', u'clusterCompatibility': 131072, u'version': u'2.2.0-837-rel-enterprise', u'otpNode': u'ns_1@10.1.3.118', u'mcdMemoryAllocated': 3082, u'otpCookie': u'arpvtvqtckjhcyil', u'systemStats':
{u'swap_total': 6140452864, u'mem_free': 1519489024, u'swap_used': 102400, u'cpu_utilization_rate': 1.0075566750629723, u'mem_total': 4040077312}, u'os': u'x86_64-unknown-linux-gnu', u'ports':
{u'proxy': 11211, u'direct': 11210}, u'interestingStats': {u'couch_views_data_size': 0, u'couch_docs_actual_disk_size': 1257213978, u'couch_views_actual_disk_size': 0, u'ops': 0.0, u'vb_replica_curr_items': 200315, u'cmd_get': 0.0, u'mem_used': 1056399872, u'get_hits': 0.0, u'curr_items': 200119, u'ep_bg_fetched': 0.0, u'curr_items_tot': 400434, u'couch_docs_data_size': 946168592}}, {u'status': u'healthy', u'couchApiBase': u'http://10.1.3.141:8092/', u'uptime': u'11667', u'clusterMembership': u'active', u'mcdMemoryReserved': 3082, u'memoryFree': 1348734976, u'memoryTotal': 4040048640, u'hostname': u'10.1.3.141:8091', u'thisNode': True, u'clusterCompatibility': 131072, u'version': u'2.2.0-837-rel-enterprise', u'otpNode': u'ns_1@10.1.3.141', u'mcdMemoryAllocated': 3082, u'otpCookie': u'arpvtvqtckjhcyil', u'systemStats':
{u'swap_total': 6140452864, u'mem_free': 1348734976, u'swap_used': 102400, u'cpu_utilization_rate': 2.0050125313283207, u'mem_total': 4040048640}, u'os': u'x86_64-unknown-linux-gnu', u'ports':
{u'proxy': 11211, u'direct': 11210}, u'interestingStats': {u'couch_views_data_size': 0, u'couch_docs_actual_disk_size': 1202101515, u'couch_views_actual_disk_size': 0, u'ops': 0.0, u'vb_replica_curr_items': 199287, u'cmd_get': 0.0, u'mem_used': 1043292760, u'get_hits': 0.0, u'curr_items': 199223, u'ep_bg_fetched': 0.0, u'curr_items_tot': 398510, u'couch_docs_data_size': 941191994}}, {u'status': u'healthy', u'couchApiBase': u'http://10.1.3.117:8092/', u'uptime': u'11667', u'clusterMembership': u'active', u'mcdMemoryReserved': 3082, u'memoryFree': 1532370944, u'memoryTotal': 4040077312, u'hostname': u'10.1.3.117:8091', u'clusterCompatibility': 131072, u'version': u'2.2.0-837-rel-enterprise', u'otpNode': u'ns_1@10.1.3.117', u'mcdMemoryAllocated': 3082, u'otpCookie': u'arpvtvqtckjhcyil', u'systemStats':
{u'swap_total': 6140452864, u'mem_free': 1532370944, u'swap_used': 102400, u'cpu_utilization_rate': 1.0075566750629723, u'mem_total': 4040077312}, u'os': u'x86_64-unknown-linux-gnu', u'ports':
{u'proxy': 11211, u'direct': 11210}, u'interestingStats': {u'couch_views_data_size': 0, u'couch_docs_actual_disk_size': 1074577442, u'couch_views_actual_disk_size': 0, u'ops': 0.0, u'vb_replica_curr_items': 200056, u'cmd_get': 0.0, u'mem_used': 1055051720, u'get_hits': 0.0, u'curr_items': 200131, u'ep_bg_fetched': 0.0, u'curr_items_tot': 400187, u'couch_docs_data_size': 941555446}}], u'nodeStatusesUri': u'/nodeStatuses', u'rebalanceStatus': u'none', u'alertsSilenceURL': u'/controller/resetAlerts?token=0&uuid=5f6c347bf2415d227157dffe1b91bc4f', u'storageTotals': {u'hdd':
{u'free': 174615122330, u'total': 281637294080, u'quotaTotal': 281637294080, u'usedByData': 5927763109, u'used': 51821262109}, u'ram':
{u'used': 19726376960, u'total': 20200357888, u'quotaTotal': 16158556160, u'usedByData': 5265320976, u'quotaUsed': 10768875520}}}
I don't see in UI logs that 10.1.3.115 was autofailover
I recall that after the update autofailover settings, we need a slight delay to the settings applied but we set them on 10.1.3.141 server and we have sleep in 5 seconds before firewall on(10.1.3.115)
https://s3.amazonaws.com/bugdb/jira/MB-10212/9a4f28e3/10.1.3.115-2122014-139-diag.zip
https://s3.amazonaws.com/bugdb/jira/MB-10212/9a4f28e3/10.1.3.116-2122014-140-diag.zip
https://s3.amazonaws.com/bugdb/jira/MB-10212/9a4f28e3/10.1.3.117-2122014-141-diag.zip
https://s3.amazonaws.com/bugdb/jira/MB-10212/9a4f28e3/10.1.3.118-2122014-142-diag.zip
https://s3.amazonaws.com/bugdb/jira/MB-10212/9a4f28e3/10.1.3.141-2122014-136-diag.zip