Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-10212

node is healthy in nodeStatuses and was not autofaliver even it was with firewall on and cluster already saw that node went down

    XMLWordPrintable

Details

    • Bug
    • Resolution: Incomplete
    • Major
    • 3.0
    • 2.2.0, 2.5.1
    • ns_server
    • Security Level: Public
    • None
    • 2.2.0-837
    • Untriaged
    • Centos 64-bit

    Description

      this problem is one to one as MB-7863

      but don't reopen it because see the issue only on 2.2.0-837 build

      http://qa.sc.couchbase.com/job/centos_x64--03_01--failover_tests-P0/33/consoleFull

      1) [2014-02-12 01:16:37,211] - [rest_client:969] INFO - rebalance params : password=password&ejectedNodes=&user=Administrator&knownNodes=ns_1%4010.1.3.117%2Cns_1%4010.1.3.141%2Cns_1%4010.1.3.115%2Cns_1%4010.1.3.118%2Cns_1%4010.1.3.116

      2) [2014-02-12 01:33:05,889] - [rest_client:1438] INFO - settings/autoFailover params : enabled=true&timeout=30
      [2014-02-12 01:33:05,896] - [autofailovertests:129] INFO - sleep for 5 secs. ...

      2) failover node 10.1.3.115

      [2014-02-12 01:33:11,784] - [remote_util:1437] INFO - running command.raw on 10.1.3.115: /sbin/iptables -A INPUT -p tcp -i eth0 --dport 1000:60000 -j REJECT
      [2014-02-12 01:33:11,928] - [remote_util:1466] INFO - command executed successfully
      [2014-02-12 01:33:11,930] - [remote_util:2250] INFO - enabled firewall on ip:10.1.3.115 port:8091 ssh_username:root
      [2014-02-12 01:33:11,930] - [remote_util:1437] INFO - running command.raw on 10.1.3.115: /sbin/iptables --list
      [2014-02-12 01:33:12,096] - [remote_util:1466] INFO - command executed successfully
      [2014-02-12 01:33:12,097] - [remote_util:1388] INFO - Chain INPUT (policy ACCEPT)
      [2014-02-12 01:33:12,098] - [remote_util:1388] INFO - target prot opt source destination
      [2014-02-12 01:33:12,098] - [remote_util:1388] INFO - REJECT tcp – anywhere anywhere tcp dpts:cadlock2:60000 reject-with icmp-port-unreachable
      [2014-02-12 01:33:12,098] - [remote_util:1388] INFO -
      [2014-02-12 01:33:12,098] - [remote_util:1388] INFO - Chain FORWARD (policy ACCEPT)
      [2014-02-12 01:33:12,099] - [remote_util:1388] INFO - target prot opt source destination
      [2014-02-12 01:33:12,099] - [remote_util:1388] INFO -
      [2014-02-12 01:33:12,099] - [remote_util:1388] INFO - Chain OUTPUT (policy ACCEPT)
      [2014-02-12 01:33:12,099] - [remote_util:1388] INFO - target prot opt source destination
      [2014-02-12 01:33:12,100] - [remote_util:1388] INFO -
      [2014-02-12 01:33:12,100] - [remote_util:1388] INFO - Chain RH-Firewall-1-INPUT (0 references)
      [2014-02-12 01:33:12,100] - [remote_util:1388] INFO - target prot opt source destination

      wait till until 10.1.3.115 becomes inactive

      [2014-02-12 01:33:12,231] - [autofailovertests:106] INFO - 'clusterMembership' for node ns_1@10.1.3.141 is active
      [2014-02-12 01:33:12,232] - [autofailovertests:106] INFO - 'clusterMembership' for node ns_1@10.1.3.117 is active
      [2014-02-12 01:33:12,232] - [autofailovertests:106] INFO - 'clusterMembership' for node ns_1@10.1.3.116 is active
      [2014-02-12 01:33:12,233] - [autofailovertests:106] INFO - 'clusterMembership' for node ns_1@10.1.3.115 is active
      ....

      [2014-02-12 01:35:40,830] - [autofailovertests:106] INFO - 'clusterMembership' for node ns_1@10.1.3.115 is active
      [2014-02-12 01:35:40,831] - [autofailovertests:106] INFO - 'clusterMembership' for node ns_1@10.1.3.117 is active
      [2014-02-12 01:35:42,845] - [autofailovertests:74] INFO - Latest logs from UI:
      [2014-02-12 01:35:43,192] - [autofailovertests:75] ERROR - {u'node': u'ns_1@10.1.3.118', u'code': 5, u'text': u"Node 'ns_1@10.1.3.118' saw that node 'ns_1@10.1.3.115' went down. Details: [

      {nodedown_reason,\n net_tick_timeout}]", u'shortText': u'node down', u'module': u'ns_node_disco', u'tstamp': 1392197694813, u'type': u'warning'}
      [2014-02-12 01:35:43,193] - [autofailovertests:75] ERROR - {u'node': u'ns_1@10.1.3.117', u'code': 5, u'text': u"Node 'ns_1@10.1.3.117' saw that node 'ns_1@10.1.3.115' went down. Details: [{nodedown_reason,\n connection_closed}]", u'shortText': u'node down', u'module': u'ns_node_disco', u'tstamp': 1392197694042, u'type': u'warning'}
      [2014-02-12 01:35:43,193] - [autofailovertests:75] ERROR - {u'node': u'ns_1@10.1.3.116', u'code': 5, u'text': u"Node 'ns_1@10.1.3.116' saw that node 'ns_1@10.1.3.115' went down. Details: [{nodedown_reason,n connection_closed}]
      ", u'shortText': u'node down', u'module': u'ns_node_disco', u'tstamp': 1392197694022, u'type': u'warning'}
      [2014-02-12 01:35:43,194] - [autofailovertests:75] ERROR - {u'node': u'ns_1@10.1.3.115', u'code': 5, u'text': u"Node 'ns_1@10.1.3.115' saw that node 'ns_1@10.1.3.116' went down. Details: [{nodedown_reason,n net_tick_timeout}

      ]", u'shortText': u'node down', u'module': u'ns_node_disco', u'tstamp': 1392197694020, u'type': u'warning'}
      [2014-02-12 01:35:43,194] - [autofailovertests:75] ERROR - {u'node': u'ns_1@10.1.3.115', u'code': 5, u'text': u"Node 'ns_1@10.1.3.115' saw that node 'ns_1@10.1.3.117' went down. Details: [

      {nodedown_reason,\n net_tick_timeout}]", u'shortText': u'node down', u'module': u'ns_node_disco', u'tstamp': 1392197694020, u'type': u'warning'}
      [2014-02-12 01:35:43,195] - [autofailovertests:75] ERROR - {u'node': u'ns_1@10.1.3.115', u'code': 5, u'text': u"Node 'ns_1@10.1.3.115' saw that node 'ns_1@10.1.3.118' went down. Details: [{nodedown_reason,n net_tick_timeout}

      ]", u'shortText': u'node down', u'module': u'ns_node_disco', u'tstamp': 1392197694019, u'type': u'warning'}
      [2014-02-12 01:35:43,195] - [autofailovertests:75] ERROR -

      {u'node': u'ns_1@10.1.3.141', u'code': 0, u'text': u'Updating auto-failover timeout to 30', u'shortText': u'message', u'module': u'auto_failover', u'tstamp': 1392197619112, u'type': u'info'}

      [2014-02-12 01:35:43,196] - [autofailovertests:75] ERROR -

      {u'node': u'ns_1@10.1.3.141', u'code': 1, u'text': u'Rebalance completed successfully.\n', u'shortText': u'message', u'module': u'ns_orchestrator', u'tstamp': 1392196970008, u'type': u'info'}

      [2014-02-12 01:35:43,196] - [autofailovertests:75] ERROR -

      {u'node': u'ns_1@10.1.3.115', u'code': 1, u'text': u'Bucket "default" loaded on node \'ns_1@10.1.3.115\' in 0 seconds.', u'shortText': u'message', u'module': u'ns_memcached', u'tstamp': 1392196633035, u'type': u'info'}

      [2014-02-12 01:35:43,197] - [autofailovertests:75] ERROR -

      {u'node': u'ns_1@10.1.3.118', u'code': 1, u'text': u'Bucket "default" loaded on node \'ns_1@10.1.3.118\' in 0 seconds.', u'shortText': u'message', u'module': u'ns_memcached', u'tstamp': 1392196632540, u'type': u'info'}

      [2014-02-12 01:35:43,219] - [autofailovertests:76] WARNING - pools/default from 10.1.3.141 : {u'rebalanceProgressUri': u'/pools/default/rebalanceProgress', u'maxBucketCount': 10, u'name': u'default', u'fastWarmupSettings':

      {u'fastWarmupEnabled': True, u'minItemsThreshold': 10, u'minMemoryThreshold': 10}

      , u'buckets':

      {u'uri': u'/pools/default/buckets?v=131509077&uuid=5f6c347bf2415d227157dffe1b91bc4f'}

      , u'alerts': [], u'counters':

      {u'rebalance_success': 26, u'rebalance_start': 26, u'failover_node': 4}

      , u'remoteClusters':

      {u'validateURI': u'/pools/default/remoteClusters?just_validate=1', u'uri': u'/pools/default/remoteClusters?uuid=5f6c347bf2415d227157dffe1b91bc4f'}

      , u'controllers': {u'setAutoCompaction':

      {u'validateURI': u'/controller/setAutoCompaction?just_validate=1', u'uri': u'/controller/setAutoCompaction?uuid=5f6c347bf2415d227157dffe1b91bc4f'}

      , u'addNode':

      {u'uri': u'/controller/addNode?uuid=5f6c347bf2415d227157dffe1b91bc4f'}

      , u'setFastWarmup':

      {u'validateURI': u'/controller/setFastWarmup?just_validate=1', u'uri': u'/controller/setFastWarmup?uuid=5f6c347bf2415d227157dffe1b91bc4f'}

      , u'failOver':

      {u'uri': u'/controller/failOver?uuid=5f6c347bf2415d227157dffe1b91bc4f'}

      , u'reAddNode':

      {u'uri': u'/controller/reAddNode?uuid=5f6c347bf2415d227157dffe1b91bc4f'}

      , u'replication':

      {u'createURI': u'/controller/createReplication?uuid=5f6c347bf2415d227157dffe1b91bc4f', u'validateURI': u'/controller/createReplication?just_validate=1'}

      , u'ejectNode':

      {u'uri': u'/controller/ejectNode?uuid=5f6c347bf2415d227157dffe1b91bc4f'}

      , u'rebalance': {u'uri': u'/controller/rebalance?uuid=5f6c347bf2415d227157dffe1b91bc4f'}}, u'tasks':

      {u'uri': u'/pools/default/tasks?v=133172395'}

      , u'stopRebalanceUri': u'/controller/stopRebalance?uuid=5f6c347bf2415d227157dffe1b91bc4f', u'autoCompactionSettings': {u'viewFragmentationThreshold':

      {u'percentage': 30, u'size': u'undefined'}

      , u'parallelDBAndViewCompaction': False, u'databaseFragmentationThreshold': {u'percentage': 30, u'size': u'undefined'}}, u'nodes': [{u'status': u'healthy', u'couchApiBase': u'http://10.1.3.116:8092/', u'uptime': u'11667', u'clusterMembership': u'active', u'mcdMemoryReserved': 3082, u'memoryFree': 1450782720, u'memoryTotal': 4040077312, u'hostname': u'10.1.3.116:8091', u'clusterCompatibility': 131072, u'version': u'2.2.0-837-rel-enterprise', u'otpNode': u'ns_1@10.1.3.116', u'mcdMemoryAllocated': 3082, u'otpCookie': u'arpvtvqtckjhcyil', u'systemStats':

      {u'swap_total': 6140452864, u'mem_free': 1450782720, u'swap_used': 102400, u'cpu_utilization_rate': 1.256281407035176, u'mem_total': 4040077312}

      , u'os': u'x86_64-unknown-linux-gnu', u'ports':

      {u'proxy': 11211, u'direct': 11210}

      , u'interestingStats': {u'couch_views_data_size': 0, u'couch_docs_actual_disk_size': 1091907614, u'couch_views_actual_disk_size': 0, u'ops': 0.0, u'vb_replica_curr_items': 200121, u'cmd_get': 0.0, u'mem_used': 1055470912, u'get_hits': 0.0, u'curr_items': 200338, u'ep_bg_fetched': 0.0, u'curr_items_tot': 400459, u'couch_docs_data_size': 942368284}}, {u'status': u'healthy', u'couchApiBase': u'http://10.1.3.115:8092/', u'uptime': u'11666', u'clusterMembership': u'active', u'mcdMemoryReserved': 3082, u'memoryFree': 1516736512, u'memoryTotal': 4040077312, u'hostname': u'10.1.3.115:8091', u'clusterCompatibility': 131072, u'version': u'2.2.0-837-rel-enterprise', u'otpNode': u'ns_1@10.1.3.115', u'mcdMemoryAllocated': 3082, u'otpCookie': u'arpvtvqtckjhcyil', u'systemStats':

      {u'swap_total': 6140452864, u'mem_free': 1516736512, u'swap_used': 102400, u'cpu_utilization_rate': 1.2594458438287153, u'mem_total': 4040077312}

      , u'os': u'x86_64-unknown-linux-gnu', u'ports':

      {u'proxy': 11211, u'direct': 11210}

      , u'interestingStats': {u'couch_views_data_size': 0, u'couch_docs_actual_disk_size': 1301962560, u'couch_views_actual_disk_size': 0, u'ops': 0.0, u'vb_replica_curr_items': 200221, u'cmd_get': 0.0, u'mem_used': 1055105712, u'get_hits': 0.0, u'curr_items': 200189, u'ep_bg_fetched': 0.0, u'curr_items_tot': 400410, u'couch_docs_data_size': 947419810}}, {u'status': u'healthy', u'couchApiBase': u'http://10.1.3.118:8092/', u'uptime': u'11667', u'clusterMembership': u'active', u'mcdMemoryReserved': 3082, u'memoryFree': 1519489024, u'memoryTotal': 4040077312, u'hostname': u'10.1.3.118:8091', u'clusterCompatibility': 131072, u'version': u'2.2.0-837-rel-enterprise', u'otpNode': u'ns_1@10.1.3.118', u'mcdMemoryAllocated': 3082, u'otpCookie': u'arpvtvqtckjhcyil', u'systemStats':

      {u'swap_total': 6140452864, u'mem_free': 1519489024, u'swap_used': 102400, u'cpu_utilization_rate': 1.0075566750629723, u'mem_total': 4040077312}

      , u'os': u'x86_64-unknown-linux-gnu', u'ports':

      {u'proxy': 11211, u'direct': 11210}

      , u'interestingStats': {u'couch_views_data_size': 0, u'couch_docs_actual_disk_size': 1257213978, u'couch_views_actual_disk_size': 0, u'ops': 0.0, u'vb_replica_curr_items': 200315, u'cmd_get': 0.0, u'mem_used': 1056399872, u'get_hits': 0.0, u'curr_items': 200119, u'ep_bg_fetched': 0.0, u'curr_items_tot': 400434, u'couch_docs_data_size': 946168592}}, {u'status': u'healthy', u'couchApiBase': u'http://10.1.3.141:8092/', u'uptime': u'11667', u'clusterMembership': u'active', u'mcdMemoryReserved': 3082, u'memoryFree': 1348734976, u'memoryTotal': 4040048640, u'hostname': u'10.1.3.141:8091', u'thisNode': True, u'clusterCompatibility': 131072, u'version': u'2.2.0-837-rel-enterprise', u'otpNode': u'ns_1@10.1.3.141', u'mcdMemoryAllocated': 3082, u'otpCookie': u'arpvtvqtckjhcyil', u'systemStats':

      {u'swap_total': 6140452864, u'mem_free': 1348734976, u'swap_used': 102400, u'cpu_utilization_rate': 2.0050125313283207, u'mem_total': 4040048640}

      , u'os': u'x86_64-unknown-linux-gnu', u'ports':

      {u'proxy': 11211, u'direct': 11210}

      , u'interestingStats': {u'couch_views_data_size': 0, u'couch_docs_actual_disk_size': 1202101515, u'couch_views_actual_disk_size': 0, u'ops': 0.0, u'vb_replica_curr_items': 199287, u'cmd_get': 0.0, u'mem_used': 1043292760, u'get_hits': 0.0, u'curr_items': 199223, u'ep_bg_fetched': 0.0, u'curr_items_tot': 398510, u'couch_docs_data_size': 941191994}}, {u'status': u'healthy', u'couchApiBase': u'http://10.1.3.117:8092/', u'uptime': u'11667', u'clusterMembership': u'active', u'mcdMemoryReserved': 3082, u'memoryFree': 1532370944, u'memoryTotal': 4040077312, u'hostname': u'10.1.3.117:8091', u'clusterCompatibility': 131072, u'version': u'2.2.0-837-rel-enterprise', u'otpNode': u'ns_1@10.1.3.117', u'mcdMemoryAllocated': 3082, u'otpCookie': u'arpvtvqtckjhcyil', u'systemStats':

      {u'swap_total': 6140452864, u'mem_free': 1532370944, u'swap_used': 102400, u'cpu_utilization_rate': 1.0075566750629723, u'mem_total': 4040077312}

      , u'os': u'x86_64-unknown-linux-gnu', u'ports':

      {u'proxy': 11211, u'direct': 11210}

      , u'interestingStats': {u'couch_views_data_size': 0, u'couch_docs_actual_disk_size': 1074577442, u'couch_views_actual_disk_size': 0, u'ops': 0.0, u'vb_replica_curr_items': 200056, u'cmd_get': 0.0, u'mem_used': 1055051720, u'get_hits': 0.0, u'curr_items': 200131, u'ep_bg_fetched': 0.0, u'curr_items_tot': 400187, u'couch_docs_data_size': 941555446}}], u'nodeStatusesUri': u'/nodeStatuses', u'rebalanceStatus': u'none', u'alertsSilenceURL': u'/controller/resetAlerts?token=0&uuid=5f6c347bf2415d227157dffe1b91bc4f', u'storageTotals': {u'hdd':

      {u'free': 174615122330, u'total': 281637294080, u'quotaTotal': 281637294080, u'usedByData': 5927763109, u'used': 51821262109}

      , u'ram':

      {u'used': 19726376960, u'total': 20200357888, u'quotaTotal': 16158556160, u'usedByData': 5265320976, u'quotaUsed': 10768875520}

      }}

      I don't see in UI logs that 10.1.3.115 was autofailover

      I recall that after the update autofailover settings, we need a slight delay to the settings applied but we set them on 10.1.3.141 server and we have sleep in 5 seconds before firewall on(10.1.3.115)

      https://s3.amazonaws.com/bugdb/jira/MB-10212/9a4f28e3/10.1.3.115-2122014-139-diag.zip
      https://s3.amazonaws.com/bugdb/jira/MB-10212/9a4f28e3/10.1.3.116-2122014-140-diag.zip
      https://s3.amazonaws.com/bugdb/jira/MB-10212/9a4f28e3/10.1.3.117-2122014-141-diag.zip
      https://s3.amazonaws.com/bugdb/jira/MB-10212/9a4f28e3/10.1.3.118-2122014-142-diag.zip
      https://s3.amazonaws.com/bugdb/jira/MB-10212/9a4f28e3/10.1.3.141-2122014-136-diag.zip

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          People

            Aliaksey Artamonau Aliaksey Artamonau (Inactive)
            andreibaranouski Andrei Baranouski
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty