Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-7863

node is healthy in nodeStatuses even it was with firewall on and cluster already saw that node went down

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Critical
    • Resolution: Fixed
    • Affects Version/s: 2.0, 2.0.1
    • Fix Version/s: 2.1.0
    • Component/s: ns_server
    • Security Level: Public
    • Labels:
      None

      Description

      http://qa.hq.northscale.net/job/centos-64-2.0-failover-tests/566/consoleFull
      ./testrunner -i resources/jenkins/centos-64-5node-failover.ini get-logs=True,GROUP=BAT,get-cbcollect-info=True -t failovertests.FailoverTests.test_failover_firewall,replica=2,items=20000,GROUP=BAT
      version=2.0.0-1982-rel

      Steps:
      1. in cluster enabled firewall on ip:10.1.3.117
      2. then wait for 5 minutes, the node becomes unhealthy in /nodeStatuses API

      if in case of falling output "iptables --list", UI logs that shows node went down and /nodeStatuses

      I saw the same on different versions 2.0.0 & 2.0.1 but filled a bug only after add to test additional logs to identify the problem

      2013-03-04 05:40:20 | INFO | MainProcess | MainThread | [remote_util.enable_firewall] enabled firewall on ip:10.1.3.117 port:8091 ssh_username:root
      2013-03-04 05:40:20 | INFO | MainProcess | MainThread | [remote_util.execute_command_raw] running command.raw /sbin/iptables --list
      2013-03-04 05:40:20 | INFO | MainProcess | MainThread | [remote_util.execute_command_raw] command executed successfully
      2013-03-04 05:40:20 | INFO | MainProcess | MainThread | [remote_util.log_command_output] Chain INPUT (policy ACCEPT)
      2013-03-04 05:40:20 | INFO | MainProcess | MainThread | [remote_util.log_command_output] target prot opt source destination
      2013-03-04 05:40:20 | INFO | MainProcess | MainThread | [remote_util.log_command_output] REJECT tcp – anywhere anywhere tcp dpts:cadlock2:60000 reject-with icmp-port-unreachable
      2013-03-04 05:40:20 | INFO | MainProcess | MainThread | [remote_util.log_command_output]
      2013-03-04 05:40:20 | INFO | MainProcess | MainThread | [remote_util.log_command_output] Chain FORWARD (policy ACCEPT)
      2013-03-04 05:40:20 | INFO | MainProcess | MainThread | [remote_util.log_command_output] target prot opt source destination
      2013-03-04 05:40:20 | INFO | MainProcess | MainThread | [remote_util.log_command_output]
      2013-03-04 05:40:20 | INFO | MainProcess | MainThread | [remote_util.log_command_output] Chain OUTPUT (policy ACCEPT)
      2013-03-04 05:40:20 | INFO | MainProcess | MainThread | [remote_util.log_command_output] target prot opt source destination
      2013-03-04 05:40:20 | INFO | MainProcess | MainThread | [remote_util.log_command_output]
      2013-03-04 05:40:20 | INFO | MainProcess | MainThread | [remote_util.log_command_output] Chain RH-Firewall-1-INPUT (0 references)
      2013-03-04 05:40:20 | INFO | MainProcess | MainThread | [remote_util.log_command_output] target prot opt source destination
      2013-03-04 05:40:20 | INFO | MainProcess | MainThread | [rest_client.wait_for_node_status] node ns_1@10.1.3.117 status : healthy
      2013-03-04 05:40:20 | INFO | MainProcess | MainThread | [rest_client.wait_for_node_status] sleep for 5
      seconds before reading the node.status again
      ....
      2013-03-04 05:45:17 | INFO | MainProcess | MainThread | [rest_client.wait_for_node_status] sleep for 5 seconds before reading the node.status again
      2013-03-04 05:45:22 | INFO | MainProcess | MainThread | [rest_client.wait_for_node_status] node ns_1@10.1.3.117 status_reached : False
      2013-03-04 05:45:22 | INFO | MainProcess | MainThread | [remote_util.__init__] connecting to 10.1.3.117 with username : root password : couchbase ssh_key:
      2013-03-04 05:45:22 | INFO | MainProcess | MainThread | [remote_util.__init__] Connected
      2013-03-04 05:45:22 | INFO | MainProcess | MainThread | [remote_util.execute_command_raw] running command.raw sudo cat /proc/cpuinfo
      2013-03-04 05:45:22 | INFO | MainProcess | MainThread | [remote_util.execute_command_raw] command executed successfully
      2013-03-04 05:45:22 | INFO | MainProcess | MainThread | [remote_util.execute_command_raw] running command.raw sudo df -Th
      2013-03-04 05:45:22 | INFO | MainProcess | MainThread | [remote_util.execute_command_raw] command executed successfully
      2013-03-04 05:45:22 | INFO | MainProcess | MainThread | [remote_util.execute_command_raw] running command.raw sudo cat /proc/meminfo
      2013-03-04 05:45:22 | INFO | MainProcess | MainThread | [remote_util.execute_command_raw] command executed successfully
      2013-03-04 05:45:22 | INFO | MainProcess | MainThread | [remote_util.execute_command_raw] running command.raw hostname
      2013-03-04 05:45:22 | INFO | MainProcess | MainThread | [remote_util.execute_command_raw] command executed successfully
      2013-03-04 05:45:22 | INFO | MainProcess | MainThread | [remote_util.execute_command_raw] running command.raw /sbin/iptables --list
      2013-03-04 05:45:22 | INFO | MainProcess | MainThread | [remote_util.execute_command_raw] command executed successfully
      2013-03-04 05:45:22 | INFO | MainProcess | MainThread | [remote_util.log_command_output] Chain INPUT (policy ACCEPT)
      2013-03-04 05:45:22 | INFO | MainProcess | MainThread | [remote_util.log_command_output] target prot opt source destination
      2013-03-04 05:45:22 | INFO | MainProcess | MainThread | [remote_util.log_command_output] REJECT tcp – anywhere anywhere tcp dpts:cadlock2:60000 reject-with icmp-port-unreachable
      2013-03-04 05:45:22 | INFO | MainProcess | MainThread | [remote_util.log_command_output]
      2013-03-04 05:45:22 | INFO | MainProcess | MainThread | [remote_util.log_command_output] Chain FORWARD (policy ACCEPT)
      2013-03-04 05:45:22 | INFO | MainProcess | MainThread | [remote_util.log_command_output] target prot opt source destination
      2013-03-04 05:45:22 | INFO | MainProcess | MainThread | [remote_util.log_command_output]
      2013-03-04 05:45:22 | INFO | MainProcess | MainThread | [remote_util.log_command_output] Chain OUTPUT (policy ACCEPT)
      2013-03-04 05:45:22 | INFO | MainProcess | MainThread | [remote_util.log_command_output] target prot opt source destination
      2013-03-04 05:45:22 | INFO | MainProcess | MainThread | [remote_util.log_command_output]
      2013-03-04 05:45:22 | INFO | MainProcess | MainThread | [remote_util.log_command_output] Chain RH-Firewall-1-INPUT (0 references)
      2013-03-04 05:45:22 | INFO | MainProcess | MainThread | [remote_util.log_command_output] target prot opt source destination
      2013-03-04 05:45:23 | ERROR | MainProcess | MainThread | [failovertests.common_test_body]

      {u'node': u'ns_1@10.1.3.115', u'code': 5, u'text': u"Node 'ns_1@10.1.3.115' saw that node 'ns_1@10.1.3.117' went down.", u'shortText': u'node down', u'module': u'ns_node_disco', u'tstamp': 1362404495905, u'type': u'warning'}

      2013-03-04 05:45:23 | ERROR | MainProcess | MainThread | [failovertests.common_test_body]

      {u'node': u'ns_1@10.1.3.118', u'code': 5, u'text': u"Node 'ns_1@10.1.3.118' saw that node 'ns_1@10.1.3.117' went down.", u'shortText': u'node down', u'module': u'ns_node_disco', u'tstamp': 1362404482733, u'type': u'warning'}

      2013-03-04 05:45:23 | ERROR | MainProcess | MainThread | [failovertests.common_test_body]

      {u'node': u'ns_1@10.1.3.117', u'code': 5, u'text': u"Node 'ns_1@10.1.3.117' saw that node 'ns_1@10.1.3.116' went down.", u'shortText': u'node down', u'module': u'ns_node_disco', u'tstamp': 1362404482729, u'type': u'warning'}

      2013-03-04 05:45:23 | ERROR | MainProcess | MainThread | [failovertests.common_test_body]

      {u'node': u'ns_1@10.1.3.117', u'code': 5, u'text': u"Node 'ns_1@10.1.3.117' saw that node 'ns_1@10.1.3.115' went down.", u'shortText': u'node down', u'module': u'ns_node_disco', u'tstamp': 1362404482729, u'type': u'warning'}

      2013-03-04 05:45:23 | ERROR | MainProcess | MainThread | [failovertests.common_test_body]

      {u'node': u'ns_1@10.1.3.117', u'code': 5, u'text': u"Node 'ns_1@10.1.3.117' saw that node 'ns_1@10.1.3.118' went down.", u'shortText': u'node down', u'module': u'ns_node_disco', u'tstamp': 1362404482728, u'type': u'warning'}

      2013-03-04 05:45:23 | ERROR | MainProcess | MainThread | [failovertests.common_test_body]

      {u'node': u'ns_1@10.1.3.116', u'code': 5, u'text': u"Node 'ns_1@10.1.3.116' saw that node 'ns_1@10.1.3.117' went down.", u'shortText': u'node down', u'module': u'ns_node_disco', u'tstamp': 1362404482110, u'type': u'warning'}

      2013-03-04 05:45:23 | ERROR | MainProcess | MainThread | [failovertests.common_test_body]

      {u'node': u'ns_1@10.1.3.114', u'code': 1, u'text': u'Rebalance completed successfully.\n', u'shortText': u'message', u'module': u'ns_orchestrator', u'tstamp': 1362404388568, u'type': u'info'}

      2013-03-04 05:45:23 | ERROR | MainProcess | MainThread | [failovertests.common_test_body]

      {u'node': u'ns_1@10.1.3.114', u'code': 0, u'text': u'Bucket "default" rebalance does not seem to be swap rebalance', u'shortText': u'message', u'module': u'ns_vbucket_mover', u'tstamp': 1362404080697, u'type': u'info'}

      2013-03-04 05:45:23 | ERROR | MainProcess | MainThread | [failovertests.common_test_body]

      {u'node': u'ns_1@10.1.3.118', u'code': 1, u'text': u'Bucket "default" loaded on node \'ns_1@10.1.3.118\' in 0 seconds.', u'shortText': u'message', u'module': u'ns_memcached', u'tstamp': 1362404080100, u'type': u'info'}

      2013-03-04 05:45:23 | ERROR | MainProcess | MainThread | [failovertests.common_test_body]

      {u'node': u'ns_1@10.1.3.117', u'code': 1, u'text': u'Bucket "default" loaded on node \'ns_1@10.1.3.117\' in 0 seconds.', u'shortText': u'message', u'module': u'ns_memcached', u'tstamp': 1362404079864, u'type': u'info'}

      2013-03-04 05:45:23 | INFO | MainProcess | MainThread | [failovertests.common_test_body] nodeStatuses: {u'10.1.3.114:8091':

      {u'status': u'healthy', u'replication': 1.0, u'otpNode': u'ns_1@10.1.3.114'}

      , u'10.1.3.117:8091':

      {u'status': u'healthy', u'replication': 0.0, u'otpNode': u'ns_1@10.1.3.117'}

      , u'10.1.3.115:8091':

      {u'status': u'healthy', u'replication': 1.0, u'otpNode': u'ns_1@10.1.3.115'}

      , u'10.1.3.118:8091':

      {u'status': u'healthy', u'replication': 0.0, u'otpNode': u'ns_1@10.1.3.118'}

      , u'10.1.3.116:8091': {u'status': u'healthy', u'replication': 0.0, u'otpNode': u'ns_1@10.1.3.116'}}
      FAIL

      now don't have logs due to:

      IMPOSSIBLE TO GRAB CBCOLLECT FROM 10.1.3.118
      Traceback (most recent call last):
      File "testrunner", line 362, in <module>
      xunit.write("

      {0}

      /report-

      {1}

      ".format(logs_folder, str_time))
      File "lib/xunit.py", line 102, in write
      report_xml_file.close()
      IOError: [Errno 28] No space left on device

      # Subject Project Status CR V
      For Gerrit Dashboard: &For+MB-7863=message:MB-7863

        Activity

        Hide
        thuan Thuan Nguyen added a comment -

        Integrated in win-ui-testing-P0 #62 (See http://qa.hq.northscale.net/job/win-ui-testing-P0/62/)

        Result = SUCCESS

        Show
        thuan Thuan Nguyen added a comment - Integrated in win-ui-testing-P0 #62 (See http://qa.hq.northscale.net/job/win-ui-testing-P0/62/ ) Result = SUCCESS
        Hide
        thuan Thuan Nguyen added a comment -

        Integrated in windows32_rebalance-kv #12 (See http://qa.hq.northscale.net/job/windows32_rebalance-kv/12/)
        MB-7863: minor fix in logs and timeouts (Revision 8308f5bd88f75b0606094899050dfcf9b453fce7)

        Result = UNSTABLE
        andrei :
        Files :

        • pytests/autofailovertests.py
        Show
        thuan Thuan Nguyen added a comment - Integrated in windows32_rebalance-kv #12 (See http://qa.hq.northscale.net/job/windows32_rebalance-kv/12/ ) MB-7863 : minor fix in logs and timeouts (Revision 8308f5bd88f75b0606094899050dfcf9b453fce7) Result = UNSTABLE andrei : Files : pytests/autofailovertests.py
        Hide
        thuan Thuan Nguyen added a comment -

        Integrated in windows32_view_P0 #10 (See http://qa.hq.northscale.net/job/windows32_view_P0/10/)
        MB-7863: minor fix in logs and timeouts (Revision 8308f5bd88f75b0606094899050dfcf9b453fce7)

        Result = UNSTABLE
        andrei :
        Files :

        • pytests/autofailovertests.py
        Show
        thuan Thuan Nguyen added a comment - Integrated in windows32_view_P0 #10 (See http://qa.hq.northscale.net/job/windows32_view_P0/10/ ) MB-7863 : minor fix in logs and timeouts (Revision 8308f5bd88f75b0606094899050dfcf9b453fce7) Result = UNSTABLE andrei : Files : pytests/autofailovertests.py
        Hide
        thuan Thuan Nguyen added a comment -

        Integrated in windows_cas_P1 #60 (See http://qa.hq.northscale.net/job/windows_cas_P1/60/)
        MB-7863: minor fix in logs and timeouts (Revision 8308f5bd88f75b0606094899050dfcf9b453fce7)

        Result = UNSTABLE
        andrei :
        Files :

        • pytests/autofailovertests.py
        Show
        thuan Thuan Nguyen added a comment - Integrated in windows_cas_P1 #60 (See http://qa.hq.northscale.net/job/windows_cas_P1/60/ ) MB-7863 : minor fix in logs and timeouts (Revision 8308f5bd88f75b0606094899050dfcf9b453fce7) Result = UNSTABLE andrei : Files : pytests/autofailovertests.py
        Hide
        thuan Thuan Nguyen added a comment -

        Integrated in windows32_sanity_P0 #32 (See http://qa.hq.northscale.net/job/windows32_sanity_P0/32/)
        MB-7863: minor fix in logs and timeouts (Revision 8308f5bd88f75b0606094899050dfcf9b453fce7)

        Result = SUCCESS
        andrei :
        Files :

        • pytests/autofailovertests.py
        Show
        thuan Thuan Nguyen added a comment - Integrated in windows32_sanity_P0 #32 (See http://qa.hq.northscale.net/job/windows32_sanity_P0/32/ ) MB-7863 : minor fix in logs and timeouts (Revision 8308f5bd88f75b0606094899050dfcf9b453fce7) Result = SUCCESS andrei : Files : pytests/autofailovertests.py

          People

          • Assignee:
            iryna iryna
            Reporter:
            andreibaranouski Andrei Baranouski
          • Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Gerrit Reviews

              There are no open Gerrit changes