Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-55879

[System Test] Autofailover did not go through because of safety check failure

    XMLWordPrintable

Details

    • Untriaged
    • 0
    • Unknown

    Description

      Build :7.2.0-5232
      Test : -test tests/2i/neo/test_neo_idx_clusterops_recovery.yml -scope tests/2i/neo/scope_neo_plasma_idx_dgm.yml
      Scale : 3

      It looks like an auto-failover of a node was attempted (not really sure why), but didn't go through because of safety check failure.

      The problematic node appears to be 172.23.97.109 -

      /opt/couchbase/var/lib/couchbase/logs/info.log.1.gz:[user:info,2023-03-07T08:25:00.948-08:00,ns_1@172.23.96.198:<0.27073.0>:auto_failover:log_unsafe_node:670]Could not automatically fail over node ('ns_1@172.23.97.109') due to operation being unsafe for service index. Safety check failed.
      /opt/couchbase/var/lib/couchbase/logs/info.log.1.gz:[user:info,2023-03-07T08:25:08.964-08:00,ns_1@172.23.96.198:<0.27073.0>:auto_failover:log_unsafe_node:670]Could not automatically fail over node ('ns_1@172.23.97.109') due to operation being unsafe for service index. Safety check failed.
      /opt/couchbase/var/lib/couchbase/logs/info.log.1.gz:[ns_server:info,2023-03-07T08:25:08.965-08:00,ns_1@172.23.96.198:ns_log<0.25245.0>:ns_log:is_duplicate_log:156]suppressing duplicate log auto_failover:0([<<"Could not automatically fail over node ('ns_1@172.23.97.109') due to operation being unsafe for service index. Safety check failed.">>]) because it's been seen 1 times in the past 8.016095 secs (last seen 8.016095 secs ago
      /opt/couchbase/var/lib/couchbase/logs/info.log.1.gz:[user:info,2023-03-07T08:25:15.977-08:00,ns_1@172.23.96.198:<0.27073.0>:auto_failover:log_unsafe_node:670]Could not automatically fail over node ('ns_1@172.23.97.109') due to operation being unsafe for service index. Safety check failed.
      /opt/couchbase/var/lib/couchbase/logs/info.log.1.gz:[ns_server:info,2023-03-07T08:25:15.977-08:00,ns_1@172.23.96.198:ns_log<0.25245.0>:ns_log:is_duplicate_log:156]suppressing duplicate log auto_failover:0([<<"Could not automatically fail over node ('ns_1@172.23.97.109') due to operation being unsafe for service index. Safety check failed.">>]) because it's been seen 2 times in the past 15.028912 secs (last seen 7.012817 secs ago
      

      The info.log on 172.23.97.109 shows these errors -

      [ns_server:error,2023-03-07T08:24:46.533-08:00,ns_1@172.23.97.109:service_agent-index<0.30705.77>:service_agent:terminate:259]Terminating abnormally
      [ns_server:error,2023-03-07T08:24:53.409-08:00,ns_1@172.23.97.109:service_status_keeper_worker<0.13783.0>:rest_utils:get_json:62]Request to (indexer) getIndexStatus with headers [{"If-None-Match",
                                                         "61fe7b1db8796333"}] failed: {error,
                                                                                       timeout}
      [ns_server:error,2023-03-07T08:24:53.410-08:00,ns_1@172.23.97.109:service_status_keeper-index<0.13786.0>:service_status_keeper:handle_cast:103]Service service_index returned incorrect status
      [ns_server:error,2023-03-07T08:25:08.413-08:00,ns_1@172.23.97.109:service_status_keeper_worker<0.13783.0>:rest_utils:get_json:62]Request to (indexer) getIndexStatus with headers [{"If-None-Match",
                                                         "61fe7b1db8796333"}] failed: {error,
                                                                                       timeout}
      [ns_server:error,2023-03-07T08:25:08.414-08:00,ns_1@172.23.97.109:service_status_keeper-index<0.13786.0>:service_status_keeper:handle_cast:103]Service service_index returned incorrect status
      [user:info,2023-03-07T08:25:21.336-08:00,ns_1@172.23.97.109:<0.5258.78>:menelaus_web_alerts_srv:global_alert:178]Warning: approaching low index resident percentage. Indexer RAM percentage on node "172.23.97.109" is 7%, which is under the threshold of 10%.
      [ns_server:info,2023-03-07T08:25:23.328-08:00,ns_1@172.23.97.109:ns_config_rep<0.13579.0>:ns_config_rep:pull_one_node:421]Pulling config from: 'ns_1@172.23.97.66'
      [ns_server:error,2023-03-07T08:25:23.417-08:00,ns_1@172.23.97.109:service_status_keeper_worker<0.13783.0>:rest_utils:get_json:62]Request to (indexer) getIndexStatus with headers [{"If-None-Match",
                                                         "61fe7b1db8796333"}] failed: {error,
                                                                                       timeout}
      [ns_server:error,2023-03-07T08:25:23.418-08:00,ns_1@172.23.97.109:service_status_keeper-index<0.13786.0>:service_status_keeper:handle_cast:103]Service service_index returned incorrect status
      

      cbcollect ->

               url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1678209145/collectinfo-2023-03-07T171226-ns_1%40172.23.105.122.zip
               url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1678209145/collectinfo-2023-03-07T171226-ns_1%40172.23.106.171.zip
               url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1678209145/collectinfo-2023-03-07T171226-ns_1%40172.23.106.176.zip
               url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1678209145/collectinfo-2023-03-07T171226-ns_1%40172.23.106.30.zip
               url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1678209145/collectinfo-2023-03-07T171226-ns_1%40172.23.96.198.zip
               url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1678209145/collectinfo-2023-03-07T171226-ns_1%40172.23.96.230.zip
               url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1678209145/collectinfo-2023-03-07T171226-ns_1%40172.23.96.245.zip
               url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1678209145/collectinfo-2023-03-07T171226-ns_1%40172.23.97.100.zip
               url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1678209145/collectinfo-2023-03-07T171226-ns_1%40172.23.97.108.zip
               url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1678209145/collectinfo-2023-03-07T171226-ns_1%40172.23.97.109.zip
               url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1678209145/collectinfo-2023-03-07T171226-ns_1%40172.23.97.66.zip
               url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1678209145/collectinfo-2023-03-07T171226-ns_1%40172.23.97.67.zip
      

      Attachments

        1. N97_109_indexer_cprof.svg
          170 kB
        2. N97_109_indexer_mprof.svg
          126 kB
        3. N97_109_CPU_utilisation_mortimer.png
          N97_109_CPU_utilisation_mortimer.png
          836 kB
        4. N97_109_memoryRss_vs_Quota.png
          N97_109_memoryRss_vs_Quota.png
          588 kB
        5. 720_5304-mut_queue_size.png
          720_5304-mut_queue_size.png
          68 kB
        6. 720_5304-ts_queue_size.png
          720_5304-ts_queue_size.png
          79 kB
        7. 720_5298-mutation_queue_size.png
          720_5298-mutation_queue_size.png
          70 kB
        8. 720_5298-ts_queue_size.png
          720_5298-ts_queue_size.png
          72 kB
        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          People

            shivansh.rustagi Shivansh Rustagi
            pavan.pb Pavan PB
            Votes:
            0 Vote for this issue
            Watchers:
            12 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty