Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-51221

[System Test] One index node is unhealthy but cannot be auto failed over

    XMLWordPrintable

Details

    Description

      Build : 7.1.0-2375
      Test : -test tests/integration/neo/test_neo_magma_milestone4.yml -scope tests/integration/neo/scope_neo_magma.yml
      Scale : 3
      Iteration : 1st

      This is the same test as in MB-51200. Around the same time when MB-51200 was encountered, 172.23.105.39 indexer node seems to be unhealthy.

      Following can be seen in the debug.log of 172.23.105.39 :

      [error_logger:error,2022-02-24T18:06:18.167-08:00,ns_1@172.23.105.39:service_agent-index<0.23829.74>:ale_error_logger_handler:do_log:101]
      =========================ERROR REPORT=========================
      ** Generic server 'service_agent-index' terminating
      ** Last message in was {'DOWN',#Ref<0.2197605375.1262223364.88040>,process,
                              <32344.11108.364>,
                              {worker_died,
                               {'EXIT',<32344.11252.364>,
                                {rebalance_failed,
                                 {service_error,
                                  <<"Post \"http://127.0.0.1:9102/createIndexRebalance\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)">>}}}}}
      ** When Server state == {state,index,
      ...
      ...
      [error_logger:error,2022-02-24T18:06:18.170-08:00,ns_1@172.23.105.39:service_agent-index<0.23829.74>:ale_error_logger_handler:do_log:101]
      =========================CRASH REPORT=========================
        crasher:
          initial call: service_agent:init/1
          pid: <0.23829.74>
          registered_name: 'service_agent-index'
          exception exit: {timeout,
                              {gen_server,call,
                                  [<0.24218.74>,
                                   {call,"ServiceAPI.CancelTask",
                                       #Fun<json_rpc_connection.0.86436583>,
                                       #{timeout => 60000}},
                                   60000]}}
            in function  gen_server:call/3 (gen_server.erl, line 247)
            in call from service_api:perform_call/4 (src/service_api.erl, line 76)
            in call from service_agent:cancel_task/2 (src/service_agent.erl, line 469)
            in call from lists:foreach/2 (lists.erl, line 1342)
            in call from service_agent:cleanup_service/1 (src/service_agent.erl, line 497)
            in call from service_agent:'-handle_unset_rebalancer/1-fun-1-'/1 (src/service_agent.erl, line 393)
            in call from service_agent:handle_unset_rebalancer/1 (src/service_agent.erl, line 387)
            in call from service_agent:handle_info/2 (src/service_agent.erl, line 235)
          ancestors: [service_agent_children_sup,service_agent_sup,ns_server_sup,
                        ns_server_nodes_sup,<0.23048.0>,ns_server_cluster_sup,
                        root_sup,<0.144.0>]
          message_queue_len: 10
      ...
      ...
      [ns_server:debug,2022-02-24T18:06:18.171-08:00,ns_1@172.23.105.39:<0.23851.74>:ns_pubsub:do_subscribe_link_continue:150]Parent process of subscription {ns_config_events,<0.23829.74>} exited with reason {timeout,
                                                                                         {gen_server,
                                                                                          call,
                                                                                          [<0.24218.74>,
                                                                                           {call,
                                                                                            "ServiceAPI.CancelTask",
                                                                                            #Fun<json_rpc_connection.0.86436583>,
                                                                                            #{timeout =>
                                                                                               60000}},
                                                                                           60000]}}
      [ns_server:error,2022-02-24T18:06:18.171-08:00,ns_1@172.23.105.39:ns_heart_slow_status_updater<0.23319.0>:ns_heart:grab_one_service_status:409]Failed to grab service index status: {exit,
                                            {{timeout,
                                              {gen_server,call,
                                               [<0.24218.74>,
                                                {call,"ServiceAPI.CancelTask",
      ...
      ...
      [ns_server:debug,2022-02-24T18:06:18.443-08:00,ns_1@172.23.105.39:<0.23651.74>:service_api:health_check:53]Exception while calling HealthCheck: exit:{noproc,
                                                 {gen_server,call,
                                                  ['json_rpc_connection-index-service_api',
                                                   {call,"ServiceAPI.HealthCheck",
                                                    #Fun<json_rpc_connection.0.86436583>,
                                                    #{silent => true,
                                                      timeout => 60000}},
                                                   60000]}}
      [ns_server:debug,2022-02-24T18:06:18.984-08:00,ns_1@172.23.105.39:index_monitor<0.23852.74>:index_monitor:handle_call:103]Detected health check error "Connection to the service is lost"
      [ns_server:debug,2022-02-24T18:06:19.987-08:00,ns_1@172.23.105.39:index_monitor<0.23852.74>:index_monitor:handle_call:103]Detected health check error "Connection to the service is lost"
      

      This triggers auto failover, but auto failover doesnt happen due to the following (as seen in the info.log of 172.23.108.139) :

      [root@localhost logs]# zgrep -i "Could not automatically fail over node" info.log.4.gz | more
      [user:info,2022-02-24T18:06:24.934-08:00,ns_1@172.23.108.139:<0.26410.0>:auto_failover:log_unsafe_node:663]Could not automatically fail over node ('ns_1@172.23.105.39') due to operation being unsafe for service index. Failing over nodes 172.23.105.39:9102(be6ed18f0a850f9a
      d04d4eea8c477ed3) would lose the following indexes/partitions: bucket7._default._default.sg_allDocs_x1 0
      [user:info,2022-02-24T18:42:08.425-08:00,ns_1@172.23.108.139:<0.26410.0>:auto_failover:log_unsafe_node:663]Could not automatically fail over node ('ns_1@172.23.105.39') due to operation being unsafe for service index. Failing over nodes 172.23.105.39:9102(be6ed18f0a850f9a
      d04d4eea8c477ed3) would lose the following indexes/partitions: bucket4.scope_5.coll_2.idx1_ZSOx2Dn2 4 3 bucket7._default._default.sg_allDocs_x1 0
      [user:info,2022-02-24T18:42:11.438-08:00,ns_1@172.23.108.139:<0.26410.0>:auto_failover:log_unsafe_node:663]Could not automatically fail over node ('ns_1@172.23.105.39') due to operation being unsafe for service index. Failing over nodes 172.23.105.39:9102(be6ed18f0a850f9a
      d04d4eea8c477ed3) would lose the following indexes/partitions: bucket7._default._default.sg_allDocs_x1 0 bucket4.scope_5.coll_2.idx1_ZSOx2Dn2 3 4
      [user:info,2022-02-24T18:42:16.607-08:00,ns_1@172.23.108.139:<0.26410.0>:auto_failover:log_unsafe_node:663]Could not automatically fail over node ('ns_1@172.23.105.39') due to operation being unsafe for service index. Failing over nodes 172.23.105.39:9102(be6ed18f0a850f9a
      d04d4eea8c477ed3) would lose the following indexes/partitions: bucket4.scope_5.coll_2.idx1_ZSOx2Dn2 3 4 bucket7._default._default.sg_allDocs_x1 0
      [user:info,2022-02-24T18:42:17.438-08:00,ns_1@172.23.108.139:<0.26410.0>:auto_failover:log_unsafe_node:663]Could not automatically fail over node ('ns_1@172.23.105.39') due to operation being unsafe for service index. Failing over nodes 172.23.105.39:9102(be6ed18f0a850f9a
      d04d4eea8c477ed3) would lose the following indexes/partitions: bucket7._default._default.sg_allDocs_x1 0 bucket4.scope_5.coll_2.idx1_ZSOx2Dn2 4 3
      [user:info,2022-02-24T18:42:23.445-08:00,ns_1@172.23.108.139:<0.26410.0>:auto_failover:log_unsafe_node:663]Could not automatically fail over node ('ns_1@172.23.105.39') due to operation being unsafe for service index. Failing over nodes 172.23.105.39:9102(be6ed18f0a850f9a
      d04d4eea8c477ed3) would lose the following indexes/partitions: bucket4.scope_5.coll_2.idx1_ZSOx2Dn2 4 2 3 bucket7._default._default.sg_allDocs_x1 0
      [user:info,2022-02-24T18:42:24.441-08:00,ns_1@172.23.108.139:<0.26410.0>:auto_failover:log_unsafe_node:663]Could not automatically fail over node ('ns_1@172.23.105.39') due to operation being unsafe for service index. Failing over nodes 172.23.105.39:9102(be6ed18f0a850f9a
      d04d4eea8c477ed3) would lose the following indexes/partitions: bucket7._default._default.sg_allDocs_x1 0 bucket4.scope_5.coll_2.idx1_ZSOx2Dn2 4 2 3
      [user:info,2022-02-24T18:42:25.455-08:00,ns_1@172.23.108.139:<0.26410.0>:auto_failover:log_unsafe_node:663]Could not automatically fail over node ('ns_1@172.23.105.39') due to operation being unsafe for service index. Failing over nodes 172.23.105.39:9102(be6ed18f0a850f9a
      d04d4eea8c477ed3) would lose the following indexes/partitions: bucket7._default._default.sg_allDocs_x1 0 bucket4.scope_5.coll_2.idx1_ZSOx2Dn2 2 3 4
      [user:info,2022-02-24T18:42:40.465-08:00,ns_1@172.23.108.139:<0.26410.0>:auto_failover:log_unsafe_node:663]Could not automatically fail over node ('ns_1@172.23.105.39') due to operation being unsafe for service index. Failing over nodes 172.23.105.39:9102(be6ed18f0a850f9a
      d04d4eea8c477ed3) would lose the following indexes/partitions: bucket4.scope_5.coll_2.idx1_ZSOx2Dn2 3 4 2 bucket7._default._default.sg_allDocs_x1 0
      [user:info,2022-02-24T18:42:53.529-08:00,ns_1@172.23.108.139:<0.26410.0>:auto_failover:log_unsafe_node:663]Could not automatically fail over node ('ns_1@172.23.105.39') due to operation being unsafe for service index. Failing over nodes 172.23.105.39:9102(be6ed18f0a850f9a
      d04d4eea8c477ed3) would lose the following indexes/partitions: bucket4.scope_5.coll_2.idx1_ZSOx2Dn2 2 3 4 bucket7._default._default.sg_allDocs_x1 0
      [user:info,2022-02-24T18:43:07.495-08:00,ns_1@172.23.108.139:<0.26410.0>:auto_failover:log_unsafe_node:663]Could not automatically fail over node ('ns_1@172.23.105.39') due to operation being unsafe for service index. Failing over nodes 172.23.105.39:9102(be6ed18f0a850f9a
      d04d4eea8c477ed3) would lose the following indexes/partitions: bucket7._default._default.sg_allDocs_x1 0 bucket4.scope_5.coll_2.idx1_ZSOx2Dn2 3 4 2
      [user:info,2022-02-24T18:43:56.586-08:00,ns_1@172.23.108.139:<0.26410.0>:auto_failover:log_unsafe_node:663]Could not automatically fail over node ('ns_1@172.23.105.39') due to operation being unsafe for service index. Failing over nodes 172.23.105.39:9102(be6ed18f0a850f9a
      d04d4eea8c477ed3) would lose the following indexes/partitions: bucket4.scope_5.coll_2.idx1_ZSOx2Dn2 2 3 4 bucket4.scope_5.coll_2.idx6_rE0yLzBR 1 bucket7._default._default.sg_allDocs_x1 0
      [user:info,2022-02-24T18:43:57.568-08:00,ns_1@172.23.108.139:<0.26410.0>:auto_failover:log_unsafe_node:663]Could not automatically fail over node ('ns_1@172.23.105.39') due to operation being unsafe for service index. Failing over nodes 172.23.105.39:9102(be6ed18f0a850f9a
      d04d4eea8c477ed3) would lose the following indexes/partitions: bucket4.scope_5.coll_2.idx1_ZSOx2Dn2 4 2 3 bucket4.scope_5.coll_2.idx6_rE0yLzBR 1 bucket7._default._default.sg_allDocs_x1 0
      [user:info,2022-02-24T18:43:58.581-08:00,ns_1@172.23.108.139:<0.26410.0>:auto_failover:log_unsafe_node:663]Could not automatically fail over node ('ns_1@172.23.105.39') due to operation being unsafe for service index. Failing over nodes 172.23.105.39:9102(be6ed18f0a850f9a
      d04d4eea8c477ed3) would lose the following indexes/partitions: bucket7._default._default.sg_allDocs_x1 0 bucket4.scope_5.coll_2.idx6_rE0yLzBR 1 bucket4.scope_5.coll_2.idx1_ZSOx2Dn2 4 2 3
      [user:info,2022-02-24T18:43:59.571-08:00,ns_1@172.23.108.139:<0.26410.0>:auto_failover:log_unsafe_node:663]Could not automatically fail over node ('ns_1@172.23.105.39') due to operation being unsafe for service index. Failing over nodes 172.23.105.39:9102(be6ed18f0a850f9a
      d04d4eea8c477ed3) would lose the following indexes/partitions: bucket4.scope_5.coll_2.idx6_rE0yLzBR 1 bucket4.scope_5.coll_2.idx1_ZSOx2Dn2 2 3 4 bucket7._default._default.sg_allDocs_x1 0
      [user:info,2022-02-24T18:44:00.571-08:00,ns_1@172.23.108.139:<0.26410.0>:auto_failover:log_unsafe_node:663]Could not automatically fail over node ('ns_1@172.23.105.39') due to operation being unsafe for service index. Failing over nodes 172.23.105.39:9102(be6ed18f0a850f9a
      d04d4eea8c477ed3) would lose the following indexes/partitions: bucket4.scope_5.coll_2.idx1_ZSOx2Dn2 3 4 2 bucket4.scope_5.coll_2.idx6_rE0yLzBR 1 bucket7._default._default.sg_allDocs_x1 0
      [user:info,2022-02-24T18:44:02.563-08:00,ns_1@172.23.108.139:<0.26410.0>:auto_failover:log_unsafe_node:663]Could not automatically fail over node ('ns_1@172.23.105.39') due to operation being unsafe for service index. Failing over nodes 172.23.105.39:9102(be6ed18f0a850f9a
      d04d4eea8c477ed3) would lose the following indexes/partitions: bucket4.scope_5.coll_2.idx1_ZSOx2Dn2 4 2 3 bucket7._default._default.sg_allDocs_x1 0 bucket4.scope_5.coll_2.idx6_rE0yLzBR 1
      [user:info,2022-02-24T18:44:03.571-08:00,ns_1@172.23.108.139:<0.26410.0>:auto_failover:log_unsafe_node:663]Could not automatically fail over node ('ns_1@172.23.105.39') due to operation being unsafe for service index. Failing over nodes 172.23.105.39:9102(be6ed18f0a850f9a
      d04d4eea8c477ed3) would lose the following indexes/partitions: bucket7._default._default.sg_allDocs_x1 0 bucket4.scope_5.coll_2.idx1_ZSOx2Dn2 4 2 3 bucket4.scope_5.coll_2.idx6_rE0yLzBR 1
      

      The attached logs are collected at around 7 PM PST on 2/25. Let me know if you need logs from a different time stamp.

      Attachments

        1. failedOverStatus.log
          5 kB
        2. keepStatus.log
          25 kB
        3. safeDetails.log
          829 kB
        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          People

            sai.teja Sai Krishna Teja
            mihir.kamdar Mihir Kamdar (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty