Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-52622

[Cloud] - During (7.0.3 -> 7.0.4 upgrade)/(Scaling post upgrade) one of the node gets into an unhealthy state and does not recover from it.

    XMLWordPrintable

Details

    • Bug
    • Resolution: Not a Bug
    • Major
    • None
    • 7.0.4
    • secondary-index
    • None
    • Cloud

    Description

      Script to Repro

      AWS_PROFILE=cbc-main go run . scenario --password "$CP_CLI_PASSWORD" --scenario scenarios/hdbaas/aws/ch16-upgrades-with-horizontal-vertical-scaling.yaml 
      

      Copying over from the mail.

      Mail

      Background:
      As part of Control Plane upgrade to server v7.0.4, the team has been doing testing on both AWS and GCP. They’ve found 1 issue where upon upgrade, 1 of the node gets into an unhealthy state and does not recover from it.
      Research on the ControlPlane suggests there’s an issue with the indexer service, where the indexer in the scaled up node gets into a bad state.
      Timing:
      GCP goes GA on 6/28th. We are trying to upgrade both GCP and AWS to 7.0.4 before GA. So, we need a resolution or a good understanding of the scope of the issue by 6/24th. Hence, time is of the essence.
      Ask:
      I’m going to send over the logs and have the QE folks create a CBSE issue; some of the research is captured here. But I wanted to send this over to you since it’s time critical. Who can partner with the team to research this issue?

       

      Copying over from https://couchbasecloud.atlassian.net/browse/AV-37514

      From Rangoli

      From @Fabian Feary

      The issue is definitely this nugget picked up in Supportal: Indexer crash loop with pointer to unallocated span

      7:51

      The indexer in the scaled up node falls over, I think upgrading couchbase server version should fix it judging by the ticket linked

      The first upgraded node was repeatedly failing to reach its own indexer

      10:58

      From what I could tell, that was then preventing the server itself from reaching a healthy state. What I don’t know though is why it threw the entire cluster into an unhealthy state

      11:02

      Even after restarting the couchbase-server service in the node, these lines were still being spammed in the error log:

      [ns_server:error,2022-06-08T13:30:33.158Z,ns_1@oqsdk0vkxml4yzk5.4u5swg--yzjkhdea.nonprod-project-
      avengers.com:service_status_keeper-index<0.3635.0>:service_status_keeper:handle_cast:103]Service service_index returned incorrect status
      [ns_server:error,2022-06-08T13:30:48.159Z,ns_1@oqsdk0vkxml4yzk5.4u5swg--yzjkhdea.nonprod-project-avengers.com:service_status_keeper_worker<0.3634.0>:rest_utils:get_json:57]Request to (indexer) getIndexStatus with headers [{"If-None-Match",
                                                         "5b9a70672b9f659c"}] failed: {error,
                                                                                       timeout}
      

       

      From Fabian

      To follow on from the quote below (@Rangoli Mathur ) it’s not entirely clear that it is that nugget, but what is clear is that it’s an issue with the indexer being unreachable. Indexer logs did not indicate any fault, but the main ns_server error logs indicated that it was completely unreachable. Other server logs across the cluster indicate that some nodes server instances are themselves unreachable.

      Relevant server logs can be found:

      https://supportal.couchbase.com/customer/cbc-stage/cluster/c6e25cde76408dc3f603d4bafd219adc

      https://supportal.couchbase.com/customer/cbc-stage/cluster/0b08676c590156e61fc2620ce52ba831

      Note that these are logged in supportal as 2 distinct clusters but are in fact 1 cluster.

      Nothing in control plane or dp-agent logs to indicate any failures on those sides.

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          People

            Balakumaran.Gopal Balakumaran Gopal
            Balakumaran.Gopal Balakumaran Gopal
            Votes:
            0 Vote for this issue
            Watchers:
            9 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty