Loading...

XML

Word

Printable

Details

Type: Bug
Resolution: Not a Bug
Priority: Major
Fix Version/s: None
Affects Version/s: 7.0.4
Component/s: secondary-index
Labels:
None
Environment:
Cloud

Triage:
Untriaged
Operating System:
Centos 64-bit
Link to Log File, atop/blg, CBCollectInfo, Core dump:

Hide
https://supportal.couchbase.com/customer/cbc-stage/cluster/c6e25cde76408dc3f603d4bafd219adc

https://supportal.couchbase.com/customer/cbc-stage/cluster/0b08676c590156e61fc2620ce52ba831

Show
https://supportal.couchbase.com/customer/cbc-stage/cluster/c6e25cde76408dc3f603d4bafd219adc https://supportal.couchbase.com/customer/cbc-stage/cluster/0b08676c590156e61fc2620ce52ba831
Story Points:
1
Is this a Regression?:
Unknown

Description

Script to Repro

AWS_PROFILE=cbc-main go run . scenario --password "$CP_CLI_PASSWORD" --scenario scenarios/hdbaas/aws/ch16-upgrades-with-horizontal-vertical-scaling.yaml

Copying over from the mail.

Mail

Background:
As part of Control Plane upgrade to server v7.0.4, the team has been doing testing on both AWS and GCP. They’ve found 1 issue where upon upgrade, 1 of the node gets into an unhealthy state and does not recover from it.
Research on the ControlPlane suggests there’s an issue with the indexer service, where the indexer in the scaled up node gets into a bad state.
Timing:
GCP goes GA on 6/28th. We are trying to upgrade both GCP and AWS to 7.0.4 before GA. So, we need a resolution or a good understanding of the scope of the issue by 6/24th. Hence, time is of the essence.
Ask:
I’m going to send over the logs and have the QE folks create a CBSE issue; some of the research is captured here. But I wanted to send this over to you since it’s time critical. Who can partner with the team to research this issue?

Copying over from https://couchbasecloud.atlassian.net/browse/AV-37514

From Rangoli

From @Fabian Feary

The issue is definitely this nugget picked up in Supportal: Indexer crash loop with pointer to unallocated span

7:51

The indexer in the scaled up node falls over, I think upgrading couchbase server version should fix it judging by the ticket linked

The first upgraded node was repeatedly failing to reach its own indexer

10:58

From what I could tell, that was then preventing the server itself from reaching a healthy state. What I don’t know though is why it threw the entire cluster into an unhealthy state

11:02

Even after restarting the couchbase-server service in the node, these lines were still being spammed in the error log:

[ns_server:error,2022-06-08T13:30:33.158Z,ns_1@oqsdk0vkxml4yzk5.4u5swg--yzjkhdea.nonprod-project-

avengers.com:service_status_keeper-index<0.3635.0>:service_status_keeper:handle_cast:103]Service service_index returned incorrect status

[ns_server:error,2022-06-08T13:30:48.159Z,ns_1@oqsdk0vkxml4yzk5.4u5swg--yzjkhdea.nonprod-project-avengers.com:service_status_keeper_worker<0.3634.0>:rest_utils:get_json:57]Request to (indexer) getIndexStatus with headers [{"If-None-Match",

                                                   "5b9a70672b9f659c"}] failed: {error,

                                                                                 timeout}

From Fabian

To follow on from the quote below (@Rangoli Mathur ) it’s not entirely clear that it is that nugget, but what is clear is that it’s an issue with the indexer being unreachable. Indexer logs did not indicate any fault, but the main ns_server error logs indicated that it was completely unreachable. Other server logs across the cluster indicate that some nodes server instances are themselves unreachable.

Relevant server logs can be found:

https://supportal.couchbase.com/customer/cbc-stage/cluster/c6e25cde76408dc3f603d4bafd219adc

https://supportal.couchbase.com/customer/cbc-stage/cluster/0b08676c590156e61fc2620ce52ba831

Note that these are logged in supportal as 2 distinct clusters but are in fact 1 cluster.

Nothing in control plane or dp-agent logs to indicate any failures on those sides.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

Screenshot 2022-06-22 at 7.15.58 AM.png
214 kB
21/Jun/22 6:46 PM

Gerrit Reviews

- Issue Only
- Show All Reviews
- Show Open Reviews
- Show All Issues
- Show Open Issues

No reviews matched the request. Check your Options in the drop-down menu of this sections header.

Activity

People

Assignee:: Balakumaran Gopal

Reporter:: Balakumaran Gopal

Votes:: 0 Vote for this issue

Watchers:: 9 Start watching this issue

Dates

Created:: 21/Jun/22 1:38 AM

Updated:: 01/Jul/22 2:10 PM

Resolved:: 26/Jun/22 11:21 PM

Gerrit Reviews

There are no open Gerrit changes

[Cloud] - During (7.0.3 -> 7.0.4 upgrade)/(Scaling post upgrade) one of the node gets into an unhealthy state and does not recover from it.

Details

Description

Attachments

Attachments

Gerrit Reviews

Activity

People

Dates

Gerrit Reviews

PagerDuty