[BP to 7.1.4]- Rebalance is hung on a dataplane since >1 hour.

Description

It looks like many operation triggered at the same time:

  1. Sample indexes are being build for a database and before it gets completed that database is deleted.

  2. Within the same timeframe, a new bucket is created and its width is change to 2 which triggered a rebalance which is in hung state.

  3. ! new bucket creation request came on CP which is redirected to this dataplane.

Components

Affects versions

Fix versions

Labels

Environment

7.5.0-3129

Link to Log File, atop/blg, CBCollectInfo, Core dump

http://supportal.couchbase.com/snapshot/0a8bdd32fb034ab1864e9196fa49bff5::1 s3://cb-customers-secure/rebl_hung/2022-10-28/collectinfo-2022-10-28t202325-ns_1@0bmqykytylmvpk3m.o3keyaim9hvjszu.nonprod-project-avengers.com.zip s3://cb-customers-secure/rebl_hung/2022-10-28/collectinfo-2022-10-28t202325-ns_1@7etvbw3xfq-36yx.o3keyaim9hvjszu.nonprod-project-avengers.com.zip s3://cb-customers-secure/rebl_hung/2022-10-28/collectinfo-2022-10-28t202325-ns_1@elvi7ulwh-hcyo4c.o3keyaim9hvjszu.nonprod-project-avengers.com.zip s3://cb-customers-secure/rebl_hung/2022-10-28/collectinfo-2022-10-28t202325-ns_1@f1jepjwmbbbothe.o3keyaim9hvjszu.nonprod-project-avengers.com.zip s3://cb-customers-secure/rebl_hung/2022-10-28/collectinfo-2022-10-28t202325-ns_1@g-q7m5fyjygfu1p.o3keyaim9hvjszu.nonprod-project-avengers.com.zip s3://cb-customers-secure/rebl_hung/2022-10-28/collectinfo-2022-10-28t202325-ns_1@giyyh2oo0zjeykdb.o3keyaim9hvjszu.nonprod-project-avengers.com.zip s3://cb-customers-secure/rebl_hung/2022-10-28/collectinfo-2022-10-28t202325-ns_1@lzinjqkfkvbfgjg.o3keyaim9hvjszu.nonprod-project-avengers.com.zip s3://cb-customers-secure/rebl_hung/2022-10-28/collectinfo-2022-10-28t202325-ns_1@rdy1gwlp2mscjicx.o3keyaim9hvjszu.nonprod-project-avengers.com.zip s3://cb-customers-secure/rebl_hung/2022-10-28/collectinfo-2022-10-28t202325-ns_1@rn7xwjznbysp3dix.o3keyaim9hvjszu.nonprod-project-avengers.com.zip s3://cb-customers-secure/rebl_hung/2022-10-28/collectinfo-2022-10-28t202325-ns_1@srip-efhpcipny2f.o3keyaim9hvjszu.nonprod-project-avengers.com.zip s3://cb-customers-secure/rebl_hung/2022-10-28/collectinfo-2022-10-28t202325-ns_1@uj1vimm712jn2xgx.o3keyaim9hvjszu.nonprod-project-avengers.com.zip

Release Notes Description

None

Activity

Show:

Varun Velamuri February 10, 2023 at 12:17 PM

CB robot January 27, 2023 at 3:24 PM

Build couchbase-server-7.1.4-3570 contains indexing commit cc9df15 with commit message:
Notify flush observer before cleaning up keyspace

Varun Velamuri January 27, 2023 at 10:59 AM

Using the steps mentioned in https://couchbasecloud.atlassian.net/browse/MB-54328?focusedCommentId=838321&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel to reproduce the issue, created 3 index instances on a bucket. Dropped one index instance while flush is paused & deleted the bucket

Before the fix: handleKeyspaceNotFound skipped cleaning up indexes

StorageSnapDone has cleaned up all index instances

This lead to lifecycle manager getting stuck - Incoming channels started to queue up requests

After the fix:

StorageSnapDone got called

but the index which got dropped was skipped

No incomings seen in lifecycle manager's channels

Fixed
Pinned fields
Click on the next to a field label to start pinning.

Details

Assignee

Reporter

Is this a Regression?

Unknown

Triage

Untriaged

Story Points

Priority

Instabug

Open Instabug

PagerDuty

Sentry

Zendesk Support

Created October 31, 2022 at 9:48 PM
Updated October 10, 2024 at 7:32 PM
Resolved January 27, 2023 at 11:44 AM
Instabug