A rebalance failed after a graceful-failover and delta-node recovery

Description

_Disclaimer:_{The hardware specifications may be insufficient given this particular cluster sizing.}

Description:
A rebalance fails after a graceful-failover and delta-node recovery with the following error message:

172.23.100.11:ns_server.debug.log(lines 2257931 to 2257941)

Cluster configuration:

32 (kv only) node cluster featuring a single magma bucket with full ejection and replicas=1. ~160 million (randomised) documents each of 256 bytes.

What does the test before the rebalance failure?:

The test performs roughly 13 steps (listed in the appendix) and finally performs the following 3 steps while data is being loaded:

1. A graceful-failover of node 172.23.104.217 which succeeds.
2. Selects node 172.23.104.217 for delta-node-recovery.
3. A rebalance operation is performed.

What happens?:

The rebalance operation fails leaving the vbuckets with a non equal distribution:

The (most recent) rebalance report of the rebalance failure (

) (Orchestrator: 172.23.100.11):

7 Oct 09:05 rebalance_report_20211007T160509.json

Logs:

Logs from the orchestrator (172.23.100.11):
https://cb-engineering.s3.amazonaws.com/MB-48814/qe/collectinfo-2021-10-08T082312-ns_1%40172.23.100.11.zip

ns_server.debug.log(lines 2260596 to 2260636)

(xref: http://src.couchbase.org/source/xref/trunk/ns_server/src/ns_janitor.erl#294)

There should be no hanging vbuckets (the sum of move start and move end events are even for each vbucket):

Command output

Logs from node that was gracefully failed-over node (172.23.104.217):
https://cb-engineering.s3.amazonaws.com/MB-48814/qe/collectinfo-2021-10-08T082312-ns_1%40172.23.104.217.zip

ns_server.debug.log(lines 365897 to 366008)

This may indicate that something timed out from kv's side.

A brief look in memcached.log shows some slow runtime warnings, however these do not seem to be close to the start time of the rebalance (09:02:54).

The steps the test performed before the rebalance failure:

Create a 32 node cluster
Create required buckets and collections.
Create 4000000 items sequentially.
Update 4000000 random keys to create 50 percent fragmentation
Create 4000000 items sequentially
Update 4000000 random keys to create 50 percent fragmentation
Multiple sequential auto-failover of 5 nodes followed by rebalance in.
Rebalance in (a single node) with document loading.
Rebalance out (a single node) with document loading.
Rebalance in and out with document loading.
Swap rebalance with document loading
Failover a node and RebalanceOut that node with loading in parallel
Failover a node and FullRecovery that node
Failover a node and DeltaRecovery that node with loading in parallel followed by rebalance (note: test failed here)

Some (Promtimer) graphs from node 172.23.104.217 (The the vertical red bar at 09:02:54 depicts the start of the failed rebalance event.):

The graphs show that the node in question may not be as resource constrained as I initially thought.

Components

couchbase-bucket

Affects versions

7.1.0

Fix versions

None

Labels

Environment

Link to Log File, atop/blg, CBCollectInfo, Core dump

None

Release Notes Description

None

Attachments

Linked issues

duplicates

MB-47387

[Magma] - Opening bucket during recovery takes a long time

Activity

Dave Rigby November 5, 2021 at 11:59 AM

I believe the DCP_OPEN failing with ETMPFAIL is the same issue as seen on - see this comment from specifically.

Note that as part of that MB. Ben merged the following patch which should avoid ns_server seeing this issue: http://review.couchbase.org/c/kv_engine/+/163886

As such I believe this issue it a duplicate of . Please could you re-run on build 7.1.0-1529 or newer to confirm?

Bryan Mccoid October 8, 2021 at 9:26 PM
Edited

This looks like an error we pass directly from memcached. So what looks like it's happening is..

1.) Start rebalance

2.) dcp_proxy:open_connection

mc_client_binary:cmd_vocal (DCP_OPEN) (dcp_commands.erl:75)

command is sent (mc_binary.erl:251)

response is received (mc_binary.erl:255)

errors get transformed (dcp_commands:proccess_response/2 line 31) -> mc_client_binary:map_status..

And we return that. This seems like the root cause of the rebalance failure from ns_server's perspective. I think maybe the KV guys can illuminate more.

Duplicate

Pinned fields

Click on the next to a field label to start pinning.

Details
Assignee
Ritesh Agarwal
Reporter
Asad Zaidi
Is this a Regression?
Unknown
Triage
Triaged
Story Points
1
Parent
MB-30659 HiDD: KV <-> Magma Integration
Priority
Major
Instabug
Open Instabug

PagerDuty

Sentry

Zendesk Support

Created October 8, 2021 at 8:18 AM

Updated March 16, 2022 at 5:01 AM

Resolved November 5, 2021 at 11:59 AM

Instabug

A rebalance failed after a graceful-failover and delta-node recovery

Description

Components

Affects versions

Fix versions

Labels

Environment

Link to Log File, atop/blg, CBCollectInfo, Core dump

Release Notes Description

Attachments

Linked issues

duplicates

Activity

Dave Rigby November 5, 2021 at 11:59 AM

Bryan Mccoid October 8, 2021 at 9:26 PMEdited

DetailsAssigneeRitesh AgarwalRitesh AgarwalReporterAsad ZaidiAsad ZaidiIs this a Regression?UnknownTriageTriagedStory Points1ParentMB-30659 HiDD: KV <-> Magma IntegrationPriorityMajorInstabugOpen Instabug

Details

Assignee

Reporter

Is this a Regression?

Triage

Story Points

Parent

Priority

Instabug

PagerDutyPagerDuty Incident

PagerDuty

Sentry Linked Issues

Sentry

Zendesk SupportLinked Tickets

Zendesk Support

Bryan Mccoid October 8, 2021 at 9:26 PM
Edited

Details
Assignee
Ritesh Agarwal
Reporter
Asad Zaidi
Is this a Regression?
Unknown
Triage
Triaged
Story Points
1
Parent
MB-30659 HiDD: KV <-> Magma Integration
Priority
Major
Instabug
Open Instabug

PagerDuty

Sentry

Zendesk Support