A rebalance failed after a graceful-failover and delta-node recovery

Description

Disclaimer: The hardware specifications may be insufficient given this particular cluster sizing.

Description:
A rebalance fails after a graceful-failover and delta-node recovery with the following error message:

172.23.100.11:ns_server.debug.log(lines 2257931 to 2257941)

Cluster configuration:

32 (kv only) node cluster featuring a single magma bucket with full ejection and replicas=1. ~160 million (randomised) documents each of 256 bytes.

What does the test before the rebalance failure?:

The test performs roughly 13 steps (listed in the appendix) and finally performs the following 3 steps while data is being loaded:

1. A graceful-failover of node 172.23.104.217 which succeeds.
2. Selects node 172.23.104.217 for delta-node-recovery.
3. A rebalance operation is performed.

What happens?:

The rebalance operation fails leaving the vbuckets with a non equal distribution:

 

The (most recent) rebalance report of the rebalance failure (

) (Orchestrator: 172.23.100.11):

7 Oct 09:05 rebalance_report_20211007T160509.json

Logs:

Logs from the orchestrator (172.23.100.11):
https://cb-engineering.s3.amazonaws.com/MB-48814/qe/collectinfo-2021-10-08T082312-ns_1%40172.23.100.11.zip

ns_server.debug.log(lines 2260596 to 2260636)

(xref: http://src.couchbase.org/source/xref/trunk/ns_server/src/ns_janitor.erl#294)

There should be no hanging vbuckets (the sum of move start and move end events are even for each vbucket):

Command output

Logs from node that was gracefully failed-over node (172.23.104.217):
https://cb-engineering.s3.amazonaws.com/MB-48814/qe/collectinfo-2021-10-08T082312-ns_1%40172.23.104.217.zip

ns_server.debug.log(lines 365897 to 366008)

This may indicate that something timed out from kv's side.

A brief look in memcached.log shows some slow runtime warnings, however these do not seem to be close to the start time of the rebalance (09:02:54).

Appendix:
Remaining logs:
https://cb-engineering.s3.amazonaws.com/MB-48814/qe/collectinfo-2021-10-08T082312-ns_1%40172.23.100.13.zip
https://cb-engineering.s3.amazonaws.com/MB-48814/qe/collectinfo-2021-10-08T082312-ns_1%40172.23.100.14.zip
https://cb-engineering.s3.amazonaws.com/MB-48814/qe/collectinfo-2021-10-08T082312-ns_1%40172.23.100.15.zip
https://cb-engineering.s3.amazonaws.com/MB-48814/qe/collectinfo-2021-10-08T082312-ns_1%40172.23.100.16.zip
https://cb-engineering.s3.amazonaws.com/MB-48814/qe/collectinfo-2021-10-08T082312-ns_1%40172.23.100.17.zip
https://cb-engineering.s3.amazonaws.com/MB-48814/qe/collectinfo-2021-10-08T082312-ns_1%40172.23.100.20.zip
https://cb-engineering.s3.amazonaws.com/MB-48814/qe/collectinfo-2021-10-08T082312-ns_1%40172.23.100.22.zip
https://cb-engineering.s3.amazonaws.com/MB-48814/qe/collectinfo-2021-10-08T082312-ns_1%40172.23.100.28.zip
https://cb-engineering.s3.amazonaws.com/MB-48814/qe/collectinfo-2021-10-08T082312-ns_1%40172.23.104.109.zip
https://cb-engineering.s3.amazonaws.com/MB-48814/qe/collectinfo-2021-10-08T082312-ns_1%40172.23.104.134.zip
https://cb-engineering.s3.amazonaws.com/MB-48814/qe/collectinfo-2021-10-08T082312-ns_1%40172.23.104.179.zip
https://cb-engineering.s3.amazonaws.com/MB-48814/qe/collectinfo-2021-10-08T082312-ns_1%40172.23.104.211.zip
https://cb-engineering.s3.amazonaws.com/MB-48814/qe/collectinfo-2021-10-08T082312-ns_1%40172.23.104.221.zip
https://cb-engineering.s3.amazonaws.com/MB-48814/qe/collectinfo-2021-10-08T082312-ns_1%40172.23.104.222.zip
https://cb-engineering.s3.amazonaws.com/MB-48814/qe/collectinfo-2021-10-08T082312-ns_1%40172.23.104.226.zip
https://cb-engineering.s3.amazonaws.com/MB-48814/qe/collectinfo-2021-10-08T082312-ns_1%40172.23.104.231.zip
https://cb-engineering.s3.amazonaws.com/MB-48814/qe/collectinfo-2021-10-08T082312-ns_1%40172.23.104.234.zip
https://cb-engineering.s3.amazonaws.com/MB-48814/qe/collectinfo-2021-10-08T082312-ns_1%40172.23.104.235.zip
https://cb-engineering.s3.amazonaws.com/MB-48814/qe/collectinfo-2021-10-08T082312-ns_1%40172.23.104.236.zip
https://cb-engineering.s3.amazonaws.com/MB-48814/qe/collectinfo-2021-10-08T082312-ns_1%40172.23.104.237.zip
https://cb-engineering.s3.amazonaws.com/MB-48814/qe/collectinfo-2021-10-08T082312-ns_1%40172.23.104.238.zip
https://cb-engineering.s3.amazonaws.com/MB-48814/qe/collectinfo-2021-10-08T082312-ns_1%40172.23.104.241.zip
https://cb-engineering.s3.amazonaws.com/MB-48814/qe/collectinfo-2021-10-08T082312-ns_1%40172.23.104.243.zip
https://cb-engineering.s3.amazonaws.com/MB-48814/qe/collectinfo-2021-10-08T082312-ns_1%40172.23.104.246.zip
https://cb-engineering.s3.amazonaws.com/MB-48814/qe/collectinfo-2021-10-08T082312-ns_1%40172.23.104.248.zip
https://cb-engineering.s3.amazonaws.com/MB-48814/qe/collectinfo-2021-10-08T082312-ns_1%40172.23.104.249.zip
https://cb-engineering.s3.amazonaws.com/MB-48814/qe/collectinfo-2021-10-08T082312-ns_1%40172.23.104.250.zip
https://cb-engineering.s3.amazonaws.com/MB-48814/qe/collectinfo-2021-10-08T082312-ns_1%40172.23.104.251.zip
https://cb-engineering.s3.amazonaws.com/MB-48814/qe/collectinfo-2021-10-08T082312-ns_1%40172.23.104.73.zip
https://cb-engineering.s3.amazonaws.com/MB-48814/qe/collectinfo-2021-10-08T082312-ns_1%40172.23.104.86.zip

The steps the test performed before the rebalance failure:

  1. Create a 32 node cluster

  2. Create required buckets and collections.

  3. Create 4000000 items sequentially.

  4. Update 4000000 random keys to create 50 percent fragmentation

  5. Create 4000000 items sequentially

  6. Update 4000000 random keys to create 50 percent fragmentation

  7. Multiple sequential auto-failover of 5 nodes followed by rebalance in.

  8. Rebalance in (a single node) with document loading.

  9. Rebalance out (a single node) with document loading.

  10. Rebalance in and out with document loading.

  11. Swap rebalance with document loading

  12. Failover a node and RebalanceOut that node with loading in parallel

  13. Failover a node and FullRecovery that node

  14. Failover a node and DeltaRecovery that node with loading in parallel followed by rebalance (note: test failed here)

Some (Promtimer) graphs from node 172.23.104.217 (The the vertical red bar at 09:02:54 depicts the start of the failed rebalance event.):

The graphs show that the node in question may not be as resource constrained as I initially thought.

Affects versions

Fix versions

None

Labels

Environment

Enterprise Edition 7.1.0 build 1430 | 32 kv nodes | centos7 | 4 gb ram | 4 cpu(s) | virtualised | ~4.8GB total disk space per node

Link to Log File, atop/blg, CBCollectInfo, Core dump

None

Release Notes Description

None

Attachments

4

Activity

Dave Rigby November 5, 2021 at 11:59 AM

I believe the DCP_OPEN failing with ETMPFAIL is the same issue as seen on - see this comment from specifically.

Note that as part of that MB. Ben merged the following patch which should avoid ns_server seeing this issue: http://review.couchbase.org/c/kv_engine/+/163886

As such I believe this issue it a duplicate of . Please could you re-run on build 7.1.0-1529 or newer to confirm?

Bryan Mccoid October 8, 2021 at 9:26 PM
Edited

This looks like an error we pass directly from memcached. So what looks like it's happening is..

1.) Start rebalance

2.) dcp_proxy:open_connection

mc_client_binary:cmd_vocal (DCP_OPEN) (dcp_commands.erl:75)

command is sent (mc_binary.erl:251)

response is received (mc_binary.erl:255)

errors get transformed (dcp_commands:proccess_response/2 line 31) -> mc_client_binary:map_status..

And we return that. This seems like the root cause of the rebalance failure from ns_server's perspective. I think maybe the KV guys can illuminate more. 

Duplicate
Pinned fields
Click on the next to a field label to start pinning.

Details

Assignee

Reporter

Is this a Regression?

Unknown

Triage

Triaged

Story Points

Priority

Instabug

Open Instabug

PagerDuty

Sentry

Zendesk Support

Created October 8, 2021 at 8:18 AM
Updated March 16, 2022 at 5:01 AM
Resolved November 5, 2021 at 11:59 AM
Instabug