Duplicate
Pinned fields
Click on the next to a field label to start pinning.
Details
Assignee
Ritesh AgarwalRitesh AgarwalReporter
Asad ZaidiAsad ZaidiIs this a Regression?
UnknownTriage
TriagedStory Points
1Priority
MajorInstabug
Open Instabug
Details
Details
Assignee
Ritesh Agarwal
Ritesh AgarwalReporter
Asad Zaidi
Asad ZaidiIs this a Regression?
Unknown
Triage
Triaged
Story Points
1
Priority
Instabug
Open Instabug
PagerDuty
PagerDuty
PagerDuty
Sentry
Sentry
Sentry
Zendesk Support
Zendesk Support
Zendesk Support
Created October 8, 2021 at 8:18 AM
Updated March 16, 2022 at 5:01 AM
Resolved November 5, 2021 at 11:59 AM
Disclaimer: The hardware specifications may be insufficient given this particular cluster sizing.
Description:
A rebalance fails after a graceful-failover and delta-node recovery with the following error message:
172.23.100.11:ns_server.debug.log(lines 2257931 to 2257941)
Cluster configuration:
32 (kv only) node cluster featuring a single magma bucket with full ejection and replicas=1. ~160 million (randomised) documents each of 256 bytes.
What does the test before the rebalance failure?:
The test performs roughly 13 steps (listed in the appendix) and finally performs the following 3 steps while data is being loaded:
1. A graceful-failover of node 172.23.104.217 which succeeds.
2. Selects node 172.23.104.217 for delta-node-recovery.
3. A rebalance operation is performed.
What happens?:
The rebalance operation fails leaving the vbuckets with a non equal distribution:
The (most recent) rebalance report of the rebalance failure (
) (Orchestrator: 172.23.100.11):
7 Oct 09:05 rebalance_report_20211007T160509.json
Logs:
Logs from the orchestrator (172.23.100.11):
https://cb-engineering.s3.amazonaws.com/MB-48814/qe/collectinfo-2021-10-08T082312-ns_1%40172.23.100.11.zip
ns_server.debug.log(lines 2260596 to 2260636)
(xref: http://src.couchbase.org/source/xref/trunk/ns_server/src/ns_janitor.erl#294)
There should be no hanging vbuckets (the sum of move start and move end events are even for each vbucket):
Command output
Logs from node that was gracefully failed-over node (172.23.104.217):
https://cb-engineering.s3.amazonaws.com/MB-48814/qe/collectinfo-2021-10-08T082312-ns_1%40172.23.104.217.zip
ns_server.debug.log(lines 365897 to 366008)
This may indicate that something timed out from kv's side.
A brief look in memcached.log shows some slow runtime warnings, however these do not seem to be close to the start time of the rebalance (09:02:54).
Appendix:
Remaining logs:
https://cb-engineering.s3.amazonaws.com/MB-48814/qe/collectinfo-2021-10-08T082312-ns_1%40172.23.100.13.zip
https://cb-engineering.s3.amazonaws.com/MB-48814/qe/collectinfo-2021-10-08T082312-ns_1%40172.23.100.14.zip
https://cb-engineering.s3.amazonaws.com/MB-48814/qe/collectinfo-2021-10-08T082312-ns_1%40172.23.100.15.zip
https://cb-engineering.s3.amazonaws.com/MB-48814/qe/collectinfo-2021-10-08T082312-ns_1%40172.23.100.16.zip
https://cb-engineering.s3.amazonaws.com/MB-48814/qe/collectinfo-2021-10-08T082312-ns_1%40172.23.100.17.zip
https://cb-engineering.s3.amazonaws.com/MB-48814/qe/collectinfo-2021-10-08T082312-ns_1%40172.23.100.20.zip
https://cb-engineering.s3.amazonaws.com/MB-48814/qe/collectinfo-2021-10-08T082312-ns_1%40172.23.100.22.zip
https://cb-engineering.s3.amazonaws.com/MB-48814/qe/collectinfo-2021-10-08T082312-ns_1%40172.23.100.28.zip
https://cb-engineering.s3.amazonaws.com/MB-48814/qe/collectinfo-2021-10-08T082312-ns_1%40172.23.104.109.zip
https://cb-engineering.s3.amazonaws.com/MB-48814/qe/collectinfo-2021-10-08T082312-ns_1%40172.23.104.134.zip
https://cb-engineering.s3.amazonaws.com/MB-48814/qe/collectinfo-2021-10-08T082312-ns_1%40172.23.104.179.zip
https://cb-engineering.s3.amazonaws.com/MB-48814/qe/collectinfo-2021-10-08T082312-ns_1%40172.23.104.211.zip
https://cb-engineering.s3.amazonaws.com/MB-48814/qe/collectinfo-2021-10-08T082312-ns_1%40172.23.104.221.zip
https://cb-engineering.s3.amazonaws.com/MB-48814/qe/collectinfo-2021-10-08T082312-ns_1%40172.23.104.222.zip
https://cb-engineering.s3.amazonaws.com/MB-48814/qe/collectinfo-2021-10-08T082312-ns_1%40172.23.104.226.zip
https://cb-engineering.s3.amazonaws.com/MB-48814/qe/collectinfo-2021-10-08T082312-ns_1%40172.23.104.231.zip
https://cb-engineering.s3.amazonaws.com/MB-48814/qe/collectinfo-2021-10-08T082312-ns_1%40172.23.104.234.zip
https://cb-engineering.s3.amazonaws.com/MB-48814/qe/collectinfo-2021-10-08T082312-ns_1%40172.23.104.235.zip
https://cb-engineering.s3.amazonaws.com/MB-48814/qe/collectinfo-2021-10-08T082312-ns_1%40172.23.104.236.zip
https://cb-engineering.s3.amazonaws.com/MB-48814/qe/collectinfo-2021-10-08T082312-ns_1%40172.23.104.237.zip
https://cb-engineering.s3.amazonaws.com/MB-48814/qe/collectinfo-2021-10-08T082312-ns_1%40172.23.104.238.zip
https://cb-engineering.s3.amazonaws.com/MB-48814/qe/collectinfo-2021-10-08T082312-ns_1%40172.23.104.241.zip
https://cb-engineering.s3.amazonaws.com/MB-48814/qe/collectinfo-2021-10-08T082312-ns_1%40172.23.104.243.zip
https://cb-engineering.s3.amazonaws.com/MB-48814/qe/collectinfo-2021-10-08T082312-ns_1%40172.23.104.246.zip
https://cb-engineering.s3.amazonaws.com/MB-48814/qe/collectinfo-2021-10-08T082312-ns_1%40172.23.104.248.zip
https://cb-engineering.s3.amazonaws.com/MB-48814/qe/collectinfo-2021-10-08T082312-ns_1%40172.23.104.249.zip
https://cb-engineering.s3.amazonaws.com/MB-48814/qe/collectinfo-2021-10-08T082312-ns_1%40172.23.104.250.zip
https://cb-engineering.s3.amazonaws.com/MB-48814/qe/collectinfo-2021-10-08T082312-ns_1%40172.23.104.251.zip
https://cb-engineering.s3.amazonaws.com/MB-48814/qe/collectinfo-2021-10-08T082312-ns_1%40172.23.104.73.zip
https://cb-engineering.s3.amazonaws.com/MB-48814/qe/collectinfo-2021-10-08T082312-ns_1%40172.23.104.86.zip
The steps the test performed before the rebalance failure:
Create a 32 node cluster
Create required buckets and collections.
Create 4000000 items sequentially.
Update 4000000 random keys to create 50 percent fragmentation
Create 4000000 items sequentially
Update 4000000 random keys to create 50 percent fragmentation
Multiple sequential auto-failover of 5 nodes followed by rebalance in.
Rebalance in (a single node) with document loading.
Rebalance out (a single node) with document loading.
Rebalance in and out with document loading.
Swap rebalance with document loading
Failover a node and RebalanceOut that node with loading in parallel
Failover a node and FullRecovery that node
Failover a node and DeltaRecovery that node with loading in parallel followed by rebalance (note: test failed here)
Some (Promtimer) graphs from node 172.23.104.217 (The the vertical red bar at 09:02:54 depicts the start of the failed rebalance event.):
The graphs show that the node in question may not be as resource constrained as I initially thought.