Details
-
Bug
-
Resolution: Fixed
-
Critical
-
Cheshire-Cat
-
Untriaged
-
-
1
-
Unknown
Description
Build : 7.0.0-5117
Test : -test tests/2i/cheshirecat/test_idx_clusterops_cheshire_cat_recovery.yml -scope tests/2i/cheshirecat/scope_idx_cheshire_cat_dgm.yml (new test with more recovery)
Scale : 2
Iteration : 1
This test is same as the GSI component test, but with more recovery steps in the test. There are 2 parts in the test -
1. Steady state - which is almost for 2 hrs. In this state, mutations on collections are ongoing, queries are running, alter indexes are going on, and on 2 buckets indexes are created and dropped, along with scopes and collections.
2. Cluster ops - After the steady state, other workloads are stopped except for ongoing mutations and running scans. Rebalance operations are initiated in this phase.
The changes done w.r.t. to recovery are -
1. There is a new step that randomly kills indexer process on any indexer node in the Steady
state.
2. For rebalance, we have activated retry on failed rebalance. After initiating a rebalance, indexer process is killed on an index node after a few mins. This will cause rebalance to fail, and retried automatically after the set duration.
While the test was running the retried rebalance, we hit https://issues.couchbase.com/browse/MB-45903?focusedCommentId=499483&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-499483. One index was stuck in "Moving" state for 14 hrs.
During this time, the disk on 172.23.97.77 got to 100%. Later disk on 172.23.121.165 also became 100%.
Following is from the indexer_stats.log on how the disk size grew from 2021-05-07T01:00 to 2021-05-07T09:40 by when it was almost full.
Timestamp | total_disk_size |
2021-05-07T01:00:01 | 23384368503 |
2021-05-07T02:00:04 | 30680803527 |
2021-05-07T03:00:16 | 43064247227 |
2021-05-07T04:00:30 | 45901589749 |
2021-05-07T05:00:47 | 51404093644 |
2021-05-07T06:00:06 | 55880208049 |
2021-05-07T07:00:26 | 61842746552 |
2021-05-07T08:00:56 | 66335009451 |
2021-05-07T09:00:15 | 75660402162 |
2021-05-07T09:40:19 | 95151809213 |
On the other index nodes - 172.23.97.83, 172.23.96.31, 172.23.96.30, the disk size is under 10% and on 172.23.97.82 it is 38%
Attachments
For Gerrit Dashboard: MB-46190 | ||||||
---|---|---|---|---|---|---|
# | Subject | Branch | Project | Status | CR | V |
153312,1 | Disable indexer.plasma.AutoTuneLSSCleaner as a workaround for MB-46190 | master | sequoia | Status: NEW | 0 | 0 |
153184,2 | MB-46190: Allow 50% free disk space for operational use | unstable | plasma | Status: ABANDONED | 0 | -1 |
153516,2 | MB-46190: Disable Frag Auto Tuner | unstable | indexing | Status: MERGED | +2 | +1 |
153569,9 | MB-46190: Track execution time for log cleaner | unstable | plasma | Status: MERGED | +2 | +1 |
153570,2 | MB-46190: Check LSSPressure before running mvcc purger | unstable | plasma | Status: ABANDONED | 0 | -1 |
153582,6 | MB-46190: Check new rp sn before proceeding in mvcc purger | unstable | plasma | Status: MERGED | +2 | +1 |
153583,6 | MB-46190: Add aggregated stats for purges | unstable | plasma | Status: MERGED | +2 | +1 |
153589,5 | MB-46190: Check LSS pressure wihle running MVCC purger | unstable | plasma | Status: MERGED | +2 | +1 |