Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-46190

[System Test] [New Test]: Disk full on 1 index node and almost full on 1 more index node

    XMLWordPrintable

Details

    Description

      Build : 7.0.0-5117
      Test : -test tests/2i/cheshirecat/test_idx_clusterops_cheshire_cat_recovery.yml -scope tests/2i/cheshirecat/scope_idx_cheshire_cat_dgm.yml (new test with more recovery)
      Scale : 2
      Iteration : 1

      This test is same as the GSI component test, but with more recovery steps in the test. There are 2 parts in the test -
      1. Steady state - which is almost for 2 hrs. In this state, mutations on collections are ongoing, queries are running, alter indexes are going on, and on 2 buckets indexes are created and dropped, along with scopes and collections.
      2. Cluster ops - After the steady state, other workloads are stopped except for ongoing mutations and running scans. Rebalance operations are initiated in this phase.

      The changes done w.r.t. to recovery are -
      1. There is a new step that randomly kills indexer process on any indexer node in the Steady
      state.
      2. For rebalance, we have activated retry on failed rebalance. After initiating a rebalance, indexer process is killed on an index node after a few mins. This will cause rebalance to fail, and retried automatically after the set duration.

      While the test was running the retried rebalance, we hit https://issues.couchbase.com/browse/MB-45903?focusedCommentId=499483&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-499483. One index was stuck in "Moving" state for 14 hrs.

      During this time, the disk on 172.23.97.77 got to 100%. Later disk on 172.23.121.165 also became 100%.

      Following is from the indexer_stats.log on how the disk size grew from 2021-05-07T01:00 to 2021-05-07T09:40 by when it was almost full.

      Timestamp total_disk_size
      2021-05-07T01:00:01 23384368503
      2021-05-07T02:00:04 30680803527
      2021-05-07T03:00:16 43064247227
      2021-05-07T04:00:30 45901589749
      2021-05-07T05:00:47 51404093644
      2021-05-07T06:00:06 55880208049
      2021-05-07T07:00:26 61842746552
      2021-05-07T08:00:56 66335009451
      2021-05-07T09:00:15 75660402162
      2021-05-07T09:40:19 95151809213

      On the other index nodes - 172.23.97.83, 172.23.96.31, 172.23.96.30, the disk size is under 10% and on 172.23.97.82 it is 38%

      Attachments

        For Gerrit Dashboard: MB-46190
        # Subject Branch Project Status CR V

        Activity

          People

            mihir.kamdar Mihir Kamdar (Inactive)
            mihir.kamdar Mihir Kamdar (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              PagerDuty