Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-51027

[Magma][Windows] - Rebalance doesn't seem to make any visible progress for long periods of time(~15mins)

    XMLWordPrintable

Details

    • Untriaged
    • Windows 64-bit
    • 1
    • Yes
    • KV 2022-Feb

    Description

      Script to Repro

      guides/gradlew --refresh-dependencies testrunner -P jython=/opt/jython/bin/jython -P 'args=-i /tmp/win10-bucket-ops-temp_rebalance_magma_win.ini rerun=False,disk_optimized_thread_settings=True,get-cbcollect-info=True,bucket_storage=magma,kv_quota_percent=30,services_init=kv:index:n1ql:fts:eventing:cbas-kv:index:n1ql:fts:eventing:cbas-kv:index:n1ql:fts:eventing:cbas-kv:index:n1ql:fts:eventing:cbas-kv:index:n1ql:fts:eventing:cbas-kv:index:n1ql:fts:eventing:cbas -t bucket_collections.collections_rebalance.CollectionsRebalance.test_data_load_collections_with_rebalance_in,nodes_init=4,nodes_in=2,bucket_spec=magma_dgm.10_percent_dgm.4_node_1_replica_magma_512,doc_size=512,randomize_value=True,data_load_spec=volume_test_load_with_CRUD_on_collections,data_load_stage=during,skip_validations=False,GROUP=rebalance_in'
      

      Was running bunch of tests to validate MB-50941 when I noticed it.

      Steps to Repro
      1. Create a 4 node cluster.

       
      2022-02-16 23:32:41,542 | test  | INFO    | MainThread | [table_view:display:72] Cluster statistics
      +----------------+--------------------------------------+-----------------+-----------+----------+---------------------+-------------------+-----------------------+
      | Node           | Services                             | CPU_utilization | Mem_total | Mem_free | Swap_mem_used       | Active / Replica  | Version               |
      +----------------+--------------------------------------+-----------------+-----------+----------+---------------------+-------------------+-----------------------+
      | 172.23.136.138 | cbas, eventing, fts, index, kv, n1ql | 5.33833333333   | 5.99 GiB  | 3.36 GiB | 3.33 GiB / 7.62 GiB | 0 / 0             | 7.1.0-2325-enterprise |
      | 172.23.136.142 | cbas, eventing, fts, index, kv, n1ql | 1.35497741704   | 5.99 GiB  | 3.42 GiB | 3.27 GiB / 7.62 GiB | 0 / 0             | 7.1.0-2325-enterprise |
      | 172.23.136.141 | cbas, eventing, fts, index, kv, n1ql | 4.97174952916   | 5.99 GiB  | 3.27 GiB | 3.33 GiB / 7.62 GiB | 0 / 0             | 7.1.0-2325-enterprise |
      | 172.23.136.137 | cbas, eventing, fts, index, kv, n1ql | 1.25            | 5.99 GiB  | 3.34 GiB | 3.32 GiB / 7.62 GiB | 0 / 0             | 7.1.0-2325-enterprise |
      +----------------+--------------------------------------+-----------------+-----------+----------+---------------------+-------------------+-----------------------+
      

      2. Create Bucket/scopes/collections/Data

      2022-02-16 23:36:56,727 | test  | INFO    | MainThread | [table_view:display:72] Bucket statistics
      +------------------------------------------------------------------------+-----------+-----------------+----------+------------+-----+---------+-----------+----------+-----------+---------------+
      | Bucket                                                                 | Type      | Storage Backend | Replicas | Durability | TTL | Items   | RAM Quota | RAM Used | Disk Used | ARR           |
      +------------------------------------------------------------------------+-----------+-----------------+----------+------------+-----+---------+-----------+----------+-----------+---------------+
      | UoQtU7JR99ysgk0PaDwPg8rWZom%U0eqtDTErgiiYijKCHCZBHE62BiI0Dr--41-860000 | couchbase | magma           | 2        | none       | 0   | 3000000 | 2.97 GiB  | 2.04 GiB | 4.25 GiB  | 27.5666666667 |
      +------------------------------------------------------------------------+-----------+-----------------+----------+------------+-----+---------+-----------+----------+-----------+---------------+
      

      3. Start CRUD on data + scopes/collections
      4. Add 2 nodes(172.23.136.136 and 172.23.136.139) and rebalance in.

      2022-02-16 23:37:15,099 | test  | INFO    | pool-3-thread-10 | [table_view:display:72] Rebalance Overview
      +----------------+--------------------------------------+-----------------------+---------------+--------------+
      | Nodes          | Services                             | Version               | CPU           | Status       |
      +----------------+--------------------------------------+-----------------------+---------------+--------------+
      | 172.23.136.138 | cbas, eventing, fts, index, kv, n1ql | 7.1.0-2325-enterprise | 15.5480741988 | Cluster node |
      | 172.23.136.142 | cbas, eventing, fts, index, kv, n1ql | 7.1.0-2325-enterprise | 60.9883333333 | Cluster node |
      | 172.23.136.141 | cbas, eventing, fts, index, kv, n1ql | 7.1.0-2325-enterprise | 11.0416666667 | Cluster node |
      | 172.23.136.137 | cbas, eventing, fts, index, kv, n1ql | 7.1.0-2325-enterprise | 17.24         | Cluster node |
      | 172.23.136.136 | None                                 |                       |               | <--- IN ---  |
      | 172.23.136.139 | None                                 |                       |               | <--- IN ---  |
      +----------------+--------------------------------------+-----------------------+---------------+--------------+
      

      As can be seen from not a single vbucket movement/incoming docs/outgoing docs happen for first 15 mins. However it does start after 15 mins and finish at the 18 mins mark.

      In all fairness, this might not be a bug. But our tests/automation frameworks currently use timeouts anywhere between 5 mins - 35 mins (depending on the amount of data) for rebalance to be considered a hang and fail the testcase. So, was wondering if something has changed recently that could cause this? If this issue occurs on a weekly run we could be looking at 100's of tc's failing as the default timeouts that we have wont work anymore.

      cbcollect_info attached.

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            drigby Dave Rigby added a comment -

            Note that the workload which is running here is somewhat contributing to the problem - we see that 78 collections are dropped just before the rebalance starts:

            $ rg -N 'drop collection' memcached.log
            2022-02-16T23:37:00.472471-08:00 INFO (UoQtU7JR99ysgk0PaDwPg8rWZom%U0eqtDTErgiiYijKCHCZBHE62BiI0Dr--41-860000) drop collection:id:0x12, scope:0x0, manifest:0xd1
            2022-02-16T23:37:00.628501-08:00 INFO (UoQtU7JR99ysgk0PaDwPg8rWZom%U0eqtDTErgiiYijKCHCZBHE62BiI0Dr--41-860000) drop collection:id:0x29, scope:0x8, manifest:0xd3
            2022-02-16T23:37:00.628544-08:00 INFO (UoQtU7JR99ysgk0PaDwPg8rWZom%U0eqtDTErgiiYijKCHCZBHE62BiI0Dr--41-860000) drop collection:id:0x8, scope:0x0, manifest:0xd3
            ... cut ...
            2022-02-16T23:37:14.341639-08:00 INFO (UoQtU7JR99ysgk0PaDwPg8rWZom%U0eqtDTErgiiYijKCHCZBHE62BiI0Dr--41-860000) drop collection:id:0x82, scope:0xd, manifest:0x104
            2022-02-16T23:37:14.341645-08:00 INFO (UoQtU7JR99ysgk0PaDwPg8rWZom%U0eqtDTErgiiYijKCHCZBHE62BiI0Dr--41-860000) drop collection:id:0x80, scope:0xd, manifest:0x104
            2022-02-16T23:37:14.341651-08:00 INFO (UoQtU7JR99ysgk0PaDwPg8rWZom%U0eqtDTErgiiYijKCHCZBHE62BiI0Dr--41-860000) drop collection:id:0x7f, scope:0xd, manifest:0x104
            

            If these all occurred at the same time then they would only trigger one compaction per vBucket; however they are spread out over multiple collection manifest updates - over 48 unique manifest IDs:

            $ rg -N 'drop collection' memcached.log | cut -d, -f 3|sort| uniq | wc -l
                  48
            

            As such we end up scheduling (and re-scheduling) compaction of each vBucket multiple times (to purge the items from the now-dropped collections). For example, for vb:792 we schedule 3 times:

            $ rg 'Compaction of vb:792,' memcached.log 
            18120:2022-02-16T23:37:00.931734-08:00 INFO (UoQtU7JR99ysgk0PaDwPg8rWZom%U0eqtDTErgiiYijKCHCZBHE62BiI0Dr--41-860000) Compaction of vb:792, task:2854, purge_before_ts:0, purge_before_seq:0, drop_deletes:false, internal:true created with delay:2147483647s (awaiting completion).
            18852:2022-02-16T23:37:03.505150-08:00 INFO (UoQtU7JR99ysgk0PaDwPg8rWZom%U0eqtDTErgiiYijKCHCZBHE62BiI0Dr--41-860000) Compaction of vb:792, task:2854, purge_before_ts:0, purge_before_seq:0, drop_deletes:false, internal:true rescheduled with delay:5s (awaiting completion).
            21625:2022-02-16T23:38:16.418058-08:00 INFO (UoQtU7JR99ysgk0PaDwPg8rWZom%U0eqtDTErgiiYijKCHCZBHE62BiI0Dr--41-860000) Compaction of vb:792, task:3885, purge_before_ts:0, purge_before_seq:0, drop_deletes:false, internal:true created with delay:2147483647s (awaiting completion).
            

            We don't actually compact 3 times, just twice as two of the schedules were within 5s of each other (we delay starting compaction for purging collections by 5s in case more collections which need to be purged are identified) - so we "just" see 2 compactions:

             $ rg 'Compaction of vb:792 done' memcached.log
            21439:2022-02-16T23:37:54.906100-08:00 INFO (UoQtU7JR99ysgk0PaDwPg8rWZom%U0eqtDTErgiiYijKCHCZBHE62BiI0Dr--41-860000) Compaction of vb:792 done (ok). purged tombstones:0, prepares:0, prepareBytes:0 collection_items_erased:alive:154,deleted:0, collections_erased:11, size/items/tombstones/purge_seqno pre{3391488, 2997, 0, 0}, post{1261568, 2843, 0, 0}
            26751:2022-02-16T23:55:17.075566-08:00 INFO (UoQtU7JR99ysgk0PaDwPg8rWZom%U0eqtDTErgiiYijKCHCZBHE62BiI0Dr--41-860000) Compaction of vb:792 done (ok). purged tombstones:0, prepares:0, prepareBytes:0 collection_items_erased:alive:938,deleted:0, collections_erased:67, size/items/tombstones/purge_seqno pre{700416, 2695, 0, 0}, post{663552, 1757, 0, 0}
            

            This isn't to downplay the underlying problem here (compaction is consuming all AuxIO threads), but just to highlight that the issue is amplified by this particular workload.

            drigby Dave Rigby added a comment - Note that the workload which is running here is somewhat contributing to the problem - we see that 78 collections are dropped just before the rebalance starts: $ rg -N 'drop collection' memcached.log 2022-02-16T23:37:00.472471-08:00 INFO (UoQtU7JR99ysgk0PaDwPg8rWZom%U0eqtDTErgiiYijKCHCZBHE62BiI0Dr--41-860000) drop collection:id:0x12, scope:0x0, manifest:0xd1 2022-02-16T23:37:00.628501-08:00 INFO (UoQtU7JR99ysgk0PaDwPg8rWZom%U0eqtDTErgiiYijKCHCZBHE62BiI0Dr--41-860000) drop collection:id:0x29, scope:0x8, manifest:0xd3 2022-02-16T23:37:00.628544-08:00 INFO (UoQtU7JR99ysgk0PaDwPg8rWZom%U0eqtDTErgiiYijKCHCZBHE62BiI0Dr--41-860000) drop collection:id:0x8, scope:0x0, manifest:0xd3 ... cut ... 2022-02-16T23:37:14.341639-08:00 INFO (UoQtU7JR99ysgk0PaDwPg8rWZom%U0eqtDTErgiiYijKCHCZBHE62BiI0Dr--41-860000) drop collection:id:0x82, scope:0xd, manifest:0x104 2022-02-16T23:37:14.341645-08:00 INFO (UoQtU7JR99ysgk0PaDwPg8rWZom%U0eqtDTErgiiYijKCHCZBHE62BiI0Dr--41-860000) drop collection:id:0x80, scope:0xd, manifest:0x104 2022-02-16T23:37:14.341651-08:00 INFO (UoQtU7JR99ysgk0PaDwPg8rWZom%U0eqtDTErgiiYijKCHCZBHE62BiI0Dr--41-860000) drop collection:id:0x7f, scope:0xd, manifest:0x104 If these all occurred at the same time then they would only trigger one compaction per vBucket; however they are spread out over multiple collection manifest updates - over 48 unique manifest IDs: $ rg -N 'drop collection' memcached.log | cut -d, -f 3|sort| uniq | wc -l 48 As such we end up scheduling (and re-scheduling) compaction of each vBucket multiple times (to purge the items from the now-dropped collections). For example, for vb:792 we schedule 3 times: $ rg 'Compaction of vb:792,' memcached.log 18120:2022-02-16T23:37:00.931734-08:00 INFO (UoQtU7JR99ysgk0PaDwPg8rWZom%U0eqtDTErgiiYijKCHCZBHE62BiI0Dr--41-860000) Compaction of vb:792, task:2854, purge_before_ts:0, purge_before_seq:0, drop_deletes:false, internal:true created with delay:2147483647s (awaiting completion). 18852:2022-02-16T23:37:03.505150-08:00 INFO (UoQtU7JR99ysgk0PaDwPg8rWZom%U0eqtDTErgiiYijKCHCZBHE62BiI0Dr--41-860000) Compaction of vb:792, task:2854, purge_before_ts:0, purge_before_seq:0, drop_deletes:false, internal:true rescheduled with delay:5s (awaiting completion). 21625:2022-02-16T23:38:16.418058-08:00 INFO (UoQtU7JR99ysgk0PaDwPg8rWZom%U0eqtDTErgiiYijKCHCZBHE62BiI0Dr--41-860000) Compaction of vb:792, task:3885, purge_before_ts:0, purge_before_seq:0, drop_deletes:false, internal:true created with delay:2147483647s (awaiting completion). We don't actually compact 3 times, just twice as two of the schedules were within 5s of each other (we delay starting compaction for purging collections by 5s in case more collections which need to be purged are identified) - so we "just" see 2 compactions: $ rg 'Compaction of vb:792 done' memcached.log 21439:2022-02-16T23:37:54.906100-08:00 INFO (UoQtU7JR99ysgk0PaDwPg8rWZom%U0eqtDTErgiiYijKCHCZBHE62BiI0Dr--41-860000) Compaction of vb:792 done (ok). purged tombstones:0, prepares:0, prepareBytes:0 collection_items_erased:alive:154,deleted:0, collections_erased:11, size/items/tombstones/purge_seqno pre{3391488, 2997, 0, 0}, post{1261568, 2843, 0, 0} 26751:2022-02-16T23:55:17.075566-08:00 INFO (UoQtU7JR99ysgk0PaDwPg8rWZom%U0eqtDTErgiiYijKCHCZBHE62BiI0Dr--41-860000) Compaction of vb:792 done (ok). purged tombstones:0, prepares:0, prepareBytes:0 collection_items_erased:alive:938,deleted:0, collections_erased:67, size/items/tombstones/purge_seqno pre{700416, 2695, 0, 0}, post{663552, 1757, 0, 0} This isn't to downplay the underlying problem here (compaction is consuming all AuxIO threads), but just to highlight that the issue is amplified by this particular workload.
            drigby Dave Rigby added a comment -

            Note this issue has similar underlying problems as MB-49512 - however in that case it was a bucket deletion which was delayed; but in both cases it was due to long-running Compaction tasks stopping other things from running.

            There is already an MB (MB-50988) tracking the issue of compactions scheduled after collection purge not being limited. This was targetted at Morpheus but if this issue is sufficiently widespread / problematic then we should consider moving it into Neo.

            drigby Dave Rigby added a comment - Note this issue has similar underlying problems as MB-49512 - however in that case it was a bucket deletion which was delayed; but in both cases it was due to long-running Compaction tasks stopping other things from running. There is already an MB ( MB-50988 ) tracking the issue of compactions scheduled after collection purge not being limited. This was targetted at Morpheus but if this issue is sufficiently widespread / problematic then we should consider moving it into Neo.
            drigby Dave Rigby added a comment -

            I also observed that Magma's disk write throughput was much lower on the source node (.142) at the start of the run (when compaction was running), but it appears to recover later on:

            (note left Y axis is log10) - so ~100x slower in earlier part - less than 10MB/s:

            I highlight this for two reasons - (a) that the node / environment could only sustain very low disk throughputs earlier on (and hence increases the problems seen with AuxIO thread contention) - and (b) for some reason performance changed halfway through the run.

            drigby Dave Rigby added a comment - I also observed that Magma's disk write throughput was much lower on the source node (.142) at the start of the run (when compaction was running), but it appears to recover later on: (note left Y axis is log10) - so ~100x slower in earlier part - less than 10MB/s: I highlight this for two reasons - (a) that the node / environment could only sustain very low disk throughputs earlier on (and hence increases the problems seen with AuxIO thread contention) - and (b) for some reason performance changed halfway through the run.
            drigby Dave Rigby added a comment -

            Closing this as a duplicate of MB-50988 given that has been brought into Neo and will address this issue.

            drigby Dave Rigby added a comment - Closing this as a duplicate of MB-50988 given that has been brought into Neo and will address this issue.

            Closing as a dup. Will validate the tagged bug with these tests.

            Balakumaran.Gopal Balakumaran Gopal added a comment - Closing as a dup. Will validate the tagged bug with these tests.

            People

              Balakumaran.Gopal Balakumaran Gopal
              Balakumaran.Gopal Balakumaran Gopal
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty