Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-49096

[Magma] - Delta recovery hangs at 1 % DGM + 512 MB bucket with 1 replica

    XMLWordPrintable

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Duplicate
    • 7.1.0
    • 7.1.0
    • couchbase-bucket
    • Enterprise Edition 7.1.0 build 1548
    • Untriaged
    • Centos 64-bit
    • 1
    • No

    Description

      Script to Repro

      guides/gradlew --refresh-dependencies testrunner -P jython=/opt/jython/bin/jython -P 'args=-i /tmp/win10-bucket-ops-temp_rebalance_magma.ini rerun=False,disk_optimized_thread_settings=True,get-cbcollect-info=True -t bucket_collections.collections_rebalance.CollectionsRebalance.test_data_load_collections_with_hard_failover_recovery,nodes_init=5,nodes_failover=1,recovery_type=delta,bucket_spec=magma_dgm.1_percent_dgm.5_node_1_replica_magma_512_single_bucket,doc_size=512,randomize_value=True,data_load_stage=during,skip_validations=False'
      

      Steps to Repro
      1. Create a 5 node cluster.

      2021-10-21 06:44:19,450 | test  | INFO    | pool-3-thread-6 | [table_view:display:72] Rebalance Overview
      +----------------+----------+-----------------------+----------------+--------------+
      | Nodes          | Services | Version               | CPU            | Status       |
      +----------------+----------+-----------------------+----------------+--------------+
      | 172.23.105.164 | kv       | 7.1.0-1548-enterprise | 0.739348370927 | Cluster node |
      | 172.23.105.206 | None     |                       |                | <--- IN ---  |
      | 172.23.105.33  | None     |                       |                | <--- IN ---  |
      | 172.23.105.36  | None     |                       |                | <--- IN ---  |
      | 172.23.106.177 | None     |                       |                | <--- IN ---  |
      +----------------+----------+-----------------------+----------------+--------------+
      

      2. Create a bucket of 512 MB ram with 1 replica. Create scopes/collections and push the bucket to 1 % DGM

      2021-10-21 11:32:26,931 | test  | INFO    | MainThread | [table_view:display:72] Bucket statistics
      +---------+-----------+-----------------+----------+------------+-----+-----------+-----------+----------+------------+----------------+
      | Bucket  | Type      | Storage Backend | Replicas | Durability | TTL | Items     | RAM Quota | RAM Used | Disk Used  | ARR            |
      +---------+-----------+-----------------+----------+------------+-----+-----------+-----------+----------+------------+----------------+
      | default | couchbase | magma           | 1        | none       | 0   | 244312500 | 2.50 GiB  | 1.82 GiB | 200.74 GiB | 0.558682834485 |
      +---------+-----------+-----------------+----------+------------+-----+-----------+-----------+----------+------------+----------------+
      

      3. Hard failover a node

      2021-10-21 11:32:30,204 | test  | INFO    | MainThread | [collections_rebalance:rebalance_operation:721] failing over nodes [ip:172.23.106.177 port:8091 ssh_username:root]
      

      4. Do a delta recovery. It hangs.

      cbcollect_info attached.

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            rohan.suri Rohan Suri added a comment -

            Balakumaran Gopal could you re-run this test after https://issues.couchbase.com/browse/MB-47318 is fixed? Ben Huddleston and I suspect it is due to the same reason. 

            Another way to verify is to re-run the test without doing any delete mutations, since that bug arises only when there are deletes present. 

            rohan.suri Rohan Suri added a comment - Balakumaran Gopal  could you re-run this test after  https://issues.couchbase.com/browse/MB-47318  is fixed? Ben Huddleston  and I suspect it is due to the same reason.  Another way to verify is to re-run the test without doing any delete mutations, since that bug arises only when there are deletes present. 

            Expirations can manifest the bug too!

            ben.huddleston Ben Huddleston added a comment - Expirations can manifest the bug too!
            rohan.suri Rohan Suri added a comment -

            Agreed. But I remember Bala mentioned he only has creates, updates, reads, deletes in this test.

            rohan.suri Rohan Suri added a comment - Agreed. But I remember Bala mentioned he only has creates, updates, reads, deletes in this test.
            rohan.suri Rohan Suri added a comment - - edited

            Closing as duplicate of MB-47318

            If we look at the unacked_bytes on producer (node 33) and consumer side (node 177), they're different:

            33's stats.log:
            eq_dcpq:replication:ns_1@172.23.105.33->ns_1@172.23.106.177:default:unacked_bytes: 10485858

            177's stats.log
            eq_dcpq:replication:ns_1@172.23.105.33->ns_1@172.23.106.177:default:unacked_bytes: 0

            Since consumer's max buffer size is 10MB and producer thinks the unacked bytes is also ~10M, it doesn't send any more items, even though the consumer has acknowledged everything sent so far. Jim's fixes for MB-47318 will fix this too.

            thanks Ben Huddleston for the help!

            rohan.suri Rohan Suri added a comment - - edited Closing as duplicate of MB-47318 If we look at the unacked_bytes on producer (node 33) and consumer side (node 177), they're different: 33's stats.log: eq_dcpq:replication:ns_1@172.23.105.33->ns_1@172.23.106.177:default:unacked_bytes: 10485858 177's stats.log eq_dcpq:replication:ns_1@172.23.105.33->ns_1@172.23.106.177:default:unacked_bytes: 0 Since consumer's max buffer size is 10MB and producer thinks the unacked bytes is also ~10M, it doesn't send any more items, even though the consumer has acknowledged everything sent so far. Jim's fixes for MB-47318 will fix this too. thanks Ben Huddleston  for the help!

            People

              rohan.suri Rohan Suri
              Balakumaran.Gopal Balakumaran Gopal
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty