Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-49486

[System Test] KV rebalance stuck: DCP producer has unacked_bytes: 35152085

    XMLWordPrintable

Details

    • Bug
    • Status: Closed
    • Critical
    • Resolution: Duplicate
    • 7.1.0
    • 7.1.0
    • couchbase-bucket
    • Triaged
    • 1
    • Unknown
    • KV 2021-Nov

    Description

      Build : 7.1.0-1667
      Test : -test tests/integration/neo/test_neo_couchstore_milestone3.yml -scope tests/integration/neo/scope_couchstore.yml
      Scale : 3
      Iteration : 2nd

      A rebalance operation to remove KV node 172.23.106.100 started at 2021-11-10T10:05:12

      [2021-11-10T10:05:12-08:00, sequoiatools/couchbase-cli:7.1:f68763] rebalance -c 172.23.108.103:8091 --server-remove 172.23.106.100:8091 -u Administrator -p password
      

      This rebalance operation is stuck since 11 hrs with the default bucket after moving 138/171 vbuckets. See screenshot.

      Supportal snapshot : https://supportal.couchbase.com/snapshot/1ea9326f31534c931687f1c5021adddc::0

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            owend Daniel Owen added a comment - - edited

            Problem is for vb:305 on stream eq_dcpq:replication:ns_1@172.23.99.25->ns_1@172.23.108.103:default once we complete the backfill we do not transition to in-memory. i.e. not seeing
            ActiveStream::transitionState: Transitioning from backfilling to in-memory

            owend Daniel Owen added a comment - - edited Problem is for vb:305 on stream eq_dcpq:replication:ns_1@172.23.99.25->ns_1@172.23.108.103:default once we complete the backfill we do not transition to in-memory. i.e. not seeing ActiveStream::transitionState: Transitioning from backfilling to in-memory
            owend Daniel Owen added a comment - - edited

            Looks like only vb:305 is affected:

            The stats on .103 state the following, however according to the memcached.log on the producer .25 the only connection recently created (during the last rebalance) is vb:305.

             eq_dcpq:replication:ns_1@172.23.99.25->ns_1@172.23.108.103:default:stream_305_last_received_seqno:                                                         0
             eq_dcpq:replication:ns_1@172.23.99.25->ns_1@172.23.108.103:default:stream_309_last_received_seqno:                                                         153735
             eq_dcpq:replication:ns_1@172.23.99.25->ns_1@172.23.108.103:default:stream_310_last_received_seqno:                                                         178847
             eq_dcpq:replication:ns_1@172.23.99.25->ns_1@172.23.108.103:default:stream_957_last_received_seqno:                                                         441474
             eq_dcpq:replication:ns_1@172.23.99.25->ns_1@172.23.108.103:default:stream_958_last_received_seqno:                                                         278045
             eq_dcpq:replication:ns_1@172.23.99.25->ns_1@172.23.108.103:default:stream_959_last_received_seqno:                                                         278685
             eq_dcpq:replication:ns_1@172.23.99.25->ns_1@172.23.108.103:default:stream_960_last_received_seqno:                                                         278098
             eq_dcpq:replication:ns_1@172.23.99.25->ns_1@172.23.108.103:default:stream_961_last_received_seqno:                                                         279407
             eq_dcpq:replication:ns_1@172.23.99.25->ns_1@172.23.108.103:default:stream_962_last_received_seqno:                                                         442168
            

            owend Daniel Owen added a comment - - edited Looks like only vb:305 is affected: The stats on .103 state the following, however according to the memcached.log on the producer .25 the only connection recently created (during the last rebalance) is vb:305. eq_dcpq:replication:ns_1@172.23.99.25->ns_1@172.23.108.103:default:stream_305_last_received_seqno: 0 eq_dcpq:replication:ns_1@172.23.99.25->ns_1@172.23.108.103:default:stream_309_last_received_seqno: 153735 eq_dcpq:replication:ns_1@172.23.99.25->ns_1@172.23.108.103:default:stream_310_last_received_seqno: 178847 eq_dcpq:replication:ns_1@172.23.99.25->ns_1@172.23.108.103:default:stream_957_last_received_seqno: 441474 eq_dcpq:replication:ns_1@172.23.99.25->ns_1@172.23.108.103:default:stream_958_last_received_seqno: 278045 eq_dcpq:replication:ns_1@172.23.99.25->ns_1@172.23.108.103:default:stream_959_last_received_seqno: 278685 eq_dcpq:replication:ns_1@172.23.99.25->ns_1@172.23.108.103:default:stream_960_last_received_seqno: 278098 eq_dcpq:replication:ns_1@172.23.99.25->ns_1@172.23.108.103:default:stream_961_last_received_seqno: 279407 eq_dcpq:replication:ns_1@172.23.99.25->ns_1@172.23.108.103:default:stream_962_last_received_seqno: 442168
            owend Daniel Owen added a comment -

            The connection stats for eq_dcpq:replication:ns_1@172.23.99.25->ns_1@172.23.108.103:default on .25 are shown below:

             eq_dcpq:replication:ns_1@172.23.99.25->ns_1@172.23.108.103:default:total_acked_bytes:                                                                     10232989837
             eq_dcpq:replication:ns_1@172.23.99.25->ns_1@172.23.108.103:default:total_bytes_sent:                                                                      10268141922
             eq_dcpq:replication:ns_1@172.23.99.25->ns_1@172.23.108.103:default:total_uncompressed_data_size:                                                          10268313088
             eq_dcpq:replication:ns_1@172.23.99.25->ns_1@172.23.108.103:default:type:                                                                                  producer
             eq_dcpq:replication:ns_1@172.23.99.25->ns_1@172.23.108.103:default:unacked_bytes:                                                                         35152085
            

            Have ~35MB of unacked bytes.

            owend Daniel Owen added a comment - The connection stats for eq_dcpq:replication:ns_1@172.23.99.25->ns_1@172.23.108.103:default on .25 are shown below: eq_dcpq:replication:ns_1@172.23.99.25->ns_1@172.23.108.103:default:total_acked_bytes: 10232989837 eq_dcpq:replication:ns_1@172.23.99.25->ns_1@172.23.108.103:default:total_bytes_sent: 10268141922 eq_dcpq:replication:ns_1@172.23.99.25->ns_1@172.23.108.103:default:total_uncompressed_data_size: 10268313088 eq_dcpq:replication:ns_1@172.23.99.25->ns_1@172.23.108.103:default:type: producer eq_dcpq:replication:ns_1@172.23.99.25->ns_1@172.23.108.103:default:unacked_bytes: 35152085 Have ~35MB of unacked bytes.
            owend Daniel Owen added a comment -

            However looking on the consumer side (.103) we see

             eq_dcpq:replication:ns_1@172.23.99.25->ns_1@172.23.108.103:default:supports_ack:                                                                           false
             eq_dcpq:replication:ns_1@172.23.99.25->ns_1@172.23.108.103:default:synchronous_replication:                                                                true
             eq_dcpq:replication:ns_1@172.23.99.25->ns_1@172.23.108.103:default:total_acked_bytes:                                                                      10232989837
             eq_dcpq:replication:ns_1@172.23.99.25->ns_1@172.23.108.103:default:total_backoffs:                                                                         13987726
             eq_dcpq:replication:ns_1@172.23.99.25->ns_1@172.23.108.103:default:type:                                                                                   consumer
             eq_dcpq:replication:ns_1@172.23.99.25->ns_1@172.23.108.103:default:unacked_bytes:                                                                          0
            

            Have 0 unacked bytes
            This is very similar to the behaviour seen in MB-49096 (which was a magma backend) This MB is for a couchstore bucket. Therefore issue is backend agnostic.
            Therefore is also believed to be a duplicate of MB-47318.

            owend Daniel Owen added a comment - However looking on the consumer side (.103) we see eq_dcpq:replication:ns_1@172.23.99.25->ns_1@172.23.108.103:default:supports_ack: false eq_dcpq:replication:ns_1@172.23.99.25->ns_1@172.23.108.103:default:synchronous_replication: true eq_dcpq:replication:ns_1@172.23.99.25->ns_1@172.23.108.103:default:total_acked_bytes: 10232989837 eq_dcpq:replication:ns_1@172.23.99.25->ns_1@172.23.108.103:default:total_backoffs: 13987726 eq_dcpq:replication:ns_1@172.23.99.25->ns_1@172.23.108.103:default:type: consumer eq_dcpq:replication:ns_1@172.23.99.25->ns_1@172.23.108.103:default:unacked_bytes: 0 Have 0 unacked bytes This is very similar to the behaviour seen in MB-49096 (which was a magma backend) This MB is for a couchstore bucket. Therefore issue is backend agnostic. Therefore is also believed to be a duplicate of MB-47318 .

            Closing out duplicates

            ritam.sharma Ritam Sharma added a comment - Closing out duplicates

            People

              mihir.kamdar Mihir Kamdar (Inactive)
              mihir.kamdar Mihir Kamdar (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty