Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-60938

Takeover might never complete when source node is under memory pressure

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Critical
    • Morpheus
    • 7.6.0
    • couchbase-bucket
    • None
    • Triaged
    • 0
    • No
    • March-June 24

    Description

      The following is enforced for in-memory streams in their TakeoverSend phase:

              "dcp_takeover_max_time": {
                  "default": "60",
                  "descr": "Max amount of time for takeover send (in seconds) after which front end ops would return ETMPFAIL",
                  "dynamic": true,
                  "type": "size_t"
              },
      

      But, we fail to backup frontend loads if any memory-pressure state on the node pushes the stream back to backfill continuously.

      Eg, this the ActiveStream on a local cluster_run node:

      024-02-26T15:56:35.951749+01:00 INFO 69: (example) DCP (Producer) eq_dcpq:replication:n_0@192.168.100.104->n_1@127.0.0.1:example - (vb:1) Creating takeover stream with start seqno 651 and end seqno 18446744073709551615; requested end seqno was 18446744073709551615, flags:1, snapshot:{651,651} filter:{passthrough, size:0}, sid:none
      2024-02-26T15:56:35.951792+01:00 INFO 69: (example) DCP (Producer) eq_dcpq:replication:n_0@192.168.100.104->n_1@127.0.0.1:example - (vb:1) ActiveStream::scheduleBackfill_UNLOCKED register cursor with name "eq_dcpq:replication:n_0@192.168.100.104->n_1@127.0.0.1:example" lastProcessedSeqno:651, result{tryBackfill:true, op:empty, seqno:708, nextSeqno:7
      08}
      2024-02-26T15:56:35.951858+01:00 INFO 69: (example) DCP (Producer) eq_dcpq:replication:n_0@192.168.100.104->n_1@127.0.0.1:example - (vb:1) Scheduling backfill from 652 to 707, uid:6, reschedule flag : False
      2024-02-26T15:56:35.952259+01:00 INFO 69: (example) DCP (Producer) eq_dcpq:replication:n_0@192.168.100.104->n_1@127.0.0.1:example - (vb:1) ActiveStream::markDiskSnapshot: Sending disk snapshot with start:651, end:707, flags:0x6, flagsDecoded:[DISK,CHK], hcs:0, mvs:707, lastBackfilledSeqno:0
      2024-02-26T15:56:35.952287+01:00 INFO 69: (example) DCP (Producer) eq_dcpq:replication:n_0@192.168.100.104->n_1@127.0.0.1:example - (vb:1) ActiveStream::registerCursor name "eq_dcpq:replication:n_0@192.168.100.104->n_1@127.0.0.1:example", lastProcessedSeqno:707, pendingBackfill:false, result{tryBackfill:false, op:empty, seqno:708, nextSeqno:708}
      2024-02-26T15:56:36.031031+01:00 INFO 69: (example) DCP (Producer) eq_dcpq:replication:n_0@192.168.100.104->n_1@127.0.0.1:example - (vb:1) Handling slow stream; state_ : backfilling, lastReadSeqno : 651, lastSentSeqno : 678, vBucketHighSeqno : 708, curChkSeqno: 708, isBackfillTaskRunning : True
      2024-02-26T15:56:36.245266+01:00 INFO (example) DCP backfilling task temporarily suspended because the current memory usage is too high
      2024-02-26T15:56:37.273116+01:00 INFO (example) DCP backfilling task temporarily suspended because the current memory usage is too high
      2024-02-26T15:56:40.666761+01:00 INFO (example) DCP backfilling task temporarily suspended because the current memory usage is too high
      2024-02-26T15:56:40.791046+01:00 INFO 69: (example) DCP (Producer) eq_dcpq:replication:n_0@192.168.100.104->n_1@127.0.0.1:example - (vb:1) Backfill complete. 10 items consisting of 201010720 bytes read from disk, 0 items from memory, lastReadSeqno:707, lastSentSeqnoAdvance:0, lastSentSnapStartSeqno:651, lastSentSnapEndSeqno:707, pendingBackfill:True
      , numBackfillPauses:10. Total runtime 315 ms (31 item/s, 608 MB/s)
      2024-02-26T15:56:40.850688+01:00 INFO 69: (example) DCP (Producer) eq_dcpq:replication:n_0@192.168.100.104->n_1@127.0.0.1:example - (vb:1) ActiveStream::scheduleBackfill_UNLOCKED register cursor with name "eq_dcpq:replication:n_0@192.168.100.104->n_1@127.0.0.1:example" lastProcessedSeqno:707, result{tryBackfill:true, op:empty, seqno:749, nextSeqno:7
      49}
      2024-02-26T15:56:40.850930+01:00 INFO 69: (example) DCP (Producer) eq_dcpq:replication:n_0@192.168.100.104->n_1@127.0.0.1:example - (vb:1) Scheduling backfill from 708 to 748, uid:7, reschedule flag : True
      2024-02-26T15:56:40.851306+01:00 INFO 69: (example) DCP (Producer) eq_dcpq:replication:n_0@192.168.100.104->n_1@127.0.0.1:example - (vb:1) ActiveStream::markDiskSnapshot: Sending disk snapshot with start:708, end:749, flags:0x6, flagsDecoded:[DISK,CHK], hcs:0, mvs:749, lastBackfilledSeqno:707
      2024-02-26T15:56:40.851335+01:00 INFO 69: (example) DCP (Producer) eq_dcpq:replication:n_0@192.168.100.104->n_1@127.0.0.1:example - (vb:1) ActiveStream::registerCursor name "eq_dcpq:replication:n_0@192.168.100.104->n_1@127.0.0.1:example", lastProcessedSeqno:749, pendingBackfill:false, result{tryBackfill:false, op:mutation, seqno:749, nextSeqno:750}
      2024-02-26T15:56:40.929715+01:00 INFO 69: (example) DCP (Producer) eq_dcpq:replication:n_0@192.168.100.104->n_1@127.0.0.1:example - (vb:1) Handling slow stream; state_ : backfilling, lastReadSeqno : 707, lastSentSeqno : 707, vBucketHighSeqno : 750, curChkSeqno: 750, isBackfillTaskRunning : True
      2024-02-26T15:56:50.817072+01:00 INFO 69: (example) DCP (Producer) eq_dcpq:replication:n_0@192.168.100.104->n_1@127.0.0.1:example - (vb:1) Backfill complete. 20 items consisting of 201010720 bytes read from disk, 0 items from memory, lastReadSeqno:749, lastSentSeqnoAdvance:0, lastSentSnapStartSeqno:708, lastSentSnapEndSeqno:749, pendingBackfill:True
      , numBackfillPauses:10. Total runtime 357 ms (55 item/s, 536 MB/s)
      2024-02-26T15:56:50.902340+01:00 INFO 69: (example) DCP (Producer) eq_dcpq:replication:n_0@192.168.100.104->n_1@127.0.0.1:example - (vb:1) ActiveStream::scheduleBackfill_UNLOCKED register cursor with name "eq_dcpq:replication:n_0@192.168.100.104->n_1@127.0.0.1:example" lastProcessedSeqno:749, result{tryBackfill:true, op:empty, seqno:836, nextSeqno:8
      36}
      2024-02-26T15:56:50.902547+01:00 INFO 69: (example) DCP (Producer) eq_dcpq:replication:n_0@192.168.100.104->n_1@127.0.0.1:example - (vb:1) Scheduling backfill from 750 to 835, uid:8, reschedule flag : True
      2024-02-26T15:56:50.903441+01:00 INFO 69: (example) DCP (Producer) eq_dcpq:replication:n_0@192.168.100.104->n_1@127.0.0.1:example - (vb:1) ActiveStream::markDiskSnapshot: Sending disk snapshot with start:750, end:836, flags:0x6, flagsDecoded:[DISK,CHK], hcs:0, mvs:836, lastBackfilledSeqno:749
      2024-02-26T15:56:50.903538+01:00 INFO 69: (example) DCP (Producer) eq_dcpq:replication:n_0@192.168.100.104->n_1@127.0.0.1:example - (vb:1) ActiveStream::registerCursor name "eq_dcpq:replication:n_0@192.168.100.104->n_1@127.0.0.1:example", lastProcessedSeqno:836, pendingBackfill:false, result{tryBackfill:false, op:mutation, seqno:836, nextSeqno:837}
      2024-02-26T15:56:51.067382+01:00 INFO 69: (example) DCP (Producer) eq_dcpq:replication:n_0@192.168.100.104->n_1@127.0.0.1:example - (vb:1) Handling slow stream; state_ : backfilling, lastReadSeqno : 749, lastSentSeqno : 749, vBucketHighSeqno : 837, curChkSeqno: 837, isBackfillTaskRunning : True
      2024-02-26T15:56:51.972161+01:00 INFO (example) DCP backfilling task temporarily suspended because the current memory usage is too high
      2024-02-26T15:56:52.978802+01:00 INFO (example) DCP backfilling task temporarily suspended because the current memory usage is too high
      2024-02-26T15:56:53.978557+01:00 INFO (example) DCP backfilling task temporarily suspended because the current memory usage is too high
      2024-02-26T15:56:55.782781+01:00 INFO (example) DCP backfilling task temporarily suspended because the current memory usage is too high
      ..
      <endless CursorDrop + Backfill>
      ..
      

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          People

            paolo.cocchi Paolo Cocchi
            paolo.cocchi Paolo Cocchi
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty