Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-58914

[CBM] Point in time backup never completes if snapshot ends with system events

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Major
    • 7.6.0
    • master
    • tools
    • Untriaged
    • 0
    • Unknown

    Description

      What is the problem?
      When doing a point in time backup on the latest master for both server and backup the backup never completes. It appears to be because we don't count system events (e.g. collections being created) when bumping the highseqno of the sink. This only happens if the last seqno increase in a snapshot is due to a system event.

      Technical detail
      We see in the logs that we are repeatedly streaming the same snapshot:

      > tail -f ~/bk/logs/backup-0.log | grep "vb 941"
      2023-10-03T08:32:01.683+00:00 (DCP) (default) (vb 941) Creating DCP stream | {"uuid":0,"start_seqno":0,"end_seqno":3,"snap_start":0,"snap_end":0,"retries":0}
      2023-10-03T08:32:01.766+00:00 (DCP) (default) (vb 941) Stream closed because all items were streamed | {"uuid":74448152038113,"snap_start":0,"snap_end":3,"last_seqno":0,"retries":0}
      2023-10-03T08:32:01.766+00:00 (DCP) (default) (vb 941) PiTR Streaming next snapshot
      2023-10-03T08:32:01.766+00:00 (DCP) (default) (vb 941) Creating DCP stream | {"uuid":74448152038113,"start_seqno":0,"end_seqno":3,"snap_start":0,"snap_end":3,"retries":0}
      2023-10-03T08:32:01.914+00:00 (DCP) (default) (vb 941) Stream closed because all items were streamed | {"uuid":74448152038113,"snap_start":0,"snap_end":3,"last_seqno":0,"retries":0}
      2023-10-03T08:32:01.916+00:00 (DCP) (default) (vb 941) PiTR Streaming next snapshot
      

      As we can see the stream ends with the last_seqno not bumped. By looking at a pcap it appears these seqnos are for system events (i.e. creation of scopes/collections), and couch_dbdump confirms it:

      > ../install/bin/couch_dbdump  --vbucket ../ns_server/data/n_0/data/default/941.couch.1
      Dumping "../ns_server/data/n_0/data/default/941.couch.1":
      Doc seq: 1
           id: (system-event-key:scope:0x8)_scope
           rev: 1
           content_meta: 0x83
           size (on disk): 48
           cas: 1696318773518008320, expiry: 0, flags: 16777216, datatype: 0x00 (raw)
           size: 40
           data: (snappy)
      Doc seq: 2
           id: (system-event-key:collection:0x9)_collection
           rev: 1
           content_meta: 0x83
           size (on disk): 57
           cas: 1696318773518204928, expiry: 0, flags: 0, datatype: 0x00 (raw)
           size: 52
           data: (snappy)
      Doc seq: 3
           id: (system-event-key:collection:0x8)_collection
           rev: 1
           content_meta: 0x83
           size (on disk): 60
           cas: 1696318773518336000, expiry: 0, flags: 0, datatype: 0x00 (raw)
           size: 64
           data: (snappy)
       
      Total docs: 3
      

      Looking at the code we can see we open a new stream if we are in PiTR mode and the highseqno of the sink does not match that of the source. This is to work around the fact DCP will not stream all the snapshots in a range for PiTR (MB-46854).

      Is this a regression
      No. 7.6 just makes this more likely because the first snapshot is likely to have some/all of the _system scope being created.

      Reproduction

      1. Create a bucket on a 7.6 cluster with PiTR enabled
      2. Create multiple scopes/collections
      3. Load less than 1024 documents
      4. Try to create a backup

      Attachments

        Issue Links

          For Gerrit Dashboard: MB-58914
          # Subject Branch Project Status CR V

          Activity

            People

              gilad.kalchheim Gilad Kalchheim
              Matt.Hall Matt Hall
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty