Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-61421

Mitigating operational race-iness between CopyTo & Merge/Persist

    XMLWordPrintable

Details

    • Improvement
    • Resolution: Fixed
    • Critical
    • 7.6.2
    • 7.6.0, 7.2.5
    • fts
    • 0

    Description

      With large data sets and (very likely) merger slowness owing to large segment sizes, file transfer rebalance's CopyTo can race causing repeated failures of CopyPartitions with these kind of errors - 

      cbcollect_info_ns_1@wdp-mts1cbindex-7.ftscc.net_20240312-204215/ns_server.fts.log:2024-02-08T16:46:32.818-05:00 [ERRO] rest: error code: 500, msg: rest_pindex_streamer: WriteTo err: error copying index metadata: error backing up index snapshot: segment: /opt/couchbase/var/lib/couchbase/index/@fts/mts_gwbt_fts_792d0c0b0e9bc1f7_2d569425.pindex/store/000000019694.zap copy err: stat /opt/couchbase/var/lib/couchbase/index/@fts/mts_gwbt_fts_792d0c0b0e9bc1f7_2d569425.pindex/store/000000019694.zap: no such file or directory -- rest.ShowErrorBody() at rest.go:62

      On CopyPartition failures, the operation is re-tried a finite number of times with an exponential back-off, but its been observed that all of these retries sometimes can fail due to the merge operation taking a long time.

      Upon exhausting the max number of retries, file transfer for the partition is aborted and rebalance falls back to regular DCP rebuild - which is very much slower in comparison.

      This is long standing behavior and this we need to try and make better - by perhaps not allowing the merger to delete any files that are part of the index snapshot that CopyTo is working with.


      We must first establish if this is a race with the merger or persister or maybe even how cbft uses bleve/scorch during rollback.

      Attachments

        1. goodLOGS.log
          6 kB
        2. image-2024-05-09-19-30-21-593.png
          image-2024-05-09-19-30-21-593.png
          59 kB
        3. image-2024-05-09-19-46-06-929.png
          image-2024-05-09-19-46-06-929.png
          45 kB
        4. image-2024-05-14-19-53-35-485.png
          image-2024-05-14-19-53-35-485.png
          97 kB
        5. realBig.log
          724 kB

        Issue Links

          For Gerrit Dashboard: MB-61421
          # Subject Branch Project Status CR V

          Activity

            People

              sarthak.dua Sarthak Dua
              abhinav Abhi Dangeti
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty