Details
-
Improvement
-
Resolution: Fixed
-
Critical
-
7.2.5
-
0
Description
With large data sets and (very likely) merger slowness owing to large segment sizes, file transfer rebalance's CopyTo can race causing repeated failures of CopyPartitions with these kind of errors -
cbcollect_info_ns_1@wdp-mts1cbindex-7.ftscc.net_20240312-204215/ns_server.fts.log:2024-02-08T16:46:32.818-05:00 [ERRO] rest: error code: 500, msg: rest_pindex_streamer: WriteTo err: error copying index metadata: error backing up index snapshot: segment: /opt/couchbase/var/lib/couchbase/index/@fts/mts_gwbt_fts_792d0c0b0e9bc1f7_2d569425.pindex/store/000000019694.zap copy err: stat /opt/couchbase/var/lib/couchbase/index/@fts/mts_gwbt_fts_792d0c0b0e9bc1f7_2d569425.pindex/store/000000019694.zap: no such file or directory -- rest.ShowErrorBody() at rest.go:62 |
On CopyPartition failures, the operation is re-tried a finite number of times with an exponential back-off, but its been observed that all of these retries sometimes can fail due to the merge operation taking a long time.
Upon exhausting the max number of retries, file transfer for the partition is aborted and rebalance falls back to regular DCP rebuild - which is very much slower in comparison.
This is long standing behavior and this we need to try and make better - by perhaps not allowing the merger to delete any files that are part of the index snapshot that CopyTo is working with.
We must first establish if this is a race with the merger or persister or maybe even how cbft uses bleve/scorch during rollback.
Attachments
Issue Links
- is a backport of
-
MB-61421 Mitigating operational race-iness between CopyTo & Merge/Persist
- Closed
- is duplicated by
-
MB-61605 [System Test] :- WriteTo err: error copying index metadata: error backing up index snapshot
- Closed
-
MB-61682 FTS] Error copying index metadata, error backing up index snapshot during rebalance
- Closed
- relates to
-
MB-61905 [System Test] FTS service exited with OOM error while running queries
- Closed