Loading...

XML

Word

Printable

Details

Type: Improvement
Resolution: Fixed
Priority: Critical
Fix Version/s: 7.2.6
Affects Version/s: 7.2.5
Component/s: fts
Labels:
- approved-for-7.2.6

Story Points:
0

Description

With large data sets and (very likely) merger slowness owing to large segment sizes, file transfer rebalance's CopyTo can race causing repeated failures of CopyPartitions with these kind of errors -

cbcollect_info_ns_1@wdp-mts1cbindex-7.ftscc.net_20240312-204215/ns_server.fts.log:2024-02-08T16:46:32.818-05:00 [ERRO] rest: error code: 500, msg: rest_pindex_streamer: WriteTo err: error copying index metadata: error backing up index snapshot: segment: /opt/couchbase/var/lib/couchbase/index/@fts/mts_gwbt_fts_792d0c0b0e9bc1f7_2d569425.pindex/store/000000019694.zap copy err: stat /opt/couchbase/var/lib/couchbase/index/@fts/mts_gwbt_fts_792d0c0b0e9bc1f7_2d569425.pindex/store/000000019694.zap: no such file or directory -- rest.ShowErrorBody() at rest.go:62

On CopyPartition failures, the operation is re-tried a finite number of times with an exponential back-off, but its been observed that all of these retries sometimes can fail due to the merge operation taking a long time.

Upon exhausting the max number of retries, file transfer for the partition is aborted and rebalance falls back to regular DCP rebuild - which is very much slower in comparison.

This is long standing behavior and this we need to try and make better - by perhaps not allowing the merger to delete any files that are part of the index snapshot that CopyTo is working with.

We must first establish if this is a race with the merger or persister or maybe even how cbft uses bleve/scorch during rollback.

Attachments

Issue Links

is a backport of

MB-61421 Mitigating operational race-iness between CopyTo & Merge/Persist

Closed

is duplicated by

MB-61605 [System Test] :- WriteTo err: error copying index metadata: error backing up index snapshot

Closed

MB-61682 FTS] Error copying index metadata, error backing up index snapshot during rebalance

Closed

relates to

MB-61905 [System Test] FTS service exited with OOM error while running queries

Closed

Gerrit Reviews

- Issue Only
- Show All Reviews
- Show Open Reviews
- Show All Issues
- Show Open Issues

No reviews matched the request. Check your Options in the drop-down menu of this sections header.

Activity

People

Assignee:: Sarthak Dua

Reporter:: Abhi Dangeti

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Due:: 17/May/24

Created:: 03/Jun/24 8:46 AM

Updated:: 20/Aug/24 9:04 PM

Resolved:: 25/Jun/24 11:32 AM

Gerrit Reviews

There are no open Gerrit changes

Show There are 4 closed Gerrit changes

Hide There are 4 closed Gerrit changes

MB-62156: Upgrade bleve/v2@7.2-couchbase; also add missing return: Gerrit Review:

MB-62156: go mod tidy: Gerrit Review:

MB-62156: go mod tidy: Gerrit Review:

MB-62156: go mod tidy: Gerrit Review:

[Backport] Mitigating operational race-iness between CopyTo & Merge/Persist

Details

Description

Attachments

Issue Links

Gerrit Reviews

Activity

People

Dates

Gerrit Reviews

PagerDuty