XDCR - "panic: Unable to establish starting manifests" should be revisited

Description

https://couchbasecloud.atlassian.net/browse/MB-41379 introduced a last resort when an error scenario couldn't be handled by panic'ing.
We should revisit this error handling scenario in the case where manifests can't be established and replication can't start.

Update 6/23:
The reason for panic most of the time is because target cluster/nodes are super busy and slow to respond back to the source nodes with the target bucket manifest.
When this happens, XDCR cannot establish starting manifest and cannot hold up the replication spec creation callback, and so panic was introduced.

In the customer situation, that is likely the case.
One way to solve this would to avoid the RPC call to the target to retrieve target manifest... but rather:

When a node (i.e. chosen master) is tasked to create a replication, it performs manifest retrieval as part of the replication creation.
As part of the replication creation, it actually sends the manifest (tagged with internal replication ID for uniqueness), via p2p, to all the peer nodes.
The peer nodes, since they are not processing replication add, they can receive the target manifest, and save them to a collection manifest service's temporary strorage.
If the p2p fails, we can consider if the spec creation should fail too for correctness or think about error handling here.
The master node then persists the replication spec. The other peer nodes receive the spec via metakv.
The peer nodes would have received the target bucket manifest, and wouldn't have needed to pull it from the target node(s).
(Currently, without this proposal, each source node would need to pull its own target bucket manifest independently during the repl spec callback. So if timeout occurs, the callback hangs and panic needs to be induced)

This design takes into account that source nodes are local and target nodes are far and long latency. The trade off is more complexity introduced during replication creation.

We need to take into account mixed mode, etc.

One situation is where source cluster has network partition. So, replication creation could fail since P2P will fail. That's probably not a huge issue because with network partition, we don't want to create a replication anyway (and have metakv not able to send spec to a subset of nodes, leading to missed data/replication).
Another situation is P2P being busy to respond, but that should sort itself out over time.

Labels

Environment

None

Link to Log File, atop/blg, CBCollectInfo, Core dump

None

Release Notes Description

None

Linked issues

causes

MB-58494

XDCR - panic when replicating to legacy target cluster

Activity

Ayush Nayyar February 4, 2024 at 3:07 PM

Verified on 7.6.0-2054.

CB robot August 31, 2023 at 1:40 PM

Build couchbase-server-8.0.0-1391 contains goxdcr commit bd57ea9 with commit message:
https://couchbasecloud.atlassian.net/browse/MB-57459#icft=MB-57459: Share manfiests as part of replication creation

CB robot August 31, 2023 at 8:44 AM

Build capella-analytics-1.0.0-1006 contains goxdcr commit bd57ea9 with commit message:
https://couchbasecloud.atlassian.net/browse/MB-57459#icft=MB-57459: Share manfiests as part of replication creation

CB robot August 31, 2023 at 1:17 AM

Build couchbase-server-7.6.0-1436 contains goxdcr commit bd57ea9 with commit message:
https://couchbasecloud.atlassian.net/browse/MB-57459#icft=MB-57459: Share manfiests as part of replication creation

CB robot August 31, 2023 at 12:26 AM

Build couchbase-server-7.5.0-4713 contains goxdcr commit bd57ea9 with commit message:
https://couchbasecloud.atlassian.net/browse/MB-57459#icft=MB-57459: Share manfiests as part of replication creation

Fixed

Pinned fields

Click on the next to a field label to start pinning.

Details
Assignee
Ayush Nayyar
Reporter
Neil Huang
Is this a Regression?
Unknown
Triage
Untriaged
Issue Impact
external
Story Points
0
Priority
Critical
Instabug
Open Instabug

PagerDuty

Sentry

Zendesk Support

Created June 20, 2023 at 8:36 PM

Updated March 21, 2025 at 2:50 AM

Resolved August 31, 2023 at 3:24 AM

Instabug

XDCR - "panic: Unable to establish starting manifests" should be revisited

Description

Components

Affects versions

Fix versions

Labels

Environment

Link to Log File, atop/blg, CBCollectInfo, Core dump

Release Notes Description

Linked issues

causes

Activity

Ayush Nayyar February 4, 2024 at 3:07 PM

CB robot August 31, 2023 at 1:40 PM

CB robot August 31, 2023 at 8:44 AM

CB robot August 31, 2023 at 1:17 AM

CB robot August 31, 2023 at 12:26 AM

Details
Assignee
Ayush Nayyar
Reporter
Neil Huang
Is this a Regression?
Unknown
Triage
Untriaged
Issue Impact
external
Story Points
0
Priority
Critical
Instabug
Open Instabug

Details

Assignee

Reporter

Is this a Regression?

Triage

Issue Impact

Story Points

Priority

Instabug

PagerDuty

PagerDuty

Sentry

Sentry

Zendesk Support

Zendesk Support

Flag notifications

Something's gone wrong

XDCR - "panic: Unable to establish starting manifests" should be revisited

Description

Components

Affects versions

Fix versions

Labels

Environment

Link to Log File, atop/blg, CBCollectInfo, Core dump

Release Notes Description

Linked issues

causes

Activity

Ayush Nayyar February 4, 2024 at 3:07 PM

CB robot August 31, 2023 at 1:40 PM

CB robot August 31, 2023 at 8:44 AM

CB robot August 31, 2023 at 1:17 AM

CB robot August 31, 2023 at 12:26 AM

DetailsAssigneeAyush NayyarAyush NayyarReporterNeil HuangNeil HuangIs this a Regression?UnknownTriageUntriagedIssue ImpactexternalStory Points0PriorityCriticalInstabugOpen Instabug

Details

Assignee

Reporter

Is this a Regression?

Triage

Issue Impact

Story Points

Priority

Instabug

PagerDutyPagerDuty Incident

PagerDuty

Sentry Linked Issues

Sentry

Zendesk SupportLinked Tickets

Zendesk Support

Flag notifications

Something's gone wrong

Details
Assignee
Ayush Nayyar
Reporter
Neil Huang
Is this a Regression?
Unknown
Triage
Untriaged
Issue Impact
external
Story Points
0
Priority
Critical
Instabug
Open Instabug

PagerDuty

Sentry

Zendesk Support