XDCR - "panic: Unable to establish starting manifests" should be revisited

Description

https://couchbasecloud.atlassian.net/browse/MB-41379 introduced a last resort when an error scenario couldn't be handled by panic'ing.
We should revisit this error handling scenario in the case where manifests can't be established and replication can't start.

Update 6/23:
The reason for panic most of the time is because target cluster/nodes are super busy and slow to respond back to the source nodes with the target bucket manifest.
When this happens, XDCR cannot establish starting manifest and cannot hold up the replication spec creation callback, and so panic was introduced.

In the customer situation, that is likely the case.
One way to solve this would to avoid the RPC call to the target to retrieve target manifest... but rather:

  1. When a node (i.e. chosen master) is tasked to create a replication, it performs manifest retrieval as part of the replication creation.

  2. As part of the replication creation, it actually sends the manifest (tagged with internal replication ID for uniqueness), via p2p, to all the peer nodes.

  3. The peer nodes, since they are not processing replication add, they can receive the target manifest, and save them to a collection manifest service's temporary strorage.

  4. If the p2p fails, we can consider if the spec creation should fail too for correctness or think about error handling here.

  5. The master node then persists the replication spec. The other peer nodes receive the spec via metakv.

  6. The peer nodes would have received the target bucket manifest, and wouldn't have needed to pull it from the target node(s).

  7. (Currently, without this proposal, each source node would need to pull its own target bucket manifest independently during the repl spec callback. So if timeout occurs, the callback hangs and panic needs to be induced)

This design takes into account that source nodes are local and target nodes are far and long latency. The trade off is more complexity introduced during replication creation.

We need to take into account mixed mode, etc.

One situation is where source cluster has network partition. So, replication creation could fail since P2P will fail. That's probably not a huge issue because with network partition, we don't want to create a replication anyway (and have metakv not able to send spec to a subset of nodes, leading to missed data/replication).
Another situation is P2P being busy to respond, but that should sort itself out over time.

Components

Fix versions

Labels

Environment

None

Link to Log File, atop/blg, CBCollectInfo, Core dump

None

Release Notes Description

None

Activity

Ayush Nayyar February 4, 2024 at 3:07 PM

Verified on 7.6.0-2054.

CB robot August 31, 2023 at 1:40 PM

Build couchbase-server-8.0.0-1391 contains goxdcr commit bd57ea9 with commit message:
https://couchbasecloud.atlassian.net/browse/MB-57459#icft=MB-57459: Share manfiests as part of replication creation

CB robot August 31, 2023 at 8:44 AM

Build capella-analytics-1.0.0-1006 contains goxdcr commit bd57ea9 with commit message:
https://couchbasecloud.atlassian.net/browse/MB-57459#icft=MB-57459: Share manfiests as part of replication creation

CB robot August 31, 2023 at 1:17 AM

Build couchbase-server-7.6.0-1436 contains goxdcr commit bd57ea9 with commit message:
https://couchbasecloud.atlassian.net/browse/MB-57459#icft=MB-57459: Share manfiests as part of replication creation

CB robot August 31, 2023 at 12:26 AM

Build couchbase-server-7.5.0-4713 contains goxdcr commit bd57ea9 with commit message:
https://couchbasecloud.atlassian.net/browse/MB-57459#icft=MB-57459: Share manfiests as part of replication creation

Fixed
Pinned fields
Click on the next to a field label to start pinning.

Details

Assignee

Reporter

Is this a Regression?

Unknown

Triage

Untriaged

Issue Impact

external

Story Points

Priority

Instabug

Open Instabug

PagerDuty

Sentry

Zendesk Support

Created June 20, 2023 at 8:36 PM
Updated March 21, 2025 at 2:50 AM
Resolved August 31, 2023 at 3:24 AM
Instabug

Flag notifications