https://issues.couchbase.com/browse/MB-41379 introduced a last resort when an error scenario couldn't be handled by panic'ing.
We should revisit this error handling scenario in the case where manifests can't be established and replication can't start.
The reason for panic most of the time is because target cluster/nodes are super busy and slow to respond back to the source nodes with the target bucket manifest.
When this happens, XDCR cannot establish starting manifest and cannot hold up the replication spec creation callback, and so panic was introduced.
In the customer situation, that is likely the case.
One way to solve this would to avoid the RPC call to the target to retrieve target manifest... but rather:
- When a node (i.e. chosen master) is tasked to create a replication, it performs manifest retrieval as part of the replication creation.
- As part of the replication creation, it actually sends the manifest (tagged with internal replication ID for uniqueness), via p2p, to all the peer nodes.
- The peer nodes, since they are not processing replication add, they can receive the target manifest, and save them to a collection manifest service's temporary strorage.
- If the p2p fails, we can consider if the spec creation should fail too for correctness or think about error handling here.
- The master node then persists the replication spec. The other peer nodes receive the spec via metakv.
- The peer nodes would have received the target bucket manifest, and wouldn't have needed to pull it from the target node(s).
- (Currently, without this proposal, each source node would need to pull its own target bucket manifest independently during the repl spec callback. So if timeout occurs, the callback hangs and panic needs to be induced)
This design takes into account that source nodes are local and target nodes are far and long latency. The trade off is more complexity introduced during replication creation.
We need to take into account mixed mode, etc.
One situation is where source cluster has network partition. So, replication creation could fail since P2P will fail. That's probably not a huge issue because with network partition, we don't want to create a replication anyway (and have metakv not able to send spec to a subset of nodes, leading to missed data/replication).
Another situation is P2P being busy to respond, but that should sort itself out over time.