Details
Description
See CBSE-703 for actual (but rare) occurrence in production.
As pointed out in CBSE-703 there's workaround which is:
<quote>
Regarding design documents not being propagated. It seems that there's a rare race condition there. Workaround for now is restarting responsible processes with this snippet:
curl -X POST -u Administrator:<password> http://<host>:8091/diag/eval -d 'rpc:eval_everywhere(erlang, apply, [fun () -> [exit(whereis(list_to_atom("capi_set_view_manager-" ++ B)), kill) || B <- ns_bucket:get_bucket_names(membase)] end, []]).'
It's enough to run it against one of the nodes.
</quote>
Issue itself we think happens because ns_node disco might combine multiple node
{up,down}events into single "cumulative" ns_node_disco_events event. And if list of nodes before and after is same it'll actually not send any event.
But that last property of eating aggregated event completely causes capi_set_view_manager (and it's friend doing replication docs replication) to lose nodeup events. I.e. because those folks are monitoring remote processes so they always see down event. But if down+up = nothing it'll not see up event.