Details
-
Improvement
-
Resolution: Unresolved
-
Critical
-
2.5.1
-
Security Level: Public
Description
Production cluster scenario that lead to data loss:
- 4 node cluster running on CB 2.5.1.
- node A, node B, node C failed-over node-D but failover message didn't reach node-D and it doesn't know it was failed over.
- SDK clients unfortunately were connecting to node-D for the vb-map config(which is incorrect).
- At this stage a set of vbuckets are active in 2 different places and once rebalance is initiated mutations mapping to active vbuckets on node-D are very likely going to get lost.
Sample data from logs to back it up:
World according to 10.100.0.46(node-D, which was failed over):
```
{incoming_replications_conf_hashes,
[{"services.z3",
[
{'ns_1@10.100.0.42',74343578},
{'ns_1@10.100.0.43',66117516}]},
{"default",
[{'ns_1@10.100.0.41',115455469}
,
,
]},
{"indigo-session",
[
{'ns_1@10.100.0.42',42364148},
{'ns_1@10.100.0.43',122578952}]},
{"services.z2",
[{'ns_1@10.100.0.41',124324038}
,
{'ns_1@10.100.0.43',122578952}]},
{"indigo",
[{'ns_1@10.100.0.41',48623555},
{'ns_1@10.100.0.42',124811468},
{'ns_1@10.100.0.43',104172491}]},
{"services",
[{'ns_1@10.100.0.41',124324038},
{'ns_1@10.100.0.42',42364148}
,
]}]},
```
World according to 10.100.0.42(is similar for node-A, node-B and node-C)
```
{incoming_replications_conf_hashes,
[{"services.z3",
[
{"default",
[{'ns_1@10.100.0.41',63506121}
,
{'ns_1@10.100.0.43',117606290}]},
{"indigo-session",
[
{"services.z2",
[{'ns_1@10.100.0.41',101019889}
,
{'ns_1@10.100.0.43',114028661}]},{"indigo",
[{'ns_1@10.100.0.41',51937030},{'ns_1@10.100.0.43',131953060}]},
{"services",
[{'ns_1@10.100.0.41',101019889},{'ns_1@10.100.0.43',114028661}
]}]},
```