I'm filing this as a bug as currently my thinking it's something that needs to be fixed, but this can also be considered some kind of improvement.
In any case, I hit the issue this ticket describes when I was running some tests that created and dropped scopes and collections in quick succession.
I tried to create bucket c_0 on scope s1 in bucket b_3487 and it failed stating that c_0 already exists:
You can see it's operating on manifest 34.
However, collection c_0 was dropped from the manifest a 90 milliseconds earlier on n_1. (Note that these nodes are all running on the same machine so the timestamps are pretty comparable.)
This was also against manifest 34 and it clearly succeeded.
This is a 3 node cluster so collection manifest updates only need to reach 2 nodes before the change is considered committed, which means it's possible the third node hasn't received the updates before another manifest update arrives.
This wouldn't have happened if the client I used (the Java client) sent the collection changes to the same nodes every time. However, on occasion the client will need to switch servers for these kinds of requests due to failover etc, so I don't believe it's a principled fix to this issue to change the client to always target the same server node.
My current view that the way to address this behavior is to do a quorum read on the manifest before performing checks as I think this is quite a bit nicer for users and changes to the manifest should in general be infrequent enough that we can afford the quorum read.
Alternatively we could add some kind of read-consistency option to the collection / scope management REST APIs. Though even in this case, I think the default should be to quorum read.
I am interested in people's opinions on this topic.
|For Gerrit Dashboard: MB-48063
|MB-48063: Do quorum read on manifest for collection update