added a comment - - edited
Review Input: Sergey:
Hi Karen.Good work, but I have some comments:
1. could you please make sure that code snippets are well formatted
(i'm talking about indentation, e.g. at page 60)
[FIXED] 2. page 61
The advantage of doing sequential read is that your client
only sends a single get request over the network and will
only need to store one response in memory. The disadvantage
is that during rebalance, replicated data can move to
another node, which means your client then has to reload
cluster topology if it needs to reattempt the replica read.
I'd rather replace "sends a single get request over the network" with
"doing single API call and let the library to handle failures".
Because what effectively libcouchbase (and therefore derived
libraries) is doing can be described with the following pseudocode:
N = get_number_of_replicas_in_the_cluster()
value = NULL
for idx from 0 to N-1 do
ret = get_replica(key, idx)
if ret == OK
value = get_value(ret)
else if ret == NOT_MY_VBUCKET
/* configuration has been changed */
idx = 0
/* will return OK and value or last error and NULL */
Then code above is demostrative sequential reading, and it is clear
that it is most handy from the user's application perspective and also
the most reliable approach. But the main disadvantage here is it can
involve a lot of calls to the server: consider following log for
cluster with 3 replicas:
get_replica("foo", 0) --> NOT_FOUND
get_replica("foo", 1) --> NOT_FOUND
/* rebalance occured and replaca had been moved */
get_replica("foo", 2) --> NOT_MY_VBUCKET
get_replica("foo", 0) --> OK
You can see here, that even we have N = 3, there might be the case
when library need to issue more requests to make sure that it tried
[FIXED. Changed to " Because this approach takes the first instance of replicated data it finds on a node, it may not be the most current version in the cluster. "]3. page 61
Also, because this approach takes the first instance of
replicated data it finds on a node, it can also mean that
this instance is the most current instance in the cluster.
I don't think that we should recommend people to rely on the fact that
the cluster is doing replication sequentially, because in future it
could be parallelized, and you cannot say that first replica is the
most recent (when third hasn't been updated yet.
4. page 61
[FIXED] Changed to " The advantage of this approach is you can control the number of replica reads with this method. For example if you know there are three nodes with replica data you can only ask the first two and do so in parallel from your client. The disadvantage is that your code needs to check the return codes from each node and handle them. "
The advantage of this approach is that you can get a
specific instance of the replicated data by specifying the
node. Like a sequential replica read, your client only
sends a single request and will only need to store a single
response in memory. The disadvantage is also the same as
sequential replica read; if the replicated item moves to
another node during rebalance, your client must get the
cluster topology again.
The real advantage here is that the user can control the number of the
replica requests here, for example, he knows that there are three
replicas in cluster, but can ask only first two, which could be also
pipelined with SDK means. But disadvantage, that we must check the
return codes, and handle them. Among three strategies SELECT strategy
is the most basic, and could be used to implement all others (like
FIRST and ALL), and also quite more, like "only one or two replicas".
This strategy is about controlling latency, when you are handling
exceptional situation when master node isn't able to serve your query.
5. page 62
[FIXED] replaced with "and you only need to perform a single API call for this request"
The requests are all made as a single network roundtrip, and
may require less round-trips than if you iterate through all
This isn't exactly true, because the replica read requests are
scattered over the cluster to several nodes, but those packets will be
sent independently of each other. Again when caller is using ALL
strategy, he is controlling latency, because he like saying "OK i know
it isn't as safe and secure as FIRST strategy, but I need to just pull
all replicas, and I will decide which one to use".