Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
3.0.6
-
None
-
None
-
1
Description
Hi Couchbase team,
Could you clarify a new behaviour on libcouchbase 3.x that we encountered on our unit test in Amadeus?
We observed that after failing over all replicas node, the getfromreplica was returning a success (by reading the master internally):
2020/11/20 14:34:25.792613 test MDW INFO confmon <confmon.cc#168 TID#0> Setting new configuration. Received via CCCP |
2020/11/20 14:34:25.792643 test MDW INFO bootstrap <bootstrap.cc#151 TID#0> Got new config (source=CCCP, bucket=mock, rev=9). Will refresh asynchronously |
2020/11/20 14:34:25.792690 test MDW DBG TST <MultigetOperations.cpp#147 TID#0> VBucket node #1 (replica) blacklisted for key ['BlacklistMasterWithReplicasInFailover'] - reading replica #2 from server with index 0 |
2020/11/20 14:34:25.792762 test MDW INFO newconfig <newconfig.cc#170 TID#0> Config Diff: [ vBuckets Modified=4096 ], [Sequence Changed=1] |
2020/11/20 14:34:25.792772 test MDW INFO newconfig <newconfig.cc#178 TID#0> Detected server 127.0.0.1:52997(Data=52997, Index=0, Query=49829) removed |
2020/11/20 14:34:25.792774 test MDW INFO newconfig <newconfig.cc#178 TID#0> Detected server 127.0.0.1:52823(Data=52823, Index=0, Query=49829) removed |
2020/11/20 14:34:25.792776 test MDW INFO newconfig <newconfig.cc#178 TID#0> Detected server 127.0.0.1:49401(Data=49401, Index=0, Query=49829) removed |
2020/11/20 14:34:25.792782 test MDW INFO newconfig <newconfig.cc#269 TID#0> Reusing server 127.0.0.1:47221 (0x437fa60). OldIndex=2. NewIndex=0 |
2020/11/20 14:34:25.792788 test MDW DBG newconfig <newconfig.cc#234 TID#0> Remapped packet 0x4430530 (SEQ=21) from 127.0.0.1:52997 (0x4387020) to 127.0.0.1:47221 (0x437fa60) |
2020/11/20 14:34:25.792795 test MDW DBG server <mcserver.cc#1150 TID#0> <127.0.0.1:52997> (CTX=0x4422530,memcached,SRV=0x4387020,IX=0) Finalizing context |
2020/11/20 14:34:25.792800 test MDW DBG ioctx <ctx.c#140 TID#0> <127.0.0.1:52997> (CTX=0x4422530,memcached) Destroying context. Pending Writes=0, Entered=false, Socket Refcount=1 |
2020/11/20 14:34:25.792832 test MDW DBG ioctx <ctx.c#140 TID#0> <127.0.0.1:52823> (CTX=0x442e870,sasl) Destroying context. Pending Writes=0, Entered=false, Socket Refcount=1 |
2020/11/20 14:34:25.792858 test MDW DBG server <mcserver.cc#1150 TID#0> <127.0.0.1:49401> (CTX=0x4423900,memcached,SRV=0x4388c50,IX=3) Finalizing context |
2020/11/20 14:34:25.792869 test MDW DBG ioctx <ctx.c#140 TID#0> <127.0.0.1:49401> (CTX=0x4423900,memcached) Destroying context. Pending Writes=0, Entered=false, Socket Refcount=1 |
2020/11/20 14:34:25.793204 test MDW DBG TST <MultigetOperations.cpp#293 TID#0> Received response (Get) for key "BlacklistMasterWithReplicasInFailover" ; lcb error: LCB_SUCCESS (0): Success (Not an error), global elapsed time: [526us] |
It seems lcb 3.x is remapping the request to the master (the only node remaining healthy for key).
Reproducer can done towards Couchbase mock (4 nodes, 3 replicas):
- set the key and get master node.
- failover all nodes except master
- get replica index 1
libcouchbase is remapping that command because the JSON configuration told it to do it
2020/11/20 14:34:25.792772 test MDW INFO newconfig <newconfig.cc#178 TID#0> Detected server 127.0.0.1:52997(Data=52997, Index=0, Query=49829) removed
2020/11/20 14:34:25.792774 test MDW INFO newconfig <newconfig.cc#178 TID#0> Detected server 127.0.0.1:52823(Data=52823, Index=0, Query=49829) removed
2020/11/20 14:34:25.792776 test MDW INFO newconfig <newconfig.cc#178 TID#0> Detected server 127.0.0.1:49401(Data=49401, Index=0, Query=49829) removed
2020/11/20 14:34:25.792782 test MDW INFO newconfig <newconfig.cc#269 TID#0> Reusing server 127.0.0.1:47221 (0x437fa60). OldIndex=2. NewIndex=0
As I see it here, the problem with getReplica is that it as a packet inside libcouchbase it should not be ever remapped. You said that libcouchbase2 behaviour is different this is strange, because libcouchbase2 didn't looked into operation code of the packet during remapping either.
Anyway, I've updated your patch turning it to fix of this issue. Also I have updated your test, so that now it captures actual behaviour:
After failover