Uploaded image for project: 'Couchbase C client library libcouchbase'
  1. Couchbase C client library libcouchbase
  2. CCBC-1339

getreplica in success after failing over all replicas node

    XMLWordPrintable

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 3.0.6
    • 3.0.7
    • None
    • None
    • 1

    Description

      Hi Couchbase team,

      Could you clarify a new behaviour on libcouchbase 3.x that we encountered on our unit test in Amadeus?

      We observed that after failing over all replicas node, the getfromreplica was returning a success (by reading the master internally):

      2020/11/20 14:34:25.792613  test MDW INFO confmon <confmon.cc#168 TID#0> Setting new configuration. Received via CCCP
      2020/11/20 14:34:25.792643  test MDW INFO bootstrap <bootstrap.cc#151 TID#0> Got new config (source=CCCP, bucket=mock, rev=9). Will refresh asynchronously
      2020/11/20 14:34:25.792690  test MDW DBG TST <MultigetOperations.cpp#147 TID#0> VBucket node #1 (replica) blacklisted for key ['BlacklistMasterWithReplicasInFailover'] - reading replica #2 from server with index 0
      2020/11/20 14:34:25.792762  test MDW INFO newconfig <newconfig.cc#170 TID#0> Config Diff: [ vBuckets Modified=4096 ], [Sequence Changed=1]
      2020/11/20 14:34:25.792772  test MDW INFO newconfig <newconfig.cc#178 TID#0> Detected server 127.0.0.1:52997(Data=52997, Index=0, Query=49829) removed
      2020/11/20 14:34:25.792774  test MDW INFO newconfig <newconfig.cc#178 TID#0> Detected server 127.0.0.1:52823(Data=52823, Index=0, Query=49829) removed
      2020/11/20 14:34:25.792776  test MDW INFO newconfig <newconfig.cc#178 TID#0> Detected server 127.0.0.1:49401(Data=49401, Index=0, Query=49829) removed
      2020/11/20 14:34:25.792782  test MDW INFO newconfig <newconfig.cc#269 TID#0> Reusing server 127.0.0.1:47221 (0x437fa60). OldIndex=2. NewIndex=0
      2020/11/20 14:34:25.792788  test MDW DBG newconfig <newconfig.cc#234 TID#0> Remapped packet 0x4430530 (SEQ=21) from 127.0.0.1:52997 (0x4387020) to 127.0.0.1:47221 (0x437fa60)
      2020/11/20 14:34:25.792795  test MDW DBG server <mcserver.cc#1150 TID#0> <127.0.0.1:52997> (CTX=0x4422530,memcached,SRV=0x4387020,IX=0) Finalizing context
      2020/11/20 14:34:25.792800  test MDW DBG ioctx <ctx.c#140 TID#0> <127.0.0.1:52997> (CTX=0x4422530,memcached) Destroying context. Pending Writes=0, Entered=false, Socket Refcount=1
      2020/11/20 14:34:25.792832  test MDW DBG ioctx <ctx.c#140 TID#0> <127.0.0.1:52823> (CTX=0x442e870,sasl) Destroying context. Pending Writes=0, Entered=false, Socket Refcount=1
      2020/11/20 14:34:25.792858  test MDW DBG server <mcserver.cc#1150 TID#0> <127.0.0.1:49401> (CTX=0x4423900,memcached,SRV=0x4388c50,IX=3) Finalizing context
      2020/11/20 14:34:25.792869  test MDW DBG ioctx <ctx.c#140 TID#0> <127.0.0.1:49401> (CTX=0x4423900,memcached) Destroying context. Pending Writes=0, Entered=false, Socket Refcount=1
      2020/11/20 14:34:25.793204  test MDW DBG TST <MultigetOperations.cpp#293 TID#0> Received response (Get) for key "BlacklistMasterWithReplicasInFailover" ; lcb error: LCB_SUCCESS (0): Success (Not an error), global elapsed time: [526us]
      

       

      It seems lcb 3.x is remapping the request to the master (the only node remaining healthy for key).

      Reproducer can done towards Couchbase mock (4 nodes, 3 replicas):

      • set the key and get master node.
      • failover all nodes except master
      • get replica index 1

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          libcouchbase is remapping that command because the JSON configuration told it to do it

          2020/11/20 14:34:25.792772  test MDW INFO newconfig <newconfig.cc#178 TID#0> Detected server 127.0.0.1:52997(Data=52997, Index=0, Query=49829) removed
          2020/11/20 14:34:25.792774  test MDW INFO newconfig <newconfig.cc#178 TID#0> Detected server 127.0.0.1:52823(Data=52823, Index=0, Query=49829) removed
          2020/11/20 14:34:25.792776  test MDW INFO newconfig <newconfig.cc#178 TID#0> Detected server 127.0.0.1:49401(Data=49401, Index=0, Query=49829) removed
          2020/11/20 14:34:25.792782  test MDW INFO newconfig <newconfig.cc#269 TID#0> Reusing server 127.0.0.1:47221 (0x437fa60). OldIndex=2. NewIndex=0
          

          As I see it here, the problem with getReplica is that it as a packet inside libcouchbase it should not be ever remapped. You said that libcouchbase2 behaviour is different this is strange, because libcouchbase2 didn't looked into operation code of the packet during remapping either.

          Anyway, I've updated your patch turning it to fix of this issue. Also I have updated your test, so that now it captures actual behaviour:

          After failover

          • if the new configuration has been received after the operation has been scheduled, and libcouchbase sees that the node is not operating, it should not relocate GET_REPLICA packets to new location and return LCB_ERR_MAP_CHANGED code
          • if the new configuration has been received before the operation has been scheduled, libcouchbase returns LCB_ERR_NO_MATCHING_SERVER error code for lcb_getreplica() call and don't schedule the operation
          avsej Sergey Avseyev added a comment - libcouchbase is remapping that command because the JSON configuration told it to do it 2020/11/20 14:34:25.792772 test MDW INFO newconfig <newconfig.cc#178 TID#0> Detected server 127.0.0.1:52997(Data=52997, Index=0, Query=49829) removed 2020/11/20 14:34:25.792774 test MDW INFO newconfig <newconfig.cc#178 TID#0> Detected server 127.0.0.1:52823(Data=52823, Index=0, Query=49829) removed 2020/11/20 14:34:25.792776 test MDW INFO newconfig <newconfig.cc#178 TID#0> Detected server 127.0.0.1:49401(Data=49401, Index=0, Query=49829) removed 2020/11/20 14:34:25.792782 test MDW INFO newconfig <newconfig.cc#269 TID#0> Reusing server 127.0.0.1:47221 (0x437fa60). OldIndex=2. NewIndex=0 As I see it here, the problem with getReplica is that it as a packet inside libcouchbase it should not be ever remapped. You said that libcouchbase2 behaviour is different this is strange, because libcouchbase2 didn't looked into operation code of the packet during remapping either. Anyway, I've updated your patch turning it to fix of this issue. Also I have updated your test, so that now it captures actual behaviour: After failover if the new configuration has been received after the operation has been scheduled, and libcouchbase sees that the node is not operating, it should not relocate GET_REPLICA packets to new location and return LCB_ERR_MAP_CHANGED code if the new configuration has been received before the operation has been scheduled, libcouchbase returns LCB_ERR_NO_MATCHING_SERVER error code for lcb_getreplica() call and don't schedule the operation

          thx Sergey 

           

          Alexis Deltour Alexis Deltour added a comment - thx Sergey   

          Build couchbase-server-7.0.0-4433 contains libcouchbase commit 766a041 with commit message:
          CCBC-1339: do not relocate get with replica on failover

          build-team Couchbase Build Team added a comment - Build couchbase-server-7.0.0-4433 contains libcouchbase commit 766a041 with commit message: CCBC-1339 : do not relocate get with replica on failover

          People

            avsej Sergey Avseyev
            Alexis Deltour Alexis Deltour
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty