Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-50874

Replica to active promotion after receiving a SnapshotMarker(CHK) with de-duplicated seqno(s) crashes on next mutation

    XMLWordPrintable

Details

    • Untriaged
    • 1
    • Unknown
    • KV 2022-Feb, KV March-22

    Description

      Summary

      If a replica vBucket is promoted to active, and the last DCP message it received was a Snapshot Marker which had the first mutation de-duplicated, then the snapshot start of the newly-promoted active ends up greater than the active.
      Upon the next Flusher run (i.e. next mutation to the vBucket), the Flusher throws an exception when trying to fetch items which terminates KV-Engine (as exception is thrown on BG thread).

      Confirmed to affect 6.6.1 and 6.6.5.
      Confirmed to affect 7.0.0 - 7.0.3 (inclusive) - see https://review.couchbase.org/c/kv_engine/+/170373
      Confirmed to not affect Neo.

      Details

      When streaming data from an Active to Replica vBucket, the extent of the Checkpoint is sent via DCP using a SnapshotMarker message, followed by N Mutation / Deletion messages. The snapshot marker may be discontinuous compared to the previous if any de-duplication occurred within the Checkpoint - for example if document "key" was written sufficient times in quick succession, one could end up with the following two Checkpoints on the active and subsequent DCP SnapshotMarker sent to the replica:

      CheckpointManager[0x108a03080] with numItems:6 checkpoints:2
          Checkpoint[0x10891f000] with id:2 seqno:{1,10} snap:{0,10, visible:10} state:CHECKPOINT_CLOSED numCursors:1 type:Memory hcs:-- items:[
      	{1,empty,cid:0x1:empty,110,[m]}
      	{1,checkpoint_start,cid:0x1:checkpoint_start,121,[m]}
      	{1,set_vbucket_state,cid:0x1:set_vbucket_state,245,[m]}
      	{10,mutation,cid:0x0: deduplicated_key,119,}
      	{11,checkpoint_end,cid:0x1:checkpoint_end,119,[m]}
      ]
          Checkpoint[0x10891fa00] with id:3 seqno:{11,12} snap:{10,12, visible:12} state:CHECKPOINT_OPEN numCursors:1 type:Memory hcs:-- items:[
      	{11,empty,cid:0x1:empty,110,[m]}
      	{11,checkpoint_start,cid:0x1:checkpoint_start,121,[m]}
      	{12,mutation,cid:0x0:deduplicated_key,130,}
      ]
      

      Note how there are just two mutations remaining (at seqnos 10 and 12), and that there is a seqno "gap" at 11 (ignore meta-items which are not send over DCP).

      When this is replicated over DCP it will be sent as:

      • DCP_SNAPSHOT_MARKER(start:0, end:10, flags=CHK)
      • DCP_MUTATION(seqno:10, ...)
      • DCP_SNAPSHOT_MARKER(start:12, end:12, flags=CHK)
      • DCP_MUTATION(seqno:12, ...)

      Once these messages are replicated over DCP the replica vBucket should have equivalent state as the active.

      However; if the last DCP_MUTATION is not received - for example if the active node is being failed over and the stream is closed before the DCP_MUTATION, then the state of the replica - crucially the Open checkpoint is as follows:

          Checkpoint[0x10cecde00] with id:2 seqno:{11,11} snap:{12,12, visible:12} state:CHECKPOINT_OPEN numCursors:0 type:Memory hcs:-- items:[
      	{11,empty,cid:0x1:empty,110,[m]}
      	{11,checkpoint_start,cid:0x1:checkpoint_start,121,[m]}
      ]
      

      Note that the second SnapshotMarker being flagged as "CHK" (Checkpoint) is essential - we need the replica to end up creating a new Checkpoint with the start and end controlled by the active - a SnapshotMarker without that flag is insufficient as it just extends the existing checkpoint, increasing the checkpoint end but leaving start unaffected.

      When this sequence occurs, the seqno range (11,11) in the open Checkpoint is less than the snapshot range (12,12). This is problematic as we have essentially broken an invariant on Checkpoints - that all items within them are between the snapshot start and end.

      This doesn't immediately cause a problem, but if this vBucket is converted to Active and starts accepting mutations itself, it will start generating seqnos from the last seqno received - 10 in this case. This results in the next mutation being assigned seqno 11, which when the flusher is woken and attempts to flush throws an exception on the BG thread and crashes KV-Engine:

      CheckpointManager::queueDirty: lastBySeqno not in snapshot range. vb:0 state:active snapshotStart:12 lastBySeqno:11 snapshotEnd:11 genSeqno:Yes checkpointList.size():2
      

      Impact Assessment

      In theory this scenario seems reasonably easy to hit - one just needs at least one seqno being de-duplicated (i.e. two mutations to the same key within the same Checkpoint), and then a failover where just the SnapshotMarker is received without any of the mutations it contains. However in practice I have not been able to trigger it on a full running cluster with default config values (yet).

      This is most likely due to the requirement that the SnapshotMarker received last (before stream is closed) must have the "CHK" flag set - i.e. must represent a newly-created Checkpoint. In 6.6.1 a new Checkpoint is only created under certain criteria (see CheckpointManager::isCheckpointCreationForHighMemUsage_UNLOCKED and CheckpointManager::checkOpenCheckpoint_UNLOCKED) - but the main ones are one of:

      • The current Checkpoint has 10,000 items in it, or
      • The current Checkpoint has at least 1 item and 5 seconds have passed since it was created.

      These criteria somewhat work against the needed scenario to trigger the bug:

      • To create a new Checkpoint when there is a large (10,000) items in it, it becomes increasingly unlikely that de-duplication would exactly de-duplicate the first mutation(s) in the Checkpoint.
      • To create a new Checkpoint after 5 seconds; but also have de-duplication requires modification of the same key in a small window, but overall little-to-no other traffic.

      However the problem has been observed on a live running cluster; where it affected 4 out of ~140 vBuckets when a node was failed over, so clearly it is possible to hit, given the "right" environment.

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            drigby Dave Rigby added a comment - - edited

            Finally managed to hit this in a "real" cluster_run setup. It required the following "tweaks" to the timing of when we process things to trigger it:

            1. Reduce the default number of requests per libevent notification to 1 (to create more gaps in the TCP/IP stream for ns_server to inject a CloseStream message).

            diff --git a/daemon/memcached.cc b/daemon/memcached.cc
            index f36ca5541..23c1a2766 100644
            --- a/daemon/memcached.cc
            +++ b/daemon/memcached.cc
            @@ -552,13 +552,13 @@ static void settings_init() {
                 Settings::instance().setNumWorkerThreads(get_number_of_worker_threads());
                 Settings::instance().setDatatypeJsonEnabled(true);
                 Settings::instance().setDatatypeSnappyEnabled(true);
            -    Settings::instance().setRequestsPerEventNotification(50,
            +    Settings::instance().setRequestsPerEventNotification(1,
                                                                      EventPriority::High);
            -    Settings::instance().setRequestsPerEventNotification(5,
            +    Settings::instance().setRequestsPerEventNotification(1,
                                                                      EventPriority::Medium);
                 Settings::instance().setRequestsPerEventNotification(1, EventPriority::Low);
                 Settings::instance().setRequestsPerEventNotification(
            -            20, EventPriority::Default);
            +            1, EventPriority::Default);
             
                 /*
                  * The max object size is 20MB. Let's allow packets up to 30MB to
            

            2. Pause for 500ms after the Consumer processes a SnapshotMarker:

            diff --git a/engines/ep/src/dcp/producer.cc b/engines/ep/src/dcp/producer.cc
            index d4414952e..c121fb12d 100644
            --- a/engines/ep/src/dcp/producer.cc
            +++ b/engines/ep/src/dcp/producer.cc
            @@ -804,6 +804,7 @@ ENGINE_ERROR_CODE DcpProducer::step(struct dcp_message_producers* producers) {
                                                 s->getHighCompletedSeqno(),
                                                 s->getMaxVisibleSeqno(),
                                                 resp->getStreamId());
            +            std::this_thread::sleep_for(std::chrono::milliseconds(500));
                         break;
                     }
                     case DcpResponse::Event::SetVbucket:
            

            (Both might not be necessary, but I added them in that order.)

            I then ran the attached cb_subdoc_dict_add_loop.py test program against the cluster - this performs 2 mutations in quick succession to the same key; for 1000 different keys (to attempt to hit a large number of vBuckets), then sleeps for just over the automatic checkpoint creation interval (5s), then repeats. The intent it to try to trigger the creation of lots of Snapshots where the first seqno has been de-duplicated.

            Roughly when the test program has just written a bunch of docs (after "done"), trigger a hard failover of one of the nodes. This took multiple attempts, but eventually I observed a number of CRITICAL log messages:

            2022-02-09T13:45:15.793885+00:00 ERROR 41: exception occurred in runloop during packet execution. Cookie info: [{"aiostat":"success","connection":"[ {\"ip\":\"127.0.0.1\",\"port\":53049} - {\"ip\":\"127.0.0.1\",\"port\":12002} (<ud>Administrator</ud>) ]","engine_storage":"0x0000000000000000","ewouldblock":false,"packet":{"bodylen":63,"cas":0,"datatype":"raw","extlen":1,"key":"<ud>key_1</ud>","keylen":5,"magic":"ClientRequest","opaque":94048,"opcode":"SUBDOC_MULTI_MUTATION","vbucket":997},"refcount":0}] - closing connection ([ {"ip":"127.0.0.1","port":53049} - {"ip":"127.0.0.1","port":12002} (<ud>Administrator</ud>) ]): CheckpointManager::queueDirty: lastBySeqno not in snapshot range. vb:997 state:active snapshotStart:596 lastBySeqno:595 snapshotEnd:595 genSeqno:Yes checkpointList.size():1
            2022-02-09T13:45:15.816442+00:00 CRITICAL *** Fatal error encountered during exception handling ***
            2022-02-09T13:45:15.816527+00:00 CRITICAL Caught unhandled std::exception-derived exception. what(): snapshot_range_t(596,595) requires start <= end
            2022-02-09T13:45:15.816619+00:00 ERROR 41: exception occurred in runloop during packet execution. Cookie info: [{"aiostat":"success","connection":"[ {\"ip\":\"127.0.0.1\",\"port\":53097} - {\"ip\":\"127.0.0.1\",\"port\":12002} (<ud>Administrator</ud>) ]","engine_storage":"0x0000000000000000","ewouldblock":false,"packet":{"bodylen":64,"cas":0,"datatype":"raw","extlen":1,"key":"<ud>key_10</ud>","keylen":6,"magic":"ClientRequest","opaque":94066,"opcode":"SUBDOC_MULTI_MUTATION","vbucket":880},"refcount":0}] - closing connection ([ {"ip":"127.0.0.1","port":53097} - {"ip":"127.0.0.1","port":12002} (<ud>Administrator</ud>) ]): CheckpointManager::queueDirty: lastBySeqno not in snapshot range. vb:880 state:active snapshotStart:2380 lastBySeqno:2251 snapshotEnd:2251 genSeqno:Yes checkpointList.size():1
            2022-02-09T13:45:15.817402+00:00 CRITICAL Call stack:     /Users/dave/repos/couchbase/server/source/install/lib/libplatform_so.0.1.0.dylib(print_backtrace_to_buffer+0x30) [0x109e30000+0x8390]
                /Users/dave/repos/couchbase/server/source/install/bin/memcached(_ZL27backtrace_terminate_handlerv+0x10e) [0x100d9a000+0x14b41e]
                /usr/lib/libc++abi.dylib(_ZSt11__terminatePFvvE+0x8) [0x7ff80b53b000+0xf4d7]
                /usr/lib/libc++abi.dylib(__cxa_get_exception_ptr+0) [0x7ff80b53b000+0x11d55]
                /usr/lib/libc++abi.dylib(_ZN10__cxxabiv1L22exception_cleanup_funcE19_Unwind_Reason_CodeP17_Unwind_Exception+0) [0x7ff80b53b000+0x11d1c]
                /Users/dave/repos/couchbase/server/source/install/lib/ep.so(_ZNK16snapshot_range_t14checkInvariantEv+0x156) [0x10c0c8000+0x4d216]
                /Users/dave/repos/couchbase/server/source/install/lib/ep.so(_ZN17CheckpointManager17getItemsForCursorEP16CheckpointCursorRNSt3__16vectorI19SingleThreadedRCPtrI4ItemPS5_NS2_14default_deleteIS5_EEENS2_9allocatorIS9_EEEEm+0x240) [0x10c0c8000+0x49e30]
                /Users/dave/repos/couchbase/server/source/install/lib/ep.so(_ZN7VBucket17getItemsToPersistEm+0x6a) [0x10c0c8000+0x18bb1a]
                /Users/dave/repos/couchbase/server/source/install/lib/ep.so(_ZN8EPBucket12flushVBucketE4Vbid+0xba) [0x10c0c8000+0xcfd1a]
                /Users/dave/repos/couchbase/server/source/install/lib/ep.so(_ZN7Flusher7flushVBEv+0x1a0) [0x10c0c8000+0x130870]
                /Users/dave/repos/couchbase/server/source/install/lib/ep.so(_ZN7Flusher4stepEP10GlobalTask+0x13b) [0x10c0c8000+0x13039b]
                /Users/dave/repos/couchbase/server/source/install/lib/ep.so(_ZN14ExecutorThread3runEv+0x348) [0x10c0c8000+0x12b5f8]
                /Users/dave/repos/couchbase/server/source/install/lib/libplatform_so.0.1.0.dylib(_ZL20platform_thread_wrapPv+0x4e) [0x109e30000+0x59ee]
                /usr/lib/system/libsystem_pthread.dylib(_pthread_start+0x7d) [0x7ff80b588000+0x64f4]
                /usr/lib/system/libsystem_pthread.dylib(thread_start+0xf) [0x7ff80b588000+0x200f]
            

            drigby Dave Rigby added a comment - - edited Finally managed to hit this in a "real" cluster_run setup. It required the following "tweaks" to the timing of when we process things to trigger it: 1. Reduce the default number of requests per libevent notification to 1 (to create more gaps in the TCP/IP stream for ns_server to inject a CloseStream message). diff --git a/daemon/memcached.cc b/daemon/memcached.cc index f36ca5541..23c1a2766 100644 --- a/daemon/memcached.cc +++ b/daemon/memcached.cc @@ -552,13 +552,13 @@ static void settings_init() { Settings::instance().setNumWorkerThreads(get_number_of_worker_threads()); Settings::instance().setDatatypeJsonEnabled(true); Settings::instance().setDatatypeSnappyEnabled(true); - Settings::instance().setRequestsPerEventNotification(50, + Settings::instance().setRequestsPerEventNotification(1, EventPriority::High); - Settings::instance().setRequestsPerEventNotification(5, + Settings::instance().setRequestsPerEventNotification(1, EventPriority::Medium); Settings::instance().setRequestsPerEventNotification(1, EventPriority::Low); Settings::instance().setRequestsPerEventNotification( - 20, EventPriority::Default); + 1, EventPriority::Default); /* * The max object size is 20MB. Let's allow packets up to 30MB to 2. Pause for 500ms after the Consumer processes a SnapshotMarker: diff --git a/engines/ep/src/dcp/producer.cc b/engines/ep/src/dcp/producer.cc index d4414952e..c121fb12d 100644 --- a/engines/ep/src/dcp/producer.cc +++ b/engines/ep/src/dcp/producer.cc @@ -804,6 +804,7 @@ ENGINE_ERROR_CODE DcpProducer::step(struct dcp_message_producers* producers) { s->getHighCompletedSeqno(), s->getMaxVisibleSeqno(), resp->getStreamId()); + std::this_thread::sleep_for(std::chrono::milliseconds(500)); break; } case DcpResponse::Event::SetVbucket: (Both might not be necessary, but I added them in that order.) I then ran the attached cb_subdoc_dict_add_loop.py test program against the cluster - this performs 2 mutations in quick succession to the same key; for 1000 different keys (to attempt to hit a large number of vBuckets), then sleeps for just over the automatic checkpoint creation interval (5s), then repeats. The intent it to try to trigger the creation of lots of Snapshots where the first seqno has been de-duplicated. Roughly when the test program has just written a bunch of docs (after "done"), trigger a hard failover of one of the nodes. This took multiple attempts, but eventually I observed a number of CRITICAL log messages: 2022-02-09T13:45:15.793885+00:00 ERROR 41: exception occurred in runloop during packet execution. Cookie info: [{"aiostat":"success","connection":"[ {\"ip\":\"127.0.0.1\",\"port\":53049} - {\"ip\":\"127.0.0.1\",\"port\":12002} (<ud>Administrator</ud>) ]","engine_storage":"0x0000000000000000","ewouldblock":false,"packet":{"bodylen":63,"cas":0,"datatype":"raw","extlen":1,"key":"<ud>key_1</ud>","keylen":5,"magic":"ClientRequest","opaque":94048,"opcode":"SUBDOC_MULTI_MUTATION","vbucket":997},"refcount":0}] - closing connection ([ {"ip":"127.0.0.1","port":53049} - {"ip":"127.0.0.1","port":12002} (<ud>Administrator</ud>) ]): CheckpointManager::queueDirty: lastBySeqno not in snapshot range. vb:997 state:active snapshotStart:596 lastBySeqno:595 snapshotEnd:595 genSeqno:Yes checkpointList.size():1 2022-02-09T13:45:15.816442+00:00 CRITICAL *** Fatal error encountered during exception handling *** 2022-02-09T13:45:15.816527+00:00 CRITICAL Caught unhandled std::exception-derived exception. what(): snapshot_range_t(596,595) requires start <= end 2022-02-09T13:45:15.816619+00:00 ERROR 41: exception occurred in runloop during packet execution. Cookie info: [{"aiostat":"success","connection":"[ {\"ip\":\"127.0.0.1\",\"port\":53097} - {\"ip\":\"127.0.0.1\",\"port\":12002} (<ud>Administrator</ud>) ]","engine_storage":"0x0000000000000000","ewouldblock":false,"packet":{"bodylen":64,"cas":0,"datatype":"raw","extlen":1,"key":"<ud>key_10</ud>","keylen":6,"magic":"ClientRequest","opaque":94066,"opcode":"SUBDOC_MULTI_MUTATION","vbucket":880},"refcount":0}] - closing connection ([ {"ip":"127.0.0.1","port":53097} - {"ip":"127.0.0.1","port":12002} (<ud>Administrator</ud>) ]): CheckpointManager::queueDirty: lastBySeqno not in snapshot range. vb:880 state:active snapshotStart:2380 lastBySeqno:2251 snapshotEnd:2251 genSeqno:Yes checkpointList.size():1 2022-02-09T13:45:15.817402+00:00 CRITICAL Call stack: /Users/dave/repos/couchbase/server/source/install/lib/libplatform_so.0.1.0.dylib(print_backtrace_to_buffer+0x30) [0x109e30000+0x8390] /Users/dave/repos/couchbase/server/source/install/bin/memcached(_ZL27backtrace_terminate_handlerv+0x10e) [0x100d9a000+0x14b41e] /usr/lib/libc++abi.dylib(_ZSt11__terminatePFvvE+0x8) [0x7ff80b53b000+0xf4d7] /usr/lib/libc++abi.dylib(__cxa_get_exception_ptr+0) [0x7ff80b53b000+0x11d55] /usr/lib/libc++abi.dylib(_ZN10__cxxabiv1L22exception_cleanup_funcE19_Unwind_Reason_CodeP17_Unwind_Exception+0) [0x7ff80b53b000+0x11d1c] /Users/dave/repos/couchbase/server/source/install/lib/ep.so(_ZNK16snapshot_range_t14checkInvariantEv+0x156) [0x10c0c8000+0x4d216] /Users/dave/repos/couchbase/server/source/install/lib/ep.so(_ZN17CheckpointManager17getItemsForCursorEP16CheckpointCursorRNSt3__16vectorI19SingleThreadedRCPtrI4ItemPS5_NS2_14default_deleteIS5_EEENS2_9allocatorIS9_EEEEm+0x240) [0x10c0c8000+0x49e30] /Users/dave/repos/couchbase/server/source/install/lib/ep.so(_ZN7VBucket17getItemsToPersistEm+0x6a) [0x10c0c8000+0x18bb1a] /Users/dave/repos/couchbase/server/source/install/lib/ep.so(_ZN8EPBucket12flushVBucketE4Vbid+0xba) [0x10c0c8000+0xcfd1a] /Users/dave/repos/couchbase/server/source/install/lib/ep.so(_ZN7Flusher7flushVBEv+0x1a0) [0x10c0c8000+0x130870] /Users/dave/repos/couchbase/server/source/install/lib/ep.so(_ZN7Flusher4stepEP10GlobalTask+0x13b) [0x10c0c8000+0x13039b] /Users/dave/repos/couchbase/server/source/install/lib/ep.so(_ZN14ExecutorThread3runEv+0x348) [0x10c0c8000+0x12b5f8] /Users/dave/repos/couchbase/server/source/install/lib/libplatform_so.0.1.0.dylib(_ZL20platform_thread_wrapPv+0x4e) [0x109e30000+0x59ee] /usr/lib/system/libsystem_pthread.dylib(_pthread_start+0x7d) [0x7ff80b588000+0x64f4] /usr/lib/system/libsystem_pthread.dylib(thread_start+0xf) [0x7ff80b588000+0x200f]
            drigby Dave Rigby added a comment -

            Re-running the same test against the "tweaked" kv_engine above, but this time with the proposed fix (https://review.couchbase.org/c/kv_engine/+/170268) shows the snapshot start being fixed up correctly on failover:

            2022-02-10T16:07:13.325040+00:00 INFO (default) CheckpointManager::createNewCheckpoint(): vb:997 Found lastBySeqno:2 less than snapStart:4, adjusting snapStart to lastBySeqno + 1
            2022-02-10T16:07:13.328996+00:00 INFO (default) CheckpointManager::createNewCheckpoint(): vb:911 Found lastBySeqno:2 less than snapStart:4, adjusting snapStart to lastBySeqno + 1
            2022-02-10T16:07:13.331859+00:00 INFO (default) CheckpointManager::createNewCheckpoint(): vb:858 Found lastBySeqno:1 less than snapStart:2, adjusting snapStart to lastBySeqno + 1
            

            drigby Dave Rigby added a comment - Re-running the same test against the "tweaked" kv_engine above, but this time with the proposed fix ( https://review.couchbase.org/c/kv_engine/+/170268 ) shows the snapshot start being fixed up correctly on failover: 2022-02-10T16:07:13.325040+00:00 INFO (default) CheckpointManager::createNewCheckpoint(): vb:997 Found lastBySeqno:2 less than snapStart:4, adjusting snapStart to lastBySeqno + 1 2022-02-10T16:07:13.328996+00:00 INFO (default) CheckpointManager::createNewCheckpoint(): vb:911 Found lastBySeqno:2 less than snapStart:4, adjusting snapStart to lastBySeqno + 1 2022-02-10T16:07:13.331859+00:00 INFO (default) CheckpointManager::createNewCheckpoint(): vb:858 Found lastBySeqno:1 less than snapStart:2, adjusting snapStart to lastBySeqno + 1

            Build couchbase-server-6.6.5-10084 contains kv_engine commit bfa0dd8 with commit message:
            MB-50874: Reset snap start if less than lastSeqno on new checkpoint creation

            build-team Couchbase Build Team added a comment - Build couchbase-server-6.6.5-10084 contains kv_engine commit bfa0dd8 with commit message: MB-50874 : Reset snap start if less than lastSeqno on new checkpoint creation
            ashwin.govindarajulu Ashwin Govindarajulu added a comment - - edited

            Not able to reproduce this issue on 6.6.5-10084.

            Steps:

            - 2 node cluster, couchbase bucket with replica=1
            - Loop over dedup mutation using a same key and wait for 10seconds before next iteration - Introduce n/w jitter on active node and wait till snapshot market req packet(opcode - 86) to arrive on replica node.
                tc qdisc add dev enp0s8 root handle 1: prio ; tc qdisc add dev enp0s8 parent 1:3 handle 30: netem delay 3s 5s ; tc filter add dev enp0s8 protocol ip parent 1:0 u32 match ip sport 11209 0xffff flowid 1:3
            - Once arrived, introduce iptable rule to drop all packets from the active node (provided it has not yet received mutation keys - opcode 87)
                iptables -I INPUT -s 10.11.220.102 -j DROP- Perform hard failover of active nodeClean jitter using,
                tc qdisc del dev enp0s8 root

             

            ashwin.govindarajulu Ashwin Govindarajulu added a comment - - edited Not able to reproduce this issue on 6.6.5-10084. Steps: - 2 node cluster, couchbase bucket with replica=1 - Loop over dedup mutation using a same key and wait for 10seconds before next iteration - Introduce n/w jitter on active node and wait till snapshot market req packet(opcode - 86) to arrive on replica node.     tc qdisc add dev enp0s8 root handle 1: prio ; tc qdisc add dev enp0s8 parent 1:3 handle 30: netem delay 3s 5s ; tc filter add dev enp0s8 protocol ip parent 1:0 u32 match ip sport 11209 0xffff flowid 1:3 - Once arrived, introduce iptable rule to drop all packets from the active node (provided it has not yet received mutation keys - opcode 87)     iptables -I INPUT -s 10.11.220.102 -j DROP- Perform hard failover of active nodeClean jitter using,     tc qdisc del dev enp0s8 root  

            Build couchbase-server-7.0.4-7215 contains kv_engine commit db53ff0 with commit message:
            MB-50874: Merge branch 'mad-hatter' into cheshire-cat

            build-team Couchbase Build Team added a comment - Build couchbase-server-7.0.4-7215 contains kv_engine commit db53ff0 with commit message: MB-50874 : Merge branch 'mad-hatter' into cheshire-cat

            Build couchbase-server-7.0.4-7215 contains kv_engine commit bfa0dd8 with commit message:
            MB-50874: Reset snap start if less than lastSeqno on new checkpoint creation

            build-team Couchbase Build Team added a comment - Build couchbase-server-7.0.4-7215 contains kv_engine commit bfa0dd8 with commit message: MB-50874 : Reset snap start if less than lastSeqno on new checkpoint creation

            Build couchbase-server-7.1.0-2375 contains kv_engine commit bfa0dd8 with commit message:
            MB-50874: Reset snap start if less than lastSeqno on new checkpoint creation

            build-team Couchbase Build Team added a comment - Build couchbase-server-7.1.0-2375 contains kv_engine commit bfa0dd8 with commit message: MB-50874 : Reset snap start if less than lastSeqno on new checkpoint creation
            drigby Dave Rigby added a comment -

            Note to self: still open as needs merging into 6.6.6.

            drigby Dave Rigby added a comment - Note to self: still open as needs merging into 6.6.6.

            Build couchbase-server-7.1.0-2504 contains kv_engine commit db53ff0 with commit message:
            MB-50874: Merge branch 'mad-hatter' into cheshire-cat

            build-team Couchbase Build Team added a comment - Build couchbase-server-7.1.0-2504 contains kv_engine commit db53ff0 with commit message: MB-50874 : Merge branch 'mad-hatter' into cheshire-cat

            Build couchbase-server-7.2.0-1024 contains kv_engine commit db53ff0 with commit message:
            MB-50874: Merge branch 'mad-hatter' into cheshire-cat

            build-team Couchbase Build Team added a comment - Build couchbase-server-7.2.0-1024 contains kv_engine commit db53ff0 with commit message: MB-50874 : Merge branch 'mad-hatter' into cheshire-cat
            drigby Dave Rigby added a comment -

            Fixed in 7.0.4. Subtask created to track potential backport to 6.6.6.

            drigby Dave Rigby added a comment - Fixed in 7.0.4. Subtask created to track potential backport to 6.6.6.

            Validated on 7.0.4-7237.

            Closing the ticket.

            ashwin.govindarajulu Ashwin Govindarajulu added a comment - Validated on 7.0.4-7237. Closing the ticket.

            People

              ashwin.govindarajulu Ashwin Govindarajulu
              drigby Dave Rigby
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                PagerDuty