Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-34318

Replication topology change causes SyncWrite ackCount reset

    XMLWordPrintable

Details

    Description

      This is subtly different from MB-34315 where the issue is that we reset the ReplicationChain object data. Currently when we set a new replication topology we "reset" all in flight SyncWrites which causes them to lose their ackCounts. This causes our syncWrite.isSatisfied() checks to return false even if we have received an ack from the majority of nodes.

      Consider the following case given by the below test:

      1. Replication topology with 3 replicas (replica 1, 2, 3)
      2. Majority SyncWrite (active implicitly acks)
      3. Replica 1 acks (majority is 3 so no action)
      4. Topology change to introduce second chain (active, replica1, replica2, undefined) - this resets the SyncWrite ack counts to 0 for both chains then our call to updateHighPreparedSeqno sets ack count to 1
      5. Replica 2 acks (majority is 3 which we now should meet) however ackCount is only 2

      Replica 1 does not ack again (no other SyncWrites) and the SyncWrite is aborted.

      TEST_P(ActiveDurabilityMonitorTest, LoseSyncWriteAckCount) {
          getActiveDM().setReplicationTopology(
                  nlohmann::json::array({{active, replica1, replica2, replica3}}));
          addSyncWrite(1);
          {
              SCOPED_TRACE("");
              assertNumTrackedAndHPS(1, 1);
          }
       
          testSeqnoAckReceived(replica1,
                               1 /*ackSeqno*/,
                               1 /*expectedNumTracked*/,
                               1 /*expectedLastWriteSeqno*/,
                               1 /*expectedLastAckSeqno*/);
       
          {
              SCOPED_TRACE("");
              assertNumTrackedAndHPS(1, 1);
          }
       
          // Add the secondChain with the new node
          getActiveDM().setReplicationTopology(
                  nlohmann::json::array({{active, replica1, replica2, replica3}, {active, replica1, replica2, nullptr}}));
       
          {
              SCOPED_TRACE("");
              assertNumTrackedAndHPS(1, 1);
          }
       
          // Should commit on replica2 ack as this should satisfy both chains
          // ************************
          // Test fails because we lost the replica1 ack and expectedNumTracked = 1
          // ************************
          testSeqnoAckReceived(replica2,
                               1 /*ackSeqno*/,
                               0 /*expectedNumTracked*/,
                               1 /*expectedLastWriteSeqno*/,
                               1 /*expectedLastAckSeqno*/);
       
          {
              SCOPED_TRACE("");
              // ************************
              // Test fails because we lost the replica1 ack
              // ************************
              assertNumTrackedAndHPS(0, 1);
          }
       
          // what about replica 3 ack?
          testSeqnoAckReceived(replica3,
                               1 /*ackSeqno*/,
                               0 /*expectedNumTracked*/,
                               1 /*expectedLastWriteSeqno*/,
                               1 /*expectedLastAckSeqno*/);
      }
      

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          People

            ben.huddleston Ben Huddleston
            ben.huddleston Ben Huddleston
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty