Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-50169

[Jepsen] Precondition failure: 'highestPrepareSeqno == s->highPreparedSeqno.it->getBySeqno()'

    XMLWordPrintable

Details

    • Untriaged
    • 1
    • Unknown
    • KV 2021-Dec

    Description

      The GSL Precondition failure: 'highestPrepareSeqno == s->highPreparedSeqno.it->getBySeqno()' has been observed during a Jepsen post-commit. Due to the highest prepare seqno not being equal to state's highPreparedSeqno seqno.

      PassiveDurabilityMonitor::PassiveDurabilityMonitor(
              VBucket& vb,
              int64_t highPreparedSeqno,
              int64_t highCompletedSeqno,
              std::vector<queued_item>&& outstandingPrepares)
          : PassiveDurabilityMonitor(vb, highPreparedSeqno, highCompletedSeqno) {
          auto s = state.wlock();
          int64_t highestPrepareSeqno = -1;
          for (auto& prepare : outstandingPrepares) {
              Expects(highestPrepareSeqno < prepare->getBySeqno());
              highestPrepareSeqno = prepare->getBySeqno();
              s->trackedWrites.emplace_back(std::move(prepare));
       
              // Advance the highPreparedSeqno iterator to point to the highest
              // SyncWrite which has been prepared.
              auto lastIt = std::prev(s->trackedWrites.end());
              if (lastIt->getBySeqno() <= highPreparedSeqno) {
                  s->highPreparedSeqno.it = lastIt;
              }
       
              // Advance the highCompletedSeqno iterator to point to the highest
              // SyncWrite which has been completed.
              //
              // Note: One might assume that this would always point to
              // trackedWrites.begin(), given that we are a newly minted PassiveDM and
              // hence would only be tracking incomplete SyncWrites. However, we
              // _could_ have been converted from an ActiveDM with null topology which
              // itself was converted from a previous PassiveDM which _did_ have
              // completed SyncWrites still in trackedWrites (because they haven't
              // been persisted locally yet).
              if (lastIt->getBySeqno() <= highCompletedSeqno) {
                  s->highCompletedSeqno.it = lastIt;
              }
          }
          if (!outstandingPrepares.empty()) {
              if (s->highPreparedSeqno.it != s->trackedWrites.end()) {
                  Expects(highestPrepareSeqno ==
                          s->highPreparedSeqno.it->getBySeqno()); // <---  FAILING LINE
              }
              if (s->highCompletedSeqno.it != s->trackedWrites.end()) {
                  Expects(highestPrepareSeqno >
                          s->highCompletedSeqno.it->getBySeqno());
              }
          }
      }
      

      The Jepsen results are attached here: 20211217T170205.000-0800.zip or can be viewed at http://172.23.120.13:8081/files/Couchbase-jenkins-kv-engine-jepsen-post-commit-5445/20211217T170205.000-0800

      It's possible this expects is missing some valid state of on disk, but it's hard to tell without investigating further, hence it not being a critical. Also the expects where added to help debug MB-49767

      Attachments

        Issue Links

          For Gerrit Dashboard: MB-50169
          # Subject Branch Project Status CR V

          Activity

            Probably worth making a code change to improve the expects to give us logging before the exception crashes memcached or do a catch and re-throw.
            Also need to get a backtrace of the first crash so that we can see which vbucket is causing the issue.

            richard.demellow Richard deMellow added a comment - Probably worth making a code change to improve the expects to give us logging before the exception crashes memcached or do a catch and re-throw. Also need to get a backtrace of the first crash so that we can see which vbucket is causing the issue.
            richard.demellow Richard deMellow added a comment - - edited

            Full backtrace of first crash added here: d4d90ca1-e5dc-4ac1-3c21f9be-db8a7417.dmp.txt, which shows its for vbucket 15 vbid = 15

            richard.demellow Richard deMellow added a comment - - edited Full backtrace of first crash added here: d4d90ca1-e5dc-4ac1-3c21f9be-db8a7417.dmp.txt , which shows its for vbucket 15 vbid = 15

            Don't think this is a bug with our functionality and insead an issue with the expects. From the logs bellow we see that the highest prepare seqno was seqno:84 but the value of highPreparedSeqno = 77. Hence the miss match, however this is allowed as the highPreparedSeqno can lag behind the highPersistedPreparedSeqno. So we need to change this expects or remove it.

            2021-12-18T01:03:51.570926+00:00 DEBUG (default) CheckpointManager::addOpenCheckpoint: Create a new open checkpoint: [vb:15, id:1, snapStart:77, snapEnd:97, visibleSnapEnd:85, hcs:none, type:Memory]
            2021-12-18T01:03:51.570979+00:00 INFO (default) VBucket: created vb:15 with state:replica initialState:replica lastSeqno:85 persistedRange:{77,97} purge_seqno:0 max_cas:1639789426931073024 uuid:161549770983431 topology:null
            2021-12-18T01:03:51.635468+00:00 DEBUG (default) EPBucket::loadPreparedSyncWrites: Identified 9 outstanding prepared SyncWrites for vb:15 in 232 us
            2021-12-18T01:03:51.635515+00:00 INFO (default) EPVBucket::loadOutstandingPrepares: (vb:15) created DM with PCS:72, PPS:84, HPS:77, number of prepares loaded:9, outstandingPrepares seqnoRange:[73 -> 84]
            2021-12-18T01:03:51.704875+00:00 CRITICAL *** Fatal error encountered during exception handling ***
            2021-12-18T01:03:51.705057+00:00 CRITICAL Caught unhandled std::exception-derived exception. what(): GSL: Precondition failure: 'highestPrepareSeqno == s->highPreparedSeqno.it->getBySeqno()' at /home/couchbase/jenkins/workspace/kv-engine-jepsen-post-commit/kv_engine/engines/ep/src/durability/passive_durability_monitor.cc:80
            

            richard.demellow Richard deMellow added a comment - Don't think this is a bug with our functionality and insead an issue with the expects. From the logs bellow we see that the highest prepare seqno was seqno:84 but the value of highPreparedSeqno = 77 . Hence the miss match, however this is allowed as the highPreparedSeqno can lag behind the highPersistedPreparedSeqno . So we need to change this expects or remove it. 2021-12-18T01:03:51.570926+00:00 DEBUG (default) CheckpointManager::addOpenCheckpoint: Create a new open checkpoint: [vb:15, id:1, snapStart:77, snapEnd:97, visibleSnapEnd:85, hcs:none, type:Memory] 2021-12-18T01:03:51.570979+00:00 INFO (default) VBucket: created vb:15 with state:replica initialState:replica lastSeqno:85 persistedRange:{77,97} purge_seqno:0 max_cas:1639789426931073024 uuid:161549770983431 topology:null 2021-12-18T01:03:51.635468+00:00 DEBUG (default) EPBucket::loadPreparedSyncWrites: Identified 9 outstanding prepared SyncWrites for vb:15 in 232 us 2021-12-18T01:03:51.635515+00:00 INFO (default) EPVBucket::loadOutstandingPrepares: (vb:15) created DM with PCS:72, PPS:84, HPS:77, number of prepares loaded:9, outstandingPrepares seqnoRange:[73 -> 84] 2021-12-18T01:03:51.704875+00:00 CRITICAL *** Fatal error encountered during exception handling *** 2021-12-18T01:03:51.705057+00:00 CRITICAL Caught unhandled std::exception-derived exception. what(): GSL: Precondition failure: 'highestPrepareSeqno == s->highPreparedSeqno.it->getBySeqno()' at /home/couchbase/jenkins/workspace/kv-engine-jepsen-post-commit/kv_engine/engines/ep/src/durability/passive_durability_monitor.cc:80

            Build couchbase-server-7.1.0-1963 contains kv_engine commit cda4535 with commit message:
            MB-50169: Remove invalid Expects in PDM

            build-team Couchbase Build Team added a comment - Build couchbase-server-7.1.0-1963 contains kv_engine commit cda4535 with commit message: MB-50169 : Remove invalid Expects in PDM

            No longer seeing this crash on Jepsen test runs so marking ticket closed

            richard.demellow Richard deMellow added a comment - No longer seeing this crash on Jepsen test runs so marking ticket closed

            People

              richard.demellow Richard deMellow
              richard.demellow Richard deMellow
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty