Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-35053

GSL: Precondition failure being triggered by trying to write a prepare on top of a prepare [2019/7/19]

    XMLWordPrintable

Details

    • Untriaged
    • No
    • KV-Engine Mad-Hatter Beta

    Description

      During kv-engine-jepsen-post-commit-124 we see the following error messages in memcached.log:

      2019-07-12T05:11:03.695563-07:00 ERROR 52: exception occurred in runloop during packet execution. Cookie info: [{"aiostat":"success","connection":"[ 127.0.0.1:38391 - 127.0.0.1:11209 (<ud>@ns_server</ud>) ]","engine_storage":"0x00007fed62857010","ewouldblock":false,"packet":{"bodylen":46,"cas":1562933463452286976,"datatype":["JSON"],"extlen":33,"key":"<ud>.jepsen0003</ud>","keylen":11,"magic":"ClientRequest","opaque":52,"opcode":"DCP_PREPARE","vbucket":134},"refcount":1}] - closing connection ([ 127.0.0.1:38391 - 127.0.0.1:11209 (<ud>@ns_server</ud>) ]): CheckpointManager::queueDirty(vb:134) - got Ckpt::queueDirty() status:failure:duplicate item when vbstate is non-active:2
      

      2019-07-12T05:11:06.288306-07:00 ERROR 54: exception occurred in runloop during packet execution. Cookie info: [{"aiostat":"success","connection":"[ 127.0.0.1:41366 - 127.0.0.1:11209 (<ud>@ns_server</ud>) ]","engine_storage":"0x00007fed628a2810","ewouldblock":false,"packet":{"bodylen":48,"cas":1562933465434030080,"datatype":["JSON","Snappy"],"extlen":33,"key":"<ud>.jepsen0003</ud>","keylen":11,"magic":"ClientRequest","opaque":52,"opcode":"DCP_PREPARE","vbucket":134},"refcount":1}] - closing connection ([ 127.0.0.1:41366 - 127.0.0.1:11209 (<ud>@ns_server</ud>) ]): GSL: Precondition failure at /home/couchbase/jenkins/workspace/kv-engine-jepsen-post-commit/kv_engine/engines/ep/src/vbucket.cc: 3151
       
      ... Same error message repeats after connect / reconnect attempts ...
      

      This rest is performing a Jepsen register workload, during this it performs a hard fail failover of one of the nodes. Jepsen doesn't see anything wrong till the rebalance of the cluster fails which causes a crash in the Jepsen framework.
      The following test parameters were used for this test:
      workload=failover,failover-type=hard,recovery-type=delta,replicas=2,no-autofailover,disrupt-count=1,rate=0,durability=0:100:0:0,eviction-policy=value
      To run the test use this command:
      lein trampoline run test --nodes-file ./nodes --username vagrant --ssh-private-key ./resources/vagrantkey --package {path to package} --workload=failover --failover-type=hard --recovery-type=delta --replicas=2 --no-autofailover --disrupt-count=1 --rate=0 --durability=0:100:0:0 --eviction-policy=value

      Code from vbucket.cc:3151:

      std::pair<MutationStatus, boost::optional<VBNotifyCtx>> VBucket::processSet(
              HashTable::FindCommitResult& htRes,
              StoredValue*& v,
              Item& itm,
              uint64_t cas,
              bool allowExisting,
              bool hasMetaData,
              const VBQueueItemCtx& queueItmCtx,
              cb::StoreIfStatus storeIfStatus,
              bool maybeKeyExists) {
          if (v) {
              if (v->isPending()) {
                  // It is not valid for an active vBucket to attempt to overwrite an
                  // in flight SyncWrite. If this vBucket is not active, we are
                  // allowed to overwrite an in flight SyncWrite iff we are receiving
                  // a disk snapshot. This is due to disk based de-dupe that allows
                  // only 1 value per key. In this case, the active node may send a
                  // mutation instead of a commit if it knows that this replica may be
                  // missing a prepare. This code allows this mutation to be accepted
                  // and overwrites the existing prepare.
                  if (getState() == vbucket_state_active ||
                      !isReceivingDiskSnapshot()) {
                      return {MutationStatus::IsPendingSyncWrite, {}};
                  }
       
      3151:            Expects(itm.isCommitted());
                  getPassiveDM().completeSyncWrite(
                          itm.getKey(), PassiveDurabilityMonitor::Resolution::Commit);
       
                  // Deal with the already existing prepare
                  processImplicitlyCompletedPrepare(htRes.pending);
       
                  // Add a new or overwrite the existing mutation
                  return processSetInner(htRes.pending.getHBL(),
                                         htRes.committed,
                                         itm,
                                         cas,
                                         allowExisting,
                                         hasMetaData,
                                         queueItmCtx,
                                         storeIfStatus,
                                         maybeKeyExists);
              }
       
              // This is a new SyncWrite, we just want to add a new prepare unless we
              // still have a completed prepare (Ephemeral) which we should replace
              // instead.
              if (!v->isCompleted() && itm.isPending()) {
                  VBNotifyCtx notifyCtx;
                  // We have to modify the StoredValue pointer passed in or we do not
                  // return the correct cas to the client.
                  std::tie(v, notifyCtx) = addNewStoredValue(
                          htRes.getHBL(), itm, queueItmCtx, GenerateRevSeqno::No);
                  return {MutationStatus::WasClean, notifyCtx};
              }
          }
      

      Looking at the logs and the function VBucket::processSet it looks like we're steaming to a replica node and trying to perform a prepared set on top of a prepared set.

      Attachments

        Issue Links

          For Gerrit Dashboard: MB-35053
          # Subject Branch Project Status CR V

          Activity

            People

              richard.demellow Richard deMellow
              richard.demellow Richard deMellow
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty