Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-35053

GSL: Precondition failure being triggered by trying to write a prepare on top of a prepare [2019/7/19]

    XMLWordPrintable

    Details

    • Triage:
      Untriaged
    • Is this a Regression?:
      No
    • Sprint:
      KV-Engine Mad-Hatter Beta

      Description

      During kv-engine-jepsen-post-commit-124 we see the following error messages in memcached.log:

      2019-07-12T05:11:03.695563-07:00 ERROR 52: exception occurred in runloop during packet execution. Cookie info: [{"aiostat":"success","connection":"[ 127.0.0.1:38391 - 127.0.0.1:11209 (<ud>@ns_server</ud>) ]","engine_storage":"0x00007fed62857010","ewouldblock":false,"packet":{"bodylen":46,"cas":1562933463452286976,"datatype":["JSON"],"extlen":33,"key":"<ud>.jepsen0003</ud>","keylen":11,"magic":"ClientRequest","opaque":52,"opcode":"DCP_PREPARE","vbucket":134},"refcount":1}] - closing connection ([ 127.0.0.1:38391 - 127.0.0.1:11209 (<ud>@ns_server</ud>) ]): CheckpointManager::queueDirty(vb:134) - got Ckpt::queueDirty() status:failure:duplicate item when vbstate is non-active:2
      

      2019-07-12T05:11:06.288306-07:00 ERROR 54: exception occurred in runloop during packet execution. Cookie info: [{"aiostat":"success","connection":"[ 127.0.0.1:41366 - 127.0.0.1:11209 (<ud>@ns_server</ud>) ]","engine_storage":"0x00007fed628a2810","ewouldblock":false,"packet":{"bodylen":48,"cas":1562933465434030080,"datatype":["JSON","Snappy"],"extlen":33,"key":"<ud>.jepsen0003</ud>","keylen":11,"magic":"ClientRequest","opaque":52,"opcode":"DCP_PREPARE","vbucket":134},"refcount":1}] - closing connection ([ 127.0.0.1:41366 - 127.0.0.1:11209 (<ud>@ns_server</ud>) ]): GSL: Precondition failure at /home/couchbase/jenkins/workspace/kv-engine-jepsen-post-commit/kv_engine/engines/ep/src/vbucket.cc: 3151
       
      ... Same error message repeats after connect / reconnect attempts ...
      

      This rest is performing a Jepsen register workload, during this it performs a hard fail failover of one of the nodes. Jepsen doesn't see anything wrong till the rebalance of the cluster fails which causes a crash in the Jepsen framework.
      The following test parameters were used for this test:
      workload=failover,failover-type=hard,recovery-type=delta,replicas=2,no-autofailover,disrupt-count=1,rate=0,durability=0:100:0:0,eviction-policy=value
      To run the test use this command:
      lein trampoline run test --nodes-file ./nodes --username vagrant --ssh-private-key ./resources/vagrantkey --package {path to package} --workload=failover --failover-type=hard --recovery-type=delta --replicas=2 --no-autofailover --disrupt-count=1 --rate=0 --durability=0:100:0:0 --eviction-policy=value

      Code from vbucket.cc:3151:

      std::pair<MutationStatus, boost::optional<VBNotifyCtx>> VBucket::processSet(
              HashTable::FindCommitResult& htRes,
              StoredValue*& v,
              Item& itm,
              uint64_t cas,
              bool allowExisting,
              bool hasMetaData,
              const VBQueueItemCtx& queueItmCtx,
              cb::StoreIfStatus storeIfStatus,
              bool maybeKeyExists) {
          if (v) {
              if (v->isPending()) {
                  // It is not valid for an active vBucket to attempt to overwrite an
                  // in flight SyncWrite. If this vBucket is not active, we are
                  // allowed to overwrite an in flight SyncWrite iff we are receiving
                  // a disk snapshot. This is due to disk based de-dupe that allows
                  // only 1 value per key. In this case, the active node may send a
                  // mutation instead of a commit if it knows that this replica may be
                  // missing a prepare. This code allows this mutation to be accepted
                  // and overwrites the existing prepare.
                  if (getState() == vbucket_state_active ||
                      !isReceivingDiskSnapshot()) {
                      return {MutationStatus::IsPendingSyncWrite, {}};
                  }
       
      3151:            Expects(itm.isCommitted());
                  getPassiveDM().completeSyncWrite(
                          itm.getKey(), PassiveDurabilityMonitor::Resolution::Commit);
       
                  // Deal with the already existing prepare
                  processImplicitlyCompletedPrepare(htRes.pending);
       
                  // Add a new or overwrite the existing mutation
                  return processSetInner(htRes.pending.getHBL(),
                                         htRes.committed,
                                         itm,
                                         cas,
                                         allowExisting,
                                         hasMetaData,
                                         queueItmCtx,
                                         storeIfStatus,
                                         maybeKeyExists);
              }
       
              // This is a new SyncWrite, we just want to add a new prepare unless we
              // still have a completed prepare (Ephemeral) which we should replace
              // instead.
              if (!v->isCompleted() && itm.isPending()) {
                  VBNotifyCtx notifyCtx;
                  // We have to modify the StoredValue pointer passed in or we do not
                  // return the correct cas to the client.
                  std::tie(v, notifyCtx) = addNewStoredValue(
                          htRes.getHBL(), itm, queueItmCtx, GenerateRevSeqno::No);
                  return {MutationStatus::WasClean, notifyCtx};
              }
          }
      

      Looking at the logs and the function VBucket::processSet it looks like we're steaming to a replica node and trying to perform a prepared set on top of a prepared set.

        Attachments

          Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

            Activity

            Hide
            james.harrison James Harrison added a comment - - edited

            Not looked too at this too thoroughly BUT it looks like 

             

            void VBucket::setUpAllowedDuplicatePrepareWindow() {
                ...
                std::unordered_set<int64_t> newDuplicates{hcs + 1, hps};
                ...
            }

            create the set

            {hcs+1, hps}

             not the set

            {hcs+1, hcs+2, hcs+3, ..., hps}

             as was intended

            A quick test seems to agree with this assessment, but I may very well have missed something.

            If this is the case, we will allow precisely two duplicate prepares, at hcs+1 and hps, rather than the full range. I suspect there is more/something else going on, because I'd expect tests to fail otherwise.

            Will debug to confirm/deny.

            Show
            james.harrison James Harrison added a comment - - edited Not looked too at this too thoroughly BUT it looks like    void VBucket::setUpAllowedDuplicatePrepareWindow() { ... std::unordered_set<int64_t> newDuplicates{hcs + 1 , hps}; ... } create the set {hcs+1, hps}  not the set {hcs+1, hcs+2, hcs+3, ..., hps}  as was intended A quick test seems to agree with this assessment, but I may very well have missed something. If this is the case, we will allow precisely two duplicate prepares, at hcs+1 and hps, rather than the full range. I suspect there is more/something else going on, because I'd expect tests to fail otherwise. Will debug to confirm/deny.
            Hide
            richard.demellow Richard deMellow added a comment - - edited

            Attached Jepsen's log files as a zip, should contain all four node's logs too.

            Show
            richard.demellow Richard deMellow added a comment - - edited Attached Jepsen's log files as a zip, should contain all four node's logs too.
            Hide
            build-team Couchbase Build Team added a comment -

            Build couchbase-server-6.5.0-3763 contains kv_engine commit a9f305b with commit message:
            MB-35053: Set correct allowedDuplicatePrepareSeqnos

            Show
            build-team Couchbase Build Team added a comment - Build couchbase-server-6.5.0-3763 contains kv_engine commit a9f305b with commit message: MB-35053 : Set correct allowedDuplicatePrepareSeqnos
            Hide
            richard.demellow Richard deMellow added a comment -

            Reopening as commit a9f305bdfc44d543615a05765f7a19335e156824 doesn't fully fix this bug. Which means we observe this error message:

            2019-07-16T03:40:21.203620-07:00 ERROR 46: exception occurred in runloop during packet execution. Cookie info: [{"aiostat":"success","connection":"[ 127.0.0.1:34194 - 127.0.0.1:11209 (<ud>@ns_server</ud>) ]","engine_storage":"0x00007f1c39ea8010","ewouldblock":false,"packet":{"bodylen":48,"cas":1563273608419016704,"datatype":["JSON","Snappy"],"extlen":33,"key":"<ud>.jepsen0010</ud>","keylen":11,"magic":"ClientRequest","opaque":103,"opcode":"DCP_PREPARE","vbucket":148},"refcount":1}] - closing connection ([ 127.0.0.1:34194 - 127.0.0.1:11209 (<ud>@ns_server</ud>) ]): GSL: Precondition failure at /home/couchbase/jenkins/workspace/kv-engine-jepsen-post-commit/kv_engine/engines/ep/src/vbucket.cc: 3151
            

            Which was seen in kv-engine-jepsen-post-commit-134.

            Show
            richard.demellow Richard deMellow added a comment - Reopening as commit a9f305bdfc44d543615a05765f7a19335e156824 doesn't fully fix this bug. Which means we observe this error message: 2019-07-16T03:40:21.203620-07:00 ERROR 46: exception occurred in runloop during packet execution. Cookie info: [{"aiostat":"success","connection":"[ 127.0.0.1:34194 - 127.0.0.1:11209 (<ud>@ns_server</ud>) ]","engine_storage":"0x00007f1c39ea8010","ewouldblock":false,"packet":{"bodylen":48,"cas":1563273608419016704,"datatype":["JSON","Snappy"],"extlen":33,"key":"<ud>.jepsen0010</ud>","keylen":11,"magic":"ClientRequest","opaque":103,"opcode":"DCP_PREPARE","vbucket":148},"refcount":1}] - closing connection ([ 127.0.0.1:34194 - 127.0.0.1:11209 (<ud>@ns_server</ud>) ]): GSL: Precondition failure at /home/couchbase/jenkins/workspace/kv-engine-jepsen-post-commit/kv_engine/engines/ep/src/vbucket.cc: 3151 Which was seen in kv-engine-jepsen-post-commit-134 .
            Hide
            richard.demellow Richard deMellow added a comment -

            Daniel Owen I've assigned it back to you but I believe James Harrison might be fixing this already.

            Show
            richard.demellow Richard deMellow added a comment - Daniel Owen I've assigned it back to you but I believe James Harrison might be fixing this already.
            Hide
            richard.demellow Richard deMellow added a comment - - edited

            Added second set log logs 20190716T033630.000-0700.zip for this failure

            Show
            richard.demellow Richard deMellow added a comment - - edited Added second set log logs 20190716T033630.000-0700.zip for this failure
            Hide
            build-team Couchbase Build Team added a comment -

            Build couchbase-server-6.5.0-3790 contains kv_engine commit 9bfe8e2 with commit message:
            MB-35053: Extend allowedDuplicatePrepareSeqnos to highSeqno

            Show
            build-team Couchbase Build Team added a comment - Build couchbase-server-6.5.0-3790 contains kv_engine commit 9bfe8e2 with commit message: MB-35053 : Extend allowedDuplicatePrepareSeqnos to highSeqno
            Hide
            ashwin.govindarajulu Ashwin Govindarajulu added a comment -

            Closing the defect based on the unit_test run.

            Show
            ashwin.govindarajulu Ashwin Govindarajulu added a comment - Closing the defect based on the unit_test run.

              People

              • Assignee:
                richard.demellow Richard deMellow
                Reporter:
                richard.demellow Richard deMellow
              • Votes:
                0 Vote for this issue
                Watchers:
                3 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved:

                  Gerrit Reviews

                  There are no open Gerrit changes

                    PagerDuty

                    Error rendering 'com.pagerduty.jira-server-plugin:PagerDuty'. Please contact your Jira administrators.