Loading...

XML

Word

Printable

Details

Type: Bug
Resolution: Fixed
Priority: Critical
Fix Version/s: 6.5.0
Affects Version/s: master, 6.5.0
Component/s: couchbase-bucket
Labels:
- 6.5mustpass
- durability

Triage:
Untriaged
Epic Link:
Sync replication
Is this a Regression?:
No
Sprint:
KV-Engine Mad-Hatter Beta

Description

During kv-engine-jepsen-post-commit-124 we see the following error messages in memcached.log:

2019-07-12T05:11:03.695563-07:00 ERROR 52: exception occurred in runloop during packet execution. Cookie info: [{"aiostat":"success","connection":"[ 127.0.0.1:38391 - 127.0.0.1:11209 (<ud>@ns_server</ud>) ]","engine_storage":"0x00007fed62857010","ewouldblock":false,"packet":{"bodylen":46,"cas":1562933463452286976,"datatype":["JSON"],"extlen":33,"key":"<ud>.jepsen0003</ud>","keylen":11,"magic":"ClientRequest","opaque":52,"opcode":"DCP_PREPARE","vbucket":134},"refcount":1}] - closing connection ([ 127.0.0.1:38391 - 127.0.0.1:11209 (<ud>@ns_server</ud>) ]): CheckpointManager::queueDirty(vb:134) - got Ckpt::queueDirty() status:failure:duplicate item when vbstate is non-active:2

2019-07-12T05:11:06.288306-07:00 ERROR 54: exception occurred in runloop during packet execution. Cookie info: [{"aiostat":"success","connection":"[ 127.0.0.1:41366 - 127.0.0.1:11209 (<ud>@ns_server</ud>) ]","engine_storage":"0x00007fed628a2810","ewouldblock":false,"packet":{"bodylen":48,"cas":1562933465434030080,"datatype":["JSON","Snappy"],"extlen":33,"key":"<ud>.jepsen0003</ud>","keylen":11,"magic":"ClientRequest","opaque":52,"opcode":"DCP_PREPARE","vbucket":134},"refcount":1}] - closing connection ([ 127.0.0.1:41366 - 127.0.0.1:11209 (<ud>@ns_server</ud>) ]): GSL: Precondition failure at /home/couchbase/jenkins/workspace/kv-engine-jepsen-post-commit/kv_engine/engines/ep/src/vbucket.cc: 3151

... Same error message repeats after connect / reconnect attempts ...

This rest is performing a Jepsen register workload, during this it performs a hard fail failover of one of the nodes. Jepsen doesn't see anything wrong till the rebalance of the cluster fails which causes a crash in the Jepsen framework.
The following test parameters were used for this test:
workload=failover,failover-type=hard,recovery-type=delta,replicas=2,no-autofailover,disrupt-count=1,rate=0,durability=0:100:0:0,eviction-policy=value
To run the test use this command:
lein trampoline run test --nodes-file ./nodes --username vagrant --ssh-private-key ./resources/vagrantkey --package {path to package} --workload=failover --failover-type=hard --recovery-type=delta --replicas=2 --no-autofailover --disrupt-count=1 --rate=0 --durability=0:100:0:0 --eviction-policy=value

Code from vbucket.cc:3151:

std::pair<MutationStatus, boost::optional<VBNotifyCtx>> VBucket::processSet(

        HashTable::FindCommitResult& htRes,

        StoredValue*& v,

        Item& itm,

        uint64_t cas,

        bool allowExisting,

        bool hasMetaData,

        const VBQueueItemCtx& queueItmCtx,

        cb::StoreIfStatus storeIfStatus,

        bool maybeKeyExists) {

    if (v) {

        if (v->isPending()) {

            // It is not valid for an active vBucket to attempt to overwrite an

            // in flight SyncWrite. If this vBucket is not active, we are

            // allowed to overwrite an in flight SyncWrite iff we are receiving

            // a disk snapshot. This is due to disk based de-dupe that allows

            // only 1 value per key. In this case, the active node may send a

            // mutation instead of a commit if it knows that this replica may be

            // missing a prepare. This code allows this mutation to be accepted

            // and overwrites the existing prepare.

            if (getState() == vbucket_state_active ||

                !isReceivingDiskSnapshot()) {

                return {MutationStatus::IsPendingSyncWrite, {}};

3151:            Expects(itm.isCommitted());

            getPassiveDM().completeSyncWrite(

                    itm.getKey(), PassiveDurabilityMonitor::Resolution::Commit);

            // Deal with the already existing prepare

            processImplicitlyCompletedPrepare(htRes.pending);

            // Add a new or overwrite the existing mutation

            return processSetInner(htRes.pending.getHBL(),

                                   htRes.committed,

                                   itm,

                                   cas,

                                   allowExisting,

                                   hasMetaData,

                                   queueItmCtx,

                                   storeIfStatus,

                                   maybeKeyExists);

        // This is a new SyncWrite, we just want to add a new prepare unless we

        // still have a completed prepare (Ephemeral) which we should replace

        // instead.

        if (!v->isCompleted() && itm.isPending()) {

            VBNotifyCtx notifyCtx;

            // We have to modify the StoredValue pointer passed in or we do not

            // return the correct cas to the client.

            std::tie(v, notifyCtx) = addNewStoredValue(

                    htRes.getHBL(), itm, queueItmCtx, GenerateRevSeqno::No);

            return {MutationStatus::WasClean, notifyCtx};

Looking at the logs and the function VBucket::processSet it looks like we're steaming to a replica node and trying to perform a prepared set on top of a prepared set.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

20190712T050731.000-0700.zip
10.17 MB
12/Jul/19 9:10 AM
20190716T033630.000-0700.zip
10.10 MB
16/Jul/19 4:25 AM

Issue Links

blocks

MB-34643 DCP replication does not recover after SIGKILL/restart to memcached (dupe PREPARE?) [ETA 2019/7/19]

Closed

is duplicated by

MB-35078 [Rebalance failure] GSL: Precondition failure being triggered when HCS > HPS [2019/7/19]

Closed

relates to

MB-35078 [Rebalance failure] GSL: Precondition failure being triggered when HCS > HPS [2019/7/19]

Closed

Gerrit Reviews

- Issue Only
- Show All Reviews
- Show Open Reviews
- Show All Issues
- Show Open Issues

No reviews matched the request. Check your Options in the drop-down menu of this sections header.

Activity

People

Assignee:: Richard deMellow

Reporter:: Richard deMellow

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 12/Jul/19 8:41 AM

Updated:: 09/Oct/19 2:27 AM

Resolved:: 17/Jul/19 7:29 AM

Gerrit Reviews

There are no open Gerrit changes

Show There are 2 closed Gerrit changes

Hide There are 2 closed Gerrit changes

MB-35053: Set correct allowedDuplicatePrepareSeqnos: Gerrit Review:

MB-35053: Extend allowedDuplicatePrepareSeqnos to highSeqno: Gerrit Review:

GSL: Precondition failure being triggered by trying to write a prepare on top of a prepare [2019/7/19]

Details

Description

Attachments

Attachments

Issue Links

Gerrit Reviews

Activity

People

Dates

Gerrit Reviews

PagerDuty