Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-37205

[Jepsen] Couchbase crash during swap-rebalance nemesis due to an item being uncommitted in processSet()

    XMLWordPrintable

Details

    • Triaged
    • No
    • KV Sprint 2019-12

    Description

      During kv-engine-jepsen-nightly-351 we observed a and crash during the following test:
      lein trampoline run test --nodes-file ./nodes --username vagrant --ssh-private-key ./resources/vagrantkey --workload=rebalance --scenario=swap-rebalance --no-autofailover --replicas=2 --disrupt-count=1 --kv-timeout=30 --durability=0:0:0:100 --use-json-docs --doc-padding-size=4096 --enable-tcp-capture --hashdump --enable-memcached-debug-log-level
      Crash Message:

      2019-12-07T06:07:09.270558+00:00 CRITICAL Caught unhandled std::exception-derived exception. what(): GSL: Precondition failure at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/engines/ep/src/vbucket.cc: 3180
      

      vbucket.cc:3180

              Expects(itm.isCommitted());
              getPassiveDM().completeSyncWrite(
                      itm.getKey(),
                      PassiveDurabilityMonitor::Resolution::Commit,
                      v->getBySeqno() /* prepareSeqno */);
      

      Couchbase version 6.5.1-6000

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            Still uploading logs will add them to the MB when complete

            richard.demellow Richard deMellow added a comment - Still uploading logs will add them to the MB when complete
            richard.demellow Richard deMellow added a comment - - edited

            Added logs and crash dumps, please note that the server is running ubuntu1604.

            richard.demellow Richard deMellow added a comment - - edited Added logs and crash dumps, please note that the server is running ubuntu1604.

            Seems simple enough. Got lucky with the minidump and got a vBucket ID out of it. 254.

            Replica side of logs

            2019-12-07T06:07:15.133675+00:00 INFO 4: (default) DCP (Consumer) eq_dcpq:replication:ns_1@172.28.128.15->ns_1@172.28.128.42:default - (vb:254) Attempting to add stream: opaque_:171, start_seqno_:4, end_seqno_:18446744073709551615, vb_uuid:24555074761989, snap_start_seqno_:4, snap_end_seqno_:5, last_seqno:4, stream_req_value:{"uid":"0"}
            <=========== Crash
            

            Active side

            2019-12-07T06:07:15.299006+00:00 INFO 53: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.28.128.15->ns_1@172.28.128.42:default - (vb:254) Creating stream with start seqno 4 and end seqno 18446744073709551615; requested end seqno was 18446744073709551615, collections-manifest uid:none, sid:none
            2019-12-07T06:07:15.299016+00:00 INFO 53: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.28.128.15->ns_1@172.28.128.42:default - (vb:254) ActiveStream::scheduleBackfill_UNLOCKED register cursor with name "eq_dcpq:replication:ns_1@172.28.128.15->ns_1@172.28.128.42:default" backfill:true, seqno:333
            2019-12-07T06:07:15.299020+00:00 INFO 53: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.28.128.15->ns_1@172.28.128.42:default - (vb:254) Scheduling backfill from 5 to 332, reschedule flag : False
            2019-12-07T06:07:15.537505+00:00 INFO 53: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.28.128.15->ns_1@172.28.128.42:default - (vb:254) ActiveStream::markDiskSnapshot: Sending disk snapshot with start 4, end 333, and high completed  331, max visible 332
            2019-12-07T06:07:15.537566+00:00 INFO 53: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.28.128.15->ns_1@172.28.128.42:default - (vb:254) ActiveStream::registerCursor name "eq_dcpq:replication:ns_1@172.28.128.15->ns_1@172.28.128.42:default", backfill:false, seqno:334
            2019-12-07T06:07:17.020034+00:00 INFO 53: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.28.128.15->ns_1@172.28.128.42:default - (vb:254) Backfill complete, 2 items read from disk, 0 from memory, last seqno read: 333, pendingBackfill : False
            2019-12-07T06:07:20.853995+00:00 INFO 53: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.28.128.15->ns_1@172.28.128.42:default - (vb:254) Handling slow stream; state_ : backfilling, lastReadSeqno : 333, lastSentSeqno : 332, vBucketHighSeqno : 336, isBackfillTaskRunning : False
            2019-12-07T06:07:23.098829+00:00 INFO 53: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.28.128.15->ns_1@172.28.128.42:default - (vb:254) Scheduling backfill from 334 to 336, reschedule flag : True
            2019-12-07T06:07:23.099966+00:00 INFO 53: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.28.128.15->ns_1@172.28.128.42:default - (vb:254) ActiveStream::markDiskSnapshot: Sending disk snapshot with start 334, end 336, and high completed  335, max visible 336
            2019-12-07T06:07:23.099975+00:00 INFO 53: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.28.128.15->ns_1@172.28.128.42:default - (vb:254) ActiveStream::registerCursor name "eq_dcpq:replication:ns_1@172.28.128.15->ns_1@172.28.128.42:default", backfill:false, seqno:337
            2019-12-07T06:07:23.181236+00:00 INFO 53: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.28.128.15->ns_1@172.28.128.42:default - (vb:254) Backfill complete, 4 items read from disk, 0 from memory, last seqno read: 336, pendingBackfill : False
            2019-12-07T06:07:23.274450+00:00 INFO 53: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.28.128.15->ns_1@172.28.128.42:default - (vb:254) ActiveStream::transitionState: Transitioning from backfilling to in-memory
            2019-12-07T06:07:25.543593+00:00 WARNING 53: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.28.128.15->ns_1@172.28.128.42:default - (vb:254) Stream closing, sent until seqno 336 remaining items 0, reason: The stream closed early because the conn was disconnected
            

            Issue is that we end up in the implicit completion of a prepare path (should be executed only for mutations) for a prepare in a disk snapshot. This is because we don't set the allowedDuplicatePrepareSeqno correctly. Currently we set it on the first snapshot marker we see when we create our stream. We should instead set it for every disk snapshot (i.e. before a snapshot with potential de-dupe).

            ben.huddleston Ben Huddleston added a comment - Seems simple enough. Got lucky with the minidump and got a vBucket ID out of it. 254. Replica side of logs 2019-12-07T06:07:15.133675+00:00 INFO 4: (default) DCP (Consumer) eq_dcpq:replication:ns_1@172.28.128.15->ns_1@172.28.128.42:default - (vb:254) Attempting to add stream: opaque_:171, start_seqno_:4, end_seqno_:18446744073709551615, vb_uuid:24555074761989, snap_start_seqno_:4, snap_end_seqno_:5, last_seqno:4, stream_req_value:{"uid":"0"} <=========== Crash Active side 2019-12-07T06:07:15.299006+00:00 INFO 53: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.28.128.15->ns_1@172.28.128.42:default - (vb:254) Creating stream with start seqno 4 and end seqno 18446744073709551615; requested end seqno was 18446744073709551615, collections-manifest uid:none, sid:none 2019-12-07T06:07:15.299016+00:00 INFO 53: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.28.128.15->ns_1@172.28.128.42:default - (vb:254) ActiveStream::scheduleBackfill_UNLOCKED register cursor with name "eq_dcpq:replication:ns_1@172.28.128.15->ns_1@172.28.128.42:default" backfill:true, seqno:333 2019-12-07T06:07:15.299020+00:00 INFO 53: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.28.128.15->ns_1@172.28.128.42:default - (vb:254) Scheduling backfill from 5 to 332, reschedule flag : False 2019-12-07T06:07:15.537505+00:00 INFO 53: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.28.128.15->ns_1@172.28.128.42:default - (vb:254) ActiveStream::markDiskSnapshot: Sending disk snapshot with start 4, end 333, and high completed 331, max visible 332 2019-12-07T06:07:15.537566+00:00 INFO 53: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.28.128.15->ns_1@172.28.128.42:default - (vb:254) ActiveStream::registerCursor name "eq_dcpq:replication:ns_1@172.28.128.15->ns_1@172.28.128.42:default", backfill:false, seqno:334 2019-12-07T06:07:17.020034+00:00 INFO 53: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.28.128.15->ns_1@172.28.128.42:default - (vb:254) Backfill complete, 2 items read from disk, 0 from memory, last seqno read: 333, pendingBackfill : False 2019-12-07T06:07:20.853995+00:00 INFO 53: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.28.128.15->ns_1@172.28.128.42:default - (vb:254) Handling slow stream; state_ : backfilling, lastReadSeqno : 333, lastSentSeqno : 332, vBucketHighSeqno : 336, isBackfillTaskRunning : False 2019-12-07T06:07:23.098829+00:00 INFO 53: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.28.128.15->ns_1@172.28.128.42:default - (vb:254) Scheduling backfill from 334 to 336, reschedule flag : True 2019-12-07T06:07:23.099966+00:00 INFO 53: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.28.128.15->ns_1@172.28.128.42:default - (vb:254) ActiveStream::markDiskSnapshot: Sending disk snapshot with start 334, end 336, and high completed 335, max visible 336 2019-12-07T06:07:23.099975+00:00 INFO 53: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.28.128.15->ns_1@172.28.128.42:default - (vb:254) ActiveStream::registerCursor name "eq_dcpq:replication:ns_1@172.28.128.15->ns_1@172.28.128.42:default", backfill:false, seqno:337 2019-12-07T06:07:23.181236+00:00 INFO 53: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.28.128.15->ns_1@172.28.128.42:default - (vb:254) Backfill complete, 4 items read from disk, 0 from memory, last seqno read: 336, pendingBackfill : False 2019-12-07T06:07:23.274450+00:00 INFO 53: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.28.128.15->ns_1@172.28.128.42:default - (vb:254) ActiveStream::transitionState: Transitioning from backfilling to in-memory 2019-12-07T06:07:25.543593+00:00 WARNING 53: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.28.128.15->ns_1@172.28.128.42:default - (vb:254) Stream closing, sent until seqno 336 remaining items 0, reason: The stream closed early because the conn was disconnected Issue is that we end up in the implicit completion of a prepare path (should be executed only for mutations) for a prepare in a disk snapshot. This is because we don't set the allowedDuplicatePrepareSeqno correctly. Currently we set it on the first snapshot marker we see when we create our stream. We should instead set it for every disk snapshot (i.e. before a snapshot with potential de-dupe).

            Got unit test to repro, code looks relatively simple to fix.

            ben.huddleston Ben Huddleston added a comment - Got unit test to repro, code looks relatively simple to fix.
            drigby Dave Rigby added a comment -

            Expect fix to be merged today (just waiting on CV).

            drigby Dave Rigby added a comment - Expect fix to be merged today (just waiting on CV).

            Build couchbase-server-6.5.0-4940 contains kv_engine commit ec90c05 with commit message:
            MB-37205: Allow duplicate prepares on Disk snapshots

            build-team Couchbase Build Team added a comment - Build couchbase-server-6.5.0-4940 contains kv_engine commit ec90c05 with commit message: MB-37205 : Allow duplicate prepares on Disk snapshots

            Build couchbase-server-6.5.1-6005 contains kv_engine commit ec90c05 with commit message:
            MB-37205: Allow duplicate prepares on Disk snapshots

            build-team Couchbase Build Team added a comment - Build couchbase-server-6.5.1-6005 contains kv_engine commit ec90c05 with commit message: MB-37205 : Allow duplicate prepares on Disk snapshots

            Build couchbase-server-7.0.0-1133 contains kv_engine commit ec90c05 with commit message:
            MB-37205: Allow duplicate prepares on Disk snapshots

            build-team Couchbase Build Team added a comment - Build couchbase-server-7.0.0-1133 contains kv_engine commit ec90c05 with commit message: MB-37205 : Allow duplicate prepares on Disk snapshots

            Closing a we no longer see this crash on the Jepsen CV jobs after the fixes were committed.

            richard.demellow Richard deMellow added a comment - Closing a we no longer see this crash on the Jepsen CV jobs after the fixes were committed.

            People

              richard.demellow Richard deMellow
              richard.demellow Richard deMellow
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty