Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-37205

[Jepsen] Couchbase crash during swap-rebalance nemesis due to an item being uncommitted in processSet()

    XMLWordPrintable

Details

    • Triaged
    • No
    • KV Sprint 2019-12

    Description

      During kv-engine-jepsen-nightly-351 we observed a and crash during the following test:
      lein trampoline run test --nodes-file ./nodes --username vagrant --ssh-private-key ./resources/vagrantkey --workload=rebalance --scenario=swap-rebalance --no-autofailover --replicas=2 --disrupt-count=1 --kv-timeout=30 --durability=0:0:0:100 --use-json-docs --doc-padding-size=4096 --enable-tcp-capture --hashdump --enable-memcached-debug-log-level
      Crash Message:

      2019-12-07T06:07:09.270558+00:00 CRITICAL Caught unhandled std::exception-derived exception. what(): GSL: Precondition failure at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/engines/ep/src/vbucket.cc: 3180
      

      vbucket.cc:3180

              Expects(itm.isCommitted());
              getPassiveDM().completeSyncWrite(
                      itm.getKey(),
                      PassiveDurabilityMonitor::Resolution::Commit,
                      v->getBySeqno() /* prepareSeqno */);
      

      Couchbase version 6.5.1-6000

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            richard.demellow Richard deMellow created issue -
            owend Daniel Owen made changes -
            Field Original Value New Value
            Assignee Daniel Owen [ owend ] Ben Huddleston [ ben.huddleston ]
            owend Daniel Owen made changes -
            Fix Version/s Mad-Hatter [ 15037 ]
            owend Daniel Owen made changes -
            Priority Major [ 3 ] Critical [ 2 ]

            Still uploading logs will add them to the MB when complete

            richard.demellow Richard deMellow added a comment - Still uploading logs will add them to the MB when complete
            richard.demellow Richard deMellow made changes -
            Summary [Jepsen] Couchbase crash during [Jepsen] Couchbase crash during swap-rebalance nemesis due to an item being uncommitted in processSet()
            owend Daniel Owen made changes -
            Sprint KV Sprint 2019-12 [ 939 ]
            owend Daniel Owen made changes -
            Rank Ranked higher
            richard.demellow Richard deMellow made changes -
            Remote Link This issue links to "20191206T215739.000-0800.zip (Web Link)" [ 18945 ]
            richard.demellow Richard deMellow added a comment - - edited

            Added logs and crash dumps, please note that the server is running ubuntu1604.

            richard.demellow Richard deMellow added a comment - - edited Added logs and crash dumps, please note that the server is running ubuntu1604.
            richard.demellow Richard deMellow made changes -
            Description During [kv-engine-jepsen-nightly-351|http://cv.jenkins.couchbase.com/view/Commit%20Validation/view/Jepsen/job/kv-engine-jepsen-nightly/351/] we observed a and crash during the following test:
            {{}}
            Crash Message:
            {noformat}
            2019-12-07T06:07:09.270558+00:00 CRITICAL Caught unhandled std::exception-derived exception. what(): GSL: Precondition failure at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/engines/ep/src/vbucket.cc: 3180
            {noformat}

            {code:cpp|title=vbucket.cc:3180}
                    Expects(itm.isCommitted());
                    getPassiveDM().completeSyncWrite(
                            itm.getKey(),
                            PassiveDurabilityMonitor::Resolution::Commit,
                            v->getBySeqno() /* prepareSeqno */);
            {code}
            Couchbase version 6.5.1-6000
            During [kv-engine-jepsen-nightly-351|http://cv.jenkins.couchbase.com/view/Commit%20Validation/view/Jepsen/job/kv-engine-jepsen-nightly/351/] we observed a and crash during the following test:
            {{lein trampoline run test --nodes-file ./nodes --username vagrant --ssh-private-key ./resources/vagrantkey --workload=rebalance --scenario=swap-rebalance --no-autofailover --replicas=2 --disrupt-count=1 --kv-timeout=30 --durability=0:0:0:100 --use-json-docs --doc-padding-size=4096 --enable-tcp-capture --hashdump --enable-memcached-debug-log-level}}
            Crash Message:
            {noformat}
            2019-12-07T06:07:09.270558+00:00 CRITICAL Caught unhandled std::exception-derived exception. what(): GSL: Precondition failure at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/engines/ep/src/vbucket.cc: 3180
            {noformat}

            {code:cpp|title=vbucket.cc:3180}
                    Expects(itm.isCommitted());
                    getPassiveDM().completeSyncWrite(
                            itm.getKey(),
                            PassiveDurabilityMonitor::Resolution::Commit,
                            v->getBySeqno() /* prepareSeqno */);
            {code}
            Couchbase version 6.5.1-6000

            Seems simple enough. Got lucky with the minidump and got a vBucket ID out of it. 254.

            Replica side of logs

            2019-12-07T06:07:15.133675+00:00 INFO 4: (default) DCP (Consumer) eq_dcpq:replication:ns_1@172.28.128.15->ns_1@172.28.128.42:default - (vb:254) Attempting to add stream: opaque_:171, start_seqno_:4, end_seqno_:18446744073709551615, vb_uuid:24555074761989, snap_start_seqno_:4, snap_end_seqno_:5, last_seqno:4, stream_req_value:{"uid":"0"}
            <=========== Crash
            

            Active side

            2019-12-07T06:07:15.299006+00:00 INFO 53: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.28.128.15->ns_1@172.28.128.42:default - (vb:254) Creating stream with start seqno 4 and end seqno 18446744073709551615; requested end seqno was 18446744073709551615, collections-manifest uid:none, sid:none
            2019-12-07T06:07:15.299016+00:00 INFO 53: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.28.128.15->ns_1@172.28.128.42:default - (vb:254) ActiveStream::scheduleBackfill_UNLOCKED register cursor with name "eq_dcpq:replication:ns_1@172.28.128.15->ns_1@172.28.128.42:default" backfill:true, seqno:333
            2019-12-07T06:07:15.299020+00:00 INFO 53: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.28.128.15->ns_1@172.28.128.42:default - (vb:254) Scheduling backfill from 5 to 332, reschedule flag : False
            2019-12-07T06:07:15.537505+00:00 INFO 53: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.28.128.15->ns_1@172.28.128.42:default - (vb:254) ActiveStream::markDiskSnapshot: Sending disk snapshot with start 4, end 333, and high completed  331, max visible 332
            2019-12-07T06:07:15.537566+00:00 INFO 53: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.28.128.15->ns_1@172.28.128.42:default - (vb:254) ActiveStream::registerCursor name "eq_dcpq:replication:ns_1@172.28.128.15->ns_1@172.28.128.42:default", backfill:false, seqno:334
            2019-12-07T06:07:17.020034+00:00 INFO 53: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.28.128.15->ns_1@172.28.128.42:default - (vb:254) Backfill complete, 2 items read from disk, 0 from memory, last seqno read: 333, pendingBackfill : False
            2019-12-07T06:07:20.853995+00:00 INFO 53: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.28.128.15->ns_1@172.28.128.42:default - (vb:254) Handling slow stream; state_ : backfilling, lastReadSeqno : 333, lastSentSeqno : 332, vBucketHighSeqno : 336, isBackfillTaskRunning : False
            2019-12-07T06:07:23.098829+00:00 INFO 53: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.28.128.15->ns_1@172.28.128.42:default - (vb:254) Scheduling backfill from 334 to 336, reschedule flag : True
            2019-12-07T06:07:23.099966+00:00 INFO 53: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.28.128.15->ns_1@172.28.128.42:default - (vb:254) ActiveStream::markDiskSnapshot: Sending disk snapshot with start 334, end 336, and high completed  335, max visible 336
            2019-12-07T06:07:23.099975+00:00 INFO 53: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.28.128.15->ns_1@172.28.128.42:default - (vb:254) ActiveStream::registerCursor name "eq_dcpq:replication:ns_1@172.28.128.15->ns_1@172.28.128.42:default", backfill:false, seqno:337
            2019-12-07T06:07:23.181236+00:00 INFO 53: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.28.128.15->ns_1@172.28.128.42:default - (vb:254) Backfill complete, 4 items read from disk, 0 from memory, last seqno read: 336, pendingBackfill : False
            2019-12-07T06:07:23.274450+00:00 INFO 53: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.28.128.15->ns_1@172.28.128.42:default - (vb:254) ActiveStream::transitionState: Transitioning from backfilling to in-memory
            2019-12-07T06:07:25.543593+00:00 WARNING 53: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.28.128.15->ns_1@172.28.128.42:default - (vb:254) Stream closing, sent until seqno 336 remaining items 0, reason: The stream closed early because the conn was disconnected
            

            Issue is that we end up in the implicit completion of a prepare path (should be executed only for mutations) for a prepare in a disk snapshot. This is because we don't set the allowedDuplicatePrepareSeqno correctly. Currently we set it on the first snapshot marker we see when we create our stream. We should instead set it for every disk snapshot (i.e. before a snapshot with potential de-dupe).

            ben.huddleston Ben Huddleston added a comment - Seems simple enough. Got lucky with the minidump and got a vBucket ID out of it. 254. Replica side of logs 2019-12-07T06:07:15.133675+00:00 INFO 4: (default) DCP (Consumer) eq_dcpq:replication:ns_1@172.28.128.15->ns_1@172.28.128.42:default - (vb:254) Attempting to add stream: opaque_:171, start_seqno_:4, end_seqno_:18446744073709551615, vb_uuid:24555074761989, snap_start_seqno_:4, snap_end_seqno_:5, last_seqno:4, stream_req_value:{"uid":"0"} <=========== Crash Active side 2019-12-07T06:07:15.299006+00:00 INFO 53: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.28.128.15->ns_1@172.28.128.42:default - (vb:254) Creating stream with start seqno 4 and end seqno 18446744073709551615; requested end seqno was 18446744073709551615, collections-manifest uid:none, sid:none 2019-12-07T06:07:15.299016+00:00 INFO 53: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.28.128.15->ns_1@172.28.128.42:default - (vb:254) ActiveStream::scheduleBackfill_UNLOCKED register cursor with name "eq_dcpq:replication:ns_1@172.28.128.15->ns_1@172.28.128.42:default" backfill:true, seqno:333 2019-12-07T06:07:15.299020+00:00 INFO 53: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.28.128.15->ns_1@172.28.128.42:default - (vb:254) Scheduling backfill from 5 to 332, reschedule flag : False 2019-12-07T06:07:15.537505+00:00 INFO 53: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.28.128.15->ns_1@172.28.128.42:default - (vb:254) ActiveStream::markDiskSnapshot: Sending disk snapshot with start 4, end 333, and high completed 331, max visible 332 2019-12-07T06:07:15.537566+00:00 INFO 53: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.28.128.15->ns_1@172.28.128.42:default - (vb:254) ActiveStream::registerCursor name "eq_dcpq:replication:ns_1@172.28.128.15->ns_1@172.28.128.42:default", backfill:false, seqno:334 2019-12-07T06:07:17.020034+00:00 INFO 53: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.28.128.15->ns_1@172.28.128.42:default - (vb:254) Backfill complete, 2 items read from disk, 0 from memory, last seqno read: 333, pendingBackfill : False 2019-12-07T06:07:20.853995+00:00 INFO 53: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.28.128.15->ns_1@172.28.128.42:default - (vb:254) Handling slow stream; state_ : backfilling, lastReadSeqno : 333, lastSentSeqno : 332, vBucketHighSeqno : 336, isBackfillTaskRunning : False 2019-12-07T06:07:23.098829+00:00 INFO 53: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.28.128.15->ns_1@172.28.128.42:default - (vb:254) Scheduling backfill from 334 to 336, reschedule flag : True 2019-12-07T06:07:23.099966+00:00 INFO 53: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.28.128.15->ns_1@172.28.128.42:default - (vb:254) ActiveStream::markDiskSnapshot: Sending disk snapshot with start 334, end 336, and high completed 335, max visible 336 2019-12-07T06:07:23.099975+00:00 INFO 53: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.28.128.15->ns_1@172.28.128.42:default - (vb:254) ActiveStream::registerCursor name "eq_dcpq:replication:ns_1@172.28.128.15->ns_1@172.28.128.42:default", backfill:false, seqno:337 2019-12-07T06:07:23.181236+00:00 INFO 53: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.28.128.15->ns_1@172.28.128.42:default - (vb:254) Backfill complete, 4 items read from disk, 0 from memory, last seqno read: 336, pendingBackfill : False 2019-12-07T06:07:23.274450+00:00 INFO 53: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.28.128.15->ns_1@172.28.128.42:default - (vb:254) ActiveStream::transitionState: Transitioning from backfilling to in-memory 2019-12-07T06:07:25.543593+00:00 WARNING 53: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.28.128.15->ns_1@172.28.128.42:default - (vb:254) Stream closing, sent until seqno 336 remaining items 0, reason: The stream closed early because the conn was disconnected Issue is that we end up in the implicit completion of a prepare path (should be executed only for mutations) for a prepare in a disk snapshot. This is because we don't set the allowedDuplicatePrepareSeqno correctly. Currently we set it on the first snapshot marker we see when we create our stream. We should instead set it for every disk snapshot (i.e. before a snapshot with potential de-dupe).
            owend Daniel Owen made changes -
            Due Date 10/Dec/19

            Got unit test to repro, code looks relatively simple to fix.

            ben.huddleston Ben Huddleston added a comment - Got unit test to repro, code looks relatively simple to fix.
            lynn.straus Lynn Straus made changes -
            Labels approved-for-mad-hatter
            lynn.straus Lynn Straus made changes -
            Link This issue blocks MB-36676 [ MB-36676 ]
            owend Daniel Owen made changes -
            Triage Untriaged [ 10351 ] Triaged [ 10350 ]
            owend Daniel Owen made changes -
            Assignee Ben Huddleston [ ben.huddleston ] Daniel Owen [ owend ]
            owend Daniel Owen made changes -
            Status Open [ 1 ] In Progress [ 3 ]
            owend Daniel Owen made changes -
            Assignee Daniel Owen [ owend ] Ben Huddleston [ ben.huddleston ]
            owend Daniel Owen made changes -
            Is this a Regression? Unknown [ 10452 ] No [ 10451 ]
            lynn.straus Lynn Straus made changes -
            Due Date 10/Dec/19 11/Dec/19
            drigby Dave Rigby added a comment -

            Expect fix to be merged today (just waiting on CV).

            drigby Dave Rigby added a comment - Expect fix to be merged today (just waiting on CV).
            drigby Dave Rigby made changes -
            Assignee Ben Huddleston [ ben.huddleston ] Richard deMellow [ richard.demellow ]
            Resolution Fixed [ 1 ]
            Status In Progress [ 3 ] Resolved [ 5 ]

            Build couchbase-server-6.5.0-4940 contains kv_engine commit ec90c05 with commit message:
            MB-37205: Allow duplicate prepares on Disk snapshots

            build-team Couchbase Build Team added a comment - Build couchbase-server-6.5.0-4940 contains kv_engine commit ec90c05 with commit message: MB-37205 : Allow duplicate prepares on Disk snapshots

            Build couchbase-server-6.5.1-6005 contains kv_engine commit ec90c05 with commit message:
            MB-37205: Allow duplicate prepares on Disk snapshots

            build-team Couchbase Build Team added a comment - Build couchbase-server-6.5.1-6005 contains kv_engine commit ec90c05 with commit message: MB-37205 : Allow duplicate prepares on Disk snapshots
            ritam.sharma Ritam Sharma made changes -
            Labels approved-for-mad-hatter approved-for-mad-hatter request-dev-verify

            Build couchbase-server-7.0.0-1133 contains kv_engine commit ec90c05 with commit message:
            MB-37205: Allow duplicate prepares on Disk snapshots

            build-team Couchbase Build Team added a comment - Build couchbase-server-7.0.0-1133 contains kv_engine commit ec90c05 with commit message: MB-37205 : Allow duplicate prepares on Disk snapshots

            Closing a we no longer see this crash on the Jepsen CV jobs after the fixes were committed.

            richard.demellow Richard deMellow added a comment - Closing a we no longer see this crash on the Jepsen CV jobs after the fixes were committed.
            richard.demellow Richard deMellow made changes -
            Status Resolved [ 5 ] Closed [ 6 ]

            People

              richard.demellow Richard deMellow
              richard.demellow Richard deMellow
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty