During kv-engine-jepsen-nightly-351 we observed a and crash during the following test: lein trampoline run test --nodes-file ./nodes --username vagrant --ssh-private-key ./resources/vagrantkey --workload=rebalance --scenario=swap-rebalance --no-autofailover --replicas=2 --disrupt-count=1 --kv-timeout=30 --durability=0:0:0:100 --use-json-docs --doc-padding-size=4096 --enable-tcp-capture --hashdump --enable-memcached-debug-log-level
Crash Message:
2019-12-07T06:07:15.299006+00:00 INFO 53: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.28.128.15->ns_1@172.28.128.42:default - (vb:254) Creating stream with start seqno 4 and end seqno 18446744073709551615; requested end seqno was 18446744073709551615, collections-manifest uid:none, sid:none
2019-12-07T06:07:15.299016+00:00 INFO 53: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.28.128.15->ns_1@172.28.128.42:default - (vb:254) ActiveStream::scheduleBackfill_UNLOCKED register cursor with name "eq_dcpq:replication:ns_1@172.28.128.15->ns_1@172.28.128.42:default" backfill:true, seqno:333
2019-12-07T06:07:15.299020+00:00 INFO 53: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.28.128.15->ns_1@172.28.128.42:default - (vb:254) Scheduling backfill from 5 to 332, reschedule flag : False
2019-12-07T06:07:15.537505+00:00 INFO 53: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.28.128.15->ns_1@172.28.128.42:default - (vb:254) ActiveStream::markDiskSnapshot: Sending disk snapshot with start 4, end 333, and high completed 331, max visible 332
2019-12-07T06:07:15.537566+00:00 INFO 53: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.28.128.15->ns_1@172.28.128.42:default - (vb:254) ActiveStream::registerCursor name "eq_dcpq:replication:ns_1@172.28.128.15->ns_1@172.28.128.42:default", backfill:false, seqno:334
2019-12-07T06:07:17.020034+00:00 INFO 53: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.28.128.15->ns_1@172.28.128.42:default - (vb:254) Backfill complete, 2 items read from disk, 0 from memory, last seqno read: 333, pendingBackfill : False
2019-12-07T06:07:23.098829+00:00 INFO 53: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.28.128.15->ns_1@172.28.128.42:default - (vb:254) Scheduling backfill from 334 to 336, reschedule flag : True
2019-12-07T06:07:23.099966+00:00 INFO 53: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.28.128.15->ns_1@172.28.128.42:default - (vb:254) ActiveStream::markDiskSnapshot: Sending disk snapshot with start 334, end 336, and high completed 335, max visible 336
2019-12-07T06:07:23.099975+00:00 INFO 53: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.28.128.15->ns_1@172.28.128.42:default - (vb:254) ActiveStream::registerCursor name "eq_dcpq:replication:ns_1@172.28.128.15->ns_1@172.28.128.42:default", backfill:false, seqno:337
2019-12-07T06:07:23.181236+00:00 INFO 53: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.28.128.15->ns_1@172.28.128.42:default - (vb:254) Backfill complete, 4 items read from disk, 0 from memory, last seqno read: 336, pendingBackfill : False
2019-12-07T06:07:23.274450+00:00 INFO 53: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.28.128.15->ns_1@172.28.128.42:default - (vb:254) ActiveStream::transitionState: Transitioning from backfilling to in-memory
2019-12-07T06:07:25.543593+00:00 WARNING 53: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.28.128.15->ns_1@172.28.128.42:default - (vb:254) Stream closing, sent until seqno 336 remaining items 0, reason: The stream closed early because the conn was disconnected
Issue is that we end up in the implicit completion of a prepare path (should be executed only for mutations) for a prepare in a disk snapshot. This is because we don't set the allowedDuplicatePrepareSeqno correctly. Currently we set it on the first snapshot marker we see when we create our stream. We should instead set it for every disk snapshot (i.e. before a snapshot with potential de-dupe).
Ben Huddleston
added a comment - Seems simple enough. Got lucky with the minidump and got a vBucket ID out of it. 254.
Replica side of logs
2019-12-07T06:07:15.133675+00:00 INFO 4: (default) DCP (Consumer) eq_dcpq:replication:ns_1@172.28.128.15->ns_1@172.28.128.42:default - (vb:254) Attempting to add stream: opaque_:171, start_seqno_:4, end_seqno_:18446744073709551615, vb_uuid:24555074761989, snap_start_seqno_:4, snap_end_seqno_:5, last_seqno:4, stream_req_value:{"uid":"0"}
<=========== Crash
Active side
2019-12-07T06:07:15.299006+00:00 INFO 53: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.28.128.15->ns_1@172.28.128.42:default - (vb:254) Creating stream with start seqno 4 and end seqno 18446744073709551615; requested end seqno was 18446744073709551615, collections-manifest uid:none, sid:none
2019-12-07T06:07:15.299016+00:00 INFO 53: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.28.128.15->ns_1@172.28.128.42:default - (vb:254) ActiveStream::scheduleBackfill_UNLOCKED register cursor with name "eq_dcpq:replication:ns_1@172.28.128.15->ns_1@172.28.128.42:default" backfill:true, seqno:333
2019-12-07T06:07:15.299020+00:00 INFO 53: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.28.128.15->ns_1@172.28.128.42:default - (vb:254) Scheduling backfill from 5 to 332, reschedule flag : False
2019-12-07T06:07:15.537505+00:00 INFO 53: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.28.128.15->ns_1@172.28.128.42:default - (vb:254) ActiveStream::markDiskSnapshot: Sending disk snapshot with start 4, end 333, and high completed 331, max visible 332
2019-12-07T06:07:15.537566+00:00 INFO 53: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.28.128.15->ns_1@172.28.128.42:default - (vb:254) ActiveStream::registerCursor name "eq_dcpq:replication:ns_1@172.28.128.15->ns_1@172.28.128.42:default", backfill:false, seqno:334
2019-12-07T06:07:17.020034+00:00 INFO 53: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.28.128.15->ns_1@172.28.128.42:default - (vb:254) Backfill complete, 2 items read from disk, 0 from memory, last seqno read: 333, pendingBackfill : False
2019-12-07T06:07:20.853995+00:00 INFO 53: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.28.128.15->ns_1@172.28.128.42:default - (vb:254) Handling slow stream; state_ : backfilling, lastReadSeqno : 333, lastSentSeqno : 332, vBucketHighSeqno : 336, isBackfillTaskRunning : False
2019-12-07T06:07:23.098829+00:00 INFO 53: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.28.128.15->ns_1@172.28.128.42:default - (vb:254) Scheduling backfill from 334 to 336, reschedule flag : True
2019-12-07T06:07:23.099966+00:00 INFO 53: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.28.128.15->ns_1@172.28.128.42:default - (vb:254) ActiveStream::markDiskSnapshot: Sending disk snapshot with start 334, end 336, and high completed 335, max visible 336
2019-12-07T06:07:23.099975+00:00 INFO 53: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.28.128.15->ns_1@172.28.128.42:default - (vb:254) ActiveStream::registerCursor name "eq_dcpq:replication:ns_1@172.28.128.15->ns_1@172.28.128.42:default", backfill:false, seqno:337
2019-12-07T06:07:23.181236+00:00 INFO 53: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.28.128.15->ns_1@172.28.128.42:default - (vb:254) Backfill complete, 4 items read from disk, 0 from memory, last seqno read: 336, pendingBackfill : False
2019-12-07T06:07:23.274450+00:00 INFO 53: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.28.128.15->ns_1@172.28.128.42:default - (vb:254) ActiveStream::transitionState: Transitioning from backfilling to in-memory
2019-12-07T06:07:25.543593+00:00 WARNING 53: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.28.128.15->ns_1@172.28.128.42:default - (vb:254) Stream closing, sent until seqno 336 remaining items 0, reason: The stream closed early because the conn was disconnected
Issue is that we end up in the implicit completion of a prepare path (should be executed only for mutations) for a prepare in a disk snapshot. This is because we don't set the allowedDuplicatePrepareSeqno correctly. Currently we set it on the first snapshot marker we see when we create our stream. We should instead set it for every disk snapshot (i.e. before a snapshot with potential de-dupe).
Build couchbase-server-6.5.0-4940 contains kv_engine commit ec90c05 with commit message: MB-37205: Allow duplicate prepares on Disk snapshots
Couchbase Build Team
added a comment - Build couchbase-server-6.5.0-4940 contains kv_engine commit ec90c05 with commit message:
MB-37205 : Allow duplicate prepares on Disk snapshots
Build couchbase-server-6.5.1-6005 contains kv_engine commit ec90c05 with commit message: MB-37205: Allow duplicate prepares on Disk snapshots
Couchbase Build Team
added a comment - Build couchbase-server-6.5.1-6005 contains kv_engine commit ec90c05 with commit message:
MB-37205 : Allow duplicate prepares on Disk snapshots
Build couchbase-server-7.0.0-1133 contains kv_engine commit ec90c05 with commit message: MB-37205: Allow duplicate prepares on Disk snapshots
Couchbase Build Team
added a comment - Build couchbase-server-7.0.0-1133 contains kv_engine commit ec90c05 with commit message:
MB-37205 : Allow duplicate prepares on Disk snapshots
Still uploading logs will add them to the MB when complete