Loading...

XML

Word

Printable

Details

Type: Bug
Resolution: Fixed
Priority: Critical
Fix Version/s: 7.1.0
Affects Version/s: 7.1.0
Component/s: couchbase-bucket
Labels:
- magma

Triage:
Untriaged
Story Points:
1
Is this a Regression?:
Unknown
Sprint:
KV 2022-Feb, KV March-22

Description

Summary

This MB has been spun out of MB-51223 to track the specific issue where the vBucket purge seqno differs in-memory vs on-disk.

This results in a DCP StreamRequest initially returning success - as FailoverTable::needsRollback returns false (based on in-memory state), but when the actual DCPBackfill starts, Magma returns that the purge seqno is greater than the start seqno, which results in the backfill being aborted and StreamEnd being sent.

Here is the main sequence of events (.107.126, acting as DcpProducer):

Node is started and bucket "GleamBookUsers0" created using Magma:

2022-03-01T14:26:55.854753-08:00 INFO Couchbase version 7.1.0-2400 starting.

2022-03-01T14:26:56.550939-08:00 INFO 59: Create Couchbase bucket [GleamBookUsers0]

2022-03-01T14:26:56.552018-08:00 INFO 59: Initialize Couchbase bucket [GleamBookUsers0] using configuration: "max_size=4294967296;dbname=/data/GleamBookUsers0;backend=magma;couch_bucket=GleamBookUsers0;max_vbuckets=1024;alog_path=/data/GleamBookUsers0/access.log;data_traffic_enabled=false;max_num_workers=3;uuid=d252f7d44b2c55bbf9813f601f721c89;conflict_resolution_type=seqno;bucket_type=persistent;durability_min_level=none;pitr_enabled=false;pitr_granularity=600;pitr_max_history_age=86400;magma_fragmentation_percentage=50;magma_mem_quota_ratio=0.5;item_eviction_policy=full_eviction;persistent_metadata_purge_age=3456;max_ttl=0;ht_locks=47;compression_mode=passive;failpartialwarmup=false"

Warmup starts, and vBucket 365 is created by loading previous state from disk. Of note is the purge_seqno of zero in the loaded vBucket state:

2022-03-01T14:27:19.414592-08:00 INFO (GleamBookUsers0) VBucket: created vb:365 with state:active initialState:active lastSeqno:2626102 persistedRange:{2626102,2626102} purge_seqno:0 max_cas:1646173598022893568 uuid:199120553667561 topology:[["ns_1@172.23.107.126","ns_1@172.23.107.240"]]

2022-03-01T14:27:19.414638-08:00 INFO (GleamBookUsers0) Warmup::createVBuckets: vb:365 created new failover entry with uuid:68781135086884 and seqno:2626102 due to unclean shutdown

Warmup continues, and loads perpared SyncWrites from disk. As part of this we create a Magma seqno scan context - where the purgeSeqno is non-zero - it is 2622682:

2022-03-01T14:27:20.248849-08:00 INFO (GleamBookUsers0) MagmaKVStore::initBySeqnoScanContext vb:365 seqno:2626049 endSeqno:2626102 purgeSeqno:2622682 nDocsToRead:54 docFilter:ALL_ITEMS valFilter:VALUES_COMPRESSED SeqIterator:0x7f858bf3ca80

2022-03-01T14:27:20.250491-08:00 INFO (GleamBookUsers0) EPVBucket::loadOutstandingPrepares: (vb:365) created DM with PCS:2626048, PPS:2626102, HPS:2626102, number of prepares loaded:46, outstandingPrepares seqnoRange:[2626052 -> 2626102]

Warmup then loads key/value pairs; this again creates a by-seqno scan context, which also shows the same non-zero purge seqno as the previous scan, Warmup completes by 14:27:32:

2022-03-01T14:27:32.288428-08:00 INFO (GleamBookUsers0) MagmaKVStore::initBySeqnoScanContext vb:365 seqno:0 endSeqno:2626102 purgeSeqno:2622682 nDocsToRead:2626103 docFilter:NO_DELETES valFilter:VALUES_COMPRESSED SeqIterator:0x7f858f18a580

2022-03-01T14:27:32.461075-08:00 INFO (GleamBookUsers0) LoadStorageKVPairCallback::callback(): vb:365 Engine warmup is complete, request to stop loading remaining database

2022-03-01T14:27:32.461088-08:00 INFO (GleamBookUsers0) WarmupVbucketVisitor::visit(): vb:365 shardId:5 lastReadSeqno:50 needToScanAgain:false vbucket memory limit has been reached703487:2022-03-01T14:27:32.461088-08:00 INFO (GleamBookUsers0) WarmupVbucketVisitor::visit(): vb:365 shardId:5 lastReadSeqno:50 needToScanAgain:false vbucket memory limit has been reached

~18 seconds later, a StreamRequest is received and processed. Note that this request succeeds (i.e. no rollback), by virtue of us printing the "Creating stream" message, and also the attempt to schedule a backfill:

2022-03-01T14:27:50.773675-08:00 INFO 44134: (GleamBookUsers0) DCP (Producer) eq_dcpq:replication:ns_1@172.23.107.126->ns_1@172.23.107.240:GleamBookUsers0 - (vb:365) Creating stream with start seqno 1810640 and end seqno 18446744073709551615; requested end seqno was 18446744073709551615, snapshot:{0,2626064} collections-filter-size:0 sid:none

2022-03-01T14:27:50.773681-08:00 INFO 44134: (GleamBookUsers0) DCP (Producer) eq_dcpq:replication:ns_1@172.23.107.126->ns_1@172.23.107.240:GleamBookUsers0 - (vb:365) ActiveStream::scheduleBackfill_UNLOCKED register cursor with name "eq_dcpq:replication:ns_1@172.23.107.126->ns_1@172.23.107.240:GleamBookUsers0" backfill:true, seqno:2626103

2022-03-01T14:27:50.773768-08:00 INFO 44134: (GleamBookUsers0) DCP (Producer) eq_dcpq:replication:ns_1@172.23.107.126->ns_1@172.23.107.240:GleamBookUsers0 - (vb:365) Scheduling backfill from 1810641 to 2626102, reschedule flag : False

Again, when the scan context is initialised we see the same non-zero purge seqno. This results in the backfill failing as we find that the purgeSeqno is less than the start seqno:

2022-03-01T14:27:50.774278-08:00 INFO (GleamBookUsers0) MagmaKVStore::initBySeqnoScanContext vb:365 seqno:1810641 endSeqno:2626102 purgeSeqno:2622682 nDocsToRead:815462 docFilter:ALL_ITEMS valFilter:VALUES_COMPRESSED SeqIterator:0x7f858ba03480

2022-03-01T14:27:50.774307-08:00 WARNING 44134: (GleamBookUsers0) DCP (Producer) eq_dcpq:replication:ns_1@172.23.107.126->ns_1@172.23.107.240:GleamBookUsers0 - DCPBackfillBySeqnoDisk::create(): (vb:365) cannot be scanned. Associated stream is set to dead state. startSeqno:1810641 < purgeSeqno:2622682. The vbucket state:active, collHigh-valid:false, collHigh:18446744073709551615

2022-03-01T14:27:50.774328-08:00 WARNING 44134: (GleamBookUsers0) DCP (Producer) eq_dcpq:replication:ns_1@172.23.107.126->ns_1@172.23.107.240:GleamBookUsers0 - (vb:365) Stream closing, sent until seqno 1810640 remaining items 0, reason: The stream closed early because the vbucket rollback'ed

(See MB-51223 for full logs.)

Q: Why is the purgeSeqno initially zero when the vBucket is created from loading on-disk state, but when we initialise a scan context at 2022-03-01T14:27:20.248849 (less than 1 second later) it is now non-zero?
Q2: Even if the purgeSeqno does change on-disk; why is the in-memory version (VBucket::purge_seqno) not updated?