Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-51328

MagmaKVStore::listPersistedVBuckets returns purgeSeqno from wrong vbid

    XMLWordPrintable

Details

    • Bug
    • Status: Closed
    • Critical
    • Resolution: Fixed
    • 7.1.0
    • 7.1.0
    • couchbase-bucket
    • Untriaged
    • 1
    • Unknown
    • KV 2022-Feb, KV March-22

    Description

      Summary

      This MB has been spun out of MB-51223 to track the specific issue where the vBucket purge seqno differs in-memory vs on-disk.

      This results in a DCP StreamRequest initially returning success - as FailoverTable::needsRollback returns false (based on in-memory state), but when the actual DCPBackfill starts, Magma returns that the purge seqno is greater than the start seqno, which results in the backfill being aborted and StreamEnd being sent.

      Here is the main sequence of events (.107.126, acting as DcpProducer):

      1. Node is started and bucket "GleamBookUsers0" created using Magma:

        2022-03-01T14:26:55.854753-08:00 INFO Couchbase version 7.1.0-2400 starting.
        2022-03-01T14:26:56.550939-08:00 INFO 59: Create Couchbase bucket [GleamBookUsers0]
        2022-03-01T14:26:56.552018-08:00 INFO 59: Initialize Couchbase bucket [GleamBookUsers0] using configuration: "max_size=4294967296;dbname=/data/GleamBookUsers0;backend=magma;couch_bucket=GleamBookUsers0;max_vbuckets=1024;alog_path=/data/GleamBookUsers0/access.log;data_traffic_enabled=false;max_num_workers=3;uuid=d252f7d44b2c55bbf9813f601f721c89;conflict_resolution_type=seqno;bucket_type=persistent;durability_min_level=none;pitr_enabled=false;pitr_granularity=600;pitr_max_history_age=86400;magma_fragmentation_percentage=50;magma_mem_quota_ratio=0.5;item_eviction_policy=full_eviction;persistent_metadata_purge_age=3456;max_ttl=0;ht_locks=47;compression_mode=passive;failpartialwarmup=false"
        

      2. Warmup starts, and vBucket 365 is created by loading previous state from disk. Of note is the purge_seqno of zero in the loaded vBucket state:

        2022-03-01T14:27:19.414592-08:00 INFO (GleamBookUsers0) VBucket: created vb:365 with state:active initialState:active lastSeqno:2626102 persistedRange:{2626102,2626102} purge_seqno:0 max_cas:1646173598022893568 uuid:199120553667561 topology:[["ns_1@172.23.107.126","ns_1@172.23.107.240"]]
        2022-03-01T14:27:19.414638-08:00 INFO (GleamBookUsers0) Warmup::createVBuckets: vb:365 created new failover entry with uuid:68781135086884 and seqno:2626102 due to unclean shutdown
        

      3. Warmup continues, and loads perpared SyncWrites from disk. As part of this we create a Magma seqno scan context - where the purgeSeqno is non-zero - it is 2622682:

        2022-03-01T14:27:20.248849-08:00 INFO (GleamBookUsers0) MagmaKVStore::initBySeqnoScanContext vb:365 seqno:2626049 endSeqno:2626102 purgeSeqno:2622682 nDocsToRead:54 docFilter:ALL_ITEMS valFilter:VALUES_COMPRESSED SeqIterator:0x7f858bf3ca80
        2022-03-01T14:27:20.250491-08:00 INFO (GleamBookUsers0) EPVBucket::loadOutstandingPrepares: (vb:365) created DM with PCS:2626048, PPS:2626102, HPS:2626102, number of prepares loaded:46, outstandingPrepares seqnoRange:[2626052 -> 2626102]
        

      4. Warmup then loads key/value pairs; this again creates a by-seqno scan context, which also shows the same non-zero purge seqno as the previous scan, Warmup completes by 14:27:32:

        2022-03-01T14:27:32.288428-08:00 INFO (GleamBookUsers0) MagmaKVStore::initBySeqnoScanContext vb:365 seqno:0 endSeqno:2626102 purgeSeqno:2622682 nDocsToRead:2626103 docFilter:NO_DELETES valFilter:VALUES_COMPRESSED SeqIterator:0x7f858f18a580
        2022-03-01T14:27:32.461075-08:00 INFO (GleamBookUsers0) LoadStorageKVPairCallback::callback(): vb:365 Engine warmup is complete, request to stop loading remaining database
        2022-03-01T14:27:32.461088-08:00 INFO (GleamBookUsers0) WarmupVbucketVisitor::visit(): vb:365 shardId:5 lastReadSeqno:50 needToScanAgain:false vbucket memory limit has been reached703487:2022-03-01T14:27:32.461088-08:00 INFO (GleamBookUsers0) WarmupVbucketVisitor::visit(): vb:365 shardId:5 lastReadSeqno:50 needToScanAgain:false vbucket memory limit has been reached
        

      5. ~18 seconds later, a StreamRequest is received and processed. Note that this request succeeds (i.e. no rollback), by virtue of us printing the "Creating stream" message, and also the attempt to schedule a backfill:

        2022-03-01T14:27:50.773675-08:00 INFO 44134: (GleamBookUsers0) DCP (Producer) eq_dcpq:replication:ns_1@172.23.107.126->ns_1@172.23.107.240:GleamBookUsers0 - (vb:365) Creating stream with start seqno 1810640 and end seqno 18446744073709551615; requested end seqno was 18446744073709551615, snapshot:{0,2626064} collections-filter-size:0 sid:none
        2022-03-01T14:27:50.773681-08:00 INFO 44134: (GleamBookUsers0) DCP (Producer) eq_dcpq:replication:ns_1@172.23.107.126->ns_1@172.23.107.240:GleamBookUsers0 - (vb:365) ActiveStream::scheduleBackfill_UNLOCKED register cursor with name "eq_dcpq:replication:ns_1@172.23.107.126->ns_1@172.23.107.240:GleamBookUsers0" backfill:true, seqno:2626103
        2022-03-01T14:27:50.773768-08:00 INFO 44134: (GleamBookUsers0) DCP (Producer) eq_dcpq:replication:ns_1@172.23.107.126->ns_1@172.23.107.240:GleamBookUsers0 - (vb:365) Scheduling backfill from 1810641 to 2626102, reschedule flag : False
        

      6. Again, when the scan context is initialised we see the same non-zero purge seqno. This results in the backfill failing as we find that the purgeSeqno is less than the start seqno:

        2022-03-01T14:27:50.774278-08:00 INFO (GleamBookUsers0) MagmaKVStore::initBySeqnoScanContext vb:365 seqno:1810641 endSeqno:2626102 purgeSeqno:2622682 nDocsToRead:815462 docFilter:ALL_ITEMS valFilter:VALUES_COMPRESSED SeqIterator:0x7f858ba03480
        2022-03-01T14:27:50.774307-08:00 WARNING 44134: (GleamBookUsers0) DCP (Producer) eq_dcpq:replication:ns_1@172.23.107.126->ns_1@172.23.107.240:GleamBookUsers0 - DCPBackfillBySeqnoDisk::create(): (vb:365) cannot be scanned. Associated stream is set to dead state. startSeqno:1810641 < purgeSeqno:2622682. The vbucket state:active, collHigh-valid:false, collHigh:18446744073709551615
        2022-03-01T14:27:50.774328-08:00 WARNING 44134: (GleamBookUsers0) DCP (Producer) eq_dcpq:replication:ns_1@172.23.107.126->ns_1@172.23.107.240:GleamBookUsers0 - (vb:365) Stream closing, sent until seqno 1810640 remaining items 0, reason: The stream closed early because the vbucket rollback'ed
        

      (See MB-51223 for full logs.)

      Q: Why is the purgeSeqno initially zero when the vBucket is created from loading on-disk state, but when we initialise a scan context at 2022-03-01T14:27:20.248849 (less than 1 second later) it is now non-zero?
      Q2: Even if the purgeSeqno does change on-disk; why is the in-memory version (VBucket::purge_seqno) not updated?

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            Sounds good Dave Rigby. Although, prior to the last passed run on 2444 it was fairly reproducible(3/3 times). Running on multiple clusters at the moment on 2466. Let's see what shows up. If test passed then i will close this issue.

            ritesh.agarwal Ritesh Agarwal added a comment - Sounds good Dave Rigby . Although, prior to the last passed run on 2444 it was fairly reproducible(3/3 times). Running on multiple clusters at the moment on 2466. Let's see what shows up. If test passed then i will close this issue.
            drigby Dave Rigby added a comment -

            Ritesh Agarwal  To confirm, were the previous runs (when it failed 3/3) against 7.1.0-2363 ?

            drigby Dave Rigby added a comment - Ritesh Agarwal   To confirm, were the previous runs (when it failed 3/3) against 7.1.0-2363 ?

            Dave Rigby, yes it was failing before on 2363 and 2400 when tried the same test.

            Verified in 2466 and it passed 2/2 times. This same test will be running many a times before final release so that if there is any chance of hitting it again we should be able to get it. For now running on RC1(2475) build on 2 different clusters. Lets see if we hit MB-51223 again and see this one too.

            ritesh.agarwal Ritesh Agarwal added a comment - Dave Rigby , yes it was failing before on 2363 and 2400 when tried the same test. Verified in 2466 and it passed 2/2 times. This same test will be running many a times before final release so that if there is any chance of hitting it again we should be able to get it. For now running on RC1(2475) build on 2 different clusters. Lets see if we hit MB-51223 again and see this one too.

            Build couchbase-server-7.2.0-1021 contains kv_engine commit dba382a with commit message:
            MB-51328: Return correct purgeSeqno from MagmaKVStore::listPersistedVbuckets

            build-team Couchbase Build Team added a comment - Build couchbase-server-7.2.0-1021 contains kv_engine commit dba382a with commit message: MB-51328 : Return correct purgeSeqno from MagmaKVStore::listPersistedVbuckets

            Build couchbase-server-7.2.0-1021 contains kv_engine commit 682fa05 with commit message:
            MB-51328: Reduce shard / vbucket count for KVStoreBackend tests

            build-team Couchbase Build Team added a comment - Build couchbase-server-7.2.0-1021 contains kv_engine commit 682fa05 with commit message: MB-51328 : Reduce shard / vbucket count for KVStoreBackend tests

            People

              ritesh.agarwal Ritesh Agarwal
              drigby Dave Rigby
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                PagerDuty