Details
-
Bug
-
Status: Closed
-
Critical
-
Resolution: Fixed
-
7.1.0
-
7.1.0-2475
-
Untriaged
-
Centos 64-bit
-
1
-
Yes
-
KV March-22
Description
Script to Repro
guides/gradlew --refresh-dependencies testrunner -P jython=/opt/jython/bin/jython -P 'args=-i /tmp/testexec.54049.ini GROUP=rebalance_in_out_P0_set1,rerun=False,disk_optimized_thread_settings=True,get-cbcollect-info=True,autoCompactionDefined=true,get-cbcollect-info=True,infra_log_level=info,log_level=info,upgrade_version=7.1.0-2475 -t bucket_collections.collections_rebalance.CollectionsRebalance.test_data_load_collections_with_rebalance_in_out,nodes_init=5,nodes_in=2,nodes_out=1,update_replica=True,updated_num_replicas=3,bucket_spec=magma_dgm.5_percent_dgm.5_node_2_replica_magma_512,doc_size=512,randomize_value=True,data_load_stage=during,skip_validations=False,GROUP=rebalance_in_out_P0_set1'
|
Steps to Repro
1. Create a 5 node cluster
2022-03-12 06:05:05,697 | test | INFO | MainThread | [table_view:display:72] Cluster statistics
|
+----------------+----------+-----------------+-----------+-----------+---------------------+-------------------+-----------------------+
|
| Node | Services | CPU_utilization | Mem_total | Mem_free | Swap_mem_used | Active / Replica | Version |
|
+----------------+----------+-----------------+-----------+-----------+---------------------+-------------------+-----------------------+
|
| 172.23.106.163 | kv | 0.313008639038 | 11.45 GiB | 10.67 GiB | 0.0 Byte / 3.50 GiB | 0 / 0 | 7.1.0-2475-enterprise |
|
| 172.23.105.36 | kv | 1.01669386218 | 11.45 GiB | 10.72 GiB | 0.0 Byte / 3.50 GiB | 0 / 0 | 7.1.0-2475-enterprise |
|
| 172.23.105.33 | kv | 1.83624701295 | 11.45 GiB | 10.61 GiB | 0.0 Byte / 3.50 GiB | 0 / 0 | 7.1.0-2475-enterprise |
|
| 172.23.107.164 | kv | 0 | 0.0 Byte | 0.0 Byte | 0.0 Byte / 0.0 Byte | 0 / 0 | 7.1.0-2475-enterprise |
|
| 172.23.105.37 | kv | 0.162886856284 | 11.45 GiB | 10.72 GiB | 0.0 Byte / 3.50 GiB | 0 / 0 | 7.1.0-2475-enterprise |
|
+----------------+----------+-----------------+-----------+-----------+---------------------+-------------------+-----------------------+
|
2. Create Bucket/scopes/collections/data
2022-03-12 06:13:31,703 | test | INFO | MainThread | [table_view:display:72] Bucket statistics
|
+---------+-----------+-----------------+----------+------------+-----+----------+-----------+------------+------------+---------------+
|
| Bucket | Type | Storage Backend | Replicas | Durability | TTL | Items | RAM Quota | RAM Used | Disk Used | ARR |
|
+---------+-----------+-----------------+----------+------------+-----+----------+-----------+------------+------------+---------------+
|
| bucket1 | couchbase | couchstore | 2 | none | 0 | 50000 | 9.77 GiB | 237.54 MiB | 171.67 MiB | 100 |
|
| bucket2 | couchbase | magma | 2 | none | 0 | 50000 | 4.88 GiB | 510.53 MiB | 317.55 MiB | 100 |
|
| default | couchbase | magma | 2 | none | 0 | 32575000 | 2.50 GiB | 1.87 GiB | 35.97 GiB | 5.32952264006 |
|
+---------+-----------+-----------------+----------+------------+-----+----------+-----------+------------+------------+---------------+
|
3. Add 2 nodes(172.23.106.156,172.23.106.159), Remove 1 node(172.23.105.37), Update all bucket replicas to 3 and do a rebalance. Rebalance completes fine.
2022-03-12 06:13:43,536 | test | INFO | pool-7-thread-14 | [table_view:display:72] Rebalance Overview
|
+----------------+----------+-----------------------+---------------+--------------+-----------------------+
|
| Nodes | Services | Version | CPU | Status | Membership / Recovery |
|
+----------------+----------+-----------------------+---------------+--------------+-----------------------+
|
| 172.23.106.163 | kv | 7.1.0-2475-enterprise | 6.43863179074 | Cluster node | active / none |
|
| 172.23.105.36 | kv | 7.1.0-2475-enterprise | 7.10230856566 | Cluster node | active / none |
|
| 172.23.105.33 | kv | 7.1.0-2475-enterprise | 6.43982356648 | Cluster node | active / none |
|
| 172.23.107.164 | kv | 7.1.0-2475-enterprise | 5.58842039018 | Cluster node | active / none |
|
| 172.23.106.156 | kv | 7.1.0-2475-enterprise | 0 | Cluster node | inactiveAdded / none |
|
| 172.23.106.159 | kv | 7.1.0-2475-enterprise | 0 | Cluster node | inactiveAdded / none |
|
| 172.23.105.37 | kv | 7.1.0-2475-enterprise | 6.41590137124 | --- OUT ---> | active / none |
|
+----------------+----------+-----------------------+---------------+--------------+-----------------------+
|
Once rebalance completes we noticed following error message on 172.23.106.163.
On 172.23.106.163
2022-03-12 06:25:10,500 | test | CRITICAL | MainThread | [basetestcase:check_coredump_exist:933] 172.23.106.163: Found ' ERROR ' logs - ['2022-03-12T06:17:18.604112-08:00 ERROR 5726: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.23.106.163->ns_1@172.23.107.164:default - ActiveStream::processItems checkpoint_end:37081 should not be in the current snapshot range s:37022->e:37090\n', '2022-03-12T06:17:19.381800-08:00 ERROR 5726: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.23.106.163->ns_1@172.23.107.164:default - ActiveStream::processItems checkpoint_end:42608 should not be in the current snapshot range s:42539->e:42612\n']
|
cbcollect_info attached. This issue was not seen on 7.1.0-2434.
From the latest set of logs we can see the issue affected vb:34, we can see the checkpoint_end was at seqno:34361 but we thought the snapshot range was s:34343->e:34390, which indicates we didn't receive the full snapshot before set the stream to dead. Before creating the takeover stream, given we always set when nextSnapshotIsCheckpoint moving to the StreamState::TakeoverSend state we will have sent the first snapshot marker of the takeover stream with MARKER_FLAG_CHK.
2022-03-14T04:44:56.659418-07:00 INFO 5638: (default) DCP (Consumer) eq_dcpq:replication:ns_1@172.23.100.34->ns_1@172.23.105.164:default - (vb:34) Setting stream to dead state, last_seqno is 34360, unAckedBytes is 45689, status is The stream closed early due to a close stream message
2022-03-14T04:44:56.663914-07:00 INFO 5638: (default) DCP (Consumer) eq_dcpq:replication:ns_1@172.23.100.34->ns_1@172.23.105.164:default - (vb:34) Attempting to add takeover stream: opaque_:1225, start_seqno_:34360, end_seqno_:18446744073709551615, vb_uuid:162649864506754, snap_start_seqno_:34343, snap_end_seqno_:34390, last_seqno:34360, stream_req_value:{"uid":
"1"}
2022-03-14T04:44:56.880219-07:00 INFO (default) VBucket::setState: transitioning vb:34 with high seqno:34493 from:replica to:pending
2022-03-14T04:44:56.904158-07:00 INFO (default) VBucket::setState: transitioning vb:34 with high seqno:34494 from:pending to:active
2022-03-14T04:44:56.904438-07:00 INFO (default) ActiveDurabilityMonitor::ctor(PDM&&): vb:34 Transitioning from PDM: HPS:33694, HCS:0, numTracked:0, highestTracked:0
2022-03-14T04:44:56.912413-07:00 INFO 5638: (default) DCP (Consumer) eq_dcpq:replication:ns_1@172.23.100.34->ns_1@172.23.105.164:default - (vb:34) End stream received with reason The stream ended due to all items being streamed
2022-03-14T04:44:57.008734-07:00 INFO (default) VBucket::setState: transitioning vb:34 with high seqno:34503 from:active to:active meta:{"topology":[["ns_1@172.23.105.164","ns_1@172.23.106.177","ns_1@172.23.100.35","ns_1@172.23.100.36"]]}
2022-03-14T04:44:57.024425-07:00 INFO 5703: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.23.105.164->ns_1@172.23.100.35:default - (vb:34) Creating stream with start seqno 34341 and end seqno 18446744073709551615; requested end seqno was 18446744073709551615, snapshot:{34296,34342} collections-filter-size:0 sid:none
2022-03-14T04:44:57.024436-07:00 INFO 5703: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.23.105.164->ns_1@172.23.100.35:default - (vb:34) ActiveStream::scheduleBackfill_UNLOCKED register cursor with name "eq_dcpq:replication:ns_1@172.23.105.164->ns_1@172.23.100.35:default" backfill:true, seqno:34360
2022-03-14T04:44:57.026520-07:00 INFO 5703: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.23.105.164->ns_1@172.23.100.35:default - (vb:34) Scheduling backfill from 34342 to 34359, reschedule flag : False
2022-03-14T04:44:57.041241-07:00 INFO (default) MagmaKVStore::initBySeqnoScanContext vb:34 seqno:34342 endSeqno:34360 purgeSeqno:0 nDocsToRead:19 docFilter:ALL_ITEMS valFilter:VALUES_COMPRESSED SeqIterator:0x7f8340fcd180
2022-03-14T04:44:57.041260-07:00 INFO 5703: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.23.105.164->ns_1@172.23.100.35:default - (vb:34) ActiveStream::markDiskSnapshot: Sending disk snapshot with start 34296, end 34360, and high completed 0, max visible 34360
2022-03-14T04:44:57.041306-07:00 INFO 5703: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.23.105.164->ns_1@172.23.100.35:default - (vb:34) ActiveStream::registerCursor name "eq_dcpq:replication:ns_1@172.23.105.164->ns_1@172.23.100.35:default", backfill:false, seqno:34361
2022-03-14T04:44:57.045227-07:00 INFO [(default) magma_2/kvstore-34/rev-000000001]0x7f8340fcd180 Starting seqno iterator (34342=>34360) with seqno 34342
2022-03-14T04:44:57.045849-07:00 INFO 5703: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.23.105.164->ns_1@172.23.100.35:default - (vb:34) Backfill complete. 18 items consisting of 8696 bytes read from disk, 1 items from memory, lastReadSeqno:34360 lastSentSeqnoAdvance:0, pendingBackfill:False. Total runtime 15 ms (1217 item/s, 0 MB/s)
2022-03-14T04:44:57.045867-07:00 INFO [(default) magma_2/kvstore-34/rev-000000001]0x7f8340fcd180 Ending seqno iterator (34342=>34360) with seqno 34360 restarted 0 times
2022-03-14T04:44:57.048235-07:00 INFO 5703: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.23.105.164->ns_1@172.23.100.35:default - (vb:34) ActiveStream::transitionState: Transitioning from backfilling to in-memory
2022-03-14T04:44:57.054268-07:00 ERROR 5703: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.23.105.164->ns_1@172.23.100.35:default - ActiveStream::processItems checkpoint_end:34361 should not be in the current snapshot range s:34343->e:34390
2022-03-14T04:44:56.653651-07:00 INFO 5670: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.23.100.34->ns_1@172.23.105.164:default - (vb:34) Stream closing, sent until seqno 34468 remaining items 0, reason: The stream closed early due to a close stream message
2022-03-14T04:44:56.667854-07:00 INFO 5670: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.23.100.34->ns_1@172.23.105.164:default - (vb:34) Creating takeover stream with start seqno 34360 and end seqno 18446744073709551615; requested end seqno was 18446744073709551615, snapshot:{34343,34390} collections-filter-size:0 sid:none
2022-03-14T04:44:56.667899-07:00 INFO 5670: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.23.100.34->ns_1@172.23.105.164:default - (vb:34) ActiveStream::scheduleBackfill_UNLOCKED register cursor with name "eq_dcpq:replication:ns_1@172.23.100.34->ns_1@172.23.105.164:default" backfill:false, seqno:34361
2022-03-14T04:44:56.667911-07:00 INFO 5670: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.23.100.34->ns_1@172.23.105.164:default - (vb:34) ActiveStream::transitionState: Transitioning from backfilling to takeover-send
2022-03-14T04:44:56.879718-07:00 INFO 5670: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.23.100.34->ns_1@172.23.105.164:default - (vb:34) ActiveStream::transitionState: Transitioning from takeover-send to takeover-wait
2022-03-14T04:44:56.881617-07:00 INFO 5670: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.23.100.34->ns_1@172.23.105.164:default - (vb:34) ActiveStream::transitionState: Transitioning from takeover-wait to takeover-send
2022-03-14T04:44:56.881645-07:00 INFO (default) VBucket::setState: transitioning vb:34 with high seqno:34494 from:active to:dead
2022-03-14T04:44:56.881704-07:00 INFO 5670: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.23.100.34->ns_1@172.23.105.164:default - (vb:34) Vbucket marked as dead, last sent seqno: 34493, high seqno: 34494
2022-03-14T04:44:56.894819-07:00 INFO 5670: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.23.100.34->ns_1@172.23.105.164:default - (vb:34) ActiveStream::transitionState: Transitioning from takeover-send to takeover-wait
2022-03-14T04:44:56.906455-07:00 INFO 5670: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.23.100.34->ns_1@172.23.105.164:default - (vb:34) Receive ack for set vbucket state to active message
2022-03-14T04:44:56.906493-07:00 INFO 5670: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.23.100.34->ns_1@172.23.105.164:default - (vb:34) Stream closing, sent until seqno 34494 remaining items 0, reason: The stream ended due to all items being streamed
Fix
Based on the above and further investigation of CheckpointManager::getItemsForCursor, I think its fine to just erase the head of outstandingItemsResult.ranges. As it adds a new snapshot range to ItemsForCursor::ranges after each queue_op::checkpoint_end that's seen if there are more items after it be added to ItemsForCursor::items. Which means we can also remove the error message too.