Update
The failure is observed on Neo 1361 and it's consistently reproducible.
Live-debugging (http://perf.jenkins.couchbase.com/job/Cloud-Tester/796/console) reveals that we enter a state where the ItemPager finds only non-resident items in the HashTable, so nothing seems eligible for ejection at that point.
That suggests that some HashTable stats might be broken or inconsistent (eg, ht_item_memory), and that actually there is just no resident item to eject in the state observed.
Sean Corrigan has repeated the test on build 1695 and the test passes (http://perf.jenkins.couchbase.com/job/Cloud-Tester/790).
Not clear if some collateral HT/Pager change in (1361, 1695] has fixed the issue. Or if something is just hiding the issue in recent builds.
Details on live-debugging
At every ItemPager run, the PagingVisitor touches all StoredValues in HTs and hit this:
bool HashTable::unlocked_ejectItem(const HashTable::HashBucketLock&,
|
StoredValue*& vptr,
|
EvictionPolicy policy) {
|
if (vptr == nullptr) {
|
throw std::invalid_argument("HashTable::unlocked_ejectItem: "
|
"Unable to delete NULL StoredValue");
|
}
|
|
if (!vptr->eligibleForEviction(policy)) {
|
++stats.numFailedEjects;
|
return false; <-- !!
|
}
|
..
|
}
|
|
bool eligibleForEviction(EvictionPolicy policy) const {
|
// Pending SyncWrite are always resident
|
if (isPending()) {
|
return false;
|
}
|
|
if (policy == EvictionPolicy::Value) {
|
return isResident() && !isDirty(); <-- !!
|
} else {
|
return !isDirty();
|
}
|
}
|
Example on vbid_891:
(gdb) p *vptr
|
$4 = {.., bySeqno = {
|
value = {<std::__atomic_base<long>> = {static _S_alignment = 8, _M_i = 957}, static is_always_lock_free = true}}, lock_expiry_or_delete_or_complete_time = {
|
lock_expiry = 0, delete_or_complete_time = 0}, exptime = 0, flags = 2, revSeqno = {counter = {_M_elems = "\001\000\000\000\000"}}, datatype = 3 '\003',
|
static dirtyIndex = 0, static deletedIndex = 1, static residentIndex = 2, static staleIndex = 3, bits = {static kBitsPerBlock = <optimized out>,
|
static kOne = <optimized out>, data_ = {_M_elems = {{<std::__atomic_base<unsigned char>> = {static _S_alignment = 1, _M_i = 0 '\000'},
|
static is_always_lock_free = true}}}}, ordered = 0 '\000', deletionSource = 0 '\000', committed = 0 '\000'}
|
'bits = 000' indicates non-resident/non-dirty, ie item persisted and already ejected from the HashTable.
Vbucket stats are consistent with the fact that all items in the vbucket are non-resident/non-dirty:
vb_891:eq_dcpq:replication:ns_1@ec2-34-205-37-111.compute-1.amazonaws.com->ns_1@ec2-3-92-87-232.compute-1.amazonaws.com:bucket-1:cursor_checkpoint_id: 54
|
vb_891:eq_dcpq:replication:ns_1@ec2-34-205-37-111.compute-1.amazonaws.com->ns_1@ec2-3-92-87-232.compute-1.amazonaws.com:bucket-1:cursor_seqno: 3937
|
vb_891:eq_dcpq:replication:ns_1@ec2-34-205-37-111.compute-1.amazonaws.com->ns_1@ec2-3-92-87-232.compute-1.amazonaws.com:bucket-1:num_items_for_cursor: 0
|
vb_891:eq_dcpq:replication:ns_1@ec2-34-205-37-111.compute-1.amazonaws.com->ns_1@ec2-3-92-87-232.compute-1.amazonaws.com:bucket-1:num_visits: 0
|
vb_891:id_54:key_index_allocator_bytes: 0
|
vb_891:id_54:queued_items_mem_usage: 263
|
vb_891:id_54:snap_end: 3936
|
vb_891:id_54:snap_start: 3936
|
vb_891:id_54:state: CHECKPOINT_OPEN
|
vb_891:id_54:to_write_allocator_bytes: 48
|
vb_891:id_54:type: Memory
|
vb_891:id_54:visible_snap_end: 3936
|
vb_891:last_closed_checkpoint_id: 53
|
vb_891:mem_usage: 751
|
vb_891:num_checkpoint_items: 1
|
vb_891:num_checkpoints: 1
|
vb_891:num_conn_cursors: 2
|
vb_891:num_items_for_persistence: 0 <-- !!
|
vb_891:num_open_checkpoint_items: 0
|
vb_891:open_checkpoint_id: 54
|
vb_891:persistence:cursor_checkpoint_id: 54
|
vb_891:persistence:cursor_seqno: 3937
|
vb_891:persistence:num_visits: 47
|
vb_891:state: active
|
|
vb_891:high_seqno: 3936
|
vb_891:ht_cache_size: 391880
|
vb_891:ht_item_memory: 391880
|
vb_891:ht_item_memory_uncompressed: 391880
|
vb_891:ht_memory: 26752
|
vb_891:ht_size: 3079
|
vb_891:logical_clock_ticks: 39
|
vb_891:max_cas: 1637073329561272320
|
vb_891:max_cas_str: 2021-11-16T14:35:29.561272320
|
vb_891:max_deleted_revid: 0
|
vb_891:max_visible_seqno: 3936
|
vb_891:might_contain_xattrs: false
|
vb_891:num_ejects: 4112
|
vb_891:num_items: 3880
|
vb_891:num_non_resident: 3880
|
Apart from 'ht_item_memory:391880' that suggests that we have memory allocated for resident items.
Sean Corrigan Do we know if this is specific to ARM-based AWS instances, or do you also see the same in equivalently sized x86 instances?