Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-44139

[Ephemeral] ThrowExceptionUnderflowPolicy on BasicLinkedList::numDeletedItems in bgthread

    XMLWordPrintable

Details

    • Triaged
    • 1
    • Unknown

    Description

      Seen on local cluster_run during the investigation for MB-44079.

      ./gradlew --refresh-dependencies testrunner -P jython=<..> -P 'args=-i <../4-node.ini> GROUP=durability_majority,rerun=False,skip_log_scan=False,get-cbcollect-info=False,bucket_storage=couchstore,upgrade_version=7.0.0-4362 -t bucket_collections.collections_rebalance.CollectionsRebalance.test_data_load_collections_with_rebalance_in,nodes_init=2,nodes_in=2,override_spec_params=durability;replicas,durability=MAJORITY,replicas=2,bucket_spec=multi_bucket.buckets_for_rebalance_tests,data_load_stage=during,skip_validations=False,GROUP=durability_majority'
      

      Running the test on top of http://review.couchbase.org/c/TAF/+/145275 increases the probability of hitting the issue.

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            paolo.cocchi Paolo Cocchi added a comment - - edited

            Another underflow on BasicLinkedList::numDeletedItems.
            For reference, we have recently fixed MB-42568 (underflow on frontend thread).

            Note: As usual, we wouldn't see any failure on PROD but this issue seems preventing us from reproducing MB-44079 and possibly other issues.

            * thread #38, name = 'NonIoPool2', stop reason = breakpoint 1.2
              * frame #0: 0x0000000106c8dbe5 libep.dylib` cb::ThrowExceptionUnderflowPolicy<unsigned long long>::underflow(this=0x000000011235f358, desired=0x0000700004af60d0, current=0, arg=1)  + 53 at non_negative_counter.h:47
                frame #1: 0x00000001071174ae libep.dylib` cb::NonNegativeCounter<unsigned long long, cb::DefaultUnderflowPolicy>::fetch_sub(this=0x000000011235f358, arg=1)  + 142 at non_negative_counter.h:159
                frame #2: 0x0000000107109b80 libep.dylib` cb::NonNegativeCounter<unsigned long long, cb::DefaultUnderflowPolicy>::operator--(this=0x000000011235f358)  + 32 at non_negative_counter.h:209
                frame #3: 0x0000000107109957 libep.dylib` BasicLinkedList::purgeListElem(this=0x000000011235f200, it=boost::intrusive::list<OrderedStoredValue, boost::intrusive::member_hook<OrderedStoredValue, boost::intrusive::list_member_hook<>, 64> >::iterator @ 0x0000700004af6420, isStale=true)>, false>, bool)  + 663 at linked_list.cc:404
                frame #4: 0x00000001071091a4 libep.dylib` BasicLinkedList::purgeTombstones(this=0x000000011235f200, purgeUpToSeqno=1434, isDroppedKeyCb= Lambda in File ephemeral_vb.cc at Line 374, shouldPause=function<bool ()> @ 0x0000700004af65d0)>, std::__1::function<bool ()>)  + 2324 at linked_list.cc:277
                frame #5: 0x0000000106f8d8a0 libep.dylib` EphemeralVBucket::purgeStaleItems(this=0x00000001106b5e00, shouldPauseCbk=function<bool ()> @ 0x0000700004af66d0)>)  + 304 at ephemeral_vb.cc:382
                frame #6: 0x0000000106f88927 libep.dylib` EphemeralVBucket::StaleItemDeleter::visit(this=0x00000001105d2910, vb=0x00000001106b5e00)  + 311 at ephemeral_tombstone_purger.cc:211
                frame #7: 0x00000001070812a0 libep.dylib` KVBucket::pauseResumeVisit(this=0x000000010fad6000, visitor=0x00000001105d2910, start_pos=0x000000010fadeb18)  + 176 at kv_bucket.cc:2325
                frame #8: 0x0000000106f806a5 libep.dylib` EphTombstoneStaleItemDeleter::run(this=0x000000010fadeab8)  + 581 at ephemeral_tombstone_purger.cc:280
                frame #9: 0x0000000107001f1c libep.dylib` GlobalTask::execute(this=0x000000010fadeab8)  + 60 at globaltask.cc:76
            

            OrderedLL::iterator BasicLinkedList::purgeListElem(OrderedLL::iterator it,
                                                               bool isStale) {
                std::unique_ptr<OrderedStoredValue> purged;
                auto next = it;
                {
                    std::lock_guard<std::mutex> lckGd(getListWriteLock());
                    next = seqList.erase(it);
                    purged.reset(&*it);
                }
             
                if (isStale) {
                    /* Update the stats tracking the memory owned by the list */
                    staleSize.fetch_sub(purged->size());
                    staleMetaDataSize.fetch_sub(purged->metaDataSize());
                    --numStaleItems;
                }
             
                st.coreLocal.get()->currentSize.fetch_sub(purged->metaDataSize());
             
                if (purged->isDeleted()) {
                    --numDeletedItems;             <--    !! HERE !!
                }
             
                if (purged->isDeleted() &&
                    purged->getBySeqno() > highestPurgedDeletedSeqno.load()) {
                    highestPurgedDeletedSeqno = purged->getBySeqno();
                }
                return next;
            }
            

            paolo.cocchi Paolo Cocchi added a comment - - edited Another underflow on BasicLinkedList::numDeletedItems. For reference, we have recently fixed MB-42568 (underflow on frontend thread). Note: As usual, we wouldn't see any failure on PROD but this issue seems preventing us from reproducing MB-44079 and possibly other issues. * thread #38, name = 'NonIoPool2', stop reason = breakpoint 1.2 * frame #0: 0x0000000106c8dbe5 libep.dylib` cb::ThrowExceptionUnderflowPolicy<unsigned long long>::underflow(this=0x000000011235f358, desired=0x0000700004af60d0, current=0, arg=1) + 53 at non_negative_counter.h:47 frame #1: 0x00000001071174ae libep.dylib` cb::NonNegativeCounter<unsigned long long, cb::DefaultUnderflowPolicy>::fetch_sub(this=0x000000011235f358, arg=1) + 142 at non_negative_counter.h:159 frame #2: 0x0000000107109b80 libep.dylib` cb::NonNegativeCounter<unsigned long long, cb::DefaultUnderflowPolicy>::operator--(this=0x000000011235f358) + 32 at non_negative_counter.h:209 frame #3: 0x0000000107109957 libep.dylib` BasicLinkedList::purgeListElem(this=0x000000011235f200, it=boost::intrusive::list<OrderedStoredValue, boost::intrusive::member_hook<OrderedStoredValue, boost::intrusive::list_member_hook<>, 64> >::iterator @ 0x0000700004af6420, isStale=true)>, false>, bool) + 663 at linked_list.cc:404 frame #4: 0x00000001071091a4 libep.dylib` BasicLinkedList::purgeTombstones(this=0x000000011235f200, purgeUpToSeqno=1434, isDroppedKeyCb= Lambda in File ephemeral_vb.cc at Line 374, shouldPause=function<bool ()> @ 0x0000700004af65d0)>, std::__1::function<bool ()>) + 2324 at linked_list.cc:277 frame #5: 0x0000000106f8d8a0 libep.dylib` EphemeralVBucket::purgeStaleItems(this=0x00000001106b5e00, shouldPauseCbk=function<bool ()> @ 0x0000700004af66d0)>) + 304 at ephemeral_vb.cc:382 frame #6: 0x0000000106f88927 libep.dylib` EphemeralVBucket::StaleItemDeleter::visit(this=0x00000001105d2910, vb=0x00000001106b5e00) + 311 at ephemeral_tombstone_purger.cc:211 frame #7: 0x00000001070812a0 libep.dylib` KVBucket::pauseResumeVisit(this=0x000000010fad6000, visitor=0x00000001105d2910, start_pos=0x000000010fadeb18) + 176 at kv_bucket.cc:2325 frame #8: 0x0000000106f806a5 libep.dylib` EphTombstoneStaleItemDeleter::run(this=0x000000010fadeab8) + 581 at ephemeral_tombstone_purger.cc:280 frame #9: 0x0000000107001f1c libep.dylib` GlobalTask::execute(this=0x000000010fadeab8) + 60 at globaltask.cc:76 OrderedLL::iterator BasicLinkedList::purgeListElem(OrderedLL::iterator it, bool isStale) { std::unique_ptr<OrderedStoredValue> purged; auto next = it; { std::lock_guard<std::mutex> lckGd(getListWriteLock()); next = seqList.erase(it); purged.reset(&*it); }   if (isStale) { /* Update the stats tracking the memory owned by the list */ staleSize.fetch_sub(purged->size()); staleMetaDataSize.fetch_sub(purged->metaDataSize()); --numStaleItems; }   st.coreLocal.get()->currentSize.fetch_sub(purged->metaDataSize());   if (purged->isDeleted()) { --numDeletedItems; <-- !! HERE !! }   if (purged->isDeleted() && purged->getBySeqno() > highestPurgedDeletedSeqno.load()) { highestPurgedDeletedSeqno = purged->getBySeqno(); } return next; }

            Root cause of the issue is that in some scenarios a Replica vbucket may miss to increment the num-deleted-items counter when when it receives a DCP_DELETION.

            Eg:

            1. SnapMarker {1, 2}
            2. DEL:2
            3. range-read (eg, HTCleaner runs but doesn't touch DEL:2)
            4. SnapMarker {3, 4}
            5. range-read released (HTCleaner shuts down)
            6. DEL:4

            In the example above we may miss to increment num-deleted-items at step (6). Which means that we end up we 2 tombstones and num-deleted-items=1.
            In a following run, the HTCleaner will purge 2 tombstones and try to decrement num-deleted-items to -1.

            Fix/tests up and open for review.

            paolo.cocchi Paolo Cocchi added a comment - Root cause of the issue is that in some scenarios a Replica vbucket may miss to increment the num-deleted-items counter when when it receives a DCP_DELETION. Eg: SnapMarker {1, 2} DEL:2 range-read (eg, HTCleaner runs but doesn't touch DEL:2) SnapMarker {3, 4} range-read released (HTCleaner shuts down) DEL:4 In the example above we may miss to increment num-deleted-items at step (6). Which means that we end up we 2 tombstones and num-deleted-items=1. In a following run, the HTCleaner will purge 2 tombstones and try to decrement num-deleted-items to -1. Fix/tests up and open for review.
            paolo.cocchi Paolo Cocchi added a comment - - edited

            Hi Sumedh Basarkod,
            The TAF test in description (running on top of http://review.couchbase.org/c/TAF/+/145275) is the same test of MB-44079. It fails 100% of the times on the issue tracked here before the fix on my local env, and it succeeds 100% of the times after the fix.

            Ritam Sharma Just for reference, this is another case where setting "ephemeral_metadata_purge_age=0;ephemeral_metadata_purge_interval=1" makes the failure much more likely.

            Thanks

            paolo.cocchi Paolo Cocchi added a comment - - edited Hi Sumedh Basarkod , The TAF test in description (running on top of http://review.couchbase.org/c/TAF/+/145275 ) is the same test of MB-44079 . It fails 100% of the times on the issue tracked here before the fix on my local env, and it succeeds 100% of the times after the fix. Ritam Sharma Just for reference, this is another case where setting "ephemeral_metadata_purge_age=0;ephemeral_metadata_purge_interval=1" makes the failure much more likely. Thanks

            Build couchbase-server-7.0.0-4443 contains kv_engine commit f5bdf82 with commit message:
            MB-44139: Rename SV::isCompleted into SV::isPrepareCompleted

            build-team Couchbase Build Team added a comment - Build couchbase-server-7.0.0-4443 contains kv_engine commit f5bdf82 with commit message: MB-44139 : Rename SV::isCompleted into SV::isPrepareCompleted

            Build couchbase-server-7.0.0-4451 contains kv_engine commit 7f6d843 with commit message:
            MB-44139: [Ephe] Correctly update num-deleted-item in deleteStoredValue

            build-team Couchbase Build Team added a comment - Build couchbase-server-7.0.0-4451 contains kv_engine commit 7f6d843 with commit message: MB-44139 : [Ephe] Correctly update num-deleted-item in deleteStoredValue
            paolo.cocchi Paolo Cocchi added a comment -

            Hi Daniel Owen, I'm assigning to you as the fix is covered by unit test but it is also easily reproducible in a functional test, so not sure if we want QE coverage here. Thanks

            paolo.cocchi Paolo Cocchi added a comment - Hi Daniel Owen , I'm assigning to you as the fix is covered by unit test but it is also easily reproducible in a functional test, so not sure if we want QE coverage here. Thanks

            People

              paolo.cocchi Paolo Cocchi
              paolo.cocchi Paolo Cocchi
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty