Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-48233

cbbackupmgr restore failed in analytics performance runs on build 7.1.0-1211

    XMLWordPrintable

Details

    • Untriaged
    • 1
    • Yes
    • KV-Engine Sprint 2021 August

    Description

      In our analytics performance runs, "cbbackupmgr restore" failed consistently with build 7.1.0-1211. 

      Build: 7.1.0-1211

      Job: http://perf.jenkins.couchbase.com/job/oceanus/6774/

      Log:

      https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-oceanus-6774/172.23.96.205.zip

      https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-oceanus-6774/172.23.96.57.zip

      https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-oceanus-6774/172.23.96.5.zip

      https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-oceanus-6774/172.23.96.7.zip

      https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-oceanus-6774/172.23.96.8.zip

      https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-oceanus-6774/172.23.96.9.zip 

      Running: ./opt/couchbase/bin/cbbackupmgr restore --force-updates --archive /backups --repo bigfun20M --threads 8 --host http://172.23.96.5 --username Administrator --password password

      Fatal error: local() encountered an error (return code 1) while executing './opt/couchbase/bin/cbbackupmgr restore --force-updates --archive /backups --repo bigfun20M --threads 8 --host http://172.23.96.5 --username Administrator --password password'

      Aborting.

       

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            paolo.cocchi Paolo Cocchi added a comment -

            Re-opening, under investigation.

            paolo.cocchi Paolo Cocchi added a comment - Re-opening, under investigation.

            Here we seem to enter a state where:

            • We hit the CM quota so we enter a TempOOM phase (on both frontend and replication)
            • Also the overall mem-usage hits the HWM, so the ItemPager runs and releases some memory from checkpoints and mostly from HTs
            • Mem-usage now drops below the LWM, so the ItemPager doesn't run again
            • At this point checkpoint memory is expected to be released by the CheckpointRemoverTask - Which doesn't seem to happen, as there is no sign of any effective activity from the CheckpointRemoverTask, no item expelled, no cursor dropped

            The issue is under live-debugging.

            paolo.cocchi Paolo Cocchi added a comment - Here we seem to enter a state where: We hit the CM quota so we enter a TempOOM phase (on both frontend and replication) Also the overall mem-usage hits the HWM, so the ItemPager runs and releases some memory from checkpoints and mostly from HTs Mem-usage now drops below the LWM, so the ItemPager doesn't run again At this point checkpoint memory is expected to be released by the CheckpointRemoverTask - Which doesn't seem to happen, as there is no sign of any effective activity from the CheckpointRemoverTask, no item expelled, no cursor dropped The issue is under live-debugging.

            The 'available' flag always false in ClosedUnrefCheckpointRemoverTask::run():

            bool ClosedUnrefCheckpointRemoverTask::run() {
                TRACE_EVENT0("ep-engine/task", "ClosedUnrefCheckpointRemoverTask");
             
                bool inverse = true;
                if (!available.compare_exchange_strong(inverse, false)) {
                    snooze(sleepTime);
                    return true;
                }
             
                bool shouldReduceMemory{false};
                size_t memToClear{0};
                size_t memRecovered{0};
             
                std::tie(shouldReduceMemory, memToClear) =
                        isReductionInCheckpointMemoryNeeded();
             
                if (!shouldReduceMemory) {
                    snooze(sleepTime);
                    return true;                          <-- Missed to reset the flag before returning to the caller
                }
                ..
            }
            

            paolo.cocchi Paolo Cocchi added a comment - The 'available' flag always false in ClosedUnrefCheckpointRemoverTask::run(): bool ClosedUnrefCheckpointRemoverTask::run() { TRACE_EVENT0("ep-engine/task", "ClosedUnrefCheckpointRemoverTask");   bool inverse = true; if (!available.compare_exchange_strong(inverse, false)) { snooze(sleepTime); return true; }   bool shouldReduceMemory{false}; size_t memToClear{0}; size_t memRecovered{0};   std::tie(shouldReduceMemory, memToClear) = isReductionInCheckpointMemoryNeeded();   if (!shouldReduceMemory) { snooze(sleepTime); return true; <-- Missed to reset the flag before returning to the caller } .. }
            paolo.cocchi Paolo Cocchi added a comment -

            Reverted patch fixed and re-pushed at http://review.couchbase.org/c/kv_engine/+/160885.

            I've reproduced the issue on mancouch.

            Before the fix there's no ItemExpel/CursorDrop. The ingestion manages to proceed only because the overall mem-usage hits the HWM, so the ItemPager triggers and releases from checkpoints too:

            At fix the CheckpointRemover expel items and drops cursors:

            paolo.cocchi Paolo Cocchi added a comment - Reverted patch fixed and re-pushed at http://review.couchbase.org/c/kv_engine/+/160885 . I've reproduced the issue on mancouch. Before the fix there's no ItemExpel/CursorDrop. The ingestion manages to proceed only because the overall mem-usage hits the HWM, so the ItemPager triggers and releases from checkpoints too: At fix the CheckpointRemover expel items and drops cursors:
            bo-chun.wang Bo-Chun Wang added a comment -

            I have a good run on build 7.1.0-1288. I close this issue.

            http://perf.jenkins.couchbase.com/job/oceanus/6948/

             

            bo-chun.wang Bo-Chun Wang added a comment - I have a good run on build 7.1.0-1288. I close this issue. http://perf.jenkins.couchbase.com/job/oceanus/6948/  

            People

              bo-chun.wang Bo-Chun Wang
              bo-chun.wang Bo-Chun Wang
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                PagerDuty