Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-39440

Magma cbc-pillowfight tests stuck during the load phase

    XMLWordPrintable

Details

    • Triaged
    • 1
    • Unknown

    Attachments

      Issue Links

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          bo-chun.wang Bo-Chun Wang added a comment - Ben Huddleston I re-ran the test with "cursor_dropping_upper_mark=90", and the run is finished without hitting the issue. Job:  http://perf.jenkins.couchbase.com/job/rhea-5node1/39/ Build: 7.0.0-2147 Logs: https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-rhea-5node1-39/172.23.97.21.zip https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-rhea-5node1-39/172.23.97.22.zip https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-rhea-5node1-39/172.23.97.23.zip https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-rhea-5node1-39/172.23.97.24.zip  

          Thanks Bo! I think this makes it reasonably likely that my theory is the issue that we are encountering here.

          One question worth asking is where is memory being used? In this case we have nothing resident so next to no memory usage in the HashTables. The other main memory hog is the CheckpointManager which is using roughly 50% of the quota. We'll have some transient memory usage for flushes but we never grow a large disk write queue so this won't be high. I don't think we would have a substantial amount of memory allocated elsewhere in KV. Magma reported 390325151 bytes (390MB) of memory usage at the end of this test.

          Dropping "cursor_dropping_upper_mark" to 90 solves this issue because it allows us to free memory (by dropping cursors) before we hit the threshold at which we stop taking new mutations on the active vBuckets. In a way, it's desirable that normally we stop mutations before dropping cursors as this allow replicas to try to catch up. If the streams that these cursors belong to are in memory or are very near the end of a disk backfill then this holds up. If the cursors are at the start of a disk backfill though then for the sake of availability it's less desirable. I'm not sure we want to change the default of this config value permanently.

          I think the real solution for this is going to be hard limits on the CheckpointManager memory usage to keep us below 93% memory usage (provided the pager can run fast enough when we're above the HWM). This will be done as part of MB-38441. In the meantime Bo-Chun Wang, I'd recommend adding this config parameter to magma tests if they fail in the same way.

          ben.huddleston Ben Huddleston added a comment - Thanks Bo! I think this makes it reasonably likely that my theory is the issue that we are encountering here. One question worth asking is where is memory being used? In this case we have nothing resident so next to no memory usage in the HashTables. The other main memory hog is the CheckpointManager which is using roughly 50% of the quota. We'll have some transient memory usage for flushes but we never grow a large disk write queue so this won't be high. I don't think we would have a substantial amount of memory allocated elsewhere in KV. Magma reported 390325151 bytes (390MB) of memory usage at the end of this test. Dropping "cursor_dropping_upper_mark" to 90 solves this issue because it allows us to free memory (by dropping cursors) before we hit the threshold at which we stop taking new mutations on the active vBuckets. In a way, it's desirable that normally we stop mutations before dropping cursors as this allow replicas to try to catch up. If the streams that these cursors belong to are in memory or are very near the end of a disk backfill then this holds up. If the cursors are at the start of a disk backfill though then for the sake of availability it's less desirable. I'm not sure we want to change the default of this config value permanently. I think the real solution for this is going to be hard limits on the CheckpointManager memory usage to keep us below 93% memory usage (provided the pager can run fast enough when we're above the HWM). This will be done as part of MB-38441 . In the meantime Bo-Chun Wang , I'd recommend adding this config parameter to magma tests if they fail in the same way.

          Bo-Chun Wang is this still a blocker ?

          mihir.kamdar Mihir Kamdar (Inactive) added a comment - Bo-Chun Wang is this still a blocker ?

          Mihir Kamdar

          No, I changed the priority given there is a workaround.

          bo-chun.wang Bo-Chun Wang added a comment - Mihir Kamdar No, I changed the priority given there is a workaround.

          The "workaround" given is actually the default config which we would recommend as we have fixed the checkpoint memory overhead issues. Please open a new ticket if this is still an issues.

          ben.huddleston Ben Huddleston added a comment - The "workaround" given is actually the default config which we would recommend as we have fixed the checkpoint memory overhead issues. Please open a new ticket if this is still an issues.

          People

            ben.huddleston Ben Huddleston
            bo-chun.wang Bo-Chun Wang
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty