Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-50201

Magma 30-bucket rebalance tests don't report cpu_utilization_rate correctly

    XMLWordPrintable

Details

    • Untriaged
    • 1
    • Unknown

    Description

      In the Magma 30-bucket rebalance runs, cpu_utilization_rate isn't reported correctly. We don't see this issue in Couchstore 30-bucket test and Magma single-bucket test. All runs below were running with build 7.1.0-1934. The runs were getting cpu_utilization_rate from {}:8091/pools/default.

      Magma 30-bucket rebalance

      Job: http://perf.jenkins.couchbase.com/job/themis-multibucket-kv-rebalance/24/

      cbmonitor graphs: http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=themis_710-1934_rebalance_fbe2

      Couchstore 30-bucket rebalance

      Job: http://perf.jenkins.couchbase.com/job/themis-multibucket-kv-rebalance/22/

      cbmonitor graphs: http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=themis_710-1934_rebalance_015e

      Magma single-bucket rebalance

      Job: http://perf.jenkins.couchbase.com/job/rhea-5node1/1839/

      cbmonitor graphs: http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=rhea_710-1934_rebalance_14ce

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          Bo-Chun Wang Few questions:

          1. Can you clarify what correctly means?
          2. What did this test reported in earlier builds?
          3. Are you running these tests in a container by any chance?
          4. Can you also please attach all logs from this run to the ticket?
          meni.hillel Meni Hillel (Inactive) added a comment - Bo-Chun Wang Few questions: Can you clarify what correctly means? What did this test reported in earlier builds? Are you running these tests in a container by any chance? Can you also please attach all logs from this run to the ticket?
          bo-chun.wang Bo-Chun Wang added a comment -
          1. The runs collected cpu_utilization_rate before, during (the green area in the graphs), and after the rebalance. The couchbase 30-bucket and magma single-bucket runs can collect cpu_utilization_rate from these 3 phases. However, in the magma 30-bucket run, it can't collect cpu_utilization_rate after the rebalance started. From the graph, we can see there was no cpu_utilization_rate reported during the rebalance (the green area) and after the rebalance was done.
          2. We didn't run the magma 30-bucket test before. The issue is reproducible with build 7.1.0-1934.
          3. No, these tests were running on physical clusters.

          Magma 30-bucket rebalance

          https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-themis-multibucket-kv-rebalance-24/172.23.99.157.zip

          https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-themis-multibucket-kv-rebalance-24/172.23.99.158.zip

          https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-themis-multibucket-kv-rebalance-24/172.23.99.159.zip

          https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-themis-multibucket-kv-rebalance-24/172.23.99.160.zip

          Couchstore 30-bucket rebalance

          https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-themis-multibucket-kv-rebalance-22/172.23.99.157.zip

          https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-themis-multibucket-kv-rebalance-22/172.23.99.158.zip

          https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-themis-multibucket-kv-rebalance-22/172.23.99.159.zip

          https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-themis-multibucket-kv-rebalance-22/172.23.99.160.zip

          Magma single-bucket rebalance

          https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-rhea-5node1-1839/172.23.97.21.zip

          https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-rhea-5node1-1839/172.23.97.22.zip

          https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-rhea-5node1-1839/172.23.97.23.zip

          https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-rhea-5node1-1839/172.23.97.24.zip

          bo-chun.wang Bo-Chun Wang added a comment - The runs collected cpu_utilization_rate before, during (the green area in the graphs), and after the rebalance. The couchbase 30-bucket and magma single-bucket runs can collect cpu_utilization_rate from these 3 phases. However, in the magma 30-bucket run, it can't collect cpu_utilization_rate after the rebalance started. From the graph, we can see there was no cpu_utilization_rate reported during the rebalance (the green area) and after the rebalance was done. We didn't run the magma 30-bucket test before. The issue is reproducible with build 7.1.0-1934. No, these tests were running on physical clusters. Magma 30-bucket rebalance https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-themis-multibucket-kv-rebalance-24/172.23.99.157.zip https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-themis-multibucket-kv-rebalance-24/172.23.99.158.zip https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-themis-multibucket-kv-rebalance-24/172.23.99.159.zip https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-themis-multibucket-kv-rebalance-24/172.23.99.160.zip Couchstore 30-bucket rebalance https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-themis-multibucket-kv-rebalance-22/172.23.99.157.zip https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-themis-multibucket-kv-rebalance-22/172.23.99.158.zip https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-themis-multibucket-kv-rebalance-22/172.23.99.159.zip https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-themis-multibucket-kv-rebalance-22/172.23.99.160.zip Magma single-bucket rebalance https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-rhea-5node1-1839/172.23.97.21.zip https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-rhea-5node1-1839/172.23.97.22.zip https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-rhea-5node1-1839/172.23.97.23.zip https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-rhea-5node1-1839/172.23.97.24.zip
          timofey.barmin Timofey Barmin added a comment - - edited

          Dave Finlay
          The reason for missing stats is the following:
          Magma buckets have more files on disk -> we need more time to calculate bucket data dir size -> total metrics collection time increases -> metrics collection time exceeds 10 sec -> metrics collection times out -> no stats are reported by ns_server

          On my local machine I see that 1 bucket dir size calculation takes 300ms - 1s (and this is for empty buckets with no load).
          This basically means that starting from 10 buckets we might start calculating dir sizes non-stop. Not sure if it is good for performance.

          I also see that the godu utility is #1 CPU consumer with only 7 magma buckets created (50-90% CPU).

          It seems like we need to make a decision here. Maybe we should not collect dir sizes for magma buckets at all.

          UPDATE:
          godu CPU consumption on node 99.157:

          timofey.barmin Timofey Barmin added a comment - - edited Dave Finlay The reason for missing stats is the following: Magma buckets have more files on disk -> we need more time to calculate bucket data dir size -> total metrics collection time increases -> metrics collection time exceeds 10 sec -> metrics collection times out -> no stats are reported by ns_server On my local machine I see that 1 bucket dir size calculation takes 300ms - 1s (and this is for empty buckets with no load). This basically means that starting from 10 buckets we might start calculating dir sizes non-stop. Not sure if it is good for performance. I also see that the godu utility is #1 CPU consumer with only 7 magma buckets created (50-90% CPU). It seems like we need to make a decision here. Maybe we should not collect dir sizes for magma buckets at all. UPDATE: godu CPU consumption on node 99.157:

          Build couchbase-server-7.1.0-2068 contains ns_server commit 68e1587 with commit message:
          MB-50201: Don't let one slow metric collector...

          build-team Couchbase Build Team added a comment - Build couchbase-server-7.1.0-2068 contains ns_server commit 68e1587 with commit message: MB-50201 : Don't let one slow metric collector...

          Build couchbase-server-7.1.0-2068 contains ns_server commit 70cee92 with commit message:
          MB-50201: Don't update magma bucket dir size too often

          build-team Couchbase Build Team added a comment - Build couchbase-server-7.1.0-2068 contains ns_server commit 70cee92 with commit message: MB-50201 : Don't update magma bucket dir size too often
          bo-chun.wang Bo-Chun Wang added a comment - I did a run with build and saw the CPU utilization across the run.  http://perf.jenkins.couchbase.com/job/rhea-dev2/119/ http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=rhea_710-2068_rebalance_1830#e38af7afdfc3ad9460e964647e812603  

          People

            timofey.barmin Timofey Barmin
            bo-chun.wang Bo-Chun Wang
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty