Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-20136

Perf daily: rebalance in 10 buckets regression form 4.7.0-835 to 4.7.0-857

    XMLWordPrintable

Details

    • Untriaged
    • Centos 64-bit
    • Unknown

    Description

      As part of daily sanity the rebalance in with 10 empty buckets, the time increased from 4.5 minutes to 5.3 minute, between builds 835 and 857. This is an increase of 48 seconds over 4.5 minutes or about 15%. This is readily reproducible.

      The node 10.5.3.44 is the one being rebalanced in.
      Logs from both runs are attached, please let me know if there is more information I can provide.

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          ericcooper Eric Cooper (Inactive) created issue -
          raju Raju Suravarjjala made changes -
          Field Original Value New Value
          Attachment changelog.txt [ 32114 ]

          Eric: Can you talk to Aliaksey or DaveR to take a quick look? This does not look like an issue on the ns-server side

          raju Raju Suravarjjala added a comment - Eric: Can you talk to Aliaksey or DaveR to take a quick look? This does not look like an issue on the ns-server side

          I see the regression in 840 but not in 837. I will narrow it down to one build tomorrow.

          ericcooper Eric Cooper (Inactive) added a comment - I see the regression in 840 but not in 837. I will narrow it down to one build tomorrow.

          I have isolated this as a regression between builds 837 and 838 and using this http://172.23.122.95:8000 tool the only change is removing moxie. While it is true that rebalance is sensitive to dcp and memcached issues, I don't see how removing moxie could cause this.

          Assigning to Trond Norbye for an assessment of this.

          ericcooper Eric Cooper (Inactive) added a comment - I have isolated this as a regression between builds 837 and 838 and using this http://172.23.122.95:8000 tool the only change is removing moxie. While it is true that rebalance is sensitive to dcp and memcached issues, I don't see how removing moxie could cause this. Assigning to Trond Norbye for an assessment of this.
          ericcooper Eric Cooper (Inactive) made changes -
          Assignee Raju Suravarjjala [ raju ] Trond Norbye [ trond ]

          The same regression appears in 4.5.1 but not in 4.1.2

          ericcooper Eric Cooper (Inactive) added a comment - The same regression appears in 4.5.1 but not in 4.1.2
          ericcooper Eric Cooper (Inactive) made changes -
          Affects Version/s 4.5.1 [ 13411 ]

          I isolated this to 4.5.1-2748 (not in build 2747) which has this https://github.com/couchbase/ep-engine/commit/e22c9ebeda1aac2fc8f4325cc39a93c3bcefffab change so assigning to Jim.

          Note that:

          • this does not appear in the weekly crank so it may be something about the (smaller) daily sanity system which exposes it
          • MB-19465 is another example of a bug which appears in the daily sanity but not in the weekly crank
          • I double checked that this bug is introduced in 4.7.0-838 and indeed it is and this is puzzling because there only change there is the moxie removal and the above change does not appear to be there
          ericcooper Eric Cooper (Inactive) added a comment - I isolated this to 4.5.1-2748 (not in build 2747) which has this https://github.com/couchbase/ep-engine/commit/e22c9ebeda1aac2fc8f4325cc39a93c3bcefffab change so assigning to Jim. Note that: this does not appear in the weekly crank so it may be something about the (smaller) daily sanity system which exposes it MB-19465 is another example of a bug which appears in the daily sanity but not in the weekly crank I double checked that this bug is introduced in 4.7.0-838 and indeed it is and this is puzzling because there only change there is the moxie removal and the above change does not appear to be there
          ericcooper Eric Cooper (Inactive) made changes -
          Assignee Trond Norbye [ trond ] Jim Walker [ jwalker ]

          I just noticed the regression is more pronounced in 4.5.1:
          Pre regression builds: rebalance takes 4.5 minutes in both 4.5.1 and 4.7
          4.5.1-2748: 6.5 minutes
          4.7.0-838: 5.3 minutes

          ericcooper Eric Cooper (Inactive) added a comment - I just noticed the regression is more pronounced in 4.5.1: Pre regression builds: rebalance takes 4.5 minutes in both 4.5.1 and 4.7 4.5.1-2748: 6.5 minutes 4.7.0-838: 5.3 minutes
          jwalker Jim Walker added a comment - https://github.com/couchbase/ep-engine/commit/e22c9ebeda1aac2fc8f4325cc39a93c3bcefffab was reverted from watson (fixed and pushed to sherlock instead). The spock issue was resolved here - https://github.com/couchbase/ep-engine/commit/50838e8aede895cac523190676e70528ab57017b
          jwalker Jim Walker made changes -
          Resolution Fixed [ 1 ]
          Status Open [ 1 ] Closed [ 6 ]

          Still seeing this issue in current watson builds e.g. 949. Let me know what info I can provide

          ericcooper Eric Cooper (Inactive) added a comment - Still seeing this issue in current watson builds e.g. 949. Let me know what info I can provide
          ericcooper Eric Cooper (Inactive) made changes -
          Resolution Fixed [ 1 ]
          Status Closed [ 6 ] Reopened [ 4 ]
          jwalker Jim Walker added a comment -

          Eric Cooper steps to reproduce are needed.

          jwalker Jim Walker added a comment - Eric Cooper steps to reproduce are needed.
          jwalker Jim Walker made changes -
          Assignee Jim Walker [ jwalker ] Eric Cooper [ ericcooper ]

          Jim, the procedure you use is similar to what you did for MB-20482, though the command is:

          python -u perfSanity/scripts/perf_regression_runner_alpha.py -e -v 4.7.0-837  -r 2016-07-14:13:18 -q "testName='reb_in_10_buckets'" -n -e
          

          Please let me know if you need more information on this.

          ericcooper Eric Cooper (Inactive) added a comment - Jim, the procedure you use is similar to what you did for MB-20482 , though the command is: python -u perfSanity/scripts/perf_regression_runner_alpha.py -e -v 4.7.0-837 -r 2016-07-14:13:18 -q "testName='reb_in_10_buckets'" -n -e Please let me know if you need more information on this.
          ericcooper Eric Cooper (Inactive) made changes -
          Assignee Eric Cooper [ ericcooper ] Jim Walker [ jwalker ]
          jwalker Jim Walker added a comment -

          Eric Cooper what spec system is used for this test, is it still the 4 core system?

          jwalker Jim Walker added a comment - Eric Cooper what spec system is used for this test, is it still the 4 core system?
          ericcooper Eric Cooper (Inactive) added a comment - - edited

          Yes - the same 4 core system per previous Jiras.

          ericcooper Eric Cooper (Inactive) added a comment - - edited Yes - the same 4 core system per previous Jiras.
          jwalker Jim Walker added a comment -

          Eric Cooper so I can reproduce the test (or something similar).

          But is a smaller 'value' better? The unit reported by the test is I presume the wall-clock time for the rebalance to complete?

          So I'm just trying to get a feel for what the test really does and the load/operations it places on the cluster. I'm limited to running VMs on my Macbook and found that for some reason the test hung if I tried the default (just seemed to be doing nothing). However the test works with 5 buckets, so I've stuck with that for now.

          With 4.5.1 I don't see a pronounced regression, comparing 4.5.1 2801 vs 2802 doing 2 runs of each I got the following values.

          *4.5.1-2801 : 3.96, 3.93
          *4.5.1-2802 : 3.93, 4.00

          Not a strong regression, maybe that 4.00 is a trend towards 2802 being slower.

          On 4.7 I see an improvement, and as you observed only moxi went away?

          • 4.7-837 : 3.97, 3.93
          • 4.7-838 : 3.01, 3.01

          That is 4.7-838 is faster? However you've seen that it is slower? If this is faster, my hypothesis is that the removal of moxi may have freed some resources on these "small" systems which are overloaded by the many bucket config.

          Overall though, what is this defect tracking? The value change triggered by moxi (smaller is better???) or 4.5.1, as the comments are really leading to two different issues and should perhaps become two different MBs.

          So to summarise my questions for now:

          1. What is the value reported by this test?
          2. A smaller value better? (larger_is_better = false is set in the test spec)
          3. What are all the pairs of builds where a change is seen? I.e. 4.7-837 to 4.7-838, 4.5.1-x to 4.5.1-y, is there another pair of 4.7 builds where a regression appears?
          jwalker Jim Walker added a comment - Eric Cooper so I can reproduce the test (or something similar). But is a smaller 'value' better? The unit reported by the test is I presume the wall-clock time for the rebalance to complete? So I'm just trying to get a feel for what the test really does and the load/operations it places on the cluster. I'm limited to running VMs on my Macbook and found that for some reason the test hung if I tried the default (just seemed to be doing nothing). However the test works with 5 buckets, so I've stuck with that for now. With 4.5.1 I don't see a pronounced regression, comparing 4.5.1 2801 vs 2802 doing 2 runs of each I got the following values. *4.5.1-2801 : 3.96, 3.93 *4.5.1-2802 : 3.93, 4.00 Not a strong regression, maybe that 4.00 is a trend towards 2802 being slower. On 4.7 I see an improvement, and as you observed only moxi went away? 4.7-837 : 3.97, 3.93 4.7-838 : 3.01, 3.01 That is 4.7-838 is faster? However you've seen that it is slower? If this is faster, my hypothesis is that the removal of moxi may have freed some resources on these "small" systems which are overloaded by the many bucket config. Overall though, what is this defect tracking? The value change triggered by moxi (smaller is better???) or 4.5.1, as the comments are really leading to two different issues and should perhaps become two different MBs. So to summarise my questions for now: What is the value reported by this test? A smaller value better? (larger_is_better = false is set in the test spec) What are all the pairs of builds where a change is seen? I.e. 4.7-837 to 4.7-838, 4.5.1-x to 4.5.1-y, is there another pair of 4.7 builds where a regression appears?
          jwalker Jim Walker made changes -
          Resolution Cannot Reproduce [ 5 ]
          Status Reopened [ 4 ] Resolved [ 5 ]
          ericcooper Eric Cooper (Inactive) made changes -
          Assignee Jim Walker [ jwalker ] Eric Cooper [ ericcooper ]

          Bulk closing all invalid, duplicate, user error and won't fix issues

          raju Raju Suravarjjala added a comment - Bulk closing all invalid, duplicate, user error and won't fix issues
          raju Raju Suravarjjala made changes -
          Status Resolved [ 5 ] Closed [ 6 ]

          People

            ericcooper Eric Cooper (Inactive)
            ericcooper Eric Cooper (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty