Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-50351

Slight replication regression in one of the XDCR tests (5 -> 5), 1G, DGM test

    XMLWordPrintable

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 6.6.5
    • 6.6.6
    • performance, XDCR
    • Untriaged
    • 1
    • Unknown

    Description

      Build : 6.6.5-10068

      Test :  Avg. initial XDCR rate (items/sec), 5 -> 5 (2 source nozzles, 4 target nozzles), 1 bucket x 1G x 1KB, DGM

      http://showfast.sc.couchbase.com/#/timeline/Linux/xdcr/init_multi/all

      Build items/sec
      6.6.3-9808 502544
      6.6.5-10068 445342
      451213

       

      html report : http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=titan_c1_665-10068_init_xdcr_908d&snapshot=titan_c1_663-9808_init_xdcr_fdcd

       

      All other tests look good.

      Looks like 6.6.5 may be using a bit more memory.

       

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          jliang John Liang added a comment - Changes between the 2 builds for XDCR http://changelog.build.couchbase.com/?product=couchbase-server&fromVersion=6.6.3&fromBuild=9808&toVersion=6.6.5&toBuild=10068&f_analytics-dcp-client=off&f_asterixdb=off&f_backup=off&f_cbas-core=off&f_cbft=off&f_cbgt=off&f_couchbase-cli=off&f_couchdb=off&f_eventing=off&f_go-couchbase=off&f_go_json=off&f_goutils=off&f_goxdcr=on&f_indexing=off&f_kv_engine=off&f_n1fty=off&f_ns_server=off&f_query=off&f_testrunner=off&f_tlm=off&f_voltron=off
          jliang John Liang added a comment - - edited

          This test may have regressed between 6.6.4-9961 to 6.6.5-10068

          jliang John Liang added a comment - - edited This test may have regressed between 6.6.4-9961 to 6.6.5-10068
          lilei.chen Lilei Chen added a comment -

          The changes between the builds include

          1. Remove staged transaction XATTR and replicate data, and add a switch to turn off this behavior
          2. Enforce TLS
          3. A bug fix for compressed binary data
          4. Coordinate calls to pools/default endpoint
          5. An file descriptor leak fix

          I don't see any change that could cause this performance drop.

          Looking at each node:

          • 105 reached 0 for changes_left at 2022-01-08T16:03:10.405
          • 106 reached 0 for changes_left at 2022-01-08T15:59:26.433
          • 107 reached 0 for changes_left at 2022-01-08T15:58:32.535
          • 108 reached 0 for changes_left at 2022-01-08T15:59:52.491
          • 109 reached 0 for changes_left at 2022-01-08T15:59:51.557

          Node 105 takes about 4 minutes more than the other nodes to finish replication. That would account for the performance regression.

          There is no error or warning in the logs for all nodes. Replication started quickly on all nodes.

          In node 105, there is big drop in "rate received from DCP" around last 5 minutes of the test and then recovered (see attachment). That does not happen in other nodes of the same test, neither in the same node of the previous test. However, I am not able to find the reason for this drop yet.

          lilei.chen Lilei Chen added a comment - The changes between the builds include Remove staged transaction XATTR and replicate data, and add a switch to turn off this behavior Enforce TLS A bug fix for compressed binary data Coordinate calls to pools/default endpoint An file descriptor leak fix I don't see any change that could cause this performance drop. Looking at each node: 105 reached 0 for changes_left at 2022-01-08T16:03:10.405 106 reached 0 for changes_left at 2022-01-08T15:59:26.433 107 reached 0 for changes_left at 2022-01-08T15:58:32.535 108 reached 0 for changes_left at 2022-01-08T15:59:52.491 109 reached 0 for changes_left at 2022-01-08T15:59:51.557 Node 105 takes about 4 minutes more than the other nodes to finish replication. That would account for the performance regression. There is no error or warning in the logs for all nodes. Replication started quickly on all nodes. In node 105, there is big drop in "rate received from DCP" around last 5 minutes of the test and then recovered (see attachment). That does not happen in other nodes of the same test, neither in the same node of the previous test. However, I am not able to find the reason for this drop yet.
          lilei.chen Lilei Chen added a comment -

          On node 105, resp_wait_time was very high (3X compared with others) at one time. However, I am not able to see anything unusual at target cluster in terms of cpu, memory, disk. While I will continue to look at this, I don't see this caused by any code changes. The changes included in this build wouldn't only affect one node.

           

          lilei.chen Lilei Chen added a comment - On node 105, resp_wait_time was very high (3X compared with others) at one time. However, I am not able to see anything unusual at target cluster in terms of cpu, memory, disk. While I will continue to look at this, I don't see this caused by any code changes. The changes included in this build wouldn't only affect one node.  
          jliang John Liang added a comment - - edited

          Lilei Chen Is compaction running at the target?

          In any case, Wayne Siu The regression is within 10% and it may not be due to code defect. Moving it out.

          jliang John Liang added a comment - - edited Lilei Chen Is compaction running at the target? In any case, Wayne Siu The regression is within 10% and it may not be due to code defect. Moving it out.
          lilei.chen Lilei Chen added a comment -

          Bo-Chun Wang 

          The performance number is much lower in the recent runs of Build 6.6.3-9808

          1 2021-08-19 02:40 545,546
          2 2022-01-07 23:47 502,544
          3 2022-01-16 08:02 469,365
          4 2022-01-19 15:26 485,269

          Can you take a look at whether there is any change in the test itself or the test environment?

          lilei.chen Lilei Chen added a comment - Bo-Chun Wang   The performance number is much lower in the recent runs of Build 6.6.3-9808 1 2021-08-19 02:40 545,546 2 2022-01-07 23:47 502,544 3 2022-01-16 08:02 469,365 4 2022-01-19 15:26 485,269 Can you take a look at whether there is any change in the test itself or the test environment?

          We didn't make any change to the test. For the test environment, there were two changes.

          1. We upgraded python version in perfrunner from 3.6 to 3.9 in Dec. 
          2. We upgrade OS version from centos 7.3 to centos 7.9 in Oct.

          These are all changes I remembered.

          bo-chun.wang Bo-Chun Wang added a comment - We didn't make any change to the test. For the test environment, there were two changes. We upgraded python version in perfrunner from 3.6 to 3.9 in Dec.  We upgrade OS version from centos 7.3 to centos 7.9 in Oct. These are all changes I remembered.
          jliang John Liang added a comment -

          Bo-Chun Wangif the test varies even in the same build, it may not due to code changes. We have seem similar issue with GSI test where there is unexplained perform degradation on the same build after OS version changes.

          jliang John Liang added a comment - Bo-Chun Wang if the test varies even in the same build, it may not due to code changes. We have seem similar issue with GSI test where there is unexplained perform degradation on the same build after OS version changes.

          People

            wayne Wayne Siu
            wayne Wayne Siu
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty