Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-28457

Replication is less efficient on 5.5.0-1970

    XMLWordPrintable

Details

    • Bug
    • Status: Closed
    • Critical
    • Resolution: Fixed
    • 5.5.0
    • 5.5.0
    • couchbase-bucket
    • Cluster: hebe_kv
      OS: CentOS 7
      CPU: E5-2680 v3 (48 vCPU)
      Memory: 64GB
      Disk: Samsung Pro 850
    • Untriaged
    • Yes

    Description

      Test env and scenario:
      3 nodes, 1 replica
      20M items in the bucket, 1M ops/sec (50/50 R/W) ongoing

       

      Despite similar replication rate the replication queue on 5.5.0-1970 grows much faster
      causing overall performance  degradation due to low-mem scenarios like DGM.

       

      Changes in 5.5.0-1970:

      [+] 4fa4905 -------MB-26021------- [6/6]: Limit #checkpoint items flushed in a single batch
      https://github.com/couchbase/kv_engine/commit/4fa490526120424e82227b431ec0bb84b487ed37

      [+] 90c76d4 -------MB-26021------- [5/6]: Set max_checkpoints=100 & chk_max_items=10000
      https://github.com/couchbase/kv_engine/commit/90c76d4f0d99ef68ff5adb2fb667a4e20383a728

       

      Servers logs:
      https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-hebe-tmp-32/172.23.100.204.zip
      https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-hebe-tmp-32/172.23.100.205.zip
      https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-hebe-tmp-32/172.23.100.206.zip

       

          5.5.0-1969 versus 5.5.0-1970, replication queue:

      All stats:
      http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=hebe_550-1979_access_e9ad&snapshot=hebe_550-1911_access_f673

       

       

      Also, similar comparison but using pillowfight tests results:

      http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=ares_550-1911_access_8d15&snapshot=ares_550-1979_access_6601&label=5.5.0-1911&label=5.5.0-1979

      Logs form 2-node pillowfight test:

      https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-ares-7547/172.23.133.13.zip
      https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-ares-7547/172.23.133.14.zip

       

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          drigby Dave Rigby added a comment - - edited Scheduled http://perf.jenkins.couchbase.com/job/hebe/888 with: chk_max_items=1,000,000;flusher_batch_split_trigger=1000,000 Edit: results are essentialy the same as 100K limit: http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=hebe_550-2054_access_2561&snapshot=hebe_550-1911_access_f673&snapshot=hebe_550-2054_access_70b0&snapshot=hebe_550-2054_access_0bb9&snapshot=hebe_550-2054_access_a619&label=2054&label=1911&label=2054-old_cfg&label=2054-limit:100k&label=2054-limit:1M
          drigby Dave Rigby added a comment - - edited

          Scheduled http://perf.jenkins.couchbase.com/job/hebe/915/ with max_checkpoints=2.

          Edit: results with 2 checkpoints look much better:

          http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=hebe_550-1911_access_f673&snapshot=hebe_550-2054_access_70b0&snapshot=hebe_550-2054_access_0bb9&snapshot=hebe_550-2054_access_a619&snapshot=hebe_550-2054_access_f324&label=1911&label=2054-old_cfg&label=2054-limit:100k&label=2054-limit:1M&label=2065-limit:1M-ckpts:2

          These are in-line with both builds 1911 and 2054 with the complete old config. Need to investigate exactly why the change in checkpoint count causes such a problem...

          drigby Dave Rigby added a comment - - edited Scheduled http://perf.jenkins.couchbase.com/job/hebe/915/ with max_checkpoints=2. Edit: results with 2 checkpoints look much better: http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=hebe_550-1911_access_f673&snapshot=hebe_550-2054_access_70b0&snapshot=hebe_550-2054_access_0bb9&snapshot=hebe_550-2054_access_a619&snapshot=hebe_550-2054_access_f324&label=1911&label=2054-old_cfg&label=2054-limit:100k&label=2054-limit:1M&label=2065-limit:1M-ckpts:2 These are in-line with both builds 1911 and 2054 with the complete old config. Need to investigate exactly why the change in checkpoint count causes such a problem...

          Build couchbase-server-5.5.0-2178 contains kv_engine commit ee81374801596cb4b4f0b79a48a55c0773fa644b with commit message:
          MB-28457: Revert max_checkpoints to 2
          https://github.com/couchbase/kv_engine/commit/ee81374801596cb4b4f0b79a48a55c0773fa644b

          build-team Couchbase Build Team added a comment - Build couchbase-server-5.5.0-2178 contains kv_engine commit ee81374801596cb4b4f0b79a48a55c0773fa644b with commit message: MB-28457 : Revert max_checkpoints to 2 https://github.com/couchbase/kv_engine/commit/ee81374801596cb4b4f0b79a48a55c0773fa644b
          oleksandr.gyryk Alex Gyryk (Inactive) added a comment - Looks good on 5.5.0-2211.  http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=hebe_550-1911_access_bc36&snapshot=hebe_550-2126_access_645a&snapshot=hebe_550-2211_access_fb3e http://showfast.sc.couchbase.com/#/timeline/Linux/kv/ycsb/all   Dave Rigby , is "Revert max_checkpoints to 2" a final fix for this issue or temporary workaround? Are you still investigating?
          drigby Dave Rigby added a comment -

          Regression has been resolved as of 5.5.0-2211 (see Alex's last comment).

          drigby Dave Rigby added a comment - Regression has been resolved as of 5.5.0-2211 (see Alex's last comment).

          People

            oleksandr.gyryk Alex Gyryk (Inactive)
            oleksandr.gyryk Alex Gyryk (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty