Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-48986

cbbackupmgr restore is taking a long time on build 7.1.0-1515

    XMLWordPrintable

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • Neo
    • Neo
    • performance
    • Untriaged
    • 1
    • Yes

    Description

      In our daily run with build 7.1.0-1515, we saw the restore process was taking a long time to finish. The run hit the timeout threshold and was aborted.

      http://perf.jenkins.couchbase.com/job/triton_analytics/1806/ 

      08:08:18 Running: ./opt/couchbase/bin/cbbackupmgr restore

      11:01:14 Transferri... at 7.76MiB/s (about 1h41m41s remaining) 202626905 items / 78.47GiB

      11:01:14 [============================================ ] 62.95%

      11:01:14 Build timed out (after 180 minutes). Marking the build as aborted.

      http://perf.jenkins.couchbase.com/job/triton_analytics/1810/
      https://s3.amazonaws.com/bugdb/jira/qe/collectinfo-2021-10-18T202807-ns_1%40triton-srv-01-ip6.perf.couchbase.com.zip
      https://s3.amazonaws.com/bugdb/jira/qe/collectinfo-2021-10-18T202807-ns_1%40triton-srv-02-ip6.perf.couchbase.com.zip
      https://s3.amazonaws.com/bugdb/jira/qe/collectinfo-2021-10-18T202807-ns_1%40triton-srv-03-ip6.perf.couchbase.com.zip
      https://s3.amazonaws.com/bugdb/jira/qe/collectinfo-2021-10-18T202807-ns_1%40triton-srv-04-ip6.perf.couchbase.com.zip
      https://s3.amazonaws.com/bugdb/jira/qe/collectinfo-2021-10-18T202807-ns_1%40triton-srv-05-ip6.perf.couchbase.com.zip
      https://s3.amazonaws.com/bugdb/jira/qe/collectinfo-2021-10-18T202807-ns_1%40triton-srv-06-ip6.perf.couchbase.com.zip
       

      The last good run we have was running on build 7.1.0-1504. In this run, it took about 40 minutes to finish the restore process. 

      http://perf.jenkins.couchbase.com/job/triton_analytics/1804/

      08:09:19 [INFO] Running: ./opt/couchbase/bin/cbbackupmgr restore

      08:47:31 Restore completed successfully

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          james.lee James Lee added a comment - - edited

          Hi Bo-Chun Wang,

          Looking at the changes between the affect builds, I don't see anything for 'backup' so I don't think this regression is directly due any recent changes we've made.

          To highlight a few things we know:
          1) This affects both backup and restore
          2) It's not localized to 'Leto', we've got performance tests on AWS which are also showing regressions
          3) There's no changes to KV engine between the affected builds

          Looking at the logs from some of our test runs (for restore) we see that our inactivity monitor is being triggered; in some cases, before any data has even been transferred.

          Inactivity

          2021-10-18T12:28:13.965+00:00 WARN: (Pool) (bucket-1) Memcached has been inactive for 1m0s, last item count 0 -- couchbase.(*MemcachedWorker).monitorFunc.func1() at pool_worker.go:347
          2021-10-18T12:28:13.965+00:00 WARN: (Pool) (bucket-1) Memcached has been inactive for 1m0s, last item count 0 -- couchbase.(*MemcachedWorker).monitorFunc.func1() at pool_worker.go:347
          2021-10-18T12:28:13.965+00:00 WARN: (Pool) (bucket-1) Memcached has been inactive for 1m0s, last item count 0 -- couchbase.(*MemcachedWorker).monitorFunc.func1() at pool_worker.go:347
          2021-10-18T12:28:13.965+00:00 WARN: (Pool) (bucket-1) Memcached has been inactive for 1m0s, last item count 0 -- couchbase.(*MemcachedWorker).monitorFunc.func1() at pool_worker.go:347
          2021-10-18T12:28:13.965+00:00 WARN: (Pool) (bucket-1) Memcached has been inactive for 1m0s, last item count 0 -- couchbase.(*MemcachedWorker).monitorFunc.func1() at pool_worker.go:347
          2021-10-18T12:28:13.965+00:00 WARN: (Pool) (bucket-1) Memcached has been inactive for 1m0s, last item count 0 -- couchbase.(*MemcachedWorker).monitorFunc.func1() at pool_worker.go:347
          2021-10-18T12:28:13.965+00:00 WARN: (Pool) (bucket-1) Memcached has been inactive for 1m0s, last item count 0 -- couchbase.(*MemcachedWorker).monitorFunc.func1() at pool_worker.go:347
          2021-10-18T12:28:13.965+00:00 WARN: (Pool) (bucket-1) Memcached has been inactive for 1m0s, last item count 0 -- couchbase.(*MemcachedWorker).monitorFunc.func1() at pool_worker.go:347
          

          It's also worth noting that for this test, we see that the backup takes significantly longer than expected (2m42s vs ~1m).

          Assigning back to performance, I think we should rule out changes to machines or perfrunner setups, then to assign to the next team to investigate.

          Thanks,
          James

          james.lee James Lee added a comment - - edited Hi Bo-Chun Wang , Looking at the changes between the affect builds, I don't see anything for ' backup ' so I don't think this regression is directly due any recent changes we've made. To highlight a few things we know: 1) This affects both backup and restore 2) It's not localized to ' Leto ', we've got performance tests on AWS which are also showing regressions 3) There's no changes to KV engine between the affected builds Looking at the logs from some of our test runs (for restore) we see that our inactivity monitor is being triggered; in some cases, before any data has even been transferred. Inactivity 2021-10-18T12:28:13.965+00:00 WARN: (Pool) (bucket-1) Memcached has been inactive for 1m0s, last item count 0 -- couchbase.(*MemcachedWorker).monitorFunc.func1() at pool_worker.go:347 2021-10-18T12:28:13.965+00:00 WARN: (Pool) (bucket-1) Memcached has been inactive for 1m0s, last item count 0 -- couchbase.(*MemcachedWorker).monitorFunc.func1() at pool_worker.go:347 2021-10-18T12:28:13.965+00:00 WARN: (Pool) (bucket-1) Memcached has been inactive for 1m0s, last item count 0 -- couchbase.(*MemcachedWorker).monitorFunc.func1() at pool_worker.go:347 2021-10-18T12:28:13.965+00:00 WARN: (Pool) (bucket-1) Memcached has been inactive for 1m0s, last item count 0 -- couchbase.(*MemcachedWorker).monitorFunc.func1() at pool_worker.go:347 2021-10-18T12:28:13.965+00:00 WARN: (Pool) (bucket-1) Memcached has been inactive for 1m0s, last item count 0 -- couchbase.(*MemcachedWorker).monitorFunc.func1() at pool_worker.go:347 2021-10-18T12:28:13.965+00:00 WARN: (Pool) (bucket-1) Memcached has been inactive for 1m0s, last item count 0 -- couchbase.(*MemcachedWorker).monitorFunc.func1() at pool_worker.go:347 2021-10-18T12:28:13.965+00:00 WARN: (Pool) (bucket-1) Memcached has been inactive for 1m0s, last item count 0 -- couchbase.(*MemcachedWorker).monitorFunc.func1() at pool_worker.go:347 2021-10-18T12:28:13.965+00:00 WARN: (Pool) (bucket-1) Memcached has been inactive for 1m0s, last item count 0 -- couchbase.(*MemcachedWorker).monitorFunc.func1() at pool_worker.go:347 It's also worth noting that for this test, we see that the backup takes significantly longer than expected (2m42s vs ~1m). Assigning back to performance, I think we should rule out changes to machines or perfrunner setups, then to assign to the next team to investigate. Thanks, James
          bo-chun.wang Bo-Chun Wang added a comment -

          The latest run with 7.1.0-1528 is good. I close the issue as the performance is back to normal.

          http://perf.jenkins.couchbase.com/job/triton_analytics/1811/

           

          bo-chun.wang Bo-Chun Wang added a comment - The latest run with 7.1.0-1528 is good. I close the issue as the performance is back to normal. http://perf.jenkins.couchbase.com/job/triton_analytics/1811/  

          People

            bo-chun.wang Bo-Chun Wang
            bo-chun.wang Bo-Chun Wang
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty