Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-47132

cbbackupmgr restore in analytics tests failed on build 7.1.0-1027

    XMLWordPrintable

Details

    • Bug
    • Status: Closed
    • Test Blocker
    • Resolution: Fixed
    • Neo
    • Neo
    • tools
    • Untriaged
    • 1
    • Unknown

    Description

      Analytics tests on build 7.1.0-1027 failed during the cbbackupmgr restore process. The issue happened intermittently. 

      2021-06-25T10:40:21 [INFO] Running: ./opt/couchbase/bin/cbbackupmgr restore --force-updates --archive /data2/backup/3collections/ --repo default --threads 8 --host http://172.23.96.5 --username Administrator --password password --map-data bucket-1.scope-1.GleambookUsers=bucket-1.scope-1.GleambookUsers-1,bucket-1.scope-1.GleambookMessages=bucket-1.scope-1.GleambookMessages-1,bucket-1.scope-1.ChirpMessages=bucket-1.scope-1.ChirpMessages-1

      Fatal error: local() encountered an error (return code 1) while executing './opt/couchbase/bin/cbbackupmgr restore --force-updates --archive /data2/backup/3collections/ --repo default --threads 8 --host http://172.23.96.5 --username Administrator --password password --map-data bucket-1.scope-1.GleambookUsers=bucket-1.scope-1.GleambookUsers-1,bucket-1.scope-1.GleambookMessages=bucket-1.scope-1.GleambookMessages-1,bucket-1.scope-1.ChirpMessages=bucket-1.scope-1.ChirpMessages-1'

      Aborting.

       

      Build: 7.1.0-1027

      Job: http://perf.jenkins.couchbase.com/job/oceanus/6120/

      Log:

      https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-oceanus-6120/172.23.96.205.zip

      https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-oceanus-6120/172.23.96.57.zip

      https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-oceanus-6120/172.23.96.5.zip

      https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-oceanus-6120/172.23.96.7.zip

      https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-oceanus-6120/172.23.96.8.zip

      https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-oceanus-6120/172.23.96.9.zip

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            owend Daniel Owen added a comment -

            Many thanks Bo-Chun Wang Very useful to know
            We believe the issue is due to a change in cbbackupmgr and are looking to resolve.

            owend Daniel Owen added a comment - Many thanks Bo-Chun Wang Very useful to know We believe the issue is due to a change in cbbackupmgr and are looking to resolve.

            I had a proper dive at the code and can say with certainly what it is happening now. Currently how cbbackupmgr handles tmp ooms is that it will retry with an exponential back off that has the following formula

            wait_time = MIN(120s, 50ms * 2^retry_num)
            

            It will retry using this back-off at till it succeeds or in total a minute has elapsed, if after a minute we cannot succeed we fail and here is where we see the ambiguous timeout error:

            2021-06-27T18:42:29.810-07:00 WARN: (Pool) (bucket-1) Failed to send document with key '<ud>1395136</ud>' due to a timeout: ambiguous timeout | {"InnerError":{"InnerError":{"InnerError":{},"Message":"ambiguous timeout"}},"OperationID":"SetMeta","Opaque":"0xd5144a","TimeObserved":60000061472,"RetryReasons":["KV_TEMPORARY_FAILURE"],"RetryAttempts":11,"LastDispatchedTo":"172.23.96.5:11210","LastDispatchedFrom":"172.23.96.22:56504","LastConnectionID":"4941ecd473977caf/840499586ff99025"} -- couchbase.(*MemcachedWorker).processOperation() at pool_worker.go:275
            

            1 minute with the formula above gives roughly enough time for 11 retries. See table below for more details:

            retry time (s) cumulative wait time (s)
            1 0.05 0.05
            2 100 0.1
            3 0.2 0.35
            4 0.4 0.75
            5 0.8 1.55
            6 1.6 3.15
            7 3.2 6.35
            8 6.4 12.75
            9 12.8 25.55
            10 25.6 51.15
            11 51.2 102.35
            12 102.4 204.75
            13 120 324.75
            14 120 444.75

            So to get an extra retry we will increase the default operation timeout to 4 minutes. Note this can be achieved today by using the environmental variable CB_MEMCACHED_OPERATION_TIMEOUT=4m. Note this does change does not guarantee that the backup will work but it gives KV an extra 3 minutes to do eviction.

            carlos.gonzalez Carlos Gonzalez Betancort (Inactive) added a comment - - edited I had a proper dive at the code and can say with certainly what it is happening now. Currently how cbbackupmgr handles tmp ooms is that it will retry with an exponential back off that has the following formula wait_time = MIN(120s, 50ms * 2^retry_num) It will retry using this back-off at till it succeeds or in total a minute has elapsed, if after a minute we cannot succeed we fail and here is where we see the ambiguous timeout error: 2021-06-27T18:42:29.810-07:00 WARN: (Pool) (bucket-1) Failed to send document with key '<ud>1395136</ud>' due to a timeout: ambiguous timeout | {"InnerError":{"InnerError":{"InnerError":{},"Message":"ambiguous timeout"}},"OperationID":"SetMeta","Opaque":"0xd5144a","TimeObserved":60000061472,"RetryReasons":["KV_TEMPORARY_FAILURE"],"RetryAttempts":11,"LastDispatchedTo":"172.23.96.5:11210","LastDispatchedFrom":"172.23.96.22:56504","LastConnectionID":"4941ecd473977caf/840499586ff99025"} -- couchbase.(*MemcachedWorker).processOperation() at pool_worker.go:275 1 minute with the formula above gives roughly enough time for 11 retries. See table below for more details: retry time (s) cumulative wait time (s) 1 0.05 0.05 2 100 0.1 3 0.2 0.35 4 0.4 0.75 5 0.8 1.55 6 1.6 3.15 7 3.2 6.35 8 6.4 12.75 9 12.8 25.55 10 25.6 51.15 11 51.2 102.35 12 102.4 204.75 13 120 324.75 14 120 444.75 So to get an extra retry we will increase the default operation timeout to 4 minutes. Note this can be achieved today by using the environmental variable CB_MEMCACHED_OPERATION_TIMEOUT=4m . Note this does change does not guarantee that the backup will work but it gives KV an extra 3 minutes to do eviction.

            I have increased the timeout to give KV time to evict data hopefully this will work

            carlos.gonzalez Carlos Gonzalez Betancort (Inactive) added a comment - I have increased the timeout to give KV time to evict data hopefully this will work

            Build couchbase-server-7.1.0-1049 contains backup commit 2d49266 with commit message:
            MB-47132 Increase memcached operation timeout

            build-team Couchbase Build Team added a comment - Build couchbase-server-7.1.0-1049 contains backup commit 2d49266 with commit message: MB-47132 Increase memcached operation timeout
            bo-chun.wang Bo-Chun Wang added a comment -

            I close this issue as we have a good run with build 7.1.0-1049.

            http://perf.jenkins.couchbase.com/job/oceanus/6264/ 

            bo-chun.wang Bo-Chun Wang added a comment - I close this issue as we have a good run with build 7.1.0-1049. http://perf.jenkins.couchbase.com/job/oceanus/6264/  

            People

              bo-chun.wang Bo-Chun Wang
              bo-chun.wang Bo-Chun Wang
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty