Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-45322

[Magma] Backup performance runs hit operation timedout on 7.0.0-4797

    XMLWordPrintable

Details

    • Bug
    • Status: Closed
    • Test Blocker
    • Resolution: Duplicate
    • Cheshire-Cat
    • 7.0.0
    • tools
    • Untriaged
    • 1
    • Unknown

    Description

      In our performance runs, we saw backup failed because operation has timed out.

      Build: 7.0.0-4797

      root@ubuntu:/tmp/magma_backup# ./opt/couchbase/bin/cbbackupmgr backup -archive /tmp/backup --repo default --host http://172.23.105.72 -username Administrator --password password --threads 16 --storage sqlite

      Backing up to '2021-03-29T11_57_52.872118914-07_00'

      Transferring key value data for bucket 'bucket-1' at 0B/s (about 1h59m42s remaining)                                                                                                                                              0 items / 0B

      [==                                                                                                                                                                                                                                    ] 1.06%

      Error backing up cluster: operation has timed out


       

      Logs
      Server logs:
      https://s3.amazonaws.com/bugdb/jira/qe/collectinfo-2021-03-29T214013-ns_1%40172.23.105.72.zip
      https://s3.amazonaws.com/bugdb/jira/qe/collectinfo-2021-03-29T214013-ns_1%40172.23.105.73.zip
      https://s3.amazonaws.com/bugdb/jira/qe/collectinfo-2021-03-29T214013-ns_1%40172.23.105.78.zip

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            The error in the backup log is

            2021-03-29T11:57:58.029-07:00 WARN: (Couchbase) Unexpected error 'operation timed out after 5s' while trying to get sequence numbers, will retry in 5s -- couchbase.GetSequenceNumbers() at sequence_numbers.go:40
            2021-03-29T11:58:09.070-07:00 (Plan) (Data) Successfully decided which key value data to transfer for bucket 'bucket-1' | {"number":11,"duration":"16.106882723s"}
            2021-03-29T11:58:09.070-07:00 (Plan) (Data) Transferring new key value data for bucket 'bucket-1'
            2021-03-29T11:58:09.070-07:00 (REST) (Attempt 1) (GET) Dispatching request to 'http://172.23.105.72:8091/pools/default/buckets/bucket-1'
            2021-03-29T11:58:09.079-07:00 (REST) (Attempt 1) (GET) (200) Received response from 'http://172.23.105.72:8091/pools/default/buckets/bucket-1'
            2021-03-29T11:58:09.079-07:00 (REST) (Attempt 1) (GET) Dispatching request to 'http://172.23.105.72:8091/pools/default/buckets/bucket-1/scopes'
            2021-03-29T11:58:09.083-07:00 (REST) (Attempt 1) (GET) (200) Received response from 'http://172.23.105.72:8091/pools/default/buckets/bucket-1/scopes'
            2021-03-29T11:58:09.084-07:00 (REST) (Attempt 1) (POST) Dispatching request to 'http://172.23.105.72:8091/pools/default/buckets/bucket-1/scopes/%40ensureManifest/0'
            2021-03-29T11:58:09.090-07:00 (REST) (Attempt 1) (POST) (200) Received response from 'http://172.23.105.72:8091/pools/default/buckets/bucket-1/scopes/%40ensureManifest/0'
            2021-03-29T11:59:10.716-07:00 (Cmd) Error backing up cluster: failed to execute cluster operations: failed to execute bucket operation for bucket 'bucket-1': failed to transfer bucket data for bucket 'bucket-1': failed to transfer key value data: failed to transfer key value data: failed to initialise worker 11: failed to get gocbcore DCP agent: agent failed to connect to the cluster: unambiguous timeout | {"InnerError":{"InnerError":{"InnerError":{},"Message":"unambiguous timeout"}},"OperationID":"WaitUntilReady","Opaque":"","TimeObserved":60000110502,"RetryReasons":["NOT_READY"],"RetryAttempts":65,"LastDispatchedTo":"","LastDispatchedFrom":"","LastConnectionID":""}
            

            This suggest that cbbackupmgr got a timeout when trying to fetch the seqno from Couchbase Server. I believe the seqno number come directly form the underlying storage engine, so this might be a magma issue. The memcached.log will log slow operations, so it be worth check them, will need server logs for that case too.

            pvarley Patrick Varley added a comment - The error in the backup log is 2021-03-29T11:57:58.029-07:00 WARN: (Couchbase) Unexpected error 'operation timed out after 5s' while trying to get sequence numbers, will retry in 5s -- couchbase.GetSequenceNumbers() at sequence_numbers.go:40 2021-03-29T11:58:09.070-07:00 (Plan) (Data) Successfully decided which key value data to transfer for bucket 'bucket-1' | {"number":11,"duration":"16.106882723s"} 2021-03-29T11:58:09.070-07:00 (Plan) (Data) Transferring new key value data for bucket 'bucket-1' 2021-03-29T11:58:09.070-07:00 (REST) (Attempt 1) (GET) Dispatching request to 'http://172.23.105.72:8091/pools/default/buckets/bucket-1' 2021-03-29T11:58:09.079-07:00 (REST) (Attempt 1) (GET) (200) Received response from 'http://172.23.105.72:8091/pools/default/buckets/bucket-1' 2021-03-29T11:58:09.079-07:00 (REST) (Attempt 1) (GET) Dispatching request to 'http://172.23.105.72:8091/pools/default/buckets/bucket-1/scopes' 2021-03-29T11:58:09.083-07:00 (REST) (Attempt 1) (GET) (200) Received response from 'http://172.23.105.72:8091/pools/default/buckets/bucket-1/scopes' 2021-03-29T11:58:09.084-07:00 (REST) (Attempt 1) (POST) Dispatching request to 'http://172.23.105.72:8091/pools/default/buckets/bucket-1/scopes/%40ensureManifest/0' 2021-03-29T11:58:09.090-07:00 (REST) (Attempt 1) (POST) (200) Received response from 'http://172.23.105.72:8091/pools/default/buckets/bucket-1/scopes/%40ensureManifest/0' 2021-03-29T11:59:10.716-07:00 (Cmd) Error backing up cluster: failed to execute cluster operations: failed to execute bucket operation for bucket 'bucket-1': failed to transfer bucket data for bucket 'bucket-1': failed to transfer key value data: failed to transfer key value data: failed to initialise worker 11: failed to get gocbcore DCP agent: agent failed to connect to the cluster: unambiguous timeout | {"InnerError":{"InnerError":{"InnerError":{},"Message":"unambiguous timeout"}},"OperationID":"WaitUntilReady","Opaque":"","TimeObserved":60000110502,"RetryReasons":["NOT_READY"],"RetryAttempts":65,"LastDispatchedTo":"","LastDispatchedFrom":"","LastConnectionID":""} This suggest that cbbackupmgr got a timeout when trying to fetch the seqno from Couchbase Server. I believe the seqno number come directly form the underlying storage engine, so this might be a magma issue. The memcached.log will log slow operations, so it be worth check them, will need server logs for that case too.

            Bo-Chun Wang has there been failures reported on couchstore, I wonder if both issue this and MB-45320 are related to magma?

            Bo-Chun Wang reply

            No, the second defect (this one MB-45322) doesn't happen on magma only. Several performance runs on couchstore failed, too. The runs failed right after backup command was issued. I don't have logs for these runs.

            http://perf.jenkins.couchbase.com/job/leto/17806/

            http://perf.jenkins.couchbase.com/job/leto/17809/

            http://perf.jenkins.couchbase.com/job/leto/17815/

            2021-03-29T02:22:17 [INFO] Running: ./opt/couchbase/bin/cbbackupmgr backup --archive /data/workspace/backup --repo default --host http://leto-srv-01.perf.couchbase.com --username Administrator --password password --threads 16 --storage forestdb

            19:22:17 Fatal error: local() encountered an error (return code 1) while executing './opt/couchbase/bin/cbbackupmgr backup --archive /data/workspace/backup --repo default --host http://leto-srv-01.perf.couchbase.com --username Administrator --password password --threads 16 --storage forestdb'

            19:22:17 Aborting.

            pvarley Patrick Varley added a comment - Bo-Chun Wang has there been failures reported on couchstore, I wonder if both issue this and MB-45320 are related to magma? Bo-Chun Wang reply No, the second defect (this one MB-45322 ) doesn't happen on magma only. Several performance runs on couchstore failed, too. The runs failed right after backup command was issued. I don't have logs for these runs. http://perf.jenkins.couchbase.com/job/leto/17806/ http://perf.jenkins.couchbase.com/job/leto/17809/ http://perf.jenkins.couchbase.com/job/leto/17815/ 2021-03-29T02:22:17 [INFO] Running: ./opt/couchbase/bin/cbbackupmgr backup --archive /data/workspace/backup --repo default --host http://leto-srv-01.perf.couchbase.com --username Administrator --password password --threads 16 --storage forestdb 19:22:17 Fatal error: local() encountered an error (return code 1) while executing './opt/couchbase/bin/cbbackupmgr backup --archive /data/workspace/backup --repo default --host http://leto-srv-01.perf.couchbase.com --username Administrator --password password --threads 16 --storage forestdb' 19:22:17 Aborting.
            bo-chun.wang Bo-Chun Wang added a comment -

            Patrick Varley

            Sorry, I messed up. Those couchstore runs above are running on 6.6.2 build, and they are using forestdb. They have nothing to do with this issue. Sorry that I made a mistake.

            bo-chun.wang Bo-Chun Wang added a comment - Patrick Varley Sorry, I messed up. Those couchstore runs above are running on 6.6.2 build, and they are using forestdb. They have nothing to do with this issue. Sorry that I made a mistake.
            bo-chun.wang Bo-Chun Wang added a comment -

            I re-ran the same test on couchstore, and it finished successfully. The couchstore run didn't hit MB-45320 and MB-45322.

            bo-chun.wang Bo-Chun Wang added a comment - I re-ran the same test on couchstore, and it finished successfully. The couchstore run didn't hit  MB-45320 and  MB-45322 .
            owend Daniel Owen added a comment -

            Hi Bo-Chun Wang,
            Is this a new test, or did it recently start to fail?

            Changing component to Couchbase-bucket / storage-engine as issue appears to be with magma.
            Given that it appears to be a magma only issue - moving to CC.next

            owend Daniel Owen added a comment - Hi Bo-Chun Wang , Is this a new test, or did it recently start to fail? Changing component to Couchbase-bucket / storage-engine as issue appears to be with magma. Given that it appears to be a magma only issue - moving to CC.next
            bo-chun.wang Bo-Chun Wang added a comment -

            Daniel Owen

            No, it's not a new test. It was working fine until the latest weekly build (4797). Patrick commented in MB-45320 that the issue may be caused by GOCBC-1073. I have kicked off a run with build 4814 having the fix for GOCBC-1073 and will update when I have results. 

            bo-chun.wang Bo-Chun Wang added a comment - Daniel Owen No, it's not a new test. It was working fine until the latest weekly build (4797). Patrick commented in MB-45320 that the issue may be caused by  GOCBC-1073 . I have kicked off a run with build 4814 having the fix for  GOCBC-1073 and will update when I have results. 
            bo-chun.wang Bo-Chun Wang added a comment -

            I have a good run with build 4814.

            http://perf.jenkins.couchbase.com/job/rhea-5node1/896/

             

            bo-chun.wang Bo-Chun Wang added a comment - I have a good run with build 4814. http://perf.jenkins.couchbase.com/job/rhea-5node1/896/  

            HI , 

            I am seeing a similar issue with backup for the FTS tests.

            Some of the tests are as follows : 
            1. http://perf.jenkins.couchbase.com/view/FTS/job/atlas_collections/461/console
            2. http://perf.jenkins.couchbase.com/view/FTS/job/atlas_collections/461/console

            First build I observed this : 7.0.0-4726 
            for this build the backup was failing when run locally also.

            For build 7.7.0-4797
            I am able to run the test locally with no failure with rebalance. But I am not able to run the same test via Jenkins

            jyotsna.nayak Jyotsna Nayak added a comment - HI ,  I am seeing a similar issue with backup for the FTS tests. Some of the tests are as follows :  1. http://perf.jenkins.couchbase.com/view/FTS/job/atlas_collections/461/console 2. http://perf.jenkins.couchbase.com/view/FTS/job/atlas_collections/461/console First build I observed this : 7.0.0-4726  for this build the backup was failing when run locally also. For build 7.7.0-4797 I am able to run the test locally with no failure with rebalance. But I am not able to run the same test via Jenkins

            I have a good run also with build 7.7.0: 4814
            link to the job: http://perf.jenkins.couchbase.com/view/FTS/job/atlas_collections/476/consoleFull 

             

            jyotsna.nayak Jyotsna Nayak added a comment - I have a good run also with build 7.7.0: 4814 link to the job: http://perf.jenkins.couchbase.com/view/FTS/job/atlas_collections/476/consoleFull    
            owend Daniel Owen added a comment -

            OK thanks Bo-Chun Wang, Jyotsna Nayak - so looks like it was a duplicate of GOCBC-1073.

            owend Daniel Owen added a comment - OK thanks Bo-Chun Wang , Jyotsna Nayak - so looks like it was a duplicate of GOCBC-1073 .
            mihir.kamdar Mihir Kamdar added a comment -

            Bulk closing non-fixed defects. Feel free to reopen if necessary.

            mihir.kamdar Mihir Kamdar added a comment - Bulk closing non-fixed defects. Feel free to reopen if necessary.

            People

              bo-chun.wang Bo-Chun Wang
              bo-chun.wang Bo-Chun Wang
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty