Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-45320

[Magma] Backup performance tests failed on 7.0.0-4797

    XMLWordPrintable

Details

    • Bug
    • Status: Closed
    • Test Blocker
    • Resolution: Duplicate
    • Cheshire-Cat
    • 7.0.0
    • tools
    • Triaged
    • 1
    • Unknown

    Description

      Probem
      Magma backup run failed on build 7.0.0-4797. The backup process was running for 14 hours until I killed it. http://perf.jenkins.couchbase.com/job/rhea-5node2/952/ 

      07:01:02 2021-03-27T07:01:02 [INFO] Running: ./opt/couchbase/bin/cbbackupmgr backup --archive /workspace/backup --repo default --host [http://172.23.97.26|http://172.23.97.26/] --username Administrator --password password --threads 16 --storage sqlite
       
      21:39:19 Build was aborted
      

      I am able to reproduce this issue on VMs. During my investigation, my runs hit two different issues. I have collected logs and uploaded. 

      • backup was stuck:* Backup progress reached 99% and then stuck. I waited for several minutes and didn't see any progress. The run above should hit this issue. That's why it's running for 14 hours. The log file for this issue is cbbackupmgr-collectinfo-backup-2021-03-29t115703.zip.

      root@ubuntu:/tmp/magma_backup# ./opt/couchbase/bin/cbbackupmgr backup --archive /tmp/backup --repo default --host [http://172.23.105.72|http://172.23.105.72/] --username Administrator --password password --threads 16 --storage sqlite
       
      Backing up to '2021-03-29T11_48_51.331337778-07_00'
       
      Transferring key value data for bucket 'bucket-1' at 123.39KiB/s (about 2s remaining)                                                                                                                                   99655 items / 31.87MiB
       
      [===================================================================================================================================================================================================================================  ] 99.13%
       
      ^Cinterrupt
      

      Steps to reproduce
       
      Expectation
      For backup to be successful

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            Bo-Chun Wang As there are two different issues here, can you open a second defect please as we do not want to confuse things.

            Focusing on the logs for 2021-03-29T115703, the last message before the program was terminated was:

            2021-03-29T11:51:00.369-07:00 WARN: (DCP) (bucket-1) (vb 915) Stream has been inactive for 1m0s, last seqno 46 -- couchbase.(*DCPAsyncWorker).monitorActivity.func1() at dcp_async_worker.go:252
            2021-03-29T11:51:00.369-07:00 WARN: (DCP) (bucket-1) (vb 647) Stream has been inactive for 1m0s, last seqno 68 -- couchbase.(*DCPAsyncWorker).monitorActivity.func1() at dcp_async_worker.go:252
            2021-03-29T11:51:00.369-07:00 WARN: (DCP) (bucket-1) (vb 835) Stream has been inactive for 1m0s, last seqno 44 -- couchbase.(*DCPAsyncWorker).monitorActivity.func1() at dcp_async_worker.go:252
            2021-03-29T11:51:00.369-07:00 WARN: (DCP) (bucket-1) (vb 739) Stream has been inactive for 1m0s, last seqno 84 -- couchbase.(*DCPAsyncWorker).monitorActivity.func1() at dcp_async_worker.go:252
            2021-03-29T11:51:00.369-07:00 WARN: (DCP) (bucket-1) (vb 1010) Stream has been inactive for 1m0s, last seqno 52 -- couchbase.(*DCPAsyncWorker).monitorActivity.func1() at dcp_async_worker.go:252
            2021-03-29T11:51:00.369-07:00 WARN: (DCP) (bucket-1) (vb 76) Stream has been inactive for 1m0s, last seqno 56 -- couchbase.(*DCPAsyncWorker).monitorActivity.func1() at dcp_async_worker.go:252
            2021-03-29T11:53:20.987-07:00 (Signal Handler) Signal `interrupt` received, exiting
            goroutine 35 [running]:
            

            This suggests that cbbackupmgr has not hanged and is wait on data for these vBuckets.

            Following vb 76 we can see that the stream was opened as follows:

            2021-03-29T11:49:00.388-07:00 (DCP) (bucket-1) (vb 76) Creating DCP stream | {"uuid":26490595435443,"start_seqno":0,"end_seqno":102,"snap_start":0,"snap_end":0,"retries":0}
            

            At the time cbbackupmgr detected that no traffice has been send for vb 76 for one minute and the last seqno it had seen was 56.

            There can be a number of cause for this, including network issues and server side problem. Can you please provide the server side logs so we can investigate this further.

            pvarley Patrick Varley added a comment - Bo-Chun Wang As there are two different issues here, can you open a second defect please as we do not want to confuse things. Focusing on the logs for 2021-03-29T115703, the last message before the program was terminated was: 2021-03-29T11:51:00.369-07:00 WARN: (DCP) (bucket-1) (vb 915) Stream has been inactive for 1m0s, last seqno 46 -- couchbase.(*DCPAsyncWorker).monitorActivity.func1() at dcp_async_worker.go:252 2021-03-29T11:51:00.369-07:00 WARN: (DCP) (bucket-1) (vb 647) Stream has been inactive for 1m0s, last seqno 68 -- couchbase.(*DCPAsyncWorker).monitorActivity.func1() at dcp_async_worker.go:252 2021-03-29T11:51:00.369-07:00 WARN: (DCP) (bucket-1) (vb 835) Stream has been inactive for 1m0s, last seqno 44 -- couchbase.(*DCPAsyncWorker).monitorActivity.func1() at dcp_async_worker.go:252 2021-03-29T11:51:00.369-07:00 WARN: (DCP) (bucket-1) (vb 739) Stream has been inactive for 1m0s, last seqno 84 -- couchbase.(*DCPAsyncWorker).monitorActivity.func1() at dcp_async_worker.go:252 2021-03-29T11:51:00.369-07:00 WARN: (DCP) (bucket-1) (vb 1010) Stream has been inactive for 1m0s, last seqno 52 -- couchbase.(*DCPAsyncWorker).monitorActivity.func1() at dcp_async_worker.go:252 2021-03-29T11:51:00.369-07:00 WARN: (DCP) (bucket-1) (vb 76) Stream has been inactive for 1m0s, last seqno 56 -- couchbase.(*DCPAsyncWorker).monitorActivity.func1() at dcp_async_worker.go:252 2021-03-29T11:53:20.987-07:00 (Signal Handler) Signal `interrupt` received, exiting goroutine 35 [running]: This suggests that cbbackupmgr has not hanged and is wait on data for these vBuckets. Following vb 76 we can see that the stream was opened as follows: 2021-03-29T11:49:00.388-07:00 (DCP) (bucket-1) (vb 76) Creating DCP stream | {"uuid":26490595435443,"start_seqno":0,"end_seqno":102,"snap_start":0,"snap_end":0,"retries":0} At the time cbbackupmgr detected that no traffice has been send for vb 76 for one minute and the last seqno it had seen was 56. There can be a number of cause for this, including network issues and server side problem. Can you please provide the server side logs so we can investigate this further.
            pvarley Patrick Varley added a comment - - edited

            Move the commend to MB-45322 as it's related to that issue (to issues was raised in the one defect)

            pvarley Patrick Varley added a comment - - edited Move the commend to MB-45322 as it's related to that issue (to issues was raised in the one defect)
            pvarley Patrick Varley added a comment - - edited

            Bo-Chun Wang Has any failures been reported on couchstore, I wonder if both are related to magma?

            pvarley Patrick Varley added a comment - - edited Bo-Chun Wang Has any failures been reported on couchstore, I wonder if both are related to magma?
            bo-chun.wang Bo-Chun Wang added a comment - I'm opening another issue for the second defect. Server logs: https://s3.amazonaws.com/bugdb/jira/qe/collectinfo-2021-03-29T214013-ns_1%40172.23.105.72.zip https://s3.amazonaws.com/bugdb/jira/qe/collectinfo-2021-03-29T214013-ns_1%40172.23.105.73.zip https://s3.amazonaws.com/bugdb/jira/qe/collectinfo-2021-03-29T214013-ns_1%40172.23.105.78.zip
            bo-chun.wang Bo-Chun Wang added a comment - - edited

            Patrick Varley I have opened MB-45322 for the second defect. 

            bo-chun.wang Bo-Chun Wang added a comment - - edited Patrick Varley I have opened  MB-45322 for the second defect. 
            pvarley Patrick Varley added a comment - - edited

            I'm going to clean up this defect and move the comments over to the correct defect, as it has already start to get confusing. This is why one issue one defect.

            pvarley Patrick Varley added a comment - - edited I'm going to clean up this defect and move the comments over to the correct defect, as it has already start to get confusing. This is why one issue one defect.

            Keeping on vBucket 76, from the server side logs we see the following:

            021-03-29T11:49:00.391282-07:00 INFO 2821: (bucket-1) DCP (Producer) eq_dcpq:cbbackupmgr_2021-03-29T11:48:57-07:00_57120_5 - (vb:76) Creating stream with start seqno 0 and end seqno 102; requested end seqno was 102, collections-m
            anifest uid:0, filter:1 sid:none
            2021-03-29T11:49:00.391340-07:00 INFO 2810: (bucket-1) DCP (Producer) eq_dcpq:cbbackupmgr_2021-03-29T11:48:57-07:00_57120_0 - (vb:109) ActiveStream::markDiskSnapshot: Sending disk snapshot with start 0, end 101, and high completed
             none, max visible none
            2021-03-29T11:49:00.391367-07:00 INFO 2821: (bucket-1) DCP (Producer) eq_dcpq:cbbackupmgr_2021-03-29T11:48:57-07:00_57120_5 - (vb:76) Scheduling backfill from 1 to 102, reschedule flag : False
            2021-03-29T11:49:01.155110-07:00 INFO (bucket-1) (bucket-1) magma_0 MagmaKVStore::initBySeqnoScanContext vb:76 seqno:1 endSeqno:102 purgeSeqno:0 nDocsToRead:102 docFilter:ALL_ITEMS valFilter:VALUES_COMPRESSED
            2021-03-29T11:49:01.155202-07:00 INFO 2821: (bucket-1) DCP (Producer) eq_dcpq:cbbackupmgr_2021-03-29T11:48:57-07:00_57120_5 - (vb:76) ActiveStream::markDiskSnapshot: Sending disk snapshot with start 0, end 102, and high completed none, max visible none
            2021-03-29T11:49:05.077923-07:00 INFO 2821: (bucket-1) DCP (Producer) eq_dcpq:cbbackupmgr_2021-03-29T11:48:57-07:00_57120_5 - (vb:76) Backfill complete, 0 items read from disk, 102 from memory, last seqno read: 102, pendingBackfill : False
            2021-03-29T11:49:05.077947-07:00 INFO [(bucket-1) magma_0/kvstore-76/rev-000000001]0x7f7cbe16dd80 Ending seqno iterator (1=>102) with seqno 102
            2021-03-29T11:49:05.078079-07:00 INFO 2821: (bucket-1) DCP (Producer) eq_dcpq:cbbackupmgr_2021-03-29T11:48:57-07:00_57120_5 - (vb:76) Stream closing, sent until seqno 102 remaining items 0, reason: The stream ended due to all items being streamed
            

            It looks like the server side sent the information as required and in a timely manor. This suggest the problem is the network or something internal to cbbackupmgr.

            pvarley Patrick Varley added a comment - Keeping on vBucket 76, from the server side logs we see the following: 021-03-29T11:49:00.391282-07:00 INFO 2821: (bucket-1) DCP (Producer) eq_dcpq:cbbackupmgr_2021-03-29T11:48:57-07:00_57120_5 - (vb:76) Creating stream with start seqno 0 and end seqno 102; requested end seqno was 102, collections-m anifest uid:0, filter:1 sid:none 2021-03-29T11:49:00.391340-07:00 INFO 2810: (bucket-1) DCP (Producer) eq_dcpq:cbbackupmgr_2021-03-29T11:48:57-07:00_57120_0 - (vb:109) ActiveStream::markDiskSnapshot: Sending disk snapshot with start 0, end 101, and high completed none, max visible none 2021-03-29T11:49:00.391367-07:00 INFO 2821: (bucket-1) DCP (Producer) eq_dcpq:cbbackupmgr_2021-03-29T11:48:57-07:00_57120_5 - (vb:76) Scheduling backfill from 1 to 102, reschedule flag : False 2021-03-29T11:49:01.155110-07:00 INFO (bucket-1) (bucket-1) magma_0 MagmaKVStore::initBySeqnoScanContext vb:76 seqno:1 endSeqno:102 purgeSeqno:0 nDocsToRead:102 docFilter:ALL_ITEMS valFilter:VALUES_COMPRESSED 2021-03-29T11:49:01.155202-07:00 INFO 2821: (bucket-1) DCP (Producer) eq_dcpq:cbbackupmgr_2021-03-29T11:48:57-07:00_57120_5 - (vb:76) ActiveStream::markDiskSnapshot: Sending disk snapshot with start 0, end 102, and high completed none, max visible none 2021-03-29T11:49:05.077923-07:00 INFO 2821: (bucket-1) DCP (Producer) eq_dcpq:cbbackupmgr_2021-03-29T11:48:57-07:00_57120_5 - (vb:76) Backfill complete, 0 items read from disk, 102 from memory, last seqno read: 102, pendingBackfill : False 2021-03-29T11:49:05.077947-07:00 INFO [(bucket-1) magma_0/kvstore-76/rev-000000001]0x7f7cbe16dd80 Ending seqno iterator (1=>102) with seqno 102 2021-03-29T11:49:05.078079-07:00 INFO 2821: (bucket-1) DCP (Producer) eq_dcpq:cbbackupmgr_2021-03-29T11:48:57-07:00_57120_5 - (vb:76) Stream closing, sent until seqno 102 remaining items 0, reason: The stream ended due to all items being streamed It looks like the server side sent the information as required and in a timely manor. This suggest the problem is the network or something internal to cbbackupmgr.

            Bo-Chun Wang Thanks for the Server logs. It sounds like you're able to reproduce this easily, would it be possible to provide the steps please?

            pvarley Patrick Varley added a comment - Bo-Chun Wang Thanks for the Server logs. It sounds like you're able to reproduce this easily, would it be possible to provide the steps please?
            bo-chun.wang Bo-Chun Wang added a comment -
            1. install cluster with build 7.0.0-4797
            2. create a bucket and use magma as backend storage engineer
            3. load 100,000 docs
            4. start backup
            bo-chun.wang Bo-Chun Wang added a comment - install cluster with build 7.0.0-4797 create a bucket and use magma as backend storage engineer load 100,000 docs start backup
            owend Daniel Owen added a comment -

            In MB-45322 Bo-Chun Wang noted that

            I re-ran the same test on couchstore, and it finished successfully. The couchstore run didn't hit MB-45320 and MB-45322.

            Therefore changing component to couchbase-bucket / storage-engine as its looks to be a magma issue.
            Given that it affects only magma moving to CC.next

            owend Daniel Owen added a comment - In MB-45322 Bo-Chun Wang noted that I re-ran the same test on couchstore, and it finished successfully. The couchstore run didn't hit MB-45320 and MB-45322 . Therefore changing component to couchbase-bucket / storage-engine as its looks to be a magma issue. Given that it affects only magma moving to CC.next

            There is a possibility that is being caused by GOCBC-1073. We are going to bump the gocbcore version, it would be worth knowing if you still hit this issue after that.

            pvarley Patrick Varley added a comment - There is a possibility that is being caused by GOCBC-1073 . We are going to bump the gocbcore version, it would be worth knowing if you still hit this issue after that.
            bo-chun.wang Bo-Chun Wang added a comment -

            I have tried the test with build 4814 on VMs, and it works fine. I will do a full scale run to verify it.

            bo-chun.wang Bo-Chun Wang added a comment - I have tried the test with build 4814 on VMs, and it works fine. I will do a full scale run to verify it.
            bo-chun.wang Bo-Chun Wang added a comment -

            I have a good run with build 4814.

            http://perf.jenkins.couchbase.com/job/rhea-5node1/896/

             

            bo-chun.wang Bo-Chun Wang added a comment - I have a good run with build 4814. http://perf.jenkins.couchbase.com/job/rhea-5node1/896/  
            owend Daniel Owen added a comment -

            Resolving as a duplicate of GOCBC-1073

            owend Daniel Owen added a comment - Resolving as a duplicate of GOCBC-1073
            mihir.kamdar Mihir Kamdar added a comment -

            Bulk closing non-fixed defects. Feel free to reopen if necessary.

            mihir.kamdar Mihir Kamdar added a comment - Bulk closing non-fixed defects. Feel free to reopen if necessary.

            People

              bo-chun.wang Bo-Chun Wang
              bo-chun.wang Bo-Chun Wang
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty