Details
-
Bug
-
Status: Closed
-
Test Blocker
-
Resolution: Duplicate
-
Cheshire-Cat
-
Triaged
-
1
-
Unknown
Description
Probem
Magma backup run failed on build 7.0.0-4797. The backup process was running for 14 hours until I killed it. http://perf.jenkins.couchbase.com/job/rhea-5node2/952/
07:01:02 2021-03-27T07:01:02 [INFO] Running: ./opt/couchbase/bin/cbbackupmgr backup --archive /workspace/backup --repo default --host [http://172.23.97.26|http://172.23.97.26/] --username Administrator --password password --threads 16 --storage sqlite
|
|
21:39:19 Build was aborted
|
I am able to reproduce this issue on VMs. During my investigation, my runs hit two different issues. I have collected logs and uploaded.
- backup was stuck:* Backup progress reached 99% and then stuck. I waited for several minutes and didn't see any progress. The run above should hit this issue. That's why it's running for 14 hours. The log file for this issue is cbbackupmgr-collectinfo-backup-2021-03-29t115703.zip.
root@ubuntu:/tmp/magma_backup# ./opt/couchbase/bin/cbbackupmgr backup --archive /tmp/backup --repo default --host [http://172.23.105.72|http://172.23.105.72/] --username Administrator --password password --threads 16 --storage sqlite
|
|
Backing up to '2021-03-29T11_48_51.331337778-07_00'
|
|
Transferring key value data for bucket 'bucket-1' at 123.39KiB/s (about 2s remaining) 99655 items / 31.87MiB
|
|
[=================================================================================================================================================================================================================================== ] 99.13%
|
|
^Cinterrupt
|
Steps to reproduce
Expectation
For backup to be successful
Attachments
Issue Links
- duplicates
-
GOCBC-1073 DCP can get stuck during closing
-
- Resolved
-
- relates to
-
MB-45322 [Magma] Backup performance runs hit operation timedout on 7.0.0-4797
-
- Closed
-
Activity
Field | Original Value | New Value |
---|---|---|
Attachment | Screen Shot 2021-03-29 at 2.04.14 PM.png [ 133132 ] |
Description |
Magma backup run failed on build 7.0.0-4797. The backup process was running for 14 hours until I killed it.
[http://perf.jenkins.couchbase.com/job/rhea-5node2/952/] {quote}07:01:02 2021-03-27T07:01:02 [INFO] Running: ./opt/couchbase/bin/cbbackupmgr backup --archive /workspace/backup --repo default --host http://172.23.97.26 --username Administrator --password password --threads 16 --storage sqlite 21:39:19 Build was aborted {quote} I am able to reproduce this issue on VMs. During my investigation, my runs hit two different issues. I have collected logs and uploaded. *1. backup was stuck:* Backup progress reached 99% and then stuck. I waited for several minutes and didn't see any progress. The run above should hit this issue. That's why it's running for 14 hours. The log file for this issue is cbbackupmgr-collectinfo-backup-2021-03-29t115703.zip. root@ubuntu:/tmp/magma_backup# ./opt/couchbase/bin/cbbackupmgr backup --archive /tmp/backup --repo default --host http://172.23.105.72 --username Administrator --password password --threads 16 --storage sqlite Warning: --host is deprecated, use -c/–cluster Backing up to '2021-03-29T11_48_51.331337778-07_00' Transferring key value data for bucket 'bucket-1' at 123.39KiB/s (about 2s remaining) 99655 items / 31.87MiB [=================================================================================================================================================================================================================================== ] 99.13% ^Cinterrupt 2. *Backup failed at the beginning*: During my investigation, my runs hit another issue. I check other backup performance runs and saw they failed because of the same issue. Please find output below. The log file of this issue is cbbackupmgr-collectinfo-backup-2021-03-29t115941.zip. root@ubuntu:/tmp/magma_backup# ./opt/couchbase/bin/cbbackupmgr backup --archive /tmp/backup --repo default --host http://172.23.105.72 --username Administrator --password password --threads 16 --storage sqlite Warning: --host is deprecated, use -c/--cluster Backing up to '2021-03-29T11_57_52.872118914-07_00' Transferring key value data for bucket 'bucket-1' at 0B/s (about 1h59m42s remaining) 0 items / 0B [== ] 1.06% Error backing up cluster: operation has timed out | Transfer | -------- | Status | Avg Transfer Rate | Started At | Finished At | Duration | | Failed | 0B | Mon, 29 Mar 2021 11:57:52 -0700/s | Mon, 29 Mar 2021 11:59:10 -0700 | 1m17s | | Bucket | ------ | Name | Status | Transferred | Avg Transfer Rate | Started At | Finished At | Duration | | bucket-1 | Failed | 0B | 0B/s | Mon, 29 Mar 2021 11:58:09 -0700 | Mon, 29 Mar 2021 11:59:10 -0700 | 1m1s | | | Mutations | Deletions | Expirations | | --------- | --------- | ----------- | | Received | Errored | Skipped | Received | Errored | Skipped | Received | Errored | Skipped | | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | Example: performance run hits the second issue. [http://perf.jenkins.couchbase.com/job/leto/17806/] {quote}2021-03-29T02:22:17 [INFO] Running: ./opt/couchbase/bin/cbbackupmgr backup --archive /data/workspace/backup --repo default --host http://leto-srv-01.perf.couchbase.com --username Administrator --password password --threads 16 --storage forestdb Fatal error: local() encountered an error (return code 1) while executing './opt/couchbase/bin/cbbackupmgr backup --archive /data/workspace/backup --repo default --host http://leto-srv-01.perf.couchbase.com --username Administrator --password password --threads 16 --storage forestdb' Aborting. {quote} |
Magma backup run failed on build 7.0.0-4797. The backup process was running for 14 hours until I killed it.
[http://perf.jenkins.couchbase.com/job/rhea-5node2/952/] {quote}07:01:02 2021-03-27T07:01:02 [INFO] Running: ./opt/couchbase/bin/cbbackupmgr backup --archive /workspace/backup --repo default --host [http://172.23.97.26|http://172.23.97.26/] --username Administrator --password password --threads 16 --storage sqlite 21:39:19 Build was aborted {quote} I am able to reproduce this issue on VMs. During my investigation, my runs hit two different issues. I have collected logs and uploaded. *1. backup was stuck:* Backup progress reached 99% and then stuck. I waited for several minutes and didn't see any progress. The run above should hit this issue. That's why it's running for 14 hours. The log file for this issue is cbbackupmgr-collectinfo-backup-2021-03-29t115703.zip. root@ubuntu:/tmp/magma_backup# ./opt/couchbase/bin/cbbackupmgr backup --archive /tmp/backup --repo default --host [http://172.23.105.72|http://172.23.105.72/] --username Administrator --password password --threads 16 --storage sqlite Backing up to '2021-03-29T11_48_51.331337778-07_00' Transferring key value data for bucket 'bucket-1' at 123.39KiB/s (about 2s remaining) 99655 items / 31.87MiB [=================================================================================================================================================================================================================================== ] 99.13% ^Cinterrupt 2. *Backup failed at the beginning*: During my investigation, my runs hit another issue. I check other backup performance runs and saw they failed because of the same issue. Please find output below. The log file of this issue is cbbackupmgr-collectinfo-backup-2021-03-29t115941.zip. root@ubuntu:/tmp/magma_backup# ./opt/couchbase/bin/cbbackupmgr backup --archive /tmp/backup --repo default --host [http://172.23.105.72|http://172.23.105.72/] --username Administrator --password password --threads 16 --storage sqlite Backing up to '2021-03-29T11_57_52.872118914-07_00' Transferring key value data for bucket 'bucket-1' at 0B/s (about 1h59m42s remaining) 0 items / 0B [== ] 1.06% Error backing up cluster: operation has timed out !Screen Shot 2021-03-29 at 2.04.14 PM.png|width=800,height=250! Example: performance run hits the second issue. [http://perf.jenkins.couchbase.com/job/leto/17806/] {quote}2021-03-29T02:22:17 [INFO] Running: ./opt/couchbase/bin/cbbackupmgr backup --archive /data/workspace/backup --repo default --host [http://leto-srv-01.perf.couchbase.com|http://leto-srv-01.perf.couchbase.com/] --username Administrator --password password --threads 16 --storage forestdb Fatal error: local() encountered an error (return code 1) while executing './opt/couchbase/bin/cbbackupmgr backup --archive /data/workspace/backup --repo default --host [http://leto-srv-01.perf.couchbase.com|http://leto-srv-01.perf.couchbase.com/] --username Administrator --password password --threads 16 --storage forestdb' Aborting. {quote} |
Assignee | Daniel Owen [ owend ] | Bo-Chun Wang [ bo-chun.wang ] |
Triage | Untriaged [ 10351 ] | Triaged [ 10350 ] |
Assignee | Bo-Chun Wang [ bo-chun.wang ] | Patrick Varley [ pvarley ] |
Description |
Magma backup run failed on build 7.0.0-4797. The backup process was running for 14 hours until I killed it.
[http://perf.jenkins.couchbase.com/job/rhea-5node2/952/] {quote}07:01:02 2021-03-27T07:01:02 [INFO] Running: ./opt/couchbase/bin/cbbackupmgr backup --archive /workspace/backup --repo default --host [http://172.23.97.26|http://172.23.97.26/] --username Administrator --password password --threads 16 --storage sqlite 21:39:19 Build was aborted {quote} I am able to reproduce this issue on VMs. During my investigation, my runs hit two different issues. I have collected logs and uploaded. *1. backup was stuck:* Backup progress reached 99% and then stuck. I waited for several minutes and didn't see any progress. The run above should hit this issue. That's why it's running for 14 hours. The log file for this issue is cbbackupmgr-collectinfo-backup-2021-03-29t115703.zip. root@ubuntu:/tmp/magma_backup# ./opt/couchbase/bin/cbbackupmgr backup --archive /tmp/backup --repo default --host [http://172.23.105.72|http://172.23.105.72/] --username Administrator --password password --threads 16 --storage sqlite Backing up to '2021-03-29T11_48_51.331337778-07_00' Transferring key value data for bucket 'bucket-1' at 123.39KiB/s (about 2s remaining) 99655 items / 31.87MiB [=================================================================================================================================================================================================================================== ] 99.13% ^Cinterrupt 2. *Backup failed at the beginning*: During my investigation, my runs hit another issue. I check other backup performance runs and saw they failed because of the same issue. Please find output below. The log file of this issue is cbbackupmgr-collectinfo-backup-2021-03-29t115941.zip. root@ubuntu:/tmp/magma_backup# ./opt/couchbase/bin/cbbackupmgr backup --archive /tmp/backup --repo default --host [http://172.23.105.72|http://172.23.105.72/] --username Administrator --password password --threads 16 --storage sqlite Backing up to '2021-03-29T11_57_52.872118914-07_00' Transferring key value data for bucket 'bucket-1' at 0B/s (about 1h59m42s remaining) 0 items / 0B [== ] 1.06% Error backing up cluster: operation has timed out !Screen Shot 2021-03-29 at 2.04.14 PM.png|width=800,height=250! Example: performance run hits the second issue. [http://perf.jenkins.couchbase.com/job/leto/17806/] {quote}2021-03-29T02:22:17 [INFO] Running: ./opt/couchbase/bin/cbbackupmgr backup --archive /data/workspace/backup --repo default --host [http://leto-srv-01.perf.couchbase.com|http://leto-srv-01.perf.couchbase.com/] --username Administrator --password password --threads 16 --storage forestdb Fatal error: local() encountered an error (return code 1) while executing './opt/couchbase/bin/cbbackupmgr backup --archive /data/workspace/backup --repo default --host [http://leto-srv-01.perf.couchbase.com|http://leto-srv-01.perf.couchbase.com/] --username Administrator --password password --threads 16 --storage forestdb' Aborting. {quote} |
Magma backup run failed on build 7.0.0-4797. The backup process was running for 14 hours until I killed it.
[http://perf.jenkins.couchbase.com/job/rhea-5node2/952/] {noformat}07:01:02 2021-03-27T07:01:02 [INFO] Running: ./opt/couchbase/bin/cbbackupmgr backup --archive /workspace/backup --repo default --host [http://172.23.97.26|http://172.23.97.26/] --username Administrator --password password --threads 16 --storage sqlite 21:39:19 Build was aborted {noformat} I am able to reproduce this issue on VMs. During my investigation, my runs hit two different issues. I have collected logs and uploaded. *1. backup was stuck:* Backup progress reached 99% and then stuck. I waited for several minutes and didn't see any progress. The run above should hit this issue. That's why it's running for 14 hours. The log file for this issue is cbbackupmgr-collectinfo-backup-2021-03-29t115703.zip. {noformat} root@ubuntu:/tmp/magma_backup# ./opt/couchbase/bin/cbbackupmgr backup --archive /tmp/backup --repo default --host [http://172.23.105.72|http://172.23.105.72/] --username Administrator --password password --threads 16 --storage sqlite Backing up to '2021-03-29T11_48_51.331337778-07_00' Transferring key value data for bucket 'bucket-1' at 123.39KiB/s (about 2s remaining) 99655 items / 31.87MiB [=================================================================================================================================================================================================================================== ] 99.13% ^Cinterrupt {noformat} |
Description |
Magma backup run failed on build 7.0.0-4797. The backup process was running for 14 hours until I killed it.
[http://perf.jenkins.couchbase.com/job/rhea-5node2/952/] {noformat}07:01:02 2021-03-27T07:01:02 [INFO] Running: ./opt/couchbase/bin/cbbackupmgr backup --archive /workspace/backup --repo default --host [http://172.23.97.26|http://172.23.97.26/] --username Administrator --password password --threads 16 --storage sqlite 21:39:19 Build was aborted {noformat} I am able to reproduce this issue on VMs. During my investigation, my runs hit two different issues. I have collected logs and uploaded. *1. backup was stuck:* Backup progress reached 99% and then stuck. I waited for several minutes and didn't see any progress. The run above should hit this issue. That's why it's running for 14 hours. The log file for this issue is cbbackupmgr-collectinfo-backup-2021-03-29t115703.zip. {noformat} root@ubuntu:/tmp/magma_backup# ./opt/couchbase/bin/cbbackupmgr backup --archive /tmp/backup --repo default --host [http://172.23.105.72|http://172.23.105.72/] --username Administrator --password password --threads 16 --storage sqlite Backing up to '2021-03-29T11_48_51.331337778-07_00' Transferring key value data for bucket 'bucket-1' at 123.39KiB/s (about 2s remaining) 99655 items / 31.87MiB [=================================================================================================================================================================================================================================== ] 99.13% ^Cinterrupt {noformat} |
Magma backup run failed on build 7.0.0-4797. The backup process was running for 14 hours until I killed it.
[http://perf.jenkins.couchbase.com/job/rhea-5node2/952/] {noformat}07:01:02 2021-03-27T07:01:02 [INFO] Running: ./opt/couchbase/bin/cbbackupmgr backup --archive /workspace/backup --repo default --host [http://172.23.97.26|http://172.23.97.26/] --username Administrator --password password --threads 16 --storage sqlite 21:39:19 Build was aborted {noformat} I am able to reproduce this issue on VMs. During my investigation, my runs hit two different issues. I have collected logs and uploaded. * backup was stuck:* Backup progress reached 99% and then stuck. I waited for several minutes and didn't see any progress. The run above should hit this issue. That's why it's running for 14 hours. The log file for this issue is cbbackupmgr-collectinfo-backup-2021-03-29t115703.zip. {noformat} root@ubuntu:/tmp/magma_backup# ./opt/couchbase/bin/cbbackupmgr backup --archive /tmp/backup --repo default --host [http://172.23.105.72|http://172.23.105.72/] --username Administrator --password password --threads 16 --storage sqlite Backing up to '2021-03-29T11_48_51.331337778-07_00' Transferring key value data for bucket 'bucket-1' at 123.39KiB/s (about 2s remaining) 99655 items / 31.87MiB [=================================================================================================================================================================================================================================== ] 99.13% ^Cinterrupt {noformat} |
Attachment | cbbackupmgr-collectinfo-backup-2021-03-29T115941.zip [ 133130 ] |
Attachment | Screen Shot 2021-03-29 at 2.04.14 PM.png [ 133132 ] |
Description |
Magma backup run failed on build 7.0.0-4797. The backup process was running for 14 hours until I killed it.
[http://perf.jenkins.couchbase.com/job/rhea-5node2/952/] {noformat}07:01:02 2021-03-27T07:01:02 [INFO] Running: ./opt/couchbase/bin/cbbackupmgr backup --archive /workspace/backup --repo default --host [http://172.23.97.26|http://172.23.97.26/] --username Administrator --password password --threads 16 --storage sqlite 21:39:19 Build was aborted {noformat} I am able to reproduce this issue on VMs. During my investigation, my runs hit two different issues. I have collected logs and uploaded. * backup was stuck:* Backup progress reached 99% and then stuck. I waited for several minutes and didn't see any progress. The run above should hit this issue. That's why it's running for 14 hours. The log file for this issue is cbbackupmgr-collectinfo-backup-2021-03-29t115703.zip. {noformat} root@ubuntu:/tmp/magma_backup# ./opt/couchbase/bin/cbbackupmgr backup --archive /tmp/backup --repo default --host [http://172.23.105.72|http://172.23.105.72/] --username Administrator --password password --threads 16 --storage sqlite Backing up to '2021-03-29T11_48_51.331337778-07_00' Transferring key value data for bucket 'bucket-1' at 123.39KiB/s (about 2s remaining) 99655 items / 31.87MiB [=================================================================================================================================================================================================================================== ] 99.13% ^Cinterrupt {noformat} |
+Probem+
Magma backup run failed on build 7.0.0-4797. The backup process was running for 14 hours until I killed it. [http://perf.jenkins.couchbase.com/job/rhea-5node2/952/] {noformat} 07:01:02 2021-03-27T07:01:02 [INFO] Running: ./opt/couchbase/bin/cbbackupmgr backup --archive /workspace/backup --repo default --host [http://172.23.97.26|http://172.23.97.26/] --username Administrator --password password --threads 16 --storage sqlite 21:39:19 Build was aborted {noformat} I am able to reproduce this issue on VMs. During my investigation, my runs hit two different issues. I have collected logs and uploaded. * backup was stuck:* Backup progress reached 99% and then stuck. I waited for several minutes and didn't see any progress. The run above should hit this issue. That's why it's running for 14 hours. The log file for this issue is cbbackupmgr-collectinfo-backup-2021-03-29t115703.zip. {noformat} root@ubuntu:/tmp/magma_backup# ./opt/couchbase/bin/cbbackupmgr backup --archive /tmp/backup --repo default --host [http://172.23.105.72|http://172.23.105.72/] --username Administrator --password password --threads 16 --storage sqlite Backing up to '2021-03-29T11_48_51.331337778-07_00' Transferring key value data for bucket 'bucket-1' at 123.39KiB/s (about 2s remaining) 99655 items / 31.87MiB [=================================================================================================================================================================================================================================== ] 99.13% ^Cinterrupt {noformat} +Steps to reproduce+ +Expectation+ For backup to be successful |
Fix Version/s | CheshireCat.Next [ 16908 ] | |
Fix Version/s | Cheshire-Cat [ 15915 ] |
Assignee | Patrick Varley [ pvarley ] | Daniel Owen [ owend ] |
Component/s | couchbase-bucket [ 10173 ] | |
Component/s | storage-engine [ 10175 ] | |
Component/s | tools [ 10223 ] |
Epic Link |
|
Summary | Backup performance tests failed on 7.0.0-4797 | [Magma] Backup performance tests failed on 7.0.0-4797 |
Component/s | tools [ 10223 ] | |
Component/s | couchbase-bucket [ 10173 ] | |
Component/s | storage-engine [ 10175 ] |
Epic Link |
|
Fix Version/s | Cheshire-Cat [ 15915 ] | |
Fix Version/s | CheshireCat.Next [ 16908 ] |
Link |
This issue duplicates |
Resolution | Duplicate [ 3 ] | |
Status | Open [ 1 ] | Resolved [ 5 ] |
Assignee | Daniel Owen [ owend ] | Bo-Chun Wang [ bo-chun.wang ] |
Status | Resolved [ 5 ] | Closed [ 6 ] |
Fix Version/s | 7.0.0 [ 17233 ] |
Fix Version/s | Cheshire-Cat [ 15915 ] |
Bo-Chun Wang As there are two different issues here, can you open a second defect please as we do not want to confuse things.
Focusing on the logs for 2021-03-29T115703, the last message before the program was terminated was:
2021-03-29T11:51:00.369-07:00 WARN: (DCP) (bucket-1) (vb 915) Stream has been inactive for 1m0s, last seqno 46 -- couchbase.(*DCPAsyncWorker).monitorActivity.func1() at dcp_async_worker.go:252
2021-03-29T11:51:00.369-07:00 WARN: (DCP) (bucket-1) (vb 647) Stream has been inactive for 1m0s, last seqno 68 -- couchbase.(*DCPAsyncWorker).monitorActivity.func1() at dcp_async_worker.go:252
2021-03-29T11:51:00.369-07:00 WARN: (DCP) (bucket-1) (vb 835) Stream has been inactive for 1m0s, last seqno 44 -- couchbase.(*DCPAsyncWorker).monitorActivity.func1() at dcp_async_worker.go:252
2021-03-29T11:51:00.369-07:00 WARN: (DCP) (bucket-1) (vb 739) Stream has been inactive for 1m0s, last seqno 84 -- couchbase.(*DCPAsyncWorker).monitorActivity.func1() at dcp_async_worker.go:252
2021-03-29T11:51:00.369-07:00 WARN: (DCP) (bucket-1) (vb 1010) Stream has been inactive for 1m0s, last seqno 52 -- couchbase.(*DCPAsyncWorker).monitorActivity.func1() at dcp_async_worker.go:252
2021-03-29T11:51:00.369-07:00 WARN: (DCP) (bucket-1) (vb 76) Stream has been inactive for 1m0s, last seqno 56 -- couchbase.(*DCPAsyncWorker).monitorActivity.func1() at dcp_async_worker.go:252
2021-03-29T11:53:20.987-07:00 (Signal Handler) Signal `interrupt` received, exiting
goroutine 35 [running]:
This suggests that cbbackupmgr has not hanged and is wait on data for these vBuckets.
Following vb 76 we can see that the stream was opened as follows:
2021-03-29T11:49:00.388-07:00 (DCP) (bucket-1) (vb 76) Creating DCP stream | {"uuid":26490595435443,"start_seqno":0,"end_seqno":102,"snap_start":0,"snap_end":0,"retries":0}
At the time cbbackupmgr detected that no traffice has been send for vb 76 for one minute and the last seqno it had seen was 56.
There can be a number of cause for this, including network issues and server side problem. Can you please provide the server side logs so we can investigate this further.