Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-25140

cbbackupmgr does not resume once hung due to low memory

    XMLWordPrintable

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 4.6.3
    • 4.6.3
    • tools
    • Untriaged
    • Unknown

    Description

      Build : 4.6.3-4047

      Was running the incremental backup test with high KV ops in progress. The current step in the test was to take an incremental backup while the KV ops were ongoing. The system was low on memory at some time. cbbackupmgr got hung for almost 4-5 hrs. Then I manually released some memory by freeing up some buckets and adding some swap memory on the VM. Even after several minutes, the cbbackupmgr did not resume processing and remained stuck.

      Attachment contains the output from strace, the backup dir and the backup.log

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            Mihir,

            I don't have enough information to know what's going on here. Whenever you suspect a hang you should run "kill -SIGQUIT <pid>" on cbbackupmgr. This will cause cbbackupmgr to log a stack trace showing what all of the goroutines are doing. This will allow me to see what cbbackupmgr is doing and will allow me to see if there is a dead lock.

            Also, it's possible that Couchbase is not streaming anything to cbbackupmgr and that could cause the appearance of a hang. To see if this is the case I would also need to look at the Couchbase logs.

            Is it possible to re-run this test to get that information?

            mikew Mike Wiederhold [X] (Inactive) added a comment - Mihir, I don't have enough information to know what's going on here. Whenever you suspect a hang you should run "kill -SIGQUIT <pid>" on cbbackupmgr. This will cause cbbackupmgr to log a stack trace showing what all of the goroutines are doing. This will allow me to see what cbbackupmgr is doing and will allow me to see if there is a dead lock. Also, it's possible that Couchbase is not streaming anything to cbbackupmgr and that could cause the appearance of a hang. To see if this is the case I would also need to look at the Couchbase logs. Is it possible to re-run this test to get that information?

            Mike Wiederhold [X]

            I could reproduce this issue, the env is live and the backup process is "hung" right now if you want to take a look.

            http://52.34.96.58:8091
            Backup location : /backupdata on 52.34.96.58

            mihir.kamdar Mihir Kamdar (Inactive) added a comment - Mike Wiederhold [X] I could reproduce this issue, the env is live and the backup process is "hung" right now if you want to take a look. http://52.34.96.58:8091 Backup location : /backupdata on 52.34.96.58

            Mihir,

            I think I know how I want to fix this. You can take the cluster back.

            mikew Mike Wiederhold [X] (Inactive) added a comment - Mihir, I think I know how I want to fix this. You can take the cluster back.

            The reason this failed is that the stats calls timed out multiple times and we were unable to get the latest sequence numbers. I'll add a change that make sure we exit and print an appropriate error message.

            mikew Mike Wiederhold [X] (Inactive) added a comment - The reason this failed is that the stats calls timed out multiple times and we were unable to get the latest sequence numbers. I'll add a change that make sure we exit and print an appropriate error message.

            Mihir,

            I think the fix from MB-25159 should fix this issue. Can you rerun this test?

            mikew Mike Wiederhold [X] (Inactive) added a comment - Mihir, I think the fix from MB-25159 should fix this issue. Can you rerun this test?
            wayne Wayne Siu added a comment -

            Mihir Kamdar

            Do you have a chance to rerun the test to confirm?  Thanks.

            wayne Wayne Siu added a comment - Mihir Kamdar Do you have a chance to rerun the test to confirm?  Thanks.

            I have started a test using 4.6.3-4084. Will update this bug once the run is complete.

            mihir.kamdar Mihir Kamdar (Inactive) added a comment - I have started a test using 4.6.3-4084. Will update this bug once the run is complete.

            Verified with 4.6.3-4084. Ran into MB-25238. The backup post the rebalance failure did fail with the correct error msg "Error backing up cluster: Unable to find the latest vbucket sequence numbers. This might be due to a node in the cluster being unreachable.". Merge backups also failed with the correct error msg since the last backup was a partial one. Then, took one more backup using the --purge option, and it went well. Merge backup also ran fine after this and restore too. The test will be updated to handle the error while backip up cluster "Error backing up cluster: Partial backup error" to retry using --purge or --resume option.

            mihir.kamdar Mihir Kamdar (Inactive) added a comment - Verified with 4.6.3-4084. Ran into MB-25238 . The backup post the rebalance failure did fail with the correct error msg "Error backing up cluster: Unable to find the latest vbucket sequence numbers. This might be due to a node in the cluster being unreachable.". Merge backups also failed with the correct error msg since the last backup was a partial one. Then, took one more backup using the --purge option, and it went well. Merge backup also ran fine after this and restore too. The test will be updated to handle the error while backip up cluster "Error backing up cluster: Partial backup error" to retry using --purge or --resume option.

            People

              mihir.kamdar Mihir Kamdar (Inactive)
              mihir.kamdar Mihir Kamdar (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty