Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-43845

cbbackupmgr: lots of failed attempts

    XMLWordPrintable

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 6.6.0
    • 7.0.0
    • tools
    • Untriaged
    • 1
    • Unknown

    Attachments

      Issue Links

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          perry Perry Krug created issue -
          perry Perry Krug made changes -
          Field Original Value New Value
          Link This issue Clones MB-43844 [ MB-43844 ]
          perry Perry Krug made changes -
          Link This issue is cloned by MB-43846 [ MB-43846 ]
          owend Daniel Owen made changes -
          Fix Version/s Cheshire-Cat [ 15915 ]
          james.lee James Lee added a comment -

          Hi Perry Krug,

          Please could you expand on what the issue is here (and possibly what the use case/situation is), I've had a quick look at the logs (assuming the logs from MB-43844 are the correct logs) and I can see that 'cbbackupmgr' is having to handle lots of timeouts (in this case, the cluster appears to be taking a very long time to respond to a simple "vbucket-details" stats call):

          Timeouts Calculating Data Range

          2020-11-09T07:08:38.154-08:00 WARN: (Couchbase) Unexpected error 'operation timed out after 5s' while trying to get sequence numbers, will retry in 5s -- couchbase.GetSequenceNumbers() at sequence_numbers.go:38
          ...
          2020-11-09T07:11:03.535-08:00 WARN: (Couchbase) Unexpected error 'operation timed out after 5s' while trying to get sequence numbers, will retry in 5s -- couchbase.GetSequenceNumbers() at sequence_numbers.go:38                                                                                           2020-11-09T07:11:18.541-08:00 WARN: (Couchbase) Unexpected error 'operation timed out after 10s' while trying to get sequence numbers, will retry in 10s -- couchbase.GetSequenceNumbers() at sequence_numbers.go:38
          ...
          2021-01-22T08:17:19.998-08:00 WARN: (Couchbase) Unexpected error 'operation timed out after 5s' while trying to get sequence numbers, will retry in 5s -- couchbase.GetSequenceNumbers() at sequence_numbers.go:38
          2021-01-22T08:17:35.001-08:00 WARN: (Couchbase) Unexpected error 'operation timed out after 10s' while trying to get sequence numbers, will retry in 10s -- couchbase.GetSequenceNumbers() at sequence_numbers.go:38
          2021-01-22T08:18:00.012-08:00 WARN: (Couchbase) Unexpected error 'operation timed out after 15s' while trying to get sequence numbers, will retry in 15s -- couchbase.GetSequenceNumbers() at sequence_numbers.go:38
          2021-01-22T08:18:35.014-08:00 WARN: (Couchbase) Unexpected error 'operation timed out after 20s' while trying to get sequence numbers, will retry in 20s -- couchbase.GetSequenceNumbers() at sequence_numbers.go:38
          2021-01-22T08:19:20.018-08:00 WARN: (Couchbase) Unexpected error 'operation timed out after 25s' while trying to get sequence numbers, will retry in 25s -- couchbase.GetSequenceNumbers() at sequence_numbers.go:38
          

          We see these log lines scattered throughout the logs, however, it looks like 'cbbackupmgr' is behaving as expected. Please note that failing fast upon receiving a rollback is the intended behavior in 'cbbackupmgr' (whether we're hitting it due to valid reasons is another matter).

          I'll have a look though the cluster logs (after looking at MB-43846).

          Thanks,
          James

          james.lee James Lee added a comment - Hi Perry Krug , Please could you expand on what the issue is here (and possibly what the use case/situation is), I've had a quick look at the logs (assuming the logs from MB-43844 are the correct logs) and I can see that ' cbbackupmgr ' is having to handle lots of timeouts (in this case, the cluster appears to be taking a very long time to respond to a simple "vbucket-details" stats call): Timeouts Calculating Data Range 2020-11-09T07:08:38.154-08:00 WARN: (Couchbase) Unexpected error 'operation timed out after 5s' while trying to get sequence numbers, will retry in 5s -- couchbase.GetSequenceNumbers() at sequence_numbers.go:38 ... 2020-11-09T07:11:03.535-08:00 WARN: (Couchbase) Unexpected error 'operation timed out after 5s' while trying to get sequence numbers, will retry in 5s -- couchbase.GetSequenceNumbers() at sequence_numbers.go:38 2020-11-09T07:11:18.541-08:00 WARN: (Couchbase) Unexpected error 'operation timed out after 10s' while trying to get sequence numbers, will retry in 10s -- couchbase.GetSequenceNumbers() at sequence_numbers.go:38 ... 2021-01-22T08:17:19.998-08:00 WARN: (Couchbase) Unexpected error 'operation timed out after 5s' while trying to get sequence numbers, will retry in 5s -- couchbase.GetSequenceNumbers() at sequence_numbers.go:38 2021-01-22T08:17:35.001-08:00 WARN: (Couchbase) Unexpected error 'operation timed out after 10s' while trying to get sequence numbers, will retry in 10s -- couchbase.GetSequenceNumbers() at sequence_numbers.go:38 2021-01-22T08:18:00.012-08:00 WARN: (Couchbase) Unexpected error 'operation timed out after 15s' while trying to get sequence numbers, will retry in 15s -- couchbase.GetSequenceNumbers() at sequence_numbers.go:38 2021-01-22T08:18:35.014-08:00 WARN: (Couchbase) Unexpected error 'operation timed out after 20s' while trying to get sequence numbers, will retry in 20s -- couchbase.GetSequenceNumbers() at sequence_numbers.go:38 2021-01-22T08:19:20.018-08:00 WARN: (Couchbase) Unexpected error 'operation timed out after 25s' while trying to get sequence numbers, will retry in 25s -- couchbase.GetSequenceNumbers() at sequence_numbers.go:38 We see these log lines scattered throughout the logs, however, it looks like ' cbbackupmgr ' is behaving as expected. Please note that failing fast upon receiving a rollback is the intended behavior in ' cbbackupmgr ' (whether we're hitting it due to valid reasons is another matter). I'll have a look though the cluster logs (after looking at MB-43846 ). Thanks, James
          perry Perry Krug added a comment -

          Thanks James, yes the logs are the same for those recent tickets I filed.

          The cluster itself is fairly undersized and so timeouts could be somewhat expected...but I was hoping that cbbackupmgr would just continue to progress as best it could (albeit slowly).

          However, I kept retrying and the error messages eventually switched to:

          Backing up to '2021-01-22T08_17_03.579930413-08_00'
          Transferring key value data for 'incoming' at 0B/s (about 0s remaining)                                                                                         0 items / 0B
          [==================================================================================================================================================================] 100.00%
          Error backing up cluster: client received rollback
          Backed up bucket "incoming" failed
          Mutations backed up: 0, Mutations failed to backup: 0
          Deletions backed up: 0, Deletions failed to backup: 0
          Skipped due to purge number or conflict resolution: Mutations: 0 Deletions: 0
          

          And it wouldn't proceed at all from there. In speaking with Patrick, he mentioned that I should be using --purge at this point (I had been using --resume on all prior attempts). If that is in fact the recommended solution, it would be nice if the error message told the user that they should (at least try) using purge.

          perry Perry Krug added a comment - Thanks James, yes the logs are the same for those recent tickets I filed. The cluster itself is fairly undersized and so timeouts could be somewhat expected...but I was hoping that cbbackupmgr would just continue to progress as best it could (albeit slowly). However, I kept retrying and the error messages eventually switched to: Backing up to '2021-01-22T08_17_03.579930413-08_00' Transferring key value data for 'incoming' at 0B/s (about 0s remaining) 0 items / 0B [==================================================================================================================================================================] 100.00% Error backing up cluster: client received rollback Backed up bucket "incoming" failed Mutations backed up: 0, Mutations failed to backup: 0 Deletions backed up: 0, Deletions failed to backup: 0 Skipped due to purge number or conflict resolution: Mutations: 0 Deletions: 0 And it wouldn't proceed at all from there. In speaking with Patrick, he mentioned that I should be using --purge at this point (I had been using --resume on all prior attempts). If that is in fact the recommended solution, it would be nice if the error message told the user that they should (at least try) using purge.
          owend Daniel Owen made changes -
          Assignee Patrick Varley [ pvarley ] James Lee [ james.lee ]
          james.lee James Lee added a comment -

          Thanks for the information, I've also caught up with Patrick since he mentioned that you'd be talking to him prior to raising these issues. It definitely looks like the cluster is undersized and 'cbbackupmgr' is having compensate for that by retrying where possible (which is where MB-43846 comes in i.e. we should be retrying).

          I agree, the rollback message should prompt the user on the steps forward to avoid them ending up in a resume-rollback scenario; I'll get a patch up to enhance the message returned by 'cbbackupmgr' in this scenario.

          james.lee James Lee added a comment - Thanks for the information, I've also caught up with Patrick since he mentioned that you'd be talking to him prior to raising these issues. It definitely looks like the cluster is undersized and ' cbbackupmgr ' is having compensate for that by retrying where possible (which is where MB-43846 comes in i.e. we should be retrying). I agree, the rollback message should prompt the user on the steps forward to avoid them ending up in a resume-rollback scenario; I'll get a patch up to enhance the message returned by ' cbbackupmgr ' in this scenario.
          james.lee James Lee made changes -
          Status Open [ 1 ] In Progress [ 3 ]

          Build couchbase-server-7.0.0-4309 contains backup commit c8fd444 with commit message:
          MB-43845 Improve the DCP rollback error message

          build-team Couchbase Build Team added a comment - Build couchbase-server-7.0.0-4309 contains backup commit c8fd444 with commit message: MB-43845 Improve the DCP rollback error message
          james.lee James Lee made changes -
          Link This issue relates to MB-37681 [ MB-37681 ]
          james.lee James Lee added a comment -

          Hi Perry Krug,

          I'm going to mark this as resolved, I've added some documentation about the steps the user can take once they receive a rollback from the cluster; the updated error message will direct the user to this documentation.

          james.lee James Lee added a comment - Hi Perry Krug , I'm going to mark this as resolved, I've added some documentation about the steps the user can take once they receive a rollback from the cluster; the updated error message will direct the user to this documentation.
          james.lee James Lee made changes -
          Assignee James Lee [ james.lee ] Perry Krug [ perry ]
          Resolution Fixed [ 1 ]
          Status In Progress [ 3 ] Resolved [ 5 ]

          Build couchbase-server-7.0.0-4312 contains backup commit 460fc62 with commit message:
          MB-43845 Document the possible solutions to receiving a rollback

          build-team Couchbase Build Team added a comment - Build couchbase-server-7.0.0-4312 contains backup commit 460fc62 with commit message: MB-43845 Document the possible solutions to receiving a rollback
          arunkumar Arunkumar Senthilnathan made changes -
          Labels request-dev-verify
          james.lee James Lee added a comment -

          Closing as theirs no new functionality to test; this was just a case of improving an error message and adding some new documentation which can be found here.

          james.lee James Lee added a comment - Closing as theirs no new functionality to test; this was just a case of improving an error message and adding some new documentation which can be found here .
          james.lee James Lee made changes -
          Status Resolved [ 5 ] Closed [ 6 ]
          lynn.straus Lynn Straus made changes -
          Fix Version/s 7.0.0 [ 17233 ]
          lynn.straus Lynn Straus made changes -
          Fix Version/s Cheshire-Cat [ 15915 ]

          People

            perry Perry Krug
            perry Perry Krug
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty