Uploaded image for project: 'Couchbase Go SDK'
  1. Couchbase Go SDK
  2. GOCBC-905

'WaitUntilReady' is not correctly returning errors

    XMLWordPrintable

Details

    • Bug
    • Status: Resolved
    • Critical
    • Resolution: Fixed
    • 2.1.1
    • 2.1.3
    • library
    • None
    • 1

    Description

      What's the problem?
      When using 'WaitUnitlReady' to wait for the gocbcore agent to connect to the cluster we are seeing an 'unambiguous timeout' error when in reality the server disconnected us because we were using the (unmerged in CC) 'backfill_order' control flag.

      What do we expect to see?
      When we get disconnected from the server, the error should be bubbled up to cbbackupmgr so that it can be handled correctly and returned to the user. I imagine that this isn't the only case in which a timeout will be masking an error that has occurred behind the scenes.

      Steps to reproduce
      Patrick Varley has commented a concise set of steps needed to reproduce this issue with cbbackupmgr in MB-39653 but to briefly recap:
      1) Install CC build 2208 onto a CentOS 7 vagrant
      2) Configure a one node cluster with only the data service
      3) Create a bucket
      4) Load some data in the bucket using cbworkloadgen
      5) Run a backup

      If we look in the memcached logs we will see:

      2020-05-29T18:17:45.505412+00:00 INFO 44: DCP connection opened successfully. PRODUCER, INCLUDE_XATTRS [ [::1]:57896 - [::1]:11210 (<ud>Administrator</ud>) ]
      2020-05-29T18:17:45.505588+00:00 WARNING 44: (default) DCP (Producer) eq_dcpq:cbbackupmgr_2020-05-29T18:17:20Z_19653_0 - Invalid ctrl parameter 'sequential' for backfill_order
      2020-05-29T18:17:45.505734+00:00 INFO 44: (No Engine) DCP (Producer) eq_dcpq:cbbackupmgr_2020-05-29T18:17:20Z_19653_0 - Removing connection [ [::1]:57896 - [::1]:11210 (<ud>Administrator</ud>) ]
      

      However cbbackupmgr will display:

       /opt/couchbase/bin/cbbackupmgr backup -u Administrator -p password -c localhost -a backup -r MB-39653
      Backing up to '2020-05-29T18_17_20.039976728Z'
      Copying at 0B/s (about 0s remaining) - Transferring key value data for 'default'                                                                                                                                                                                             0 items / 0B
      [===============================================================================================================================================================================================================================================================================] 100.00%
      Error backing up cluster: operation has timed out
      Backed up bucket "default" failed
      Mutations backed up: 0, Mutations failed to backup: 0
      Deletions backed up: 0, Deletions failed to backup: 0
      Skipped due to purge number or conflict resolution: Mutations: 0 Deletions: 0
      

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            james.lee James Lee created issue -
            james.lee James Lee made changes -
            Field Original Value New Value
            Link This issue relates to MB-39653 [ MB-39653 ]
            james.lee James Lee added a comment -

            I have since recalled a conversation with Charles Dixon when he said that "The 1 thing that you guys might care about is we don’t return an error on connect anymore. We’ll keep trying to connect under the hood until the agent is closed.". So I imagine that the "issue" I've described above is actually the expected behavior. However I don't think this is the correct way to handle connecting to the node because there are always going to some errors which should be bubbled up to the user because having a blanket error returned due to a timeout is not particularly helpful to users. I might be wrong though because the 'WaitUntilReady' callback does accept and error.

            james.lee James Lee added a comment - I have since recalled a conversation with Charles Dixon when he said that "The 1 thing that you guys might care about is we don’t return an error on connect anymore. We’ll keep trying to connect under the hood until the agent is closed.". So I imagine that the "issue" I've described above is actually the expected behavior. However I don't think this is the correct way to handle connecting to the node because there are always going to some errors which should be bubbled up to the user because having a blanket error returned due to a timeout is not particularly helpful to users. I might be wrong though because the ' WaitUntilReady ' callback does accept and error.
            brett19 Brett Lawson made changes -
            Status New [ 10003 ] Open [ 1 ]
            charles.dixon Charles Dixon made changes -
            Fix Version/s 2.1.2 [ 16797 ]
            Fix Version/s .next [ 12427 ]
            charles.dixon Charles Dixon made changes -
            Link This issue relates to GOCBC-868 [ GOCBC-868 ]
            charles.dixon Charles Dixon made changes -
            Fix Version/s 2.1.3 [ 16905 ]
            Fix Version/s 2.1.2 [ 16797 ]
            james.lee James Lee made changes -
            Link This issue causes MB-40140 [ MB-40140 ]
            james.lee James Lee made changes -
            Priority Major [ 3 ] Critical [ 2 ]
            charles.dixon Charles Dixon made changes -
            Fix Version/s 2.1.3 [ 17009 ]
            Fix Version/s 2.1.4 [ 16905 ]
            charles.dixon Charles Dixon made changes -
            Assignee Brett Lawson [ brett19 ] Charles Dixon [ charles.dixon ]
            charles.dixon Charles Dixon made changes -
            Resolution Fixed [ 1 ]
            Status Open [ 1 ] Resolved [ 5 ]

            People

              charles.dixon Charles Dixon
              james.lee James Lee
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty