Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-45333

cbbackupmgr restore: fails with --enable-bucket-config

    XMLWordPrintable

Details

    • Bug
    • Status: Open
    • Minor
    • Resolution: Unresolved
    • 5.0.0, 5.0.1, 6.5.0, 6.0.0, 5.5.0, 5.5.1, 5.5.2, 5.5.3, 6.0.1, 6.0.2, 5.5.4, 6.0.3, 5.5.5, 5.5.6, 6.0.4, 6.5.1, 6.0.5, 6.6.0, 6.6.1
    • Morpheus
    • tools
    • 1 node cluster running Server Version: 6.6.1-9213
    • Untriaged
    • 1
    • No

    Description

      Couchbase Version: 6.6.1-9213

      Scenario:

      1. Took backup of sample bucket beer-sample , Item Count: 7303.
      2. Edited the bucket settings, enabling flush + changing evictionPolicy from valueOnly to full,
      3. Flushed the bucket,
      4. restored the backup with --enable-bucket-config flag,
      5. Flush was disabled + evictionPolicy changed back to valueOnly,Item Count of the bucket after restore: 2390

      Expected Behaviour:

      1. All Items are restored successfully. 
      2. Failed Mutations count displays actual number of failures if any.

      Observed Behaviour

      1. All items were not restored leading to data loss.
      2. The failed Mutations count is not correct.

      Error:  Error restoring cluster: EOF

      cbbackupmgr logs had the following message in the log file:

      2021-03-30T03:20:18.146-07:00 WARN: (Pool) (beer-sample) Failed to send document with key '<ud>harmon_brewing_company-brown_s_point_esb</ud>' because of an unexpected EOF -- couchbase.(*MemcachedWorker).processOperation() at pool_worker.go:350
      2021-03-30T03:20:18.146-07:00 WARN: (Pool) (beer-sample) Failed to send document with key '<ud>southend_brewery_and_smokehouse_charleston-frostbite</ud>' because of an unexpected EOF -- couchbase.(*MemcachedWorker).processOperation() at pool_worker.go:350
      2021-03-30T03:20:18.150-07:00 ERRO: (Scanworker) (vb 875) Failed to perform 'FullFileScan': failed to perform full scan: failed to perform full scan on chunk 0 for vBucket 875: mutation callback failed: EOF | {"bucket":"beer-sample","last_dispatched_to":"172.23.104.106:11210","last_dispatched_from":"172.23.104.106:50527","last_connection_id":"f314e562541286a2/38522d6c9b17eda2"} -- archive.scanWorker.func1() at source.go:326
      2021-03-30T03:20:18.150-07:00 (Stats) Stopping stat collection
      2021-03-30T03:20:18.151-07:00 WARN: (Pool) (beer-sample) Failed to send document with key '<ud>titletown_brewing-boxcar_brown_ale</ud>' because of an unexpected EOF -- couchbase.(*MemcachedWorker).processOperation() at pool_worker.go:350
      2021-03-30T03:20:18.162-07:00 (Cmd) Error restoring cluster: failed to execute cluster operations: failed to execute bucket operations: failed to transfer bucket data for bucket 'beer-sample': failed to transfer key value data: failed to transfer key value data: failed to perform full scan: failed to perform full scan on chunk 0 for vBucket 875: mutation callback failed: EOF | {"bucket":"beer-sample","last_dispatched_to":"172.23.104.106:11210","last_dispatched_from":"172.23.104.106:50527","last_connection_id":"f314e562541286a2/38522d6c9b17eda2"} 

      Logs attached + Screenshot of cbbackupmgr commands executed.

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            james.lee James Lee added a comment -

            For the benefit of others, Prateek Kumar and I have spoken briefly about this issue over slack. It's not something that I've managed to reproduce yet, however, I believe I know what might be happening here.

            1) We begin the restore
            2) The bucket settings are passed to the Couchbase sink
            3) It sees that the settings aren't equal
            4) It modifies the settings so that they are
            4a) It disables flush
            4b) It changes the eviction policy
            5) We connect to the bucket
            6) Begin loading data
            7) The eviction policy change takes place
            8) The node disconnects us, we get a socket closed in flight/EOF error

            I believe this may be a race condition when editing the bucket settings resulting in us hitting an expected behavior when modifying the eviction policy:

            Bucket Eviction Warning WebUI

            Changing eviction policy will restart the bucket. This will lead to closing all open connections and some downtime
            

            It's worth noting that once we've edited the bucket settings via the REST UI we will wait for the bucket to report as healthy, however, this may not be working in this scenario. This is what I'd like to investigate further before making any code changes.

            james.lee James Lee added a comment - For the benefit of others, Prateek Kumar and I have spoken briefly about this issue over slack. It's not something that I've managed to reproduce yet, however, I believe I know what might be happening here. 1) We begin the restore 2) The bucket settings are passed to the Couchbase sink 3) It sees that the settings aren't equal 4) It modifies the settings so that they are 4a) It disables flush 4b) It changes the eviction policy 5) We connect to the bucket 6) Begin loading data 7) The eviction policy change takes place 8) The node disconnects us, we get a socket closed in flight/EOF error I believe this may be a race condition when editing the bucket settings resulting in us hitting an expected behavior when modifying the eviction policy: Bucket Eviction Warning WebUI Changing eviction policy will restart the bucket. This will lead to closing all open connections and some downtime It's worth noting that once we've edited the bucket settings via the REST UI we will wait for the bucket to report as healthy, however, this may not be working in this scenario. This is what I'd like to investigate further before making any code changes.

            Observed Behaviour:

            All items were not restored leading to data loss.
            The failed Mutations count is not correct.

            There was no data loss the restore process failed as it can be seen in the screenshot

            The mutations stats are what it did before it failed.

            I have updated the title to reflex that it's not data loss.

            James Lee We should review the output as I think this the second time someone has missed that the restore failed.

            pvarley Patrick Varley added a comment - Observed Behaviour: All items were not restored leading to data loss. The failed Mutations count is not correct. There was no data loss the restore process failed as it can be seen in the screenshot The mutations stats are what it did before it failed. I have updated the title to reflex that it's not data loss. James Lee We should review the output as I think this the second time someone has missed that the restore failed.

            Cluster Logs: https://cb-engineering.s3.amazonaws.com/cbbackupmgr_restore/45333/collectinfo-2021-03-30T130117-ns_1%40127.0.0.1.zip 

            The cluster has been under usage for some time now hence it might have other log files as well.

            prateek.kumar Prateek Kumar added a comment - Cluster Logs:  https://cb-engineering.s3.amazonaws.com/cbbackupmgr_restore/45333/collectinfo-2021-03-30T130117-ns_1%40127.0.0.1.zip   The cluster has been under usage for some time now hence it might have other log files as well.
            james.lee James Lee added a comment -

            To confirm, I believe my previous hypothesis is correct.

            Looking at the logs we see:

            cbbackupmgr logs

            2021-03-30T03:20:18.146-07:00 WARN: (Pool) (beer-sample) Failed to send document with key '<ud>harmon_brewing_company-brown_s_point_esb</ud>' because of an unexpected EOF -- couchbase.(*MemcachedWorker).processOperation() at pool_worker.go:350
            2021-03-30T03:20:18.146-07:00 WARN: (Pool) (beer-sample) Failed to send document with key '<ud>southend_brewery_and_smokehouse_charleston-frostbite</ud>' because of an unexpected EOF -- couchbase.(*MemcachedWorker).processOperation() at pool_worker.go:350
            2021-03-30T03:20:18.151-07:00 WARN: (Pool) (beer-sample) Failed to send document with key '<ud>titletown_brewing-boxcar_brown_ale</ud>' because of an unexpected EOF -- couchbase.(*MemcachedWorker).processOperation() at pool_worker.go:350
            

            These EOF errors are an indication of the socket being closed, looking at the memcached logs, we see:

            memcached logs

            2021-03-30T03:20:14.299538-07:00 INFO 84: HELO [{"a":"gocbcore/v9.0.7 cbbackupmgr-6.6.1-9213-4335b9e","i":"f314e562541286a2/38522d6c9b17eda2"}] XATTR, XERROR, Select bucket, Snappy, JSON, AltRequestSupport, SyncReplication, SubdocCreateAsDeleted [ {"ip":"172.23.104.106","port":50527} - {"ip":"172.23.104.106","port":11210} (not authenticated) ]
            2021-03-30T03:20:18.134514-07:00 INFO 84: The connected bucket is being deleted.. closing connection [ {"ip":"172.23.104.106","port":50527} - {"ip":"172.23.104.106","port":11210} (<ud>Administrator</ud>) ]
            

            This is directly as a result of the editing of the bucket settings:

            eviction policy

            2021-03-30T03:20:12.988926-07:00 INFO Pushing new cluster config for bucket:[beer-sample] revision:[365]
            

            This indicates that our method of polling/waiting for the bucket to report as healthy is not sufficient in this case i.e. we should not be connecting until the bucket has been recreated to change the eviction policy.

            james.lee James Lee added a comment - To confirm, I believe my previous hypothesis is correct. Looking at the logs we see: cbbackupmgr logs 2021-03-30T03:20:18.146-07:00 WARN: (Pool) (beer-sample) Failed to send document with key '<ud>harmon_brewing_company-brown_s_point_esb</ud>' because of an unexpected EOF -- couchbase.(*MemcachedWorker).processOperation() at pool_worker.go:350 2021-03-30T03:20:18.146-07:00 WARN: (Pool) (beer-sample) Failed to send document with key '<ud>southend_brewery_and_smokehouse_charleston-frostbite</ud>' because of an unexpected EOF -- couchbase.(*MemcachedWorker).processOperation() at pool_worker.go:350 2021-03-30T03:20:18.151-07:00 WARN: (Pool) (beer-sample) Failed to send document with key '<ud>titletown_brewing-boxcar_brown_ale</ud>' because of an unexpected EOF -- couchbase.(*MemcachedWorker).processOperation() at pool_worker.go:350 These EOF errors are an indication of the socket being closed, looking at the memcached logs, we see: memcached logs 2021-03-30T03:20:14.299538-07:00 INFO 84: HELO [{"a":"gocbcore/v9.0.7 cbbackupmgr-6.6.1-9213-4335b9e","i":"f314e562541286a2/38522d6c9b17eda2"}] XATTR, XERROR, Select bucket, Snappy, JSON, AltRequestSupport, SyncReplication, SubdocCreateAsDeleted [ {"ip":"172.23.104.106","port":50527} - {"ip":"172.23.104.106","port":11210} (not authenticated) ] 2021-03-30T03:20:18.134514-07:00 INFO 84: The connected bucket is being deleted.. closing connection [ {"ip":"172.23.104.106","port":50527} - {"ip":"172.23.104.106","port":11210} (<ud>Administrator</ud>) ] This is directly as a result of the editing of the bucket settings: eviction policy 2021-03-30T03:20:12.988926-07:00 INFO Pushing new cluster config for bucket:[beer-sample] revision:[365] This indicates that our method of polling/waiting for the bucket to report as healthy is not sufficient in this case i.e. we should not be connecting until the bucket has been recreated to change the eviction policy.
            james.lee James Lee added a comment -

            Looking at the behavior of the REST endpoint we use to determine whether the bucket is ready to take traffic, it does appear to correctly change state from healthy -> warmup -> healthy when changing the eviction policy, however, this is an asynchronous process. I suspect we're racing with ns_server i.e. at the time, when we execute the REST request, the nodes are reporting as healthy meaning we continue on before the eviction policy change has even started.

            james.lee James Lee added a comment - Looking at the behavior of the REST endpoint we use to determine whether the bucket is ready to take traffic, it does appear to correctly change state from healthy -> warmup -> healthy when changing the eviction policy, however, this is an asynchronous process. I suspect we're racing with ns_server i.e. at the time, when we execute the REST request, the nodes are reporting as healthy meaning we continue on before the eviction policy change has even started.
            james.lee James Lee added a comment -

            To clarify a few things about the affected version:
            1) 'cbbackupmgr' has been using this polling/wait behavior since the ability to create/edit buckets was initially introduce (around 2015)
            2) I've also tested this behavior back to a 5.5.2 cluster version (using a 7.0.0 version of 'cbbackupmgr') and I'm able to reproduce this behavior

            james.lee James Lee added a comment - To clarify a few things about the affected version: 1) ' cbbackupmgr ' has been using this polling/wait behavior since the ability to create/edit buckets was initially introduce (around 2015) 2) I've also tested this behavior back to a 5.5.2 cluster version (using a 7.0.0 version of ' cbbackupmgr ') and I'm able to reproduce this behavior

            This is the problem with the -enable-bucket-config option on restore to a bucket that exists. It will over writings which will affect the restore, like setting the bucket quote too small as the existing bucket has data in it. In this case changing the conflict resolution which will cause a bucket restart. I think the options here is remove -enabled-bucket-config is it a useful option when there is auto-create-bucket? Another options is to not change the conflict resolution type.

            Dropping this down to minor as it's an edge case and moving it out to Morpheus.

            pvarley Patrick Varley added a comment - This is the problem with the - enable-bucket-config option on restore to a bucket that exists. It will over writings which will affect the restore, like setting the bucket quote too small as the existing bucket has data in it. In this case changing the conflict resolution which will cause a bucket restart. I think the options here is remove -enabled-bucket-config is it a useful option when there is auto-create-bucket ? Another options is to not change the conflict resolution type. Dropping this down to minor as it's an edge case and moving it out to Morpheus.

            People

              james.lee James Lee
              prateek.kumar Prateek Kumar
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty