Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-43298

Cluster takes over 60 seconds to create 1000 collections

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Critical
    • Resolution: Cannot Reproduce
    • Affects Version/s: Cheshire-Cat
    • Fix Version/s: 7.0.0
    • Component/s: test-execution
    • Labels:
    • Environment:
      Centos 7.6
    • Triage:
      Untriaged
    • Operating System:
      Centos 64-bit
    • Story Points:
      1
    • Is this a Regression?:
      Yes

      Description

      Install Couchbase server 7.0.0-4027 on 4 centos 7 servers.

      Create default bucket on node 224.

      Create 2 scopes with 500 collections on each scope

      Run backup (no data loaded)

      Create default bucket on node 226

      Run restore, restore failed with errors:

      Error restoring cluster: failed to execute cluster operations: failed to execute bucket operation for bucket 'default': failed to automatically create collections: failed to update scopes/collections: failed to update cluster collection manifest: failed to wait for manifest change to propagate: timeout reached whilst waiting for manifest id 1 to have propagated throughout the cluster, waited 5m0s 

       

       

        Attachments

          Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

            Activity

            Hide
            pvarley Patrick Varley added a comment - - edited

            From the backup.log, it looks like the cluster manager responded with a 504 error:

            backup.log

            2020-12-14T21:45:25.002-08:00 (REST) (Attempt 1) (POST) Dispatching request to 'http://172.23.121.226:8091/pools/default/buckets/default/scopes/%40ensureManifest/1?timeout=300000'
            2020-12-14T21:45:55.066-08:00 (REST) (Attempt 1) (POST) (504) Received response from 'http://172.23.121.226:8091/pools/default/buckets/default/scopes/%40ensureManifest/1?timeout=300000'
            2020-12-14T21:45:55.067-08:00 WARN: (REST) (Attempt 1) (POST) Request to endpoint '/pools/default/buckets/default/scopes/%40ensureManifest/1' failed with status code 504 -- rest.(*Request).Execute() at request.go:142
            2020-12-14T21:45:55.067-08:00 (Stats) Stopping stat collection
            2020-12-14T21:45:55.093-08:00 (Cmd) Error restoring cluster: failed to execute cluster operations: failed to execute bucket operation for bucket 'default': failed to automatically create collections: failed to update scopes/collections: failed to update cluster collection manifest: failed to wait for manifest change to propagate: timeout reached whilst waiting for manifest id 1 to have propagated throughout the cluster, waited 5m0s
            

            Interesting the response was within 30 seconds and not 5minutes as reported by cbbackupmgr. That needs fixing.

            The http access logs confirms the same error:

            172.23.121.226-http_access.log

            172.23.121.227 - Administrator [14/Dec/2020:11:48:41 -0800] "POST /pools/default/buckets/default/scopes/%40ensureManifest/1?timeout=300000 HTTP/1.1" 504 7 - "cbbackupmgr-7.0.0-4027-716e6b8" 30055
            

            The 504 feels a bit like a strange response as that is "504 Gateway Timeout".

            Looking at the ns_server error logs for that node, it does not cover the time in question:

            172.23.121.226-error.log

            [ns_server:error,2020-12-14T21:44:59.663-08:00,ns_1@172.23.121.226:service_agent-index<0.20776.58>:service_agent:handle_info:291]Lost json rpc connection for service index, reason shutdown. Terminating.
            [ns_server:error,2020-12-14T21:44:59.664-08:00,ns_1@172.23.121.226:service_agent-index<0.20776.58>:service_agent:terminate:312]Terminating abnormally
            [ns_server:error,2020-12-14T21:45:00.924-08:00,ns_1@172.23.121.226:service_rebalancer-fts<0.21105.58>:service_rebalancer:run_rebalance_worker:128]Got exit message from parent: {'EXIT',<0.21138.58>,shutdown}
            [ns_server:error,2020-12-14T21:45:00.925-08:00,ns_1@172.23.121.226:service_agent-fts<0.20766.58>:service_agent:handle_info:287]Rebalancer <0.21105.58> died unexpectedly: shutdown
            

            Was the node shutdown during the restore given the index message? Looking at the other logs that does not seem like the case, maybe something else was going on with the index service.

            Looking further at the memcached.log it looks like the collections were created but took almost a minute.

            2020-12-14T21:46:27.688408-08:00 INFO (default) collections: vb:1023 adding collection:id:0x8, name:mycollection_scope1_48 to scope:0x8, maxTTL:false 0, replica:false, seqno:1002, manifest:0x1, force:false
            2020-12-14T21:46:27.688802-08:00 WARNING 45: Slow operation. {"cid":"{"ip":"127.0.0.1","port":37850}/0","duration":"55 s","trace":"request=5133760768522933:55994758","command":"COLLECTIONS_SET_MANIFEST","peer":"{"ip":"127.0.0.1","port":37850}","bucket":"default","packet":{"bodylen":46710,"cas":0,"datatype":"raw","extlen":0,"key":"<ud></ud>","keylen":0,"magic":"ClientRequest","opaque":0,"opcode":"COLLECTIONS_SET_MANIFEST","vbucket":0}}
            

            It looks like the value of the timeout is not what cbbackupmgr expected.

            Is the usage of 504 error correct here? I thought maybe 408, but usage of http error codes are not well define.

            Show
            pvarley Patrick Varley added a comment - - edited From the backup.log, it looks like the cluster manager responded with a 504 error: backup.log 2020-12-14T21:45:25.002-08:00 (REST) (Attempt 1) (POST) Dispatching request to 'http://172.23.121.226:8091/pools/default/buckets/default/scopes/%40ensureManifest/1?timeout=300000' 2020-12-14T21:45:55.066-08:00 (REST) (Attempt 1) (POST) (504) Received response from 'http://172.23.121.226:8091/pools/default/buckets/default/scopes/%40ensureManifest/1?timeout=300000' 2020-12-14T21:45:55.067-08:00 WARN: (REST) (Attempt 1) (POST) Request to endpoint '/pools/default/buckets/default/scopes/%40ensureManifest/1' failed with status code 504 -- rest.(*Request).Execute() at request.go:142 2020-12-14T21:45:55.067-08:00 (Stats) Stopping stat collection 2020-12-14T21:45:55.093-08:00 (Cmd) Error restoring cluster: failed to execute cluster operations: failed to execute bucket operation for bucket 'default': failed to automatically create collections: failed to update scopes/collections: failed to update cluster collection manifest: failed to wait for manifest change to propagate: timeout reached whilst waiting for manifest id 1 to have propagated throughout the cluster, waited 5m0s Interesting the response was within 30 seconds and not 5minutes as reported by cbbackupmgr. That needs fixing. The http access logs confirms the same error: 172.23.121.226-http_access.log 172.23.121.227 - Administrator [14/Dec/2020:11:48:41 -0800] "POST /pools/default/buckets/default/scopes/%40ensureManifest/1?timeout=300000 HTTP/1.1" 504 7 - "cbbackupmgr-7.0.0-4027-716e6b8" 30055 The 504 feels a bit like a strange response as that is "504 Gateway Timeout". Looking at the ns_server error logs for that node, it does not cover the time in question: 172.23.121.226-error.log [ns_server:error,2020-12-14T21:44:59.663-08:00,ns_1@172.23.121.226:service_agent-index<0.20776.58>:service_agent:handle_info:291]Lost json rpc connection for service index, reason shutdown. Terminating. [ns_server:error,2020-12-14T21:44:59.664-08:00,ns_1@172.23.121.226:service_agent-index<0.20776.58>:service_agent:terminate:312]Terminating abnormally [ns_server:error,2020-12-14T21:45:00.924-08:00,ns_1@172.23.121.226:service_rebalancer-fts<0.21105.58>:service_rebalancer:run_rebalance_worker:128]Got exit message from parent: {'EXIT',<0.21138.58>,shutdown} [ns_server:error,2020-12-14T21:45:00.925-08:00,ns_1@172.23.121.226:service_agent-fts<0.20766.58>:service_agent:handle_info:287]Rebalancer <0.21105.58> died unexpectedly: shutdown Was the node shutdown during the restore given the index message? Looking at the other logs that does not seem like the case, maybe something else was going on with the index service. Looking further at the memcached.log it looks like the collections were created but took almost a minute. 2020-12-14T21:46:27.688408-08:00 INFO (default) collections: vb:1023 adding collection:id:0x8, name:mycollection_scope1_48 to scope:0x8, maxTTL:false 0, replica:false, seqno:1002, manifest:0x1, force:false 2020-12-14T21:46:27.688802-08:00 WARNING 45: Slow operation. {"cid":"{"ip":"127.0.0.1","port":37850}/0","duration":"55 s","trace":"request=5133760768522933:55994758","command":"COLLECTIONS_SET_MANIFEST","peer":"{"ip":"127.0.0.1","port":37850}","bucket":"default","packet":{"bodylen":46710,"cas":0,"datatype":"raw","extlen":0,"key":"<ud></ud>","keylen":0,"magic":"ClientRequest","opaque":0,"opcode":"COLLECTIONS_SET_MANIFEST","vbucket":0}} It looks like the value of the timeout is not what cbbackupmgr expected. Is the usage of 504 error correct here ? I thought maybe 408, but usage of http error codes are not well define.
            Hide
            pvarley Patrick Varley added a comment - - edited

            Speaking to the ns_server team, I got the answers to my question:

            Is the usage of 504 error correct here? I thought maybe 408, but usage of http error codes are not well define.

            5XX are for server side issue, where 4XX is for client side issues. As a result 504 is the closest one to match.

            It looks like the value of the timeout is not what cbbackupmgr expected.

            It turns out cbbackupmgr is not setting the time out correctly - I have created MB-43351 for that issue.

            Even with the change above (MB-43351) and setting the timeout the value to 60 seconds. This issue would still be hit. From testing on a laptop 1200 collections can be created in 10 seconds. As a result there looks to be an environmental issue that is cause collection creation to be slow.

            Removing Tools component and assigning back to Thuan Nguyen. It might be worth look at what the load on the system was at the time.

            Show
            pvarley Patrick Varley added a comment - - edited Speaking to the ns_server team, I got the answers to my question: Is the usage of 504 error correct here? I thought maybe 408, but usage of http error codes are not well define. 5XX are for server side issue, where 4XX is for client side issues. As a result 504 is the closest one to match. It looks like the value of the timeout is not what cbbackupmgr expected. It turns out cbbackupmgr is not setting the time out correctly - I have created MB-43351 for that issue. Even with the change above ( MB-43351 ) and setting the timeout the value to 60 seconds. This issue would still be hit. From testing on a laptop 1200 collections can be created in 10 seconds. As a result there looks to be an environmental issue that is cause collection creation to be slow. Removing Tools component and assigning back to Thuan Nguyen . It might be worth look at what the load on the system was at the time.
            Hide
            thuan Thuan Nguyen added a comment -

            At the time of create collections, it only creates collection, no data loaded or run something else. I will run test again to double check it.

            Show
            thuan Thuan Nguyen added a comment - At the time of create collections, it only creates collection, no data loaded or run something else. I will run test again to double check it.
            Hide
            thuan Thuan Nguyen added a comment -

            Just run restore with time command.  cbbackupmgr times out at 30 seconds, not 5 mins as in print out

            [root@s44018 ~]# time /opt/couchbase/bin//cbbackupmgr restore --archive /tmp/entbackup_172.23.121.224 --repo backup --cluster http://172.23.121.226:8091 --username Administrator  --password password 
            Restoring backup '2020-12-22T18_02_46.063799175-08_00'
            Automatically creating scopes/collections for 'default'                                                                                                                                                                                             0 items / 0B
            [                                                                                                                                                                                                                                                        ] 0.10%
            Error restoring cluster: timeout reached whilst waiting for manifest id 7 to have propagated throughout the cluster, waited 5m0s
             
             
            real	0m30.459s
            user	0m0.254s
            sys	0m0.131s
            [root@s44018 ~]#  

             

            Show
            thuan Thuan Nguyen added a comment - Just run restore with time command.  cbbackupmgr times out at 30 seconds, not 5 mins as in print out [root@s44018 ~]# time /opt/couchbase/bin//cbbackupmgr restore --archive /tmp/entbackup_172.23.121.224 --repo backup --cluster http://172.23.121.226:8091 --username Administrator  --password password  Restoring backup '2020-12-22T18_02_46.063799175-08_00' Automatically creating scopes/collections for 'default'                                                                                                                                                                                             0 items / 0B [                                                                                                                                                                                                                                                        ] 0.10% Error restoring cluster: timeout reached whilst waiting for manifest id 7 to have propagated throughout the cluster, waited 5m0s     real 0m30.459s user 0m0.254s sys 0m0.131s [root@s44018 ~]#  
            Hide
            pvarley Patrick Varley added a comment - - edited

            Just run restore with time command. cbbackupmgr times out at 30 seconds, not 5 mins as in print out

            I explained that in a previous comment and opened a defect for that issue:

            It turns out cbbackupmgr is not setting the time out correctly - I have created MB-43351 for that issue.


            That does not detract from the root issue which is why it takes so long in this environment to create the collections, as I mentioned previously:

            Even with the change above (MB-43351) and setting the timeout the value to 60 seconds. This issue would still be hit. From testing on a laptop 1200 collections can be created in 10 seconds. As a result there looks to be an environmental issue that is cause collection creation to be slow.

            Assigning back to Thuan Nguyen, the timeout issue will be fixed but that will not resolved this issue.

            Show
            pvarley Patrick Varley added a comment - - edited Just run restore with time command. cbbackupmgr times out at 30 seconds, not 5 mins as in print out I explained that in a previous comment and opened a defect for that issue: It turns out cbbackupmgr is not setting the time out correctly - I have created MB-43351 for that issue. — That does not detract from the root issue which is why it takes so long in this environment to create the collections, as I mentioned previously: Even with the change above ( MB-43351 ) and setting the timeout the value to 60 seconds. This issue would still be hit. From testing on a laptop 1200 collections can be created in 10 seconds. As a result there looks to be an environmental issue that is cause collection creation to be slow. Assigning back to Thuan Nguyen , the timeout issue will be fixed but that will not resolved this issue.
            Hide
            thuan Thuan Nguyen added a comment -

            I will try again in latest build and update here soon. If I could not reproduce this issue, I will close this ticket.

            Show
            thuan Thuan Nguyen added a comment - I will try again in latest build and update here soon. If I could not reproduce this issue, I will close this ticket.
            Hide
            arunkumar Arunkumar Senthilnathan added a comment -

            Thuan Nguyen pls reopen if you are able to repro it

            Show
            arunkumar Arunkumar Senthilnathan added a comment - Thuan Nguyen pls reopen if you are able to repro it

              People

              Assignee:
              thuan Thuan Nguyen
              Reporter:
              thuan Thuan Nguyen
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved:

                  Gerrit Reviews

                  There are no open Gerrit changes

                    PagerDuty