Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-62911

Abolish existing lease after node rename

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Major
    • Morpheus
    • 7.6.1
    • ns_server
    • None
    • Untriaged
    • 0
    • Unknown

    Description

      A customer is concerned about the time to bring up a single node cluster with a bucket which is done many times in their test environment. ns_server is a contributor to the problem and Dave Finlay may the following observation.

      ===================

      I took a very quick initial look at the ns-server side based on the logs you provided (https://supportal.couchbase.com/snapshot/e4270e456372a32aea5fd315ec551ca0::0).

      My first observation is there may be an issue with the way that the leases are being managed. Here's what I see:

      Babysitter starts at 03:39:

      [ns_server:info,2024-07-14T14:03:39.333Z,babysitter_of_ns_1@cb.local:<0.115.0>:ns_babysitter:init_logging:113]Brought up babysitter logging
      

      NS-server starts at 03:41:

      [ns_server:info,2024-07-14T14:03:41.910Z,nonode@nohost:<0.155.0>:ns_server:init_logging:180]Started & configured logging
      

      The node is initially named cb.local - and it grants itself a 15s lease at 03:45:

      [ns_server:debug,2024-07-14T14:03:45.351Z,ns_1@cb.local:leader_lease_agent<0.752.0>:leader_lease_agent:do_handle_acquire_lease:140]Granting lease to {lease_holder,<<"f0b54c85ee7b40431dbbda7f91959fd1">>,
                                      'ns_1@cb.local'} for 15000ms
      

      Right around the same time, the web server is up:

      [user:info,2024-07-14T14:03:45.259Z,ns_1@cb.local:menelaus_sup<0.609.0>:menelaus_web_sup:start_link:38]Couchbase Server has started on web port 8091 on node 'ns_1@cb.local'. Version: "7.6.1-3200-enterprise".
      

      You immediately post a clusterInit against this endpoint. It's pretty quick (183 ms):

      192.168.65.1 - - [14/Jul/2024:14:03:46 +0000] "POST /clusterInit HTTP/1.1" 200 39 - "Java-http-client/17.0.11" 183 ### cluster init
      

      NS-server stops and starts net_kernel (due to the node name change). This happens pretty quickly.

      [ns_server:info,2024-07-14T14:03:45.854Z,nonode@nohost:<0.953.0>:dist_manager:bringup:246]Attempting to bring up net_kernel with name 'ns_1@127.0.0.1'
      

      And the bucket is created at 03:47:

      [menelaus:info,2024-07-14T14:03:47.083Z,ns_1@127.0.0.1:<0.928.0>:menelaus_web_buckets:do_bucket_create:877]Created bucket "test" of type: couchbase
      

      However, we have to wait for the lease to expire (which was granted against the previous name of the node) to begin creating an initial vbucket map for the bucket. This is 15s after it was previously granted at 03:45:

      [ns_server:debug,2024-07-14T14:04:00.355Z,ns_1@127.0.0.1:leader_lease_agent<0.1020.0>:leader_lease_agent:handle_lease_expired:277]Lease held by {lease_holder,<<"f0b54c85ee7b40431dbbda7f91959fd1">>,
                                  'ns_1@cb.local'} expired. Starting expirer.
      ...
      [ns_server:debug,2024-07-14T14:04:00.357Z,ns_1@127.0.0.1:leader_lease_agent<0.1020.0>:leader_lease_agent:do_handle_acquire_lease:140]Granting lease to {lease_holder,<<"80104698095b6bedefe263ea663d4df4">>,
                                      'ns_1@127.0.0.1'} for 15000ms
      

      Now at this point the janitor can create the initial vbucket map:

      [ns_server:info,2024-07-14T14:04:00.383Z,ns_1@127.0.0.1:<0.1596.0>:ns_janitor:cleanup_with_membase_bucket_check_map:198]janitor decided to generate initial vbucket map
      

      And traffic is enabled to the new bucket at 04:04:

      [ns_server:info,2024-07-14T14:04:00.687Z,ns_1@127.0.0.1:ns_memcached-test<0.1634.0>:ns_memcached:handle_call:370]Enabling traffic to bucket "test"
      [ns_server:info,2024-07-14T14:04:00.688Z,ns_1@127.0.0.1:ns_memcached-test<0.1634.0>:ns_memcached:handle_call:374]Bucket "test" marked as warmed in 0 seconds
      

      Right now it looks to me like the lease management is inserting a bunch of time that we could maybe optimize out.

      Abhijeeth Nuthan: could you have someone look at this and see if this is possibly optimizable? E.g. can we somehow abolish the lease after the rename?

      ===================

      I made the following change https://review.couchbase.org/c/ns_server/+/213340 to verify Dave's proposal and it does decrease the ns_server time on my machine from 23 to 10 seconds. This ticket tracks making this hack into a proper change.

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          People

            Abhijeeth.Nuthan Abhijeeth Nuthan
            steve.watanabe Steve Watanabe
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty