Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-11484

Implement a generic mechanism to observe completion of any changes requested by REST API requests. Must work for all port 8091 APIs

    XMLWordPrintable

Details

    • Epic
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 1.6.4, 1.7.0, 1.8.0, 2.0, 2.2.0, 2.5.1, 3.0, 4.0.0, 4.1.0, 4.1.1, 4.5.0, 4.6.0, 4.6.2, 4.6.1, 5.0.0, 5.1.0, 5.5.0, Spock.Next
    • Morpheus
    • None
    • Security Level: Public

    Description

      SUBJ.

      We have a number of places and a number of "done-ness" conditions that are not possible to observe with today's API.

      Reasons for that are varied. Biggest of them is that changes often has to be applied on all nodes of the cluster and trying to wait for all the nodes to apply changes might cause unbounded delays.

      Most widely known problems are:

      • bucket deletion. We return success immediately and actual bucket is deleted async-ly. And it's impossible to create new bucket with same name until all reachable nodes (note emphasis, that's pointer to another problem) completed deleting old bucket instance
      • bucket flush (via REST API if that matters). It is actually implemented in synchronous and cluster-wide fashion. But if any of nodes is slow to complete failover, then clients may receive "in progress" response without any convenient means to observe it completion (short of polling for some key and observing tmperrors)
      • when node is ejected from the cluster it restarts its rest api service. Causing temporary unavailability. I'm not sure it's worth fixing it, but at least we need official and documented answer for this (e.g. "poll it Luke!").
      • there's known issue with cluster join request which might timeout (especially if there are lots of concurrent addNode requests to same node) from client's perspective (we return 500 I think). But then "silently" complete (because internally requests are queued to ns_cluster service).

      This is considered done when there's request flag that enables the following behavior:

      *) if REST API request completes with 200 then it's effect is "done" on all cluster nodes. "Done" for bucket ops (create/flush) should be "available for ops". We'll decide separately if it'll include readiness of moxi.

      Various deletions should ideally be also monitored for completion for folks/scripts to be sure that whatever resource consumption deleted thing had, is now freed. But lets leave it out of scope for now. We can add separate request flag for this later. Note however that for testing we'll likely have some ways of monitoring deletions anyways.

      Note that any node being unavailable will prevent 200 unless request only applies to specific node(s) that are all available. Yes it will even affect trivial requests like change of compaction settings.

      But also note that it doesn't mean that we'll enforce strict (aka linear) consistency of all REST API requests (we might but probably won't). For example trying to change compaction settings to different value on two different nodes at same time, will result in all cluster nodes eventually converging to same settings. But in this case API requests are not required to wait for full convergence.

      We might as a bonus provide something like: "request was applied to all nodes but some of the nodes actually decided to accept different version". Another possibility is to have another request option for full linear consistency.

      Durability of config settings is another area without strict promise.

      So in effect 200 means "we've applied it on all the nodes, but what happen(s/ed) after that is unknown"

      *) if REST API request returns non-200 response, then it's not done and will not be done.

      • if REST API request returns 202 then there's a standard way to monitor completion or failure of request. It will likely be in a form of some url path to use in polling for completion. NOTE: I need to think a bit more about requests that might get lost due to node unavailability and then suddenly "found". Limits of what we're going to support here is still to be finalized.
      • exceptions to this list (like above mentioned node ejection) should be short. All of them with documented and good reason. And all of them with documented way to observe completion of the change.

      I will also likely support client-specified timeout or deadline to specify exactly where waiting for 200 should stop and return 202. Also I'll likely support a form of long polling for completion for urls returned from 202. Something like "wait for completion of this task but don't wait longer than 10 seconds. If it succeeds before deadline, give me 200 and if it's still incomplete when 10 secs have passed, give me 202".

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            Lack of this interface continues to make it hard to write automation or tests against Couchbase.  It came up in the forums again today: https://forums.couchbase.com/t/ibucket-fails-to-upsert-immediately-after-bucket-creation/13289

            ingenthr Matt Ingenthron added a comment - Lack of this interface continues to make it hard to write automation or tests against Couchbase.  It came up in the forums again today:  https://forums.couchbase.com/t/ibucket-fails-to-upsert-immediately-after-bucket-creation/13289

            We're also having difficulties at CenterEdge for development machines prepopulating new servers with test data.  Forum post: https://forums.couchbase.com/t/testing-for-couchbase-server-startup-via-rest-api/13452

            btburnett3 Brant Burnett added a comment - We're also having difficulties at CenterEdge for development machines prepopulating new servers with test data.  Forum post:  https://forums.couchbase.com/t/testing-for-couchbase-server-startup-via-rest-api/13452

            It looks like this is also implicated in this user's attempt to script a Couchbase setup:
            https://stackoverflow.com/questions/60241347/couchbase-community-edition-6-0-could-not-create-index

            ingenthr Matt Ingenthron added a comment - It looks like this is also implicated in this user's attempt to script a Couchbase setup: https://stackoverflow.com/questions/60241347/couchbase-community-edition-6-0-could-not-create-index

            Currently, user uses varies methods to creates objects in couchbase. Some are through service REST API (not limited to ns_server) and some through DDL. Some creation takes time to propagate throughout the cluster and cannot be consumed immediately. However the creation returns control to the user with perceived success and no way to track the readiness of the objects.

            There design challenges when coming to address this :

            1. The creation method is not centralized. Would the expectation be to use the same method to also track readiness?
            2. If we are contemplating centralized approach, should the implementation be on the SDK? or server? It may be easier to answer this question as there are use cases where user may not use SDKs and directly interacts with server.
            3. If we choose to go with a centralized server side approach, It may be natural to think ns_server might be a good choice, but ns_server is not aware of some objects at all, not to mention lifecycle. So this option will require a framework that allow for querying, tracking and notifications for object lifecycle.
            4. How would the API look like? Does user prefer to provide a "readiness state condition" on creation and API will not return till that condition is met? Or have separate API to track object readiness? Maybe we need both?

            In terms of priority, it mostly hits testing/dev/automation. But we know of cases where customer are also impacted and most probably added some flavor of retry logic.

            meni.hillel Meni Hillel (Inactive) added a comment - Currently, user uses varies methods to creates objects in couchbase. Some are through service REST API (not limited to ns_server) and some through DDL. Some creation takes time to propagate throughout the cluster and cannot be consumed immediately. However the creation returns control to the user with perceived success and no way to track the readiness of the objects. There design challenges when coming to address this : The creation method is not centralized. Would the expectation be to use the same method to also track readiness? If we are contemplating centralized approach, should the implementation be on the SDK? or server? It may be easier to answer this question as there are use cases where user may not use SDKs and directly interacts with server. If we choose to go with a centralized server side approach, It may be natural to think ns_server might be a good choice, but ns_server is not aware of some objects at all, not to mention lifecycle. So this option will require a framework that allow for querying, tracking and notifications for object lifecycle. How would the API look like? Does user prefer to provide a "readiness state condition" on creation and API will not return till that condition is met? Or have separate API to track object readiness? Maybe we need both? In terms of priority, it mostly hits testing/dev/automation. But we know of cases where customer are also impacted and most probably added some flavor of retry logic.

            This is probably be an epic as all service components will be involved. There has been an earlier discussion on this topic which Dave Finlay is driving.

            meni.hillel Meni Hillel (Inactive) added a comment - This is probably be an epic as all service components will be involved. There has been an earlier discussion on this topic which Dave Finlay is driving.

            People

              dfinlay Dave Finlay
              alkondratenko Aleksey Kondratenko (Inactive)
              Votes:
              7 Vote for this issue
              Watchers:
              19 Start watching this issue

              Dates

                Created:
                Updated:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty