Details
-
Improvement
-
Resolution: Unresolved
-
Major
-
None
-
4.0.0, 4.1.0, 4.1.1, 4.1.2, 4.5.0, 4.5.1, 4.6.0, 5.0.0, 5.1.0
-
None
Description
For purposes of understanding cluster topology when a client library wants to execute something on a service that is decoupled from buckets, it would be useful to have an identified place to get that configuration to be used by possibly thousands of clients across possibly hundreds of nodes.
Current public interfaces for retrieving cluster configuration are to my knowledge:
- Carrier Publication operations/nmv replies over port 11210
- ns_server buckets/<bucketname> URI and terse equivalent at b/<bucketname>
There is also pool level streaming at other ns_server streaming URIs. It's unclear if this is an intended public interface.
With changes in the compartmentalization of the system, we now need a way to be aware of topology changes independent of buckets. This interface…
- Must be capable of simultaneously handling 30,000 connections or more.
- Must be able to handle an arrival rate of 10,000 clients per second or more, servicing all of these configuration requests in under 100ms.
- Should be available over a service/services the client is going to use anyway. This is why cbmcd was selected in the time of Carrier Publication.
- Must have a method of requesting the configuration in response to request.
- Could have a method of subscribing for topology changes to be sent as they are received.
The "changes in compartmentalization" I speak of is that there are now situations where either FTS searches or N1QL statements to be executed can be in an application entirely independent of a bucket. In order to actually get that request to the right service on the cluster, however, we need a way to know where to locate the services.
As a use case example, in a given user's architecture there can be some application servers that are dedicated to only FTS searches. They expect to be able to use an SDK with only these searches. The current workaround for this today is to require the application to make a connection to a bucket involved in an FTS search.
This is unreasonably expensive in a large deployment since the underlying SDK is maintaining persistent connections to all nodes. If there were a large number of nodes and a large number of clients, this would be expensive and could push us into the limits of memcached's number of ports.
It's also confusing from a user perspective, since they just want to do a search/query and don't understand why they have to connect to a bucket.
This enhancement request also aligns with RBAC which will separate the principal from the resource being accessed.
Possibly Relevant Background
Owing to the scale of clients we need to update and problems with the the second public interface above (covered in MB-8211), contemporary clients use only the Carrier Publication interface. This has virtually eliminated all of the problems we'd had previously and has even stood up well to tests of 5000+ clients connecting and requesting config nearly simultaneously. Users have tested us on this (a core switch failure test at a large deployment) and been happy with the result.
Existing Possiblilties
Current Pools Streaming Interface
The pools level streaming interface works functionally, but is sort of a waste of a connection that isn't going to be used for anything else and has always had the MB-8211 problems. It's also not clear if that's a public interface for such things.
Poll another URI
I checked to see how jdbc-cb did it, thinking that there may be something that the query service exposes and it appears (though I didn't read deeply) that it polls /admin/clusters/default/nodes once a second when it has no configuration:
https://github.com/jdbc-json/jdbc-cb/blob/master/src/main/java/com/couchbase/jdbc/core/ProtocolImpl.java#L970-L988
https://github.com/jdbc-json/jdbc-cb/blob/master/src/main/java/com/couchbase/jdbc/CBDriver.java#L276-L297
The query specification doesn't indicate that it has a way of handing out this information. Maybe the polling on 1s resolution as needed has already been identified as a solution to this?