Uploaded image for project: 'Couchbase C client library libcouchbase'
  1. Couchbase C client library libcouchbase
  2. CCBC-627

Poll regularly for config updates

    XMLWordPrintable

Details

    • Task
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 2.5.1
    • 2.7.5
    • None
    • Security Level: Public
    • build 3508 running cbc-n1qlback

    Description

      start a cluster with 2 query nodes
      start cbc-n1qlback with some query
      add a new query node to the cluster and rebalance
      observe the request/sec per node

      expected: topology changes should eb automatically picked up by the clients. after rebalance the new query node needs to be part of the round robin requests being sent to the cluster. however new node does not start taking traffic even after a long wait.

      if the load is stopped and restarted, the requests do go to the newly added node as well. However this means topology changes would require a restart of the app servers. that does cause admin overhead and possibly failed requests for the app and downtime.

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            cihan Cihan Biyikoglu (Inactive) created issue -
            cihan Cihan Biyikoglu (Inactive) made changes -
            Field Original Value New Value
            Description start a cluster with 2 query nodes
            start cbc-n1qlback with some query
            add a new query node to the cluster and rebalance
            observe the request/sec per node

            expected: after rebalance the new query node needs to be part of the round robin requests being sent to the cluster. however new node does not start taking traffic even after a long wait.

            if the load is stopped and restarted, the requests do go to the newly added node as well. However this means adding a new node would require a restart of the app servers.
            start a cluster with 2 query nodes
            start cbc-n1qlback with some query
            add a new query node to the cluster and rebalance
            observe the request/sec per node

            expected: after rebalance the new query node needs to be part of the round robin requests being sent to the cluster. however new node does not start taking traffic even after a long wait.

            if the load is stopped and restarted, the requests do go to the newly added node as well. However this means topology changes would require a restart of the app servers. that does cause admin overhead and possibly failed requests for the app and downtime.
            cihan Cihan Biyikoglu (Inactive) made changes -
            Description start a cluster with 2 query nodes
            start cbc-n1qlback with some query
            add a new query node to the cluster and rebalance
            observe the request/sec per node

            expected: after rebalance the new query node needs to be part of the round robin requests being sent to the cluster. however new node does not start taking traffic even after a long wait.

            if the load is stopped and restarted, the requests do go to the newly added node as well. However this means topology changes would require a restart of the app servers. that does cause admin overhead and possibly failed requests for the app and downtime.
            start a cluster with 2 query nodes
            start cbc-n1qlback with some query
            add a new query node to the cluster and rebalance
            observe the request/sec per node

            expected: topology changes should eb automatically picked up by the clients. after rebalance the new query node needs to be part of the round robin requests being sent to the cluster. however new node does not start taking traffic even after a long wait.

            if the load is stopped and restarted, the requests do go to the newly added node as well. However this means topology changes would require a restart of the app servers. that does cause admin overhead and possibly failed requests for the app and downtime.
            mnunberg Mark Nunberg (Inactive) added a comment - - edited

            Preemptively 'auto-detecting' new nodes is not supported via CCCP (memcached) bootstrap; only via the HTTP 'streaming' config, which is disabled by default for the library.

            You can specify `bootstrap_on=http` in the connection string to use old, 'streaming-style' config updates which will notify the client whenever there is a configuration change. Otherwise the client only checks if there is a new configuration in the case of an error.

            Considering that most deployments of the library would be using a combination of KV and query operations, this is not considered an issue at the moment. For those running 'query-only' workloads, they can specify the extra option in the connection string.

            Note that this same problem appears in 'view-only' connections, but in practice does not seem to be a common deployment configuration on the client.

            mnunberg Mark Nunberg (Inactive) added a comment - - edited Preemptively 'auto-detecting' new nodes is not supported via CCCP (memcached) bootstrap; only via the HTTP 'streaming' config, which is disabled by default for the library. You can specify `bootstrap_on=http` in the connection string to use old, 'streaming-style' config updates which will notify the client whenever there is a configuration change. Otherwise the client only checks if there is a new configuration in the case of an error. Considering that most deployments of the library would be using a combination of KV and query operations, this is not considered an issue at the moment. For those running 'query-only' workloads, they can specify the extra option in the connection string. Note that this same problem appears in 'view-only' connections, but in practice does not seem to be a common deployment configuration on the client.
            mnunberg Mark Nunberg (Inactive) made changes -
            Status New [ 10003 ] Resolved [ 5 ]

            Thanks Mark. I assume this is an issue across all SDKs. does the option exist in all SDKs or should I open a separate bug for each sdk?

            Also, Can't we do this refresh on the topology for the default connection mode? I think this is basic for online elasticity. We'd need to trigger the topology updates without triggering with an error?

            cihan Cihan Biyikoglu (Inactive) added a comment - Thanks Mark. I assume this is an issue across all SDKs. does the option exist in all SDKs or should I open a separate bug for each sdk? Also, Can't we do this refresh on the topology for the default connection mode? I think this is basic for online elasticity. We'd need to trigger the topology updates without triggering with an error?

            We update for both views and n1ql query via a regular polling interval. I remember we covered this once before for PHP and solved it there with a similar backstop, I thought. If not, I believe we need to. This is one of those 'behavioral' things we need to document better and verify across SDKs.

            Java definitely has this 'backstop' and I believe .NET does as well. I'll ask Brett to comment here.

            ingenthr Matt Ingenthron added a comment - We update for both views and n1ql query via a regular polling interval. I remember we covered this once before for PHP and solved it there with a similar backstop, I thought. If not, I believe we need to. This is one of those 'behavioral' things we need to document better and verify across SDKs. Java definitely has this 'backstop' and I believe .NET does as well. I'll ask Brett to comment here.

            The Java client can use the streaming config (though it's off by default), in addition it might end up doing some intermittent polling for new configurations (this is an implementation detail; I'm not sure if it still does so). I'm unsure what .NET does, but it might use polling as well.

            Regarding 'polling'. It's technically possible to do so in the C library as well, but has its own issues: consider the impact of having several hundred client instances (as would be the case with PHP) sending/receiving config information every few seconds.

            We've had serious scalability issues in the past with having many open connections to port 8091, in addition to the fact that push-based config updates are inherently unreliable (what if the node sending you the updates becomes unresponsive). The issue is a bit specific to common deployments of the C client (many smaller application instances, each using a library instance; typically several hundred+) than the Java or .NET clients (one large application instance, one large client instance, multiple threads).

            mnunberg Mark Nunberg (Inactive) added a comment - The Java client can use the streaming config (though it's off by default), in addition it might end up doing some intermittent polling for new configurations (this is an implementation detail; I'm not sure if it still does so). I'm unsure what .NET does, but it might use polling as well. Regarding 'polling'. It's technically possible to do so in the C library as well, but has its own issues: consider the impact of having several hundred client instances (as would be the case with PHP) sending/receiving config information every few seconds. We've had serious scalability issues in the past with having many open connections to port 8091, in addition to the fact that push-based config updates are inherently unreliable (what if the node sending you the updates becomes unresponsive). The issue is a bit specific to common deployments of the C client (many smaller application instances, each using a library instance; typically several hundred+) than the Java or .NET clients (one large application instance, one large client instance, multiple threads).
            mnunberg Mark Nunberg (Inactive) added a comment - - edited

            Polling shouldn't be difficult to add to the C library. Please file a bug if you think this is the correct solution (rather than just using the streaming config). This will add ~8k of traffic every 10 seconds or so per client instance.

            mnunberg Mark Nunberg (Inactive) added a comment - - edited Polling shouldn't be difficult to add to the C library. Please file a bug if you think this is the correct solution (rather than just using the streaming config). This will add ~8k of traffic every 10 seconds or so per client instance.
            mnunberg Mark Nunberg (Inactive) made changes -
            Status Resolved [ 5 ] Reopened [ 4 ]

            The Java and .NET polling is via Carrier Publication, not HTTP. Thus, it does not bother port 8091 at all. The backstop is 10s IIRC on Java. I remember this first coming up on PHP where a user had a views only workload back in the 2.0 days.

            ingenthr Matt Ingenthron added a comment - The Java and .NET polling is via Carrier Publication, not HTTP. Thus, it does not bother port 8091 at all. The backstop is 10s IIRC on Java. I remember this first coming up on PHP where a user had a views only workload back in the 2.0 days.
            mnunberg Mark Nunberg (Inactive) added a comment - - edited

            IIRC the issue with PHP was that it wasn't detecting when a node was removed, and would return with errors (non-200 HTTP return codes) when contacting that node. This was fixed by having LCB taking any view API request with a non-200 return code as a cue to refresh the config.

            The issue in this ticket however is that lcb is failing to take advantage of a new node added to the cluster: no errors are being returned, but existing instances aren't managing to take advantage of the newly added node.

            mnunberg Mark Nunberg (Inactive) added a comment - - edited IIRC the issue with PHP was that it wasn't detecting when a node was removed , and would return with errors (non-200 HTTP return codes) when contacting that node. This was fixed by having LCB taking any view API request with a non-200 return code as a cue to refresh the config. The issue in this ticket however is that lcb is failing to take advantage of a new node added to the cluster: no errors are being returned, but existing instances aren't managing to take advantage of the newly added node.

            I seem to remember solving for both cases though Mark. Let's see what Brett's comments are on expected behavior.

            ingenthr Matt Ingenthron added a comment - I seem to remember solving for both cases though Mark. Let's see what Brett's comments are on expected behavior.
            mnunberg Mark Nunberg (Inactive) made changes -
            Fix Version/s 2.5.1 [ 12808 ]
            mnunberg Mark Nunberg (Inactive) made changes -
            Summary clients don't utilize a newly added node to the cluster Poll regularly for config updates
            mnunberg Mark Nunberg (Inactive) made changes -
            Fix Version/s .future [ 11337 ]
            mnunberg Mark Nunberg (Inactive) made changes -
            Fix Version/s 2.7.5 [ 14404 ]
            Fix Version/s .future [ 11337 ]
            mnunberg Mark Nunberg (Inactive) made changes -
            Labels n1ql fast_failover n1ql

            Resurrecting this for fast-failover.

            mnunberg Mark Nunberg (Inactive) added a comment - Resurrecting this for fast-failover.
            mnunberg Mark Nunberg (Inactive) made changes -
            Status Reopened [ 4 ] In Progress [ 3 ]
            mnunberg Mark Nunberg (Inactive) made changes -
            Link This issue relates to CCBC-760 [ CCBC-760 ]
            mnunberg Mark Nunberg (Inactive) made changes -
            Resolution Fixed [ 1 ]
            Status In Progress [ 3 ] Resolved [ 5 ]

            People

              mnunberg Mark Nunberg (Inactive)
              cihan Cihan Biyikoglu (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty