Cluster disconnect is intermittently timing out
Description
Environment
Gerrit Reviews
Release Notes Description
Activity

David Nault December 21, 2022 at 5:03 AM
David Nault am I correct in believing that you are in the process of redoing the config processing already?
I'm not working on it at the moment, but yes, it's something I want to tackle eventually.

Graham Pople December 13, 2022 at 11:40 AMEdited
I've been adding a bunch of fixes, but there's several races here, with at least one remaining that I'm not sure how to resolve without a very large change to how configuration is done.
What happens in all the races it that shutdown is waiting for all nodes to be removed, and times out as they're not. It invariably comes down to a race between the 'final' empty config that is pushed out as part of shutdown (which should remove all nodes), and another config that still contains nodes. Because only one empty config is emitted, if the other config wins the race, we get stuck here and the nodes will never be removed.
Shutdown used to emit lots of configs (as closing a bucket would emit one) and I've removed all of those to just leave the empty config. However, there's still a challenging race where if a real config comes in just before shutdown, it ends up racing.
I think the main issue is that configuration is completely non-atomic. There is a check in places like proposeBucketConfig to see if we're in shutdown mode (e.g. the code is trying to not emit configs during shutdown), but it's not thread-safe. E.g. shutdown mode can start immediately after that check. There seem to be further thread safety issues. E.g. I've changed it to not emit the config in proposeBucketConfig and other places, when in shutdown mode (which won't eliminate the race, just reduce the window). However, DefaultConfigurationProvider's currentConfig still gets updated regardless - and when the 'empty config' is emitted, it's actually emits that currentConfig, because that's what pushConfig sends.
I think this code needs a significant rewrite to be thread-safe and race-free, but it's challenging due to the use of reactor. And the risk vs reward seems poor since I think these problems only cause real issues during shutdown. (Since during normal operation a) the regular config polling will overwrite any oddness from config races pretty quickly, b) configs will usually match anyway, c) there aren't many times when configs could race).
am I correct in believing that you are in the process of redoing the config processing already?
Update: a quick fix/workaround just occurred. We could repeatedly emit an empty config throughout shutdown, which is basically how ConfigRefresher is already working around these thread-safety issues. And actually maybe there's a simpler-still approach: shutdown the ConfigurationProvider, and then have Core reconfigure from an empty config. That should be after any races can occur.

Graham Pople November 17, 2022 at 4:52 PM
Assigning to the new SDK owner. Enjoy!
I've been adding a bunch of debug around this issue, and one cause is in the logs in this gist, from an intermittent failure on 'Cluster testing (Linux, cbdyncluster 6.6-stable, Oracle JDK 8) / com.couchbase.client.java.QueryIntegrationTest'. At the end of the 10 second timeout, disconnect is still waiting for 2 nodes to disappear. At least one of these nodes has an endpoint that's repeatedly trying to open a bucket it doesn't have access to:
So perhaps this is blocking the node from shutting down? Maybe there needs to be some check if we're in disconnecting state before retrying the select bucket? But I'm not confident enough with the SDK internals yet to know exactly where that should go or if there may be unintended knock-on consequences.
In most of these disconnect timeouts I don't see this SelectBucketFailedEvent failure (in fact the log usually shows nothing at all apart from my added debug), so this unfortunately is just one uncommon cause.
Details
Details
Assignee

Reporter

Story Points
Sprint
Fix versions
Priority
Instabug
PagerDuty
PagerDuty Incident
PagerDuty

Sentry
Linked Issues
Sentry
Zendesk Support
Linked Tickets
Zendesk Support

We intermittently see errors on CI:
stdout for this test was: