gocbcore v10.2.3+ can not perform CCCP polling
Description
Environment
Gerrit Reviews
Release Notes Description
Activity
Charles Dixon April 12, 2023 at 8:20 AMEdited
The issue here is that once a bucket is opened the server is responding with an invalid bucket config - the bucket config is missing the vbucket map:
2023-04-11T02:32:16.031Z [TRC] gocb+: Routing data is not valid, skipping update:
Revision ID: 96
Revision Epoch: 1
Bucket: sg_int_0_1681180335489927701
Capi Eps:
TLS:
- [http://172.18.0.2:8092|http://172.18.0.2:8092/] seed: true
- [http://172.18.0.3:8092|http://172.18.0.3:8092/] seed: false
- [http://172.18.0.4:8092|http://172.18.0.4:8092/] seed: false
...
VBMap:
&{entries:[] numReplicas:1
}KetamaMap: not-used
This is happening against both versions of gocbcore, and leads the SDK to not apply the config. I guess this could be a server setup timing thing? Maybe someone from kv could comment on why that might be happening, @Daniel Owen ?
Regardless of that, the reason why the older version of gocbcore worked and the newer doesn't is a change in when the poller is started. In the newer version we have pipelined a config fetch into connection bootstrap and the CCCP poller waits until a connection has fetched a config as a part of bootstrap and the SDK has applied that config. Here the config is being rejected but the connection is already established and bootstrapped. This means that the SDK only knows about the single endpoint and isn't retrying the config fetch because a) the connection to that endoint is already bootstrapped and b) CCCP is still awaiting a connection to fetch a config. In the older version of gocbcore the CCCP poller would just fetch another config after x seconds (2.5 by default) at which point the returned config seems to be ok (which is why I think this is probably a server/bucket setup timing thing).
To fix this we probably just need to prevent CCCP from waiting until a config has been applied (I think that @Tor Colvin already confirmed that this does fix this issue), there is a reason why I added that logic though so I need to investigate that which may lead to more in depth changes. For informational purposes - this change was introduced in v10.2.0 - https://github.com/couchbase/gocbcore/commit/7a53c9ff53da680dba5ca1cf954dfc23b8942e6a
Details
Details
Assignee
Reporter
Story Points
Fix versions
Priority
Instabug
PagerDuty
PagerDuty Incident
PagerDuty

Sentry
Linked Issues
Sentry
Zendesk Support
Linked Tickets
Zendesk Support

Sync Gateway recently upgraded from
github.com/couchbase/gocbcore/v10 v10.1.6
togithub.com/couchbase/gocbcore/v10 v10.2.3-0.20230404070112-cab6da1895ae
to fix the https://couchbasecloud.atlassian.net/browse/GOCBC-1401In basic case in our test harness, we are no longer able to make a CCCP connection.
Our test case is
start up CBS in docker
run go test in sync gateway
go test creates a bucket, fails with CCCP polling
if successful, runs a test (in this case, a simple DCP test)
The interesting logs are from verbose_int.out.raw.
Here's an example from enterprise-7.0.5 (failing)= https://jenkins.sgwdev.com/job/SyncGateway-Integration/1681/artifact/verbose_int.out.raw/view/
Here's a passing example:
https://jenkins.sgwdev.com/job/SyncGateway-Integration/1683/
The difference between these two builds is https://github.com/couchbase/sync_gateway/commit/b4dab6117732ba793bb83b9eb1406b7e18e990b1. I've also fixed this so sync gateway go.mod uses gocb v2.6.2 which we probably should have done originally, but I get the same failure: https://jenkins.sgwdev.com/job/SyncGateway-Integration/1684/
The automation code I use for starting CBS is https://github.com/couchbase/sync_gateway/pull/6176/files integration-test/start_server.sh. This code will probably only work on linux right now where Jenkins is running but I expect to be modifying it to work on mac soon.