Description
When the couchbase bucket isn't available, CreateAgent should fail right away with either ErrBucketNotFound (or for some reason ErrAuthenticationFailure like before) as opposed to clients having to use WaitUntilReady(..) and receive an "unambiguous timeout" error.
Attachments
Issue Links
Activity
Hey Charles Dixon, here's why FTS needs this ..
On bucket delete, FTS needs to drop all indexes associated with it. When a bucket is actually dropped, firstly all streams are closed with a socket closure message. On this at the FTS end, we close the feed and try setting up an agent to see if the bucket is still available and if so, is it's UUID still the same as previous (to cover a quick bucket recreation case). If the bucket doesn't exist or the UUID didn't match we go ahead and drop the index.
Now if CreateAgent doesn't return immediately and just returns "unambiguous timeout" after WaitUntilReady() times out, we'd have to wait for a bit unnecessarily and also accommodate the "unambiguous timeout" error.
I'm open to any alternative suggestions on how to determine if a bucket has been deleted too.
Hi Abhi Dangeti that makes sense. I think that immediately returning that error from CreateAgent won't be possible due to how gocbcore is now setup to work. We might be able to add additional context to the error returned by WaitUntilReady but that doesn't solve the unnecessary wait issue.
An alternative to creating a new agent (assuming that you still have a standard agent available for other operations) could be to use the REST API - https://github.com/couchbase/gocbcore/blob/master/agent_ops.go#L233. You could make a request similar to what we expose via gocb - https://github.com/couchbase/gocb/blob/master/cluster_bucketmgr.go#L200
Yea using DoHTTPRequest with a pendingOp is something I've tried already - but I'm not convinced that it's a good solution because - it still relies on the timeout I set for it, and would return ErrTimeout rather than a proper ErrBucketNotFound error.
Hey Mihir Kamdar,
We are currently working on a solution to this problem. We should have some more information for you by the end of the week.
Cheers, Brett
Hi Brett Lawson any updates on this ? This is causing a lot of FTS tests to fail.
Hey Mihir Kamdar, I've pushed up a fix to FTS:
http://review.couchbase.org/c/cbgt/+/128021
This essentially lets FTS get the required data directly from ns_server endpoints to handle bucket and collections deletions.
Mihir Kamdar I've merged the above change. That should serve as an interim fix until SDK adds support for this.
Reducing the priority of this bug.
Build couchbase-server-7.0.0-2070 contains cbgt commit fcc6aaa with commit message:
GOCBC-868: [feed_dcp_gocbcore] Handling bucket deletions
Hey Brett Lawson, been a while since we checked in on this - d'you have any updates or an ETA for this yet?
We've some issues within FTS which would get resolved once we reach a resolution for this.
I'm moving this to 2.1.4 but if the latest behaviour change we made (that I discussed with Abhi Dangeti) solves this then we can resolve and move it back to 2.1.3.
Build couchbase-server-7.0.0-2505 contains gocbcore commit 2d1ed35 with commit message:
GOCBC-868: Expose a way to fast fail WaitUntilReady
Build couchbase-server-6.6.0-7858 contains gocbcore commit 2d1ed35 with commit message:
GOCBC-868: Expose a way to fast fail WaitUntilReady
gocbcore.Agent's WaitUntilReady doesn't fail quickly in the case of the bucket being not available when the client attempts to connect over HTTP ..
[09:43:44] AD: ~/Documents/go/src/github.com/abhinavdangeti/tmp $ go run simple_agent.go |
--> FromConnStr, err: <nil>
|
2020/07/08 09:43:45 (GOCBCORE) SDK Version: gocbcore/v9.0.3 |
2020/07/08 09:43:45 (GOCBCORE) Creating new agent: &{MemdAddrs:[] HTTPAddrs:[127.0.0.1:9000] BucketName:default UserAgent: UseTLS:false NetworkType: Auth:0x18a7688 TLSRootCAProvider:<nil> UseMutationTokens:false UseCompression:false UseDurations:false DisableDecompression:false UseOutOfOrderResponses:false UseCollections:false CompressionMinSize:0 CompressionMinRatio:0 HTTPRedialPeriod:0s HTTPRetryDelay:0s CccpMaxWait:0s CccpPollPeriod:0s ConnectTimeout:6s KVConnectTimeout:0s KvPoolSize:0 MaxQueueSize:0 HTTPMaxIdleConns:0 HTTPMaxIdleConnsPerHost:0 HTTPIdleConnectionTimeout:0s Tracer:<nil> NoRootTraceSpans:false DefaultRetryStrategy:<nil> CircuitBreakerConfig:{Enabled:false VolumeThreshold:0 ErrorThresholdPercentage:0 SleepWindow:0s RollingWindow:0s CompletionCallback:<nil> CanaryTimeout:0s} UseZombieLogger:false ZombieLoggerInterval:0s ZombieLoggerSampleSize:0} |
--> CreateAgent, err: <nil>
|
--> WaitUntilReady, err: <nil>
|
2020/07/08 09:43:45 (GOCBCORE) Will retry request. Backoff=1ms, OperationID=waituntilready. Reason=NOT_READY |
2020/07/08 09:43:45 (GOCBCORE) CCCP Looper starting. |
2020/07/08 09:43:45 (GOCBCORE) CCCPPOLL: No nodes available to poll, return upstream |
2020/07/08 09:43:45 (GOCBCORE) HTTP Looper starting. |
2020/07/08 09:43:45 (GOCBCORE) Http Picked: http://127.0.0.1:9000. |
2020/07/08 09:43:45 (GOCBCORE) HTTP Hostname: 127.0.0.1. |
2020/07/08 09:43:45 (GOCBCORE) Requesting config from: http://127.0.0.1:9000//pools/default/bs/default. |
2020/07/08 09:43:45 (GOCBCORE) Will retry request. Backoff=10ms, OperationID=waituntilready. Reason=NOT_READY |
2020/07/08 09:43:45 (GOCBCORE) Writing HTTP request to http://127.0.0.1:9000/pools/default/bs/default ID= |
2020/07/08 09:43:45 (GOCBCORE) Requesting config from: http://127.0.0.1:9000//pools/default/bucketsStreaming/default. |
2020/07/08 09:43:45 (GOCBCORE) Writing HTTP request to http://127.0.0.1:9000/pools/default/bucketsStreaming/default ID= |
2020/07/08 09:43:45 (GOCBCORE) Failed to connect to host, bad bucket. |
2020/07/08 09:43:45 (GOCBCORE) Pick Failed. |
2020/07/08 09:43:45 (GOCBCORE) Looper waiting... |
2020/07/08 09:43:45 (GOCBCORE) Will retry request. Backoff=50ms, OperationID=waituntilready. Reason=NOT_READY |
2020/07/08 09:43:45 (GOCBCORE) Will retry request. Backoff=100ms, OperationID=waituntilready. Reason=NOT_READY |
2020/07/08 09:43:45 (GOCBCORE) Will retry request. Backoff=500ms, OperationID=waituntilready. Reason=NOT_READY |
2020/07/08 09:43:46 (GOCBCORE) Will retry request. Backoff=1s, OperationID=waituntilready. Reason=NOT_READY |
2020/07/08 09:43:47 (GOCBCORE) Will retry request. Backoff=1s, OperationID=waituntilready. Reason=NOT_READY |
2020/07/08 09:43:48 (GOCBCORE) Will retry request. Backoff=1s, OperationID=waituntilready. Reason=NOT_READY |
2020/07/08 09:43:49 (GOCBCORE) Will retry request. Backoff=1s, OperationID=waituntilready. Reason=NOT_READY |
2020/07/08 09:43:50 (GOCBCORE) Will retry request. Backoff=1s, OperationID=waituntilready. Reason=NOT_READY |
2020/07/08 09:43:51 (GOCBCORE) Will retry request. Backoff=1s, OperationID=waituntilready. Reason=NOT_READY |
2020/07/08 09:43:52 (GOCBCORE) Will retry request. Backoff=1s, OperationID=waituntilready. Reason=NOT_READY |
2020/07/08 09:43:53 (GOCBCORE) Will retry request. Backoff=1s, OperationID=waituntilready. Reason=NOT_READY |
2020/07/08 09:43:54 (GOCBCORE) Will retry request. Backoff=1s, OperationID=waituntilready. Reason=NOT_READY |
2020/07/08 09:43:55 (GOCBCORE) Will retry request. Backoff=1s, OperationID=waituntilready. Reason=NOT_READY |
--> WaitUntilReady Callback, err: unambiguous timeout | {"InnerError":{"InnerError":{"InnerError":{},"Message":"unambiguous timeout"}},"OperationID":"WaitUntilReady","Opaque":"","TimeObserved":10001420431,"RetryReasons":["NOT_READY"],"RetryAttempts":15,"LastDispatchedTo":"","LastDispatchedFrom":"","LastConnectionID":""} |
Hey Charles Dixon, noticed a regression with the new change. I'm unable to set up an agent on restarting couchbase-server ..
Agent setup fails with this error message - "document not found".
This is the config I'm using to set up the agent ..
&gocbcore.AgentConfig{
|
MemdAddrs:[]string(nil),
|
HTTPAddrs:[]string{"127.0.0.1:9000"}, |
BucketName:"beer-sample", |
UserAgent:"beer_1b2fb08b470d27e2_4c1c5584", |
UseTLS:false, |
NetworkType:"", Auth:(*cbgt.CBAuthenticator)(0x2887ae0), |
TLSRootCAProvider:(func() *x509.CertPool)(nil),
|
UseMutationTokens:false, |
UseCompression:false, |
UseDurations:false, |
DisableDecompression:false, |
UseOutOfOrderResponses:false, |
UseCollections:true, |
CompressionMinSize:0, |
CompressionMinRatio:0, |
HTTPRedialPeriod:0, |
HTTPRetryDelay:0, |
CccpMaxWait:0, |
CccpPollPeriod:0, |
ConnectTimeout:60000000000, |
KVConnectTimeout:7000000000, |
KvPoolSize:0, |
MaxQueueSize:0, |
HTTPMaxIdleConns:0, |
HTTPMaxIdleConnsPerHost:0, |
HTTPIdleConnectionTimeout:0, |
Tracer:gocbcore.RequestTracer(nil),
|
NoRootTraceSpans:false, |
DefaultRetryStrategy:gocbcore.RetryStrategy(nil),
|
CircuitBreakerConfig:gocbcore.CircuitBreakerConfig{
|
Enabled:false, |
VolumeThreshold:0, |
ErrorThresholdPercentage:0, |
SleepWindow:0, |
RollingWindow:0, |
CompletionCallback:(gocbcore.CircuitBreakerCallback)(nil),
|
CanaryTimeout:0 |
},
|
UseZombieLogger:false, |
ZombieLoggerInterval:0, |
ZombieLoggerSampleSize:0 |
}
|
This error is new so I'm reverting the go mod update i've made to cbgt to point back to our last clean build:
The regression introduced here causes MB-40505.
Additional logging from within GOCBCORE ..
2020/07/16 12:07:35 (GOCBCORE) Failed to perform select bucket against server (document not found | {"status_code":1,"bucket":"beer-sample","error_name":"KEY_ENOENT","error_description":"Not Found","opaque":6,"last_dispatched_to":"127.0.0.1:12000","last_dispatched_from":"127.0.0.1:58729","last_connection_id":"919c13c337816aba/96712d3a0387dc1e"}) |
2020/07/16 12:07:35 (GOCBCORE) Pipeline Client `127.0.0.1:12000/0xc0001b0230` preparing for new client loop |
2020/07/16 12:07:35 (GOCBCORE) Pipeline Client `127.0.0.1:12000/0xc0001b0230` retrieving new client connection for parent 0xc0001b0190 |
2020/07/16 12:07:35 (GOCBCORE) Won't retry request. OperationID=waituntilready. Reason=CONNECTION_ERROR |
--> WaitUntilReady Callback, err: document not found | {"status_code":1,"bucket":"beer-sample","error_name":"KEY_ENOENT","error_description":"Not Found","opaque":6,"last_dispatched_to":"127.0.0.1:12000","last_dispatched_from":"127.0.0.1:58729","last_connection_id":"919c13c337816aba/96712d3a0387dc1e"} |
2020/07/16 12:07:35 (GOCBCORE) Pipeline Client `127.0.0.1:12000/0xc0001b0230` received close request |
2020/07/16 12:07:38 (GOCBCORE) CCCPPOLL: Failed to retrieve CCCP config. ambiguous timeout |
2020/07/16 12:07:38 (GOCBCORE) CCCPPOLL: Failed to retrieve config from any node. |
Build couchbase-server-7.0.0-2631 contains cbft commit 643c013 with commit message:
GOCBC-868: Falling back to older gocbcore v8.0.3
Build couchbase-server-7.0.0-2631 contains cbgt commit 05b2181 with commit message:
GOCBC-868: Revert "Bumping up gocbcore version to >v9.0.3"
Abhi Dangeti what is the expected behaviour here? The SDK cannot connect to the bucket so is reporting that the bucket doesn't exist (which is what the server is telling us) via the wait until ready.
In the case I've highlighted above - the bucket was not actually deleted. It was warming up after a server restart.
If you are suggesting that with the new set of changes we can't really differentiate between a missing bucket and a bucket being not-ready - this could be a big problem for us. D'you have any recommendations on how clients can differentiate between the 2 scenarios here?
Abhi Dangeti I think that has always been the case. However I think (I'm not certain of this because I can't repro connecting to a bucket in warmup) but a missing bucket will be a `ErrAuthenticationFailure` and a bucket in warmup an `ErrDocumentNotFound`. I'm not entirely sure on that though.
I see. What about WaitUntilReady with the new changes - does it return immediately only in case of ErrAuthenticationFailure or for both the above errors? If ErrDocumentNotFound indicates that the bucket is in warmup - I'd expect WaitUntilReady to block until the bucket becomes ready. Does that make sense?
I see what you mean. Can you clarify which user fts auths as? That seems to change the error code returned by the server in this scenario.
As a side note, you can effectively achieve this your side using a custom retry strategy (see https://github.com/couchbase/gocbcore/blob/master/retry.go and https://github.com/couchbase/gocbcore/blob/master/retry_test.go#L53 for an example) which you could pass to the WaitUntilReady operation. The retry strategy could return an action containing a 0 duration (do not retry) for all errors other than ErrDocumentNotFound which could use a non zero duration to trigger retrying (causing waituntilready to not return an error). Note that by default the SDK applies a global retry strategy of fast fail (no retries for all but a couple of cases) which can be overridden per operation.
FTS auths as "admin". Could you let me know how the behavior is different based on the role?
Let me give the custom retry strategy a shot - sounds reasonable enough to me.
Build couchbase-server-7.0.0-2677 contains gocbcore commit a192800 with commit message:
GOCBC-868: Add fast fail waituntilready for non-default http bootstrap
Build couchbase-server-6.6.0-7892 contains gocbcore commit a192800 with commit message:
GOCBC-868: Add fast fail waituntilready for non-default http bootstrap
Hey Charles Dixon, so i've been testing retry strategies, but with the information at hand I think it's not going to work for us. Here's why -
- I cannot achieve fail-right-away with retry strategies - because RetryReason/RetryRequest do not have enough context for me to identify the error that's causing the retry; The WaitUntilReady callback is invoked once only after all the allowed retries have completed. The RetryReason in itself just says CONNECTION_ERROR and that's not enough for me to differentiate between ErrDocumentNotFound (bucket warming up) and ErrAuthenticationFailure (bucket not found).
- Alternatively, I placed a loop around my WaitUntilReady(..) (with the default retry strategy) in case the error returned was ErrDocumentNotFound. While this works, I hate the look of it - a loop around a function called WaitUntilReady(..).
For the first bullet above, do correct me if I'm wrong and let me know if there's a way I can derive the information I'll need to make a decision on the retry action. If not I'll need another way from you for FTS to achieve this properly.
For the first bullet above, do correct me if I'm wrong and let me know if there's a way I can derive the information I'll need to make a decision on the retry action.
No you're right there is no way to make the distinction between errors within the retry reason. We could treat ErrDocumentNotFound as a NOT_READY reason and have it trigger a retry (so you wouldn't need your retry logic anymore) but we need to understand all of the implications of doing that first (e.g. if any other bootstrap requests can trigger it or if it will impact any other teams who rely on the behaviour for some reason).
Ok, thanks for confirming that Charles. So I'll look forward to a change here from you then, once you've figured the right thing to do.
Build couchbase-server-7.0.0-2818 contains cbgt commit c5969c6 with commit message:
GOCBC-868: Introducing RetryStrategy for gocbcore.Agents
Build couchbase-server-7.0.0-2819 contains gocbcore commit 9cd9897 with commit message:
GOCBC-868: Add bucket not found retry reason
Build couchbase-server-6.6.1-9153 contains gocbcore commit 9cd9897 with commit message:
GOCBC-868: Add bucket not found retry reason
Build sync_gateway-3.0.0-52 contains gocbcore commit 9cd9897 with commit message:
GOCBC-868: Add bucket not found retry reason
Build sync_gateway-3.0.0-52 contains gocbcore commit a192800 with commit message:
GOCBC-868: Add fast fail waituntilready for non-default http bootstrap
Build sync_gateway-3.0.0-52 contains gocbcore commit 2d1ed35 with commit message:
GOCBC-868: Expose a way to fast fail WaitUntilReady
Hi Abhi Dangeti can you give me a bit of detail on why you need this specific error at connect time? This is likely going to go to wider SDK team discussion so more info would be useful to aid in that discussion.