Details
-
Bug
-
Resolution: Unresolved
-
Major
-
6.6.2
-
None
-
Untriaged
-
1
-
Unknown
Description
This MB is spun out from MB-53076, when the following sequence showed that Indexing managed to reconnect to a KV node and perform DCP StreamRequests before the node had trafficEnabled by ns_server - and crucially a few seconds before the KV node was failed over and vBucket promoted elsewhere:
a. At 2022-07-21T17:39:12.120-07:00, memcached exited
2022-07-21T17:39:12.120-07:00, ns_log:0:info:message(ns_1@172.23.100.15) - Service 'memcached' exited with status 137. Restarting. Messages:
b. Due to memcached exit, indexer received a streamEnd. Considering vb:141 for our analysis, the last received seqno. is 4168
2022-07-21T17:39:12.111-07:00 [Info] TK StreamEnd MAINT_STREAM test3 141 77409272469274 4168
c. Indexer triggered stream repair due to stream end. For vb:141, indexer restarted the stream with last received seqno. 4168 and a snapshot of {4163, 4168}
2022-07-21T17:39:16.284-07:00 [Info] Indexer::handleKVStreamRepair Initiate Stream Repair MAINT_STREAM Bucket test3 StreamId 1
2022-07-21T17:39:16.284-07:00 [Info] Indexer::startBucketStream Stream: MAINT_STREAM Bucket: test3 SessionId 1 RestartTS bucket: test3, vbuckets: 1024 Crc64: 0 snapType NO_SNAP -
{vbno, vbuuid, seqno, snapshot-start, snapshot-end}
{ 141 46674030171a 4168 4163 4168}
d.For this stream request, memcached (on node 100.15) asked projector to rollback to 4109
2022-07-21T17:39:17.706652-07:00 WARNING 382: (test3) DCP (Producer) eq_dcpq:secidx:proj-test3-MAINT_STREAM_TOPIC_0765f2d280c663c382dea1a1b49ab1cd-2556451229124417097/1 - (vb:141) Stream request requires rollback to seqno:4109 because consumer ahead of producer - producer upper at 4109. Client requested seqnos:{4168,18446744073709551615} snapshot:{4163,4168} uuid:77409272469274
e. After this, the KV node 100.15 failed over
2022-07-21T17:39:18.234-07:00, failover:0:info:message(ns_1@172.23.100.15) - Starting failing over ['ns_1@172.23.100.15']
...
2022-07-21T17:39:19.308-07:00, ns_orchestrator:0:info:message(ns_1@172.23.100.15) - Failover completed successfully.
... Indexer goes on to connect to new node holding vb:141, issues StreamRequest, is told to rollback again ...
The problem here is that Indexing has had to perform an extra rollback / repair based on VBucket UUID & sequence numbers on the "just crashed" active node, before ns_server had decided this node is healthy and enabled traffic on it.
I believe this is a long-standing behaviour in KV - from the comments in the code:
* KV-engine has the following behaviour as warmup runs.
|
*
|
* Whilst the following phases are incomplete:
|
*
|
* Initialise
|
* CreateVBuckets
|
* LoadingCollectionCounts
|
* EstimateDatabaseItemCount
|
* LoadPreparedSyncWrites
|
* PopulateVBucketMap
|
*
|
* 1) setVBucket requests are queued (using the EWOULDBLOCK mechanism)
|
* 2) The VBucket map is empty. No operation can find a VBucket object to
|
* operate on. For CRUD operations externally clients will see
|
* 'temporary_failure' instead of 'not_my_vbucket'.
|
* 3) DCP Consumers cannot be created.
|
* 4) DCP Producers can be created, but stream-request will fail with
|
* 'not_my_vbucket'.
|
*
|
* On completion of PopulateVBucketMap:
|
*
|
* 1) All queued setVBucket requests are notified and all new setVBucket
|
* requests can be processed.
|
* 2) The VBucket map is populated and operations can now find VBucket objects.
|
* * DCP producer stream-requests can now be processed.
|
<cut>
|
Once PopulateVBucketMap step has completed, DCP StreamRequests can be (successfully) processed.
While this is desirable in terms of bringing replicas / other DCP consumers back online and consistent asap, it does have the downside seen the the above scenario - a DCP consumer could end up having to discard data / perform costly updates of it's own state based on information from a node which is not "ready" from ns_server's POV and is soon-to-be failed over.
Supportal
http://supportal.couchbase.com/snapshot/551490e6c919e74d343537e9e2e38887::0
https://cb-engineering.s3.amazonaws.com/MB-53075_functional/collectinfo-2022-07-22T004733-ns_1%40172.23.100.15.zip
https://cb-engineering.s3.amazonaws.com/MB-53075_functional/collectinfo-2022-07-22T004733-ns_1%40172.23.100.16.zip
https://cb-engineering.s3.amazonaws.com/MB-53075_functional/collectinfo-2022-07-22T004733-ns_1%40172.23.100.17.zip
https://cb-engineering.s3.amazonaws.com/MB-53075_functional/collectinfo-2022-07-22T004733-ns_1%40172.23.100.19.zip
https://cb-engineering.s3.amazonaws.com/MB-53075_functional/collectinfo-2022-07-22T004733-ns_1%40172.23.100.22.zip
https://cb-engineering.s3.amazonaws.com/MB-53075_functional/collectinfo-2022-07-22T004733-ns_1%40172.23.121.215.zip