Loading...

XML

Word

Printable

Details

Type: Bug
Resolution: Unresolved
Priority: Major
Fix Version/s: Morpheus
Affects Version/s: 6.6.2
Component/s: couchbase-bucket
Labels:
None

Triage:
Untriaged
Story Points:
1
Is this a Regression?:
Unknown

Description

This MB is spun out from MB-53076, when the following sequence showed that Indexing managed to reconnect to a KV node and perform DCP StreamRequests before the node had trafficEnabled by ns_server - and crucially a few seconds before the KV node was failed over and vBucket promoted elsewhere:

a. At 2022-07-21T17:39:12.120-07:00, memcached exited
2022-07-21T17:39:12.120-07:00, ns_log:0:info:message(ns_1@172.23.100.15) - Service 'memcached' exited with status 137. Restarting. Messages:
b. Due to memcached exit, indexer received a streamEnd. Considering vb:141 for our analysis, the last received seqno. is 4168
2022-07-21T17:39:12.111-07:00 [Info] TK StreamEnd MAINT_STREAM test3 141 77409272469274 4168
c. Indexer triggered stream repair due to stream end. For vb:141, indexer restarted the stream with last received seqno. 4168 and a snapshot of {4163, 4168}
2022-07-21T17:39:16.284-07:00 [Info] Indexer::handleKVStreamRepair Initiate Stream Repair MAINT_STREAM Bucket test3 StreamId 1
2022-07-21T17:39:16.284-07:00 [Info] Indexer::startBucketStream Stream: MAINT_STREAM Bucket: test3 SessionId 1 RestartTS bucket: test3, vbuckets: 1024 Crc64: 0 snapType NO_SNAP -
    {vbno, vbuuid, seqno, snapshot-start, snapshot-end}
    {  141     46674030171a       4168       4163       4168}
d.For this stream request, memcached (on node 100.15) asked projector to rollback to 4109
2022-07-21T17:39:17.706652-07:00 WARNING 382: (test3) DCP (Producer) eq_dcpq:secidx:proj-test3-MAINT_STREAM_TOPIC_0765f2d280c663c382dea1a1b49ab1cd-2556451229124417097/1 - (vb:141) Stream request requires rollback to seqno:4109 because consumer ahead of producer - producer upper at 4109. Client requested seqnos:{4168,18446744073709551615} snapshot:{4163,4168} uuid:77409272469274
e. After this, the KV node 100.15 failed over
2022-07-21T17:39:18.234-07:00, failover:0:info:message(ns_1@172.23.100.15) - Starting failing over ['ns_1@172.23.100.15'] 
...
2022-07-21T17:39:19.308-07:00, ns_orchestrator:0:info:message(ns_1@172.23.100.15) - Failover completed successfully.
... Indexer goes on to connect to new node holding vb:141, issues StreamRequest, is told to rollback again ...

The problem here is that Indexing has had to perform an extra rollback / repair based on VBucket UUID & sequence numbers on the "just crashed" active node, before ns_server had decided this node is healthy and enabled traffic on it.

I believe this is a long-standing behaviour in KV - from the comments in the code:

 * KV-engine has the following behaviour as warmup runs.

 * Whilst the following phases are incomplete:

 *    Initialise

 *    CreateVBuckets

 *    LoadingCollectionCounts

 *    EstimateDatabaseItemCount

 *    LoadPreparedSyncWrites

 *    PopulateVBucketMap

 *  1) setVBucket requests are queued (using the EWOULDBLOCK mechanism)

 *  2) The VBucket map is empty. No operation can find a VBucket object to

 *     operate on. For CRUD operations externally clients will see

 *     'temporary_failure' instead of 'not_my_vbucket'.

 *  3) DCP Consumers cannot be created.

 *  4) DCP Producers can be created, but stream-request will fail with

 *     'not_my_vbucket'.

 * On completion of PopulateVBucketMap:

 *  1) All queued setVBucket requests are notified and all new setVBucket

 *     requests can be processed.

 *  2) The VBucket map is populated and operations can now find VBucket objects.

 *     * DCP producer stream-requests can now be processed.

 <cut>

Once PopulateVBucketMap step has completed, DCP StreamRequests can be (successfully) processed.

While this is desirable in terms of bringing replicas / other DCP consumers back online and consistent asap, it does have the downside seen the the above scenario - a DCP consumer could end up having to discard data / perform costly updates of it's own state based on information from a node which is not "ready" from ns_server's POV and is soon-to-be failed over.

Supportal
http://supportal.couchbase.com/snapshot/551490e6c919e74d343537e9e2e38887::0

https://cb-engineering.s3.amazonaws.com/MB-53075_functional/collectinfo-2022-07-22T004733-ns_1%40172.23.100.15.zip
https://cb-engineering.s3.amazonaws.com/MB-53075_functional/collectinfo-2022-07-22T004733-ns_1%40172.23.100.16.zip
https://cb-engineering.s3.amazonaws.com/MB-53075_functional/collectinfo-2022-07-22T004733-ns_1%40172.23.100.17.zip
https://cb-engineering.s3.amazonaws.com/MB-53075_functional/collectinfo-2022-07-22T004733-ns_1%40172.23.100.19.zip
https://cb-engineering.s3.amazonaws.com/MB-53075_functional/collectinfo-2022-07-22T004733-ns_1%40172.23.100.22.zip
https://cb-engineering.s3.amazonaws.com/MB-53075_functional/collectinfo-2022-07-22T004733-ns_1%40172.23.121.215.zip

Attachments

Gerrit Reviews

- Issue Only
- Show All Reviews
- Show Open Reviews

No reviews matched the request. Check your Options in the drop-down menu of this sections header.

Activity

People

Assignee:: Daniel Owen

Reporter:: Dave Rigby (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 22/Jul/22 5:30 AM

Updated:: 05/Dec/23 7:05 AM

Gerrit Reviews

There are no open Gerrit changes

DCP Consumers can successfully StreamRequest from KV node in warmup which is about-to-be failed over

Details

Description

Attachments

Gerrit Reviews

Activity

People

Dates

Gerrit Reviews

PagerDuty