Details
-
Bug
-
Status: Closed
-
Critical
-
Resolution: Duplicate
-
Cheshire-Cat
-
Untriaged
-
1
-
Unknown
Description
Build: 7.0.0 build 4619
Test: Eventing Volume
- Create 9 node cluster ( Eventing:3 , Kv:4,index:1,query:1)
- Create 15 handlers (3 of bucket op, timer,n1ql, SBM, curl)
- deploy bucket op, timers and N1ql
- Rebalance in 1 eventing node
- Start CRUD on bucket_op where no handler are deployed
- Verify handlers
- Load more data
- Add to 2 KV nodes
- deploy curl and SBM handlers
- Rebalance out 2 kv nodes
- Verify all handlers
- Pause bucket op and timers
- Swap 2 kv nodes 1 after another
- load more data
- Rebalance in 2 eventing nodes
- Resume bucket op and timers
- Verify all handlers
- load more data
- rebalance out 2 eventing nodes
- verify all handlers
- load more data
- Swap 2 eventing nodes 1 after another
- Verify all handlers
- load all data
- Rebalance in kv and eventing node
- Verify all handlers
- load all data
- Rebalance out 1 kv 1 eventing
- Verify all handlers
- load all data
- Swap 1 kv and 1 eventing
- Verify all handlers
Observed that all verification for timer handler failed due to data miss match and seeing the documents in metadata
select count(*) from metadata.scope_1.coll0 where
|
meta().id like 'eventing::%:sp' and sta != stp; --> 128
|
|
select count(*) from metadata.scope_1.coll1 where meta().id like 'eventing::%:sp' and sta != stp; --> 128
|
|
select count(*) from metadata.scope_1.coll2 where meta().id like 'eventing::%:sp' and sta != stp; --> 128
|
|
Handler | Source Scope | Destination Scope |
timers0_0 | timer_op.scope_0.coll0(4095000) | timer_op.scope_1.coll0(4094498) |
timers1_0 | timer_op.scope_0.coll1(4091100) | timer_op.scope_1.coll1(4089924) |
timers2_0 | timer_op.scope_0.coll2(4089600) | timer_op.scope_1.coll2(4089127) |
N1ql queries failing with CAS miss match
2021-03-05T13:20:27.900-08:00 [INFO] "Query failed: " {"message":"SDK error : LCB_ERR_CAS_MISMATCH (209) Query error : {\n\"requestID\": \"1bc36320-7bb6-4e5f-a47d-d33f7c235c62\",\n\"clientContextID\": \"4@n1ql0_0.js(OnUpdate)\",\n\"signature\": null,\n\"results\": [\n],\n\"errors\": [{\"code\":12009,\"msg\":\"DML Error, possible causes include CAS mismatch or concurrent modificationFailed to perform UPSERT - cause: dial tcp 172.23.105.25:11210: connect: cannot assign requested address\"}],\n\"status\": \"errors\",\n\"metrics\": {\"elapsedTime\": \"23.135053ms\",\"executionTime\": \"23.080655ms\",\"resultCount\": 0,\"resultSize\": 0,\"serviceLoad\": 15,\"errorCount\": 1}\n}\n","stack":"Error\n at N1QL (<anonymous>)\n at OnUpdate (n1ql0_0.js:4:21)"} 2021-03-05T13:20:27.875-08:00 [INFO] "Query failed: " {"message":"SDK error : LCB_ERR_CAS_MISMATCH (209) Query error : {\n\"requestID\": \"1570ce91-f811-469c-b4b3-0ac8f1b57cca\",\n\"clientContextID\": \"4@n1ql0_0.js(OnUpdate)\",\n\"signature\": null,\n\"results\": [\n],\n\"errors\": [{\"code\":12009,\"msg\":\"DML Error, possible causes include CAS mismatch or concurrent modificationFailed to perform UPSERT - cause: dial tcp 172.23.105.25:11210: connect: cannot assign requested address\"}],\n\"status\": \"errors\",\n\"metrics\": {\"elapsedTime\": \"18.964561ms\",\"executionTime\": \"18.922147ms\",\"resultCount\": 0,\"resultSize\": 0,\"serviceLoad\": 9,\"errorCount\": 1}\n}\n","stack":"Error\n at N1QL (<anonymous>)\n at OnUpdate (n1ql0_0.js:4:21)"}
|
In this case I do not see the collection getting dropped. So pretty sure that
CCBC-1400is not the cause here.From the logs the mismatch of 2 docs started off somewhere between 2021-05-04T02:45:20 and 2021-05-04T03:05:20 on node 214, vbucket 289:
Documents on src collection = 5424 while on dst collection : 5422.
At this time vb 289 was owned by eventing node .186. Additionally at 2021-05-04T02:45:20 2 eventing nodes .182 and .185 were being rebalanced out. This rebalance lasted a minute at max - ended by 2021-05-04T02:46:54.
By 2021-05-04T02:45:58 we got streamend for this vb. At this time seq no - 7382 was the last sent sequence number from Go producer side to CPP consumer side:
2021-05-04T02:45:58.400-07:00 [Info] Consumer::processDCPEvents [worker_timers0_0_0:/tmp/127.0.0.1:8091_0_3656683659.sock:1383] vb: 289 got STREAMEND
2021-05-04T02:45:58.400-07:00 [Info] Consumer::sendVbFilterData [worker_timers0_0_0:/tmp/127.0.0.1:8091_0_3656683659.sock:1383] vb: 289 seqNo: 7382 sending filter data to C++
CPP side acknowledged that out of 7382, 7354 is the highest that has been processed:
2021-05-04T02:45:58.498-07:00 [Info] eventing-consumer [worker_timers0_0_0:/tmp/127.0.0.1:8091_0_3656683659.sock:1383] vb: 289 seqNo: 7354 skip_ack: 0 sending filter ack to Go
A filter is now set for 7354. Consumer notices that vb: 289 should still be owned by same consumer so DCP stream should be reclaimed. Hence, opens up a new stream:
2021-05-04T02:46:00.818-07:00 [Info] Consumer::dcpRequestStreamHandle [worker_timers0_0_0:/tmp/127.0.0.1:8091_0_3656683659.sock:1383] vb: 289 DCP stream start vbKvAddr: 172.23.97.214:11210 vbuuid: 60405539196205 startSeq !metadata_items.png|thumbnail! : 7354 snapshotStart: 7354 snapshotEnd: 7354 Manifest id: c
This confirms no mutation has been dropped as part of filtering.
Metadata collection for this handler suggests no timers are remaining to be fired. Note the numbers before and after markers:
Only possibility is that the test maybe adding some documents that are duplicates.
It would help if: