Details
-
Bug
-
Status: Closed
-
Critical
-
Resolution: Not a Bug
-
6.6.2
-
None
-
Untriaged
-
1
-
Unknown
Description
Steps to reproduce:
- Create 4 buckets
- Create indexes with replicas on each of the 4 buckets.
- Run pillowfight to continuously load data ((buckets have 1M, 1M , 1M and 3M items). The bucket RR needs to be under 10%. Load until then
- Run a shell script that runs the request_plus scans continuously.
- Run stress-ng with the params -> stress-ng --vm 4 --vm-bytes 1G --metrics-brief --vm-keep --vm-locked -m 4 --aggressive --vm-populate. (Adjust the --vm-bytes param depending upon the VM resources)
- Once you run enough stress-ng processes, OOM kill will kick in. This can be verified by checking the dmesg ( dmesg -T | egrep -i 'killed process' )
- There's a possibility that stress-ng gets spawned and killed since OOM kill is determined by a oom_score_adj factor. In order to make sure that memcached gets killed run this
echo 1000 > /proc/<memcached PID>/oom_score_adj |
- Observe that the scans are timing out and that the index has rolled back to 0.
Logs and pcaps are attached. There were 2 instances of rollbacks observed. Please use these timestamps for analysis
Instance 1
Index node1 ( 172.23.106.159) |
2022-07-26T03:26:54.738-07:00 [Info] StorageMgr::rollbackAllToZero MAINT_STREAM test8 |
|
Index node2 ( 172.23.106.163) |
|
2022-07-26T03:26:58.186-07:00 [Info] StorageMgr::rollbackAllToZero MAINT_STREAM test8 |
Instance 2
Index node1 ( 172.23.106.159) |
|
2022-07-26T05:06:12.658-07:00 [Info] StorageMgr::rollbackAllToZero MAINT_STREAM test8 |
|
Index node2 ( 172.23.106.163) |
|
2022-07-26T05:06:10.805-07:00 [Info] StorageMgr::rollbackAllToZero MAINT_STREAM test8 |
Log bundles ->
s3://cb-customers-secure/cbse12279oomkil2/2022-07-26/collectinfo-2022-07-26t125548-ns_1@172.23.105.36.zip |
s3://cb-customers-secure/cbse12279oomkil2/2022-07-26/collectinfo-2022-07-26t125548-ns_1@172.23.105.37.zip |
s3://cb-customers-secure/cbse12279oomkil2/2022-07-26/collectinfo-2022-07-26t125548-ns_1@172.23.106.156.zip |
s3://cb-customers-secure/cbse12279oomkil2/2022-07-26/collectinfo-2022-07-26t125548-ns_1@172.23.106.159.zip |
s3://cb-customers-secure/cbse12279oomkil2/2022-07-26/collectinfo-2022-07-26t125548-ns_1@172.23.106.163.zip |
s3://cb-customers-secure/cbse12279oomkil2/2022-07-26/collectinfo-2022-07-26t125548-ns_1@172.23.106.204.zip |
I have the packet capture files too and can attach them if necessary.
Attachments
Activity
a. Consider the rollback
2022-07-26T05:06:10.805-07:00 [Info] StorageMgr::rollbackAllToZero MAINT_STREAM test8
|
b. Indexer did not have any snapshot to be used. Hence, it rollbacked to zero
2022-07-26T05:06:09.431-07:00 [Info] StorageMgr::handleRollback latestSnapInfo <nil> lastRollbackTs <nil>. Use latest snapshot.
|
c. The rollbackTs for many vbuckets is "0"
2022-07-26T05:06:09.430-07:00 [Info] StorageMgr::handleRollback rollbackTs is bucket: test8, vbuckets: 1024 Crc64: 0 snapType NO_SNAP -
|
{vbno, vbuuid, seqno, snapshot-start, snapshot-end}
|
{ 0 1a84ffdb1dbd 0 0 0}
|
{ 1 4fdea47cd9ba 0 0 0}
|
{ 2 20aa63115fd1 126103 0 0}
|
{ 3 f123fa709ea7 127215 0 0}
|
d. Taking vb: 0 as example, memcached oom kill happened and that lead to memcached create a new vbuuid due to unclean shutdown
2022-07-26T05:02:58.678-07:00, ns_memcached:0:info:message(ns_1@172.23.105.36) - Control connection to memcached on 'ns_1@172.23.105.36' disconnected. Check logs for details.
|
2022-07-26T05:03:00.548-07:00, ns_log:0:info:message(ns_1@172.23.105.36) - Service 'memcached' exited with status 137. Restarting. Messages:
|
...
|
|
2022-07-26T05:03:08.066510-07:00 INFO (test8) Warmup::createVBuckets: vb:0 created new failover entry with uuid:29158530555325 and seqno:116360 due to unclean shutdown
|
e. Indexer received mutations with this vbuuid from memcached on 100.36 post restart, till the connection is closed
2022-07-26T05:03:16.199803-07:00 INFO 148: (test8) DCP (Producer) eq_dcpq:secidx:proj-test8-MAINT_STREAM_TOPIC_3952cfedf2ceec6b184b836c0f396aaf-18233705252004069608/0 - (vb:0) Creating stream with start seqno 98199 and end seqno 18446744073709551615; requested end seqno was 18446744073709551615, collections-manifest uid:none, sid:none
|
...
|
2022-07-26T05:05:05.432874-07:00 WARNING 148: (test8) DCP (Producer) eq_dcpq:secidx:proj-test8-MAINT_STREAM_TOPIC_3952cfedf2ceec6b184b836c0f396aaf-18233705252004069608/0 - (vb:0) Stream closing, sent until seqno 109400 remaining items 0, reason: The stream closed early because the conn was disconnected
|
f. The connnection is closed at the time of failover
2022-07-26T05:05:02.810-07:00, failover:0:info:message(ns_1@172.23.105.36) - Starting failing over ['ns_1@172.23.105.36']
|
2022-07-26T05:05:02.810-07:00, ns_orchestrator:0:info:message(ns_1@172.23.105.36) - Starting failover of nodes ['ns_1@172.23.105.36']. Operation Id = 2139917b8f5387cec83fac633b7e7b5a
|
g. Due to failover, the vbuuid changed
2022-07-26T05:05:04.916622-07:00 INFO (test8) VBucket::setState: transitioning vb:0 with high seqno:126312 from:replica to:active meta:{"topology":[["ns_1@172.23.105.37",null]]}
|
2022-07-26T05:05:04.916676-07:00 INFO (test8) KVBucket::setVBucketState: vb:0 created new failover entry with uuid:122286947444349 and seqno:125619
|
g. When KV node fail over happened, this has led indexer to restart the stream with 1a84ffdb1dbd as node UUID and 109400 as the last received seqno.
2022-07-26T05:05:56.934-07:00 [Info] FEED[<=>MAINT_STREAM_TOPIC_3952cfedf2ceec6b184b836c0f396aaf(127.0.0.1:8091)] ##98 start-timestamp bucket: test8, vbuckets: 171 -
|
{vbno, vbuuid, seqno, snapshot-start, snapshot-end}
|
{ 0 1a84ffdb1dbd 109400 54044 116360}
|
{ 1 4fdea47cd9ba 92432 53304 113709}
|
g. When indexer made the request, failover table did not contain the 29158530555325 (or 1a84ffdb1dbd) UUID. Hence, memcached threw this error
Failover table:
test8
|
|
vb_0:0:id: 122286947444349
|
vb_0:0:seq: 125619
|
vb_0:1:id: 216042336623651
|
vb_0:1:seq: 0
|
2022-07-26T05:05:56.981326-07:00 WARNING 61: (test8) DCP (Producer) eq_dcpq:secidx:proj-test8-MAINT_STREAM_TOPIC_3952cfedf2ceec6b184b836c0f396aaf-1290788532655393424/1 - (vb:0) Stream request requires rollback to seqno:0 because vBucket UUID not found in failover table, consumer and producer have no common history. Client requested seqnos:{109400,18446744073709551615} snapshot:{54044,116360} uuid:29158530555325
|
This has resulted in rollback to zero. The UUID created by the failed over node (100.36) is not persisted in the failover table. When failover happened, memcached did not have the UUID indexer wanted and hence there was a rollback to zero. I checked some other vbuckets as well. The issue appears to be different from customer issue.
Varun Velamuri Is this similar to one of the other MBs raised this week where if Indexer had retired with the previous UUID (and adjusted snapshot range) it could have avoided rolling back all the way to zero?
Dave Rigby - We could have avoided rollback by using previous VBUUID and snapshot range. The bigger problem I see is that indexer did not have a snapshot to retry with. This needs to be investigated further
2022-07-26T05:06:09.431-07:00 [Info] StorageMgr::handleRollback latestSnapInfo <nil> lastRollbackTs <nil>. Use latest snapshot.
|
h. From the logs, the bucket "test8" has been flushed at 2022-07-26T03:26:20.792-07:00. }}So, all the indexed data (including disk snapshots) is lost at this time and the next disk snapshot can only be generated at around {{2022-07-26T03:36 i.e. 10 minutes after stream start. I do not see any disk snapshot getting generated after flush
2022-07-26T03:26:20.792-07:00, ns_orchestrator:0:info:message(ns_1@172.23.105.36) - Flushing bucket "test8" from node 'ns_1@172.23.105.36'
|
i. After the stream started, indexer started to receiving larger timestamps. From projector logs on node 105.36, the avg. DCP snapshot size has grown from 22 to 392 - essentially indexer is receiving larger snapshots from DCP. This could be due to slow KV (I am claiming slow KV as mutChLen at projector is zero - which means projector is waiting for data from KV)
2022-07-26T03:27:28.147-07:00 [Info] KVDT[<-test8<-127.0.0.1:8091 #MAINT_STREAM_TOPIC_3952cfedf2ceec6b184b836c0f396aaf] ##72 stats: {"hbCount":14,"eventCount":40059,"reqCount":342,"endCount":0,"snapStat.samples":3131,"snapStat.min":1,"snapStat.max":356,"snapStat.avg":22,"upsertCount":36586,"deleteCount":0,"exprCount":0,"ainstCount":1,"dinstCount":0,"tsCount":0,"mutChLen":0,"numDocsProcessed":36520,"numDocsPending":101094}
|
...
|
2022-07-26T03:36:40.381-07:00 [Info] KVDT[<-test8<-127.0.0.1:8091 #MAINT_STREAM_TOPIC_3952cfedf2ceec6b184b836c0f396aaf] ##72 stats: {"hbCount":287,"eventCount":3881623,"reqCount":342,"endCount":0,"snapStat.samples":12328,"snapStat.min":1,"snapStat.max":21854,"snapStat.avg":392,"upsertCount":3868953,"deleteCount":0,"exprCount":0,"ainstCount":1,"dinstCount":0,"tsCount":0,"mutChLen":0,"numDocsProcessed":3866756,"numDocsPending":7632366}
|
j. Larger snapshot would mean that indexer will not be able to align the seqno. received to snapshot end. After a while, the size of snapshot has grown really large
2022-07-26T04:19:12.914-07:00 [Info] MutationStreamReader::updateSnapInFilter Huge Snapshot Received for test8 48 MAINT_STREAM 3115-54850
|
...
|
2022-07-26T05:03:24.863-07:00 [Info] MutationStreamReader::updateSnapInFilter Huge Snapshot Received for test8 280 MAINT_STREAM 54800-118903
|
k. Indexer could not align the timestamps to snapshots & hence no disk snapshot got persisted. The num_snapshots/num_commits stats are not changing on indexer side & the num_nonalign_ts is grown drastically. This suggests that indexer did not get even a single timestamp in which all vbucket seqnos. are aligned to snapshot boundaries.
2022-07-26T03:28:32.918-07:00 [Info] PeriodicStats = {"test8:idx8:num_snapshots":57067, "test8:idx8:num_commits":13,
|
"test8:num_nonalign_ts":293053}
|
...
|
2022-07-26T05:05:33.111-07:00 [Info] PeriodicStats = {"test8:idx8:num_snapshots":57067, "test8:idx8:num_commits":13, "test8:num_nonalign_ts":569809}
|
l. As there was no disk snapshot, indexer could not rollback to a disk snapshot & hence it rolled back to zero.
This is another scenario where a slow KV could send potentially large snapshots which will lead to indexer not able to align the seqnos. to the snapshots & therefore had to rollback to zero (or a very old disk snapshot). Even if one vbucket is not snapshot aligned, indexer would not generate a disk snapshot
CC: Deepkaran Salooja , Amit Kulkarni , Dave Finlay , Dave Rigby
I think the case where the bucket got flushed and Indexer had to rollback to zero is completely fine.
The is when the bucket flush starts:
[ns_server:debug,2022-07-26T03:26:48.768-07:00,ns_1@172.23.105.36:ns_audit<0.28196.0>:ns_audit:handle_call:146]Audit flush_bucket: [{remote,{[{ip,<<"10.100.195.195">>},{port,49358}]}},
|
{sessionid,<<"634890a73e552079485242dc299216a91c34cdda">>},
|
{real_userid,{[{domain,builtin},
|
{user,<<"<ud>Administrator</ud>">>}]}},
|
{timestamp,<<"2022-07-26T03:26:48.768-07:00">>},
|
{bucket_name,<<"test8">>}]
|
This is when it finishes:
[ns_server:debug,2022-07-26T03:26:48.768-07:00,ns_1@172.23.105.36:ns_audit<0.28196.0>:ns_audit:handle_call:146]Audit flush_bucket: [{remote,{[{ip,<<"10.100.195.195">>},{port,49358}]}},
|
{sessionid,<<"634890a73e552079485242dc299216a91c34cdda">>},
|
{real_userid,{[{domain,builtin},
|
{user,<<"<ud>Administrator</ud>">>}]}},
|
{timestamp,<<"2022-07-26T03:26:48.768-07:00">>},
|
{bucket_name,<<"test8">>}]
|
Flushs empties the bucket by deleting the vbucket which results in the creation of a new vbuuid. It's not surprising that Indexing rolls back to zero a few seconds after the bucket flush finishes:
Index node1 ( 172.23.106.159)
|
2022-07-26T03:26:54.738-07:00 [Info] StorageMgr::rollbackAllToZero MAINT_STREAM test8
|
The second case at 5:06 is more interesting as it is the result of a failover induced by an OOM kill.
OOM kill:
2022-07-26T05:02:58.678-07:00, ns_memcached:0:info:message(ns_1@172.23.105.36) - Control connection to memcached on 'ns_1@172.23.105.36' disconnected. Check logs for details.
|
2022-07-26T05:03:00.548-07:00, ns_log:0:info:message(ns_1@172.23.105.36) - Service 'memcached' exited with status 137. Restarting. Messages:
|
Failover:
2022-07-26T05:05:02.810-07:00, ns_orchestrator:0:info:message(ns_1@172.23.105.36) - Starting failover of nodes ['ns_1@172.23.105.36']. Operation Id = 2139917b8f5387cec83fac633b7e7b5a
|
2022-07-26T05:05:05.290-07:00, ns_memcached:0:info:message(ns_1@172.23.105.36) - Shutting down bucket "test8" on 'ns_1@172.23.105.36' for deletion
|
2022-07-26T05:05:07.683-07:00, ns_memcached:0:info:message(ns_1@172.23.105.36) - Bucket "test6" loaded on node 'ns_1@172.23.105.36' in 124 seconds.
|
2022-07-26T05:05:08.608-07:00, failover:0:info:message(ns_1@172.23.105.36) - Failed over ['ns_1@172.23.105.36']: ok
|
2022-07-26T05:05:08.651-07:00, failover:0:info:message(ns_1@172.23.105.36) - Deactivating failed over nodes ['ns_1@172.23.105.36']
|
2022-07-26T05:05:08.690-07:00, auto_failover:0:info:message(ns_1@172.23.105.36) - Node ('ns_1@172.23.105.36') was automatically failed over. Reason: The data service did not respond for the duration of the auto-failover threshold. Either none of the buckets have warmed up or there is an issue with the data service.
|
Rollback to zero:
Index node1 ( 172.23.106.159)
|
2022-07-26T05:06:12.658-07:00 [Info] StorageMgr::rollbackAllToZero MAINT_STREAM test8
|
CPU is high on all 3 of the memcached nodes for the last 10 minutes before the failover – however it doesn't seem to be that high prior to that.
That said, I see quite a lot of slow scheduling for the DCP connection notifier task on the non-IO thread workers during this time.
$ grep "T0[345].*test8.*Slow scheduling for NON_IO task .DCP connection" memcached.log | grep -o 2022-..-..T..:. | uniq -c
|
50 2022-07-26T03:0
|
13 2022-07-26T03:1
|
14 2022-07-26T03:2
|
87 2022-07-26T03:3
|
19 2022-07-26T03:4
|
65 2022-07-26T03:5
|
79 2022-07-26T04:0
|
41 2022-07-26T04:1
|
13 2022-07-26T04:2
|
22 2022-07-26T04:3
|
45 2022-07-26T04:4
|
27 2022-07-26T04:5
|
33 2022-07-26T05:0
|
Some of these waits are pretty lengthy. Dave - is this evidence for KV being slow to get mutations to indexing? Is there a reason for the slowness?
From DaveF:
Dave - is this evidence for KV being slow to get mutations to indexing? Is there a reason for the slowness?
Looking at mortimer we for the flush around 03:26, we see the test8 has a bucket flush at [1], followed by the bucket being repopulated with data. Nodes 106.37 and 106.156 hit the high watermark and successfully free memory, dropping back to low watermark and continue to oscillate as more data is added, but node 105.36 exceeds the high_watermark and never comes back down below the high watermark:
Looking at a breakdown of memory, we see large amounts in CheckpointManager, but not in the disk queue - i.e. the bulk of the memory is outstanding items for DCP stream(s):
Looking at item counts, we see both replication and GSI have large number of items remaining:
Essentially the workload is generating more mutations than KV-Engine can transmit via DCP to replicas and indexing nodes; hence a backlog is formed which consumes most of the bucket memory. To attempt to alleviate that KV will switch from in-memory to disk backfill - which will result in larger snapshots being sent:
2022-07-26T03:32:46.494248-07:00 INFO 64: (test8) DCP (Producer) eq_dcpq:secidx:proj-test8-MAINT_STREAM_TOPIC_d44295ce804e3e6e97cc718dd4f42cba-3631624410218667233/0 - (vb:264) Handling slow stream; state_ : in-memory, lastReadSeqno : 5279, lastSentSeqno : 3452, vBucketHighSeqno : 24649, isBackfillTaskRunning : False
|
2022-07-26T04:03:48.467911-07:00 INFO 64: (test8) DCP (Producer) eq_dcpq:secidx:proj-test8-MAINT_STREAM_TOPIC_d44295ce804e3e6e97cc718dd4f42cba-3631624410218667233/0 - (vb:264) ActiveStream::transitionState: Transitioning from in-memory to backfilling
|
2022-07-26T04:03:48.472522-07:00 INFO 64: (test8) DCP (Producer) eq_dcpq:secidx:proj-test8-MAINT_STREAM_TOPIC_d44295ce804e3e6e97cc718dd4f42cba-3631624410218667233/0 - (vb:264) ActiveStream::scheduleBackfill_UNLOCKED register cursor with name "eq_dcpq:secidx:proj-test8-MAINT_STREAM_TOPIC_d44295ce804e3e6e97cc718dd4f42cba-3631624410218667233/0" backfill:true, seqno:5873
|
2022-07-26T04:03:48.474445-07:00 INFO 64: (test8) DCP (Producer) eq_dcpq:secidx:proj-test8-MAINT_STREAM_TOPIC_d44295ce804e3e6e97cc718dd4f42cba-3631624410218667233/0 - (vb:264) Scheduling backfill from 5280 to 5872, reschedule flag : True
|
Note that we have improved this behaviour in later versions - most notably in 7.1 by limiting the amount of memory which can be used by CheckpontManager (for holding outstanding DCP & disk mutations) - that wouldn't avoid the switch to backfilling, but would keep more memory free in the bucket for normal operations, andr apply backpressure to the client sooner telling them we are temporarily out of memory.
The later OOM-kill around 05:02 is essentially the same - there rate of ingress mutations is greater than the environment can egress to replication and indexer, resulting in large disk backfill snapshots being sent to indexing.
Note this setup only has:
- Only one AuxIO thread, which in 6.6.2 is responsible for running tasks which is responsible for all processing of in-memory Checkpoints into DCP messages - for replicas and indexing, all buckets, and all disk backfills.
- Two NonIO threads, which are running ItemCompressorTasks (active compression), ItemPaging, and notifying DCP connections data is ready.
Essentially the system is not well-sized for the workload being run on it, and things are running slow / queues are backing up etc.
Varun Velamuri Pavan PB Since, this is marked for 7.1.2, can you pls also update the due date on this? Thanks!
I don't think there is anything wrong with indexer in this case. Removing the fix version as 7.1.2 for now.
Pavan PB - Resolving this as not-a-bug as it turned out to be sizing issue (See the comment from Dave Rigby above)
OOM-killer seen killing memcached on node 105.36 at:
There's 280 rollback responses sent to GSI from memcached on node 105.36 after this OOM-kill:
rg 'Stream request requires rollback' memcached.log | rg 2022-07-26T05: | rg secidx | wc -l
280
I'm not sure how to map the rollbacks seem from KV to indexing's "rollback to zero" - certainly KV didn't respond with "rollback to zero" for any of the failed stream requests post-OOM-kill:
rg 'Stream request requires rollback' memcached.log | rg 2022-07-26T05: | rg secidx | rg -c --include-zero seqno:0
0
Few examples of the GSI rollbacks we responded with:
2022-07-26T05:03:14.726864-07:00 WARNING 145: (test8) DCP (Producer) eq_dcpq:secidx:proj-test8-MAINT_STREAM_TOPIC_d44295ce804e3e6e97cc718dd4f42cba-8649600389967487890/3 - (vb:3) Stream request requires rollback to seqno:121417 because consumer ahead of producer - producer upper at 121417. Client requested seqnos:{127824,18446744073709551615} snapshot:{127814,127824} uuid:265136827834023
2022-07-26T05:03:16.222055-07:00 WARNING 150: (test8) DCP (Producer) eq_dcpq:secidx:proj-test8-MAINT_STREAM_TOPIC_3952cfedf2ceec6b184b836c0f396aaf-18233705252004069608/2 - (vb:2) Stream request requires rollback to seqno:125392 because consumer ahead of producer - producer upper at 125392. Client requested seqnos:{126981,18446744073709551615} snapshot:{126981,126981} uuid:35916178612177