Details
-
Improvement
-
Resolution: Won't Do
-
Major
-
7.1.0
-
7.1.0-2438
-
1
Description
This issue is seen while running the XDCR volume tests with magma. It is seen on the destination cluster where there is another XDCR issue exists on this build i.e. https://issues.couchbase.com/browse/MB-50854. Here the xdcr backfill was getting refreshed infinitely and the data from the src xdcr is being sent. Please take a look at the xdcr defect for more details.
It looks like the data coming in to the XDCR destination is all temp documents and those were taking up the space in memory and hence filling it up. I am not sure how to simulate the same workload without XDCR.
From Slack:
can the seqno be that large?
|
SV @0x7f2070dee320 ... ....Cm temp: D seq:18446744073709551613 rev:129 cas:1646538741157724161 key:"cid:0x33:JnWsHZEdH-12910444, size:19" exp:1646538740 age:8 fc:2 vallen:0
|
8:52
|
cbstats output:
|
vb_0:ht_num_in_memory_items: 219671
|
vb_0:ht_num_in_memory_non_resident_items: 219671
|
vb_0:ht_num_items: 219671
|
Dave Rigby
|
The vbucket-details stats as printed by cbstats should make more sense.
|
Posted in kv-magma | Yesterday at 12:40 AM | View message
|
|
|
|
|
|
8:52
|
i am not sure how these numbers can be same... (edited)
|
|
Ben Huddleston:palm_tree: 11:31 PM
|
@Ritesh Agarwal That’s a temporary item which has a negative seqno which is being displayed as a positive value due to over/underflow. See temp: D saying that the item is temporary and deleted (edited)
|
11:33
|
Are all of the items in the hash table dump temporary items @Ritesh Agarwal?
|
|
Ritesh Agarwal 11:45 PM
|
no @benh, doesn't look like that:
|
[root@localhost ~]# /opt/couchbase/bin/cbstats localhost:11210 -u Administrator -p password -b GleamBookUsers0 raw '_hash-dump 0' | more
|
vb:0: HashTable[0x7f21f958a518] with numItems:219671 numInMemory:219671 numDeleted:0 numNonResident:219671 numTemp:59581 numSystemItems:0 numPreparedSW:0 values:
|
SV @0x7f20da3e5cc0 .CJ ....Cm temp: seq:991175 rev:171 cas:1646532756115226624 key:"cid:0x27:QUUpyg86d-21996840, size:19" exp:0 age:10 fc:3 vallen:0
|
SV @0x7f21bd481a30 .CJ ....Cm temp: seq:897960 rev:164 cas:1646524103845216256 key:"cid:0x2d:va9INIC7g55-756921, size:19" exp:0 age:2 fc:3 vallen:0
|
SV @0x7f1d8e5f3ea0 .CJ ....Cm temp: seq:1010501 rev:172 cas:1646534569763405824 key:"cid:0xd:KfUzLzveYW-2297242, size:19" exp:0 age:10 fc:3 vallen:0
|
SV @0x7f218532bac0 .CJ ....Cm temp: seq:1134615 rev:183 cas:1646548835842129920 key:"cid:0x1a:qoVZfcefrCy-988315, size:19" exp:0 age:6 fc:3 vallen:0
|
SV @0x7f2070dee320 ... ....Cm temp: D seq:18446744073709551613 rev:129 cas:1646538741157724161 key:"cid:0x33:JnWsHZEdH-12910444, size:19" exp:1646538740 age:8 fc:2 vallen:0
|
11:45
|
ssh root@172.23.106.238 (edited)
|
|
Ben Huddleston:palm_tree: 11:46 PM
|
All of those are temporary, some just have seqnos :slightly_smiling_face:
|
11:46
|
Items being temporary vs not affects the tracking of some of those stats that you mentioned
|
|
Ritesh Agarwal 11:50 PM
|
temp: D
|
means deleted, right? but the others doesn't have 'D'
|
|
Ben Huddleston:palm_tree: 11:50 PM
|
temp: D means temp deleted. temp means temporary
|
|
Ritesh Agarwal 11:50 PM
|
ah okay
|
11:51
|
so when does these temp removed from memory?
|
|
Ben Huddleston:palm_tree: 11:51 PM
|
next item pager run generally
|
|
Ritesh Agarwal 11:52 PM
|
[root@localhost ~]# /opt/couchbase/bin/cbstats localhost:11210 -u Administrator -p password -b GleamBookUsers0 raw '_hash-dump 0' | grep temp | wc -l
|
508094
|
11:54
|
i think because of these i am seeing a lot of Metadata overhead warning
|
11:54
|
although eviction is set to full eviction
|
|
Ben Huddleston:palm_tree: 11:56 PM
|
Well to set that numNonResident stat it looks like we consider deleted or temporary items as non-resident (as well as actually non-resident items but as you say you’re testing full eviction) (edited)
|
11:58
|
Actually you are hitting the residency check (edited)
|
|
Ritesh Agarwal 12:02 AM
|
so there is this xdcr issue(https://issues.couchbase.com/browse/MB-50854) where the backfill in getting restarted infinitely. And the cluster i am referring here is the xdcr destination cluster...
|
12:04
|
the items are coming to the xdcr dstn forever but looks like they all are ignored as they are all temp and then metadata warnings are showing up....(I am not able to connect the dots) (edited)
|
|
Ben Huddleston:palm_tree: 12:04 AM
|
:philosoraptor: so we have these counters
|
vb:0: HashTable[0x7f21f958a518] with numItems:554288 numInMemory:554288 numDeleted:0 numNonResident:554288 numTemp:29467 numSystemItems:0 numPreparedSW:0 values:
|
but everything (and more?) is temp in the dump
|
[root@localhost ~]# grep temp htdump.txt | wc -l
|
583755
|
12:04
|
Are you running anything at the same time?
|
|
Ritesh Agarwal 12:04 AM
|
nothing
|
12:04
|
scr: http://172.23.105.175:8091/ui/index.html#/buckets?commonBucket=GleamBookUsers0&scenarioZoom=minute&scenario=uezvmzrc0
|
dstn: http://172.23.106.250:8091/ui/index.html#/buckets?commonBucket=GleamBookUsers0&scenarioZoom=minute&scenario=bvycbn8zh
|
12:05
|
scr cluster is idle since 2 days
|
12:05
|
no cluster operations, no mutations
|
12:06
|
its xdcr src which is pushing data to this cluster
|
Ben Huddleston
|
Are you running anything at the same time?
|
Posted in kv-magma | Today at 12:04 AM | View message
|
|
Ben Huddleston:palm_tree: 12:07 AM
|
I suspect that numTemp is misleading because it checks via seqno and if an item has been fetched back into memory then it gets the seqno back
|
12:07
|
So numTemp is probably numTemp waiting for a fetch (edited)
|
12:08
|
this adds up
|
[root@localhost ~]# grep "temp: D" htdump.txt | wc -l
|
29467
|
vs
|
numTemp: 29467
|
(edited)
|
|
Ritesh Agarwal 12:12 AM
|
Okay
|
|
Ben Huddleston:palm_tree: 12:13 AM
|
The difference in items reported via the ht dump vs the overall stats may be due to that xdcr load you’ve got
|
|
Ritesh Agarwal 12:14 AM
|
Yea, right
|
|
Ben Huddleston:palm_tree: 12:14 AM
|
iirc it locks/unlocks to not trash the performance of everything else so you can’t expect the number of items it prints to match the count unless you have 0 load
|
12:15
|
In terms of your warning, I think it’s just temporary items contributing towards the stats making it display, assuming that the warning is backed from the stats you’ve been mentoining
|
|
Ritesh Agarwal 12:17 AM
|
Yea the items match is not something i am bothered about. I think its clear now that temp items come and then i do see that the % goes down also in the warning message so eviction is also happening and as in new temp items comes it gets filled again and warning shows up repeatedly. (edited)
|
12:18
|
And this keeps on happening forever due to the xdcr defect I mentioned
|
|
Dave Rigby 12:23 AM
|
I assume this is with Magma (given the channel)? If so then this might be worse than we’d see with Couchstore as there's no ep-engine level Bloom filter to avoid most meta bgfetches - and hence we end up with much more temp (deleted) items being loaded from disk.
|
12:25
|
@benh also, I recall some discussion on when we get rid of temp items - did we used to have code which (partially) removed them when we re-execute the frontend request after notify IO complete?
|
|
Ben Huddleston:palm_tree: 12:29 AM
|
In a couple of cases we do. E.g. if an item is temp deleted or temp non existent and we have the DELETE_TEMP flag set. In the general case, I don’t think we currently do. I can’t recall code in the past that did that, but it’s been a while since I looked at it
|
|
Dave Rigby 12:32 AM
|
gotcha. I suspect that quite a few places we probably could / should set DELETE_TEMP but don't currently; which might help with this kind of workload.
|
|
Dave Rigby 1:01 AM
|
@Ritesh Agarwal could you raise an MB got this? I don't believe it reaches the critical threshold needed for Neo (we’ll delete all these temp items when item pager next runs - say when the high watermark is hit), but we could potentially remove them earlier and generally have more memory free (and reduce those warnings).
|
|
Ritesh Agarwal 7:37 AM
|
Sure @daver, will raise one.
|
|
Dave Rigby 7:38 AM
|
Great, thanks.
|