Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-51365

High number(~417M) of temp items are seen in hash tables which leads to metadata warnings while eviction is set to full.

    XMLWordPrintable

Details

    • 1

    Description

      This issue is seen while running the XDCR volume tests with magma. It is seen on the destination cluster where there is another XDCR issue exists on this build i.e. https://issues.couchbase.com/browse/MB-50854. Here the xdcr backfill was getting refreshed infinitely and the data from the src xdcr is being sent. Please take a look at the xdcr defect for more details.

      It looks like the data coming in to the XDCR destination is all temp documents and those were taking up the space in memory and hence filling it up. I am not sure how to simulate the same workload without XDCR.

      From Slack:

      can the seqno be that large?
      SV @0x7f2070dee320 ... ....Cm temp: D  seq:18446744073709551613 rev:129 cas:1646538741157724161 key:"cid:0x33:JnWsHZEdH-12910444, size:19" exp:1646538740 age:8 fc:2 vallen:0
      8:52
      cbstats output:
       vb_0:ht_num_in_memory_items:              219671
       vb_0:ht_num_in_memory_non_resident_items: 219671
       vb_0:ht_num_items:                        219671
      Dave Rigby
      The vbucket-details stats as printed by cbstats should make more sense.
      Posted in kv-magma | Yesterday at 12:40 AM | View message
       
       
       
       
       
      8:52
      i am not sure how these numbers can be same... (edited) 
       
      Ben Huddleston:palm_tree:  11:31 PM
      @Ritesh Agarwal That’s a temporary item which has a negative seqno which is being displayed as a positive value due to over/underflow. See temp: D saying that the item is temporary and deleted (edited) 
      11:33
      Are all of the items in the hash table dump temporary items @Ritesh Agarwal?
       
      Ritesh Agarwal  11:45 PM
      no @benh, doesn't look like that:
      [root@localhost ~]# /opt/couchbase/bin/cbstats localhost:11210 -u Administrator -p password -b GleamBookUsers0 raw '_hash-dump 0' | more
       vb:0: HashTable[0x7f21f958a518] with numItems:219671 numInMemory:219671 numDeleted:0 numNonResident:219671 numTemp:59581 numSystemItems:0 numPreparedSW:0 values: 
           SV @0x7f20da3e5cc0 .CJ ....Cm temp:    seq:991175 rev:171 cas:1646532756115226624 key:"cid:0x27:QUUpyg86d-21996840, size:19" exp:0 age:10 fc:3 vallen:0
           SV @0x7f21bd481a30 .CJ ....Cm temp:    seq:897960 rev:164 cas:1646524103845216256 key:"cid:0x2d:va9INIC7g55-756921, size:19" exp:0 age:2 fc:3 vallen:0
           SV @0x7f1d8e5f3ea0 .CJ ....Cm temp:    seq:1010501 rev:172 cas:1646534569763405824 key:"cid:0xd:KfUzLzveYW-2297242, size:19" exp:0 age:10 fc:3 vallen:0
           SV @0x7f218532bac0 .CJ ....Cm temp:    seq:1134615 rev:183 cas:1646548835842129920 key:"cid:0x1a:qoVZfcefrCy-988315, size:19" exp:0 age:6 fc:3 vallen:0
           SV @0x7f2070dee320 ... ....Cm temp: D  seq:18446744073709551613 rev:129 cas:1646538741157724161 key:"cid:0x33:JnWsHZEdH-12910444, size:19" exp:1646538740 age:8 fc:2 vallen:0
      11:45
      ssh root@172.23.106.238 (edited) 
       
      Ben Huddleston:palm_tree:  11:46 PM
      All of those are temporary, some just have seqnos :slightly_smiling_face:
      11:46
      Items being temporary vs not affects the tracking of some of those stats that you mentioned
       
      Ritesh Agarwal  11:50 PM
      temp: D
      means deleted, right? but the others doesn't have 'D'
       
      Ben Huddleston:palm_tree:  11:50 PM
      temp: D means temp deleted. temp means temporary
       
      Ritesh Agarwal  11:50 PM
      ah okay
      11:51
      so when does these temp removed from memory?
       
      Ben Huddleston:palm_tree:  11:51 PM
      next item pager run generally
       
      Ritesh Agarwal  11:52 PM
      [root@localhost ~]# /opt/couchbase/bin/cbstats localhost:11210 -u Administrator -p password -b GleamBookUsers0 raw '_hash-dump 0' | grep temp | wc -l
      508094
      11:54
      i think because of these i am seeing a lot of Metadata overhead warning
      11:54
      although eviction is set to full eviction
       
      Ben Huddleston:palm_tree:  11:56 PM
      Well to set that numNonResident stat it looks like we consider deleted or temporary items as non-resident (as well as actually non-resident items but as you say you’re testing full eviction) (edited) 
      11:58
      Actually you are hitting the residency check (edited) 
       
      Ritesh Agarwal  12:02 AM
      so there is this xdcr issue(https://issues.couchbase.com/browse/MB-50854) where the backfill in getting restarted infinitely. And the cluster i am referring here is the xdcr destination cluster...
      12:04
      the items are coming to the xdcr dstn forever but looks like they all are ignored as they are all temp and then metadata warnings are showing up....(I am not able to connect the dots) (edited) 
       
      Ben Huddleston:palm_tree:  12:04 AM
      :philosoraptor:  so we have these counters
       vb:0: HashTable[0x7f21f958a518] with numItems:554288 numInMemory:554288 numDeleted:0 numNonResident:554288 numTemp:29467 numSystemItems:0 numPreparedSW:0 values:
      but everything (and more?) is temp in the dump
      [root@localhost ~]# grep temp htdump.txt | wc -l
      583755
      12:04
      Are you running anything at the same time?
       
      Ritesh Agarwal  12:04 AM
      nothing
      12:04
      scr: http://172.23.105.175:8091/ui/index.html#/buckets?commonBucket=GleamBookUsers0&scenarioZoom=minute&scenario=uezvmzrc0
      dstn: http://172.23.106.250:8091/ui/index.html#/buckets?commonBucket=GleamBookUsers0&scenarioZoom=minute&scenario=bvycbn8zh
      12:05
      scr cluster is idle since 2 days
      12:05
      no cluster operations, no mutations
      12:06
      its xdcr src which is pushing data to this cluster
      Ben Huddleston
      Are you running anything at the same time?
      Posted in kv-magma | Today at 12:04 AM | View message
       
      Ben Huddleston:palm_tree:  12:07 AM
      I suspect that numTemp is misleading because it checks via seqno and if an item has been fetched back into memory then it gets the seqno back
      12:07
      So numTemp is probably numTemp waiting for a fetch (edited) 
      12:08
      this adds up
      [root@localhost ~]# grep "temp: D" htdump.txt | wc -l
      29467
      vs
      numTemp: 29467
      (edited)
       
      Ritesh Agarwal  12:12 AM
      Okay
       
      Ben Huddleston:palm_tree:  12:13 AM
      The difference in items reported via the ht dump vs the overall stats may be due to that xdcr load you’ve got
       
      Ritesh Agarwal  12:14 AM
      Yea, right
       
      Ben Huddleston:palm_tree:  12:14 AM
      iirc it locks/unlocks to not trash the performance of everything else so you can’t expect the number of items it prints to match the count unless you have 0 load
      12:15
      In terms of your warning, I think it’s just temporary items contributing towards the stats making it display, assuming that the warning is backed from the stats you’ve been mentoining
       
      Ritesh Agarwal  12:17 AM
      Yea the items match is not something i am bothered about. I think its clear now that temp items come and then i do see that the % goes down also in the warning message so eviction is also happening and as in new temp items comes it gets filled again and warning shows up repeatedly.  (edited) 
      12:18
      And this keeps on happening forever due to the xdcr defect I mentioned
       
      Dave Rigby  12:23 AM
      I assume this is with Magma (given the channel)? If so then this might be worse than we’d see with Couchstore as there's no ep-engine level Bloom filter to avoid most meta bgfetches - and hence we end up with much more temp (deleted) items being loaded from disk.
      12:25
      @benh also, I recall some discussion on when we get rid of temp items - did we used to have code which (partially) removed them when we re-execute the frontend request after notify IO complete?
       
      Ben Huddleston:palm_tree:  12:29 AM
      In a couple of cases we do. E.g. if an item is temp deleted or temp non existent and we have the DELETE_TEMP flag set. In the general case, I don’t think we currently do. I can’t recall code in the past that did that, but it’s been a while since I looked at it
       
      Dave Rigby  12:32 AM
      gotcha. I suspect that quite a few places we probably could / should set DELETE_TEMP but don't currently; which might help with this kind of workload.
       
      Dave Rigby  1:01 AM
      @Ritesh Agarwal could you raise an MB got this? I don't believe it reaches the critical threshold needed for Neo (we’ll delete all these temp items when item pager next runs - say when the high watermark is hit), but we could potentially remove them earlier and generally have more memory free (and reduce those warnings).
       
      Ritesh Agarwal  7:37 AM
      Sure @daver, will raise one.
       
      Dave Rigby  7:38 AM
      Great, thanks.
      

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          People

            ritesh.agarwal Ritesh Agarwal
            ritesh.agarwal Ritesh Agarwal
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty