Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-49037

AWS m6g.large rebalance hung due to backfilling paused

    XMLWordPrintable

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 7.1.0
    • 7.1.0
    • couchbase-bucket
    • Build number: 7.1.0-1361

      OS: Amazon Linux 2
      ARM instance: m6g.large

      2vCPU
      8GB Memory
      40GB EBS

    Description

      During rebalance performance tests on ARM AWS instances, the tests consistently hang - an example job can be found here along with the logs:

      http://perf.jenkins.couchbase.com/job/Cloud-Tester/600/

       

      https://s3.amazonaws.com/bugdb/jira/qe/collectinfo-2021-10-07T223241-ns_1%40ec2-3-219-56-9.compute-1.amazonaws.com.zip
      https://s3.amazonaws.com/bugdb/jira/qe/collectinfo-2021-10-07T223241-ns_1%40ec2-3-223-6-164.compute-1.amazonaws.com.zip
      https://s3.amazonaws.com/bugdb/jira/qe/collectinfo-2021-10-07T223241-ns_1%40ec2-44-195-22-82.compute-1.amazonaws.com.zip

       

      The rebalance seems to hang on 'still waiting for backfill on connection', this happens 115 times in the logs:

       

      [rebalance:debug,2021-10-07T22:35:41.445Z,ns_1@ec2-44-195-22-82.compute-1.amazonaws.com:<0.1108.3>:dcp_replicator:wait_for_data_move_on_one_node:192]Still waiting for backfill on connection "replication:ns_1@ec2-44-195-22-82.compute-1.amazonaws.com->ns_1@ec2-3-223-6-164.compute-1.amazonaws.com:bucket-1" bucket "bucket-1", partition 745, last estimate {0,0, <<"calculating-item-count">>}

      During this time memcached keeps returning <<"calculating-item-count">> with no estimation, CPU usage also spikes at this time.

       

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            sean.corrigan Sean Corrigan created issue -
            drigby Dave Rigby made changes -
            Field Original Value New Value
            Component/s couchbase-bucket [ 10173 ]
            Component/s memcached [ 11621 ]
            drigby Dave Rigby made changes -
            Affects Version/s Neo [ 17615 ]
            drigby Dave Rigby made changes -
            Fix Version/s Neo [ 17615 ]
            drigby Dave Rigby made changes -
            Assignee Trond Norbye [ trond ] Daniel Owen [ owend ]
            drigby Dave Rigby added a comment -

            Sean Corrigan Do we know if this is specific to ARM-based AWS instances, or do you also see the same in equivalently sized x86 instances?

            drigby Dave Rigby added a comment - Sean Corrigan Do we know if this is specific to ARM-based AWS instances, or do you also see the same in equivalently sized x86 instances?

            Dave Rigby when the issue first arose, both x86 and ARM tests were hanging on the rebalance tests, however this was investigated in MB-48825 (https://issues.couchbase.com/browse/MB-48825) and they turned out to be separate issues. The failure to calculate item count seems isolated to the ARM instance.

            sean.corrigan Sean Corrigan added a comment - Dave Rigby  when the issue first arose, both x86 and ARM tests were hanging on the rebalance tests, however this was investigated in MB-48825 ( https://issues.couchbase.com/browse/MB-48825 ) and they turned out to be separate issues. The failure to calculate item count seems isolated to the ARM instance.
            drigby Dave Rigby added a comment -

            Sean Corrigan URLs are returning an error - e.g. https://s3.amazonaws.com/bugdb/jira/qe/collectinfo-2021-10-07T223241-ns_1%40ec2-44-195-22-82.compute-1.amazonaws.com.zip

            <Error>
            <Code>InvalidObjectState</Code>
            <Message>
            The operation is not valid for the object's storage class
            </Message>
            <StorageClass>GLACIER</StorageClass>
            <RequestId>BYZM5DBAB2S2Q66A</RequestId>
            <HostId>
            16oQXHcvP+RoZPt1EkB6vCts97Spb8HhiPR/32aUEFSnVQxQ8PoGj8liZNq/vzSEvW1x2BR/XhY=
            </HostId>
            </Error>
            

            Looks like they might have been glaciered already?

            drigby Dave Rigby added a comment - Sean Corrigan URLs are returning an error - e.g. https://s3.amazonaws.com/bugdb/jira/qe/collectinfo-2021-10-07T223241-ns_1%40ec2-44-195-22-82.compute-1.amazonaws.com.zip <Error> <Code>InvalidObjectState</Code> <Message> The operation is not valid for the object's storage class </Message> <StorageClass>GLACIER</StorageClass> <RequestId>BYZM5DBAB2S2Q66A</RequestId> <HostId> 16oQXHcvP+RoZPt1EkB6vCts97Spb8HhiPR/32aUEFSnVQxQ8PoGj8liZNq/vzSEvW1x2BR/XhY= </HostId> </Error> Looks like they might have been glaciered already?
            drigby Dave Rigby made changes -
            Link This issue relates to MB-48825 [ MB-48825 ]

            Dave Rigby My bad, will rerun the test and post new logs here.

            sean.corrigan Sean Corrigan added a comment - Dave Rigby  My bad, will rerun the test and post new logs here.
            sean.corrigan Sean Corrigan added a comment - Dave Rigby  Here are the new logs: https://s3.amazonaws.com/bugdb/jira/qe/collectinfo-2021-10-20T120409-ns_1%40ec2-3-235-136-83.compute-1.amazonaws.com.zip https://s3.amazonaws.com/bugdb/jira/qe/collectinfo-2021-10-20T120409-ns_1%40ec2-3-237-95-29.compute-1.amazonaws.com.zip https://s3.amazonaws.com/bugdb/jira/qe/collectinfo-2021-10-20T120409-ns_1%40ec2-3-238-93-68.compute-1.amazonaws.com.zip
            drigby Dave Rigby made changes -
            Assignee Daniel Owen [ owend ] Dave Rigby [ drigby ]
            drigby Dave Rigby added a comment -

            Sean Corrigan I'm not seeing the aforementioned "Still waiting for backfill on connection" message in any of those log files - are you still seeing the issue in that run?

            drigby Dave Rigby added a comment - Sean Corrigan I'm not seeing the aforementioned "Still waiting for backfill on connection" message in any of those log files - are you still seeing the issue in that run?

            Yes, it is still hanging, interestingly also at 16.x%, here is the corresponding job for those logs:
            http://perf.jenkins.couchbase.com/job/Cloud-Tester/615/console

            sean.corrigan Sean Corrigan added a comment - Yes, it is still hanging, interestingly also at 16.x%, here is the corresponding job for those logs: http://perf.jenkins.couchbase.com/job/Cloud-Tester/615/console
            drigby Dave Rigby made changes -
            Attachment Screenshot 2021-10-20 at 13.37.03.png [ 164992 ]
            drigby Dave Rigby added a comment -

            So this is why the backfill phase of the DCP stream created during rebalance is hanging:

            2021-10-20T12:03:53.951358+00:00 INFO (bucket-1) DCP backfilling task temporarily suspended because the current memory usage is too high
            

            Essentially we have stopped any backfilling tasks as memory usage is over 95% of the specified bucket quota. From looking at the prometheus data via promtimer, we can confirm that bucket memory used (mem_used) is well above the high watermark:

            Looking into where the memory is being used...

            drigby Dave Rigby added a comment - So this is why the backfill phase of the DCP stream created during rebalance is hanging: 2021-10-20T12:03:53.951358+00:00 INFO (bucket-1) DCP backfilling task temporarily suspended because the current memory usage is too high Essentially we have stopped any backfilling tasks as memory usage is over 95% of the specified bucket quota. From looking at the prometheus data via promtimer, we can confirm that bucket memory used (mem_used) is well above the high watermark: Looking into where the memory is being used...
            drigby Dave Rigby made changes -
            Attachment Screenshot 2021-10-20 at 13.41.23.png [ 164993 ]
            drigby Dave Rigby made changes -
            Attachment Screenshot 2021-10-20 at 13.45.15.png [ 164994 ]
            drigby Dave Rigby made changes -
            Attachment Screenshot 2021-10-20 at 13.41.23.png [ 164993 ]
            drigby Dave Rigby made changes -
            Attachment Screenshot 2021-10-20 at 13.37.03.png [ 164992 ]
            drigby Dave Rigby made changes -
            Attachment Screenshot 2021-10-20 at 13.45.15.png [ 164994 ]
            drigby Dave Rigby made changes -
            drigby Dave Rigby made changes -
            drigby Dave Rigby made changes -
            drigby Dave Rigby added a comment -

            Including Paolo Cocchi's KV-engine dashboard which gives a good overview:

            Note that around half of the memory is consumed by replica checkpoints:

            drigby Dave Rigby added a comment - Including Paolo Cocchi 's KV-engine dashboard which gives a good overview: Note that around half of the memory is consumed by replica checkpoints:
            drigby Dave Rigby added a comment -

            Sean Corrigan Still digging into the details, but I'd be surprised if this was actually ARM-specific. Do you have any logs available for the x86 instances - suspect we will see similar there (but maybe just not quite enough to entirely hang...)

            drigby Dave Rigby added a comment - Sean Corrigan Still digging into the details, but I'd be surprised if this was actually ARM-specific. Do you have any logs available for the x86 instances - suspect we will see similar there (but maybe just not quite enough to entirely hang...)

            Dave Rigby Yes, here are some logs for the m5.large x86 instance:
            https://s3.amazonaws.com/bugdb/jira/qe/collectinfo-2021-10-15T114904-ns_1%40ec2-100-26-1-78.compute-1.amazonaws.com.zip
            https://s3.amazonaws.com/bugdb/jira/qe/collectinfo-2021-10-15T114904-ns_1%40ec2-34-200-246-184.compute-1.amazonaws.com.zip
            https://s3.amazonaws.com/bugdb/jira/qe/collectinfo-2021-10-15T114904-ns_1%40ec2-35-170-70-9.compute-1.amazonaws.com.zip

            I can also kick off a new run and leave it longer as I don't think these had quite as long a run time as the m6 logs, but it had still been hung to the degree that it was taking longer than the same test on lower spec machines.

            sean.corrigan Sean Corrigan added a comment - Dave Rigby  Yes, here are some logs for the m5.large x86 instance: https://s3.amazonaws.com/bugdb/jira/qe/collectinfo-2021-10-15T114904-ns_1%40ec2-100-26-1-78.compute-1.amazonaws.com.zip https://s3.amazonaws.com/bugdb/jira/qe/collectinfo-2021-10-15T114904-ns_1%40ec2-34-200-246-184.compute-1.amazonaws.com.zip https://s3.amazonaws.com/bugdb/jira/qe/collectinfo-2021-10-15T114904-ns_1%40ec2-35-170-70-9.compute-1.amazonaws.com.zip I can also kick off a new run and leave it longer as I don't think these had quite as long a run time as the m6 logs, but it had still been hung to the degree that it was taking longer than the same test on lower spec machines.
            drigby Dave Rigby added a comment -

            So we do have a number of replica checkpoints with reasonably large memory utilisation (many more, just top 10 here):

            $ rg -N ':mem_usage:' stats.log | column -t | sort -k2 -n -r | head -n10
            vb_965:mem_usage:   4909126
            vb_910:mem_usage:   4638017
            vb_932:mem_usage:   4632771
            vb_924:mem_usage:   4564335
            vb_664:mem_usage:   4540006
            vb_1014:mem_usage:  4530871
            vb_953:mem_usage:   4506277
            vb_681:mem_usage:   4500811
            vb_579:mem_usage:   4492963
            vb_999:mem_usage:   4429508
            vb_582:mem_usage:   4391771
            vb_928:mem_usage:   4366402
            vb_912:mem_usage:   4362803
            vb_930:mem_usage:   4322427
            vb_598:mem_usage:   4299419
            vb_969:mem_usage:   4293061
            vb_966:mem_usage:   4286497
            vb_597:mem_usage:   4221753
            vb_1000:mem_usage:  4219598
            vb_589:mem_usage:   4194045
            

            Drilling into one example:

            vb_988:id_1:key_index_allocator_bytes:    0
            vb_988:id_1:queued_items_mem_usage:       3937966
            vb_988:id_1:snap_end:                     4733
            vb_988:id_1:snap_start:                   0
            vb_988:id_1:state:                        CHECKPOINT_OPEN
            vb_988:id_1:to_write_allocator_bytes:     113424
            vb_988:id_1:type:                         Disk
            vb_988:id_1:visible_snap_end:             4733
            vb_988:last_closed_checkpoint_id:         0
            vb_988:mem_usage:                         4051830
            vb_988:num_checkpoint_items:              4725
            vb_988:num_checkpoints:                   1
            vb_988:num_conn_cursors:                  1
            vb_988:num_items_for_persistence:         0
            vb_988:num_open_checkpoint_items:         4722
            vb_988:open_checkpoint_id:                1
            vb_988:persistence:cursor_checkpoint_id:  1
            vb_988:persistence:cursor_seqno:          4733
            vb_988:persistence:num_visits:            4
            vb_988:state:                             replica
            

            Note - we should be expelling (discarding from memory) items from these checkpoints as long as all cursors have processed them; and above we can see there's only one cursor (persistence) registered, so conceptually we should be able to expel here. For some reason there's still ~4MB of data in the checkpoint

            drigby Dave Rigby added a comment - So we do have a number of replica checkpoints with reasonably large memory utilisation (many more, just top 10 here): $ rg -N ':mem_usage:' stats.log | column -t | sort -k2 -n -r | head -n10 vb_965:mem_usage: 4909126 vb_910:mem_usage: 4638017 vb_932:mem_usage: 4632771 vb_924:mem_usage: 4564335 vb_664:mem_usage: 4540006 vb_1014:mem_usage: 4530871 vb_953:mem_usage: 4506277 vb_681:mem_usage: 4500811 vb_579:mem_usage: 4492963 vb_999:mem_usage: 4429508 vb_582:mem_usage: 4391771 vb_928:mem_usage: 4366402 vb_912:mem_usage: 4362803 vb_930:mem_usage: 4322427 vb_598:mem_usage: 4299419 vb_969:mem_usage: 4293061 vb_966:mem_usage: 4286497 vb_597:mem_usage: 4221753 vb_1000:mem_usage: 4219598 vb_589:mem_usage: 4194045 Drilling into one example: vb_988:id_1:key_index_allocator_bytes: 0 vb_988:id_1:queued_items_mem_usage: 3937966 vb_988:id_1:snap_end: 4733 vb_988:id_1:snap_start: 0 vb_988:id_1:state: CHECKPOINT_OPEN vb_988:id_1:to_write_allocator_bytes: 113424 vb_988:id_1:type: Disk vb_988:id_1:visible_snap_end: 4733 vb_988:last_closed_checkpoint_id: 0 vb_988:mem_usage: 4051830 vb_988:num_checkpoint_items: 4725 vb_988:num_checkpoints: 1 vb_988:num_conn_cursors: 1 vb_988:num_items_for_persistence: 0 vb_988:num_open_checkpoint_items: 4722 vb_988:open_checkpoint_id: 1 vb_988:persistence:cursor_checkpoint_id: 1 vb_988:persistence:cursor_seqno: 4733 vb_988:persistence:num_visits: 4 vb_988:state: replica Note - we should be expelling (discarding from memory) items from these checkpoints as long as all cursors have processed them; and above we can see there's only one cursor (persistence) registered, so conceptually we should be able to expel here. For some reason there's still ~4MB of data in the checkpoint
            drigby Dave Rigby made changes -
            Attachment x86 dashboard.png [ 165003 ]
            drigby Dave Rigby added a comment -

            I see the same behaviour on the x86 instances - not on all nodes but 2 out of the 3 have their memory usage stuck above high watermark, with replica checkpoint being biggest issue:

            Assigning over to Paolo Cocchi given he's been working on checkpoint memory recovery recently.

            drigby Dave Rigby added a comment - I see the same behaviour on the x86 instances - not on all nodes but 2 out of the 3 have their memory usage stuck above high watermark, with replica checkpoint being biggest issue: Assigning over to Paolo Cocchi given he's been working on checkpoint memory recovery recently.
            drigby Dave Rigby made changes -
            Assignee Dave Rigby [ drigby ] Paolo Cocchi [ paolo.cocchi ]
            drigby Dave Rigby made changes -
            Summary AWS ARM m6g.large Stuck Calculating Item Count AWS m6g.large rebalance hung due to backfilling paused

            Thanks Dave Rigby appreciate you looking into this for us.

            sean.corrigan Sean Corrigan added a comment - Thanks Dave Rigby  appreciate you looking into this for us.
            owend Daniel Owen made changes -
            Rank Ranked higher
            owend Daniel Owen made changes -
            Link This issue is duplicated by MB-48825 [ MB-48825 ]
            owend Daniel Owen made changes -
            Link This issue relates to MB-48825 [ MB-48825 ]
            owend Daniel Owen added a comment -

            Hi Sean Corrigan
            I tried to download the logs e.g.

            https://s3.amazonaws.com/bugdb/jira/qe/collectinfo-2021-10-20T120409-ns_1%40ec2-3-235-136-83.compute-1.amazonaws.com.zip
            

            But getting the following error:

            wget https://s3.amazonaws.com/bugdb/jira/qe/collectinfo-2021-10-20T120409-ns_1%40ec2-3-235-136-83.compute-1.amazonaws.com.zip
            --2021-11-05 10:17:40--  https://s3.amazonaws.com/bugdb/jira/qe/collectinfo-2021-10-20T120409-ns_1%40ec2-3-235-136-83.compute-1.amazonaws.com.zip
            Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.216.232.189
            Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.216.232.189|:443... connected.
            HTTP request sent, awaiting response... 403 Forbidden
            2021-11-05 10:17:41 ERROR 403: Forbidden.
            

            Could you double check the links (or provide new ones)? - thanks

            owend Daniel Owen added a comment - Hi Sean Corrigan I tried to download the logs e.g. https://s3.amazonaws.com/bugdb/jira/qe/collectinfo-2021-10-20T120409-ns_1%40ec2-3-235-136-83.compute-1.amazonaws.com.zip But getting the following error: wget https://s3.amazonaws.com/bugdb/jira/qe/collectinfo-2021-10-20T120409-ns_1%40ec2-3-235-136-83.compute-1.amazonaws.com.zip --2021-11-05 10:17:40-- https://s3.amazonaws.com/bugdb/jira/qe/collectinfo-2021-10-20T120409-ns_1%40ec2-3-235-136-83.compute-1.amazonaws.com.zip Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.216.232.189 Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.216.232.189|:443... connected. HTTP request sent, awaiting response... 403 Forbidden 2021-11-05 10:17:41 ERROR 403: Forbidden. Could you double check the links (or provide new ones)? - thanks
            owend Daniel Owen made changes -
            Epic Link MB-38441 [ 123649 ]
            owend Daniel Owen made changes -
            Rank Ranked higher
            drigby Dave Rigby added a comment -

            I have the logs locally if we can't un-glacier them...

            drigby Dave Rigby added a comment - I have the logs locally if we can't un-glacier them...

            Dave Rigby I don't think I'm unable to un-glacier them - we are currently having some issue in our cloud tests due to a bug in pypa setuptools, if you were able to share the logs you have downloaded locally that would be extremely helpful.

            sean.corrigan Sean Corrigan added a comment - Dave Rigby  I don't think I'm unable to un-glacier them - we are currently having some issue in our cloud tests due to a bug in pypa setuptools, if you were able to share the logs you have downloaded locally that would be extremely helpful.
            drigby Dave Rigby added a comment -

            Re-uploaded the 3 logs from 20th Oct:

            cbcollect_info_ns_1@ec2-3-235-136-83.compute-1.amazonaws.com_20211020-120410.zip
            cbcollect_info_ns_1@ec2-3-237-95-29.compute-1.amazonaws.com_20211020-120409.zip
            cbcollect_info_ns_1@ec2-3-238-93-68.compute-1.amazonaws.com_20211020-120410.zip

            Note these were re-compressed from my extracted cbcollect dirs, hence the slightly different filenames.

            drigby Dave Rigby added a comment - Re-uploaded the 3 logs from 20th Oct: cbcollect_info_ns_1@ec2-3-235-136-83.compute-1.amazonaws.com_20211020-120410.zip cbcollect_info_ns_1@ec2-3-237-95-29.compute-1.amazonaws.com_20211020-120409.zip cbcollect_info_ns_1@ec2-3-238-93-68.compute-1.amazonaws.com_20211020-120410.zip Note these were re-compressed from my extracted cbcollect dirs, hence the slightly different filenames.
            owend Daniel Owen made changes -
            Link This issue relates to MB-49134 [ MB-49134 ]
            paolo.cocchi Paolo Cocchi made changes -
            Status Open [ 1 ] In Progress [ 3 ]
            paolo.cocchi Paolo Cocchi made changes -
            Sprint KV 2021-Nov [ 1866 ]
            owend Daniel Owen made changes -
            Rank Ranked lower
            paolo.cocchi Paolo Cocchi added a comment - - edited

            Latest logset uploaded by Dave, node ec2-3-235-136-83 charts shows the same replica checkpoint as previously reported.

            Checkpoint memory state on stats.log:

            ep_checkpoint_memory:                                  475220869
             ep_checkpoint_memory_overhead:                         14284441
             ep_checkpoint_memory_ratio:                            0.5
             ep_checkpoint_memory_recovery_lower_mark:              0.6000000238418579
             ep_checkpoint_memory_recovery_upper_mark:              0.8999999761581421
             ep_checkpoint_memory_unreferenced:                     0
             ep_max_size:                                           1073741824
            

            • cm_quota: 536870912 (ep_max_size * ep_checkpoint_memory_ratio)
            • recovery_threshold: 483183820,8 (cm_quota * recovery_upper_mark)

            So here from the checkpoint PoV we aren't expected to recover anything.
            Also, we have relevant allocation in the HashTable that isnt' being freed, which is unexpected as the global-mem-usage has crossed the HWM. That is the problem here.

            The reason for replication getting stuck is that the global-mem-usage has hit the Bucket Quota. Which means that we are over the replication_threshold (99% of the Bucket Quota), so we have entered an unrecoverable OOM phase on DCP.

            paolo.cocchi Paolo Cocchi added a comment - - edited Latest logset uploaded by Dave, node ec2-3-235-136-83 charts shows the same replica checkpoint as previously reported. Checkpoint memory state on stats.log: ep_checkpoint_memory: 475220869 ep_checkpoint_memory_overhead: 14284441 ep_checkpoint_memory_ratio: 0.5 ep_checkpoint_memory_recovery_lower_mark: 0.6000000238418579 ep_checkpoint_memory_recovery_upper_mark: 0.8999999761581421 ep_checkpoint_memory_unreferenced: 0 ep_max_size: 1073741824 cm_quota: 536870912 (ep_max_size * ep_checkpoint_memory_ratio) recovery_threshold: 483183820,8 (cm_quota * recovery_upper_mark) So here from the checkpoint PoV we aren't expected to recover anything. Also, we have relevant allocation in the HashTable that isnt' being freed, which is unexpected as the global-mem-usage has crossed the HWM. That is the problem here. The reason for replication getting stuck is that the global-mem-usage has hit the Bucket Quota. Which means that we are over the replication_threshold (99% of the Bucket Quota), so we have entered an unrecoverable OOM phase on DCP.
            paolo.cocchi Paolo Cocchi made changes -
            Attachment MB-49037_dcp-backoff.png [ 168523 ]
            Attachment MB-49037_mem.png [ 168524 ]
            paolo.cocchi Paolo Cocchi made changes -
            Attachment MB-49037_ht-mem.png [ 168926 ]
            paolo.cocchi Paolo Cocchi added a comment - - edited

            The test is running with:

            ep_item_eviction_policy:                value_only
            

            But the HT doesn't appear to be fully taken by metadata. By looking at stats I see that there are still items in the HastTable that would be eligible for ejection once persisted:

            stats.log confirms at vb-level:

             vb_0:ht_item_memory:                     412080
             vb_1:ht_item_memory:                     456520
             vb_2:ht_item_memory:                     484800
             vb_3:ht_item_memory:                     404000
             vb_4:ht_item_memory:                     464600
            ..
            

            Eg for vb_0:

            vb_0:num_items_for_persistence: 0
            

            Essentially we have ~ 400MB of resident/clean StoredValues (approximately equally split among active/replica vbuckets) that aren't being ejected.

            Hi Sean Corrigan, some improvement in the HT stats have gone in recently (eg, HT item counts exposed, https://review.couchbase.org/c/kv_engine/+/158100).
            That information would be very useful here, could you run this test again on the latest Neo and attach logs please? Thanks

            paolo.cocchi Paolo Cocchi added a comment - - edited The test is running with: ep_item_eviction_policy: value_only But the HT doesn't appear to be fully taken by metadata. By looking at stats I see that there are still items in the HastTable that would be eligible for ejection once persisted: stats.log confirms at vb-level: vb_0:ht_item_memory: 412080 vb_1:ht_item_memory: 456520 vb_2:ht_item_memory: 484800 vb_3:ht_item_memory: 404000 vb_4:ht_item_memory: 464600 .. Eg for vb_0: vb_0:num_items_for_persistence: 0 Essentially we have ~ 400MB of resident/clean StoredValues (approximately equally split among active/replica vbuckets) that aren't being ejected. Hi Sean Corrigan , some improvement in the HT stats have gone in recently (eg, HT item counts exposed, https://review.couchbase.org/c/kv_engine/+/158100 ). That information would be very useful here, could you run this test again on the latest Neo and attach logs please? Thanks
            paolo.cocchi Paolo Cocchi made changes -
            Attachment MB-49037_HT-ejection.png [ 169207 ]
            paolo.cocchi Paolo Cocchi added a comment - - edited

            No error of any kind in memcached logs, just a few instances of:

            2021-10-20T11:49:43.271054+00:00 WARNING (No Engine) Slow runtime for 'Paging out items.' on thread NonIoPool0: 28 ms
            2021-10-20T11:49:55.666988+00:00 WARNING (No Engine) Slow runtime for 'Paging out items.' on thread NonIoPool1: 36 ms
            

            Also, the Pager does keep running in the OOM windows and it ejects items from both active/replica up to some point. Then nothing ejected, any ejection attempt fails:

            paolo.cocchi Paolo Cocchi added a comment - - edited No error of any kind in memcached logs, just a few instances of: 2021-10-20T11:49:43.271054+00:00 WARNING (No Engine) Slow runtime for 'Paging out items.' on thread NonIoPool0: 28 ms 2021-10-20T11:49:55.666988+00:00 WARNING (No Engine) Slow runtime for 'Paging out items.' on thread NonIoPool1: 36 ms Also, the Pager does keep running in the OOM windows and it ejects items from both active/replica up to some point. Then nothing ejected, any ejection attempt fails:
            paolo.cocchi Paolo Cocchi made changes -
            Attachment MB-49037_HT-ejection.png [ 169207 ]
            paolo.cocchi Paolo Cocchi made changes -
            Attachment MB-49037_HT-ejection.png [ 169440 ]
            paolo.cocchi Paolo Cocchi added a comment - - edited

            Update

            The failure is observed on Neo 1361 and it's consistently reproducible.
            Live-debugging (http://perf.jenkins.couchbase.com/job/Cloud-Tester/796/console) reveals that we enter a state where the ItemPager finds only non-resident items in the HashTable, so nothing seems eligible for ejection at that point.
            That suggests that some HashTable stats might be broken or inconsistent (eg, ht_item_memory), and that actually there is just no resident item to eject in the state observed.

            Sean Corrigan has repeated the test on build 1695 and the test passes (http://perf.jenkins.couchbase.com/job/Cloud-Tester/790).
            Not clear if some collateral HT/Pager change in (1361, 1695] has fixed the issue. Or if something is just hiding the issue in recent builds.

            Details on live-debugging

            At every ItemPager run, the PagingVisitor touches all StoredValues in HTs and hit this:

            bool HashTable::unlocked_ejectItem(const HashTable::HashBucketLock&,
                                               StoredValue*& vptr,
                                               EvictionPolicy policy) {
                if (vptr == nullptr) {
                    throw std::invalid_argument("HashTable::unlocked_ejectItem: "
                            "Unable to delete NULL StoredValue");
                }
             
                if (!vptr->eligibleForEviction(policy)) {
                    ++stats.numFailedEjects;
                    return false;         <-- !!
                }
                ..
            }
             
                bool eligibleForEviction(EvictionPolicy policy) const {
                    // Pending SyncWrite are always resident
                    if (isPending()) {
                        return false;
                    }
             
                    if (policy == EvictionPolicy::Value) {
                        return isResident() && !isDirty();         <-- !!
                    } else {
                        return !isDirty();
                    }
                }
            

            Example on vbid_891:

            (gdb) p *vptr
            $4 = {.., bySeqno = {
                value = {<std::__atomic_base<long>> = {static _S_alignment = 8, _M_i = 957}, static is_always_lock_free = true}}, lock_expiry_or_delete_or_complete_time = {
                lock_expiry = 0, delete_or_complete_time = 0}, exptime = 0, flags = 2, revSeqno = {counter = {_M_elems = "\001\000\000\000\000"}}, datatype = 3 '\003', 
              static dirtyIndex = 0, static deletedIndex = 1, static residentIndex = 2, static staleIndex = 3, bits = {static kBitsPerBlock = <optimized out>, 
                static kOne = <optimized out>, data_ = {_M_elems = {{<std::__atomic_base<unsigned char>> = {static _S_alignment = 1, _M_i = 0 '\000'}, 
                      static is_always_lock_free = true}}}}, ordered = 0 '\000', deletionSource = 0 '\000', committed = 0 '\000'}
            

            'bits = 000' indicates non-resident/non-dirty, ie item persisted and already ejected from the HashTable.

            Vbucket stats are consistent with the fact that all items in the vbucket are non-resident/non-dirty:

                vb_891:eq_dcpq:replication:ns_1@ec2-34-205-37-111.compute-1.amazonaws.com->ns_1@ec2-3-92-87-232.compute-1.amazonaws.com:bucket-1:cursor_checkpoint_id: 54
                vb_891:eq_dcpq:replication:ns_1@ec2-34-205-37-111.compute-1.amazonaws.com->ns_1@ec2-3-92-87-232.compute-1.amazonaws.com:bucket-1:cursor_seqno:         3937
                vb_891:eq_dcpq:replication:ns_1@ec2-34-205-37-111.compute-1.amazonaws.com->ns_1@ec2-3-92-87-232.compute-1.amazonaws.com:bucket-1:num_items_for_cursor: 0
                vb_891:eq_dcpq:replication:ns_1@ec2-34-205-37-111.compute-1.amazonaws.com->ns_1@ec2-3-92-87-232.compute-1.amazonaws.com:bucket-1:num_visits:           0
                vb_891:id_54:key_index_allocator_bytes:                                                                                                                0
                vb_891:id_54:queued_items_mem_usage:                                                                                                                   263
                vb_891:id_54:snap_end:                                                                                                                                 3936
                vb_891:id_54:snap_start:                                                                                                                               3936
                vb_891:id_54:state:                                                                                                                                    CHECKPOINT_OPEN
                vb_891:id_54:to_write_allocator_bytes:                                                                                                                 48
                vb_891:id_54:type:                                                                                                                                     Memory
                vb_891:id_54:visible_snap_end:                                                                                                                         3936
                vb_891:last_closed_checkpoint_id:                                                                                                                      53
                vb_891:mem_usage:                                                                                                                                      751
                vb_891:num_checkpoint_items:                                                                                                                           1
                vb_891:num_checkpoints:                                                                                                                                1
                vb_891:num_conn_cursors:                                                                                                                               2
                vb_891:num_items_for_persistence:                                                                                                                      0        <-- !!
                vb_891:num_open_checkpoint_items:                                                                                                                      0
                vb_891:open_checkpoint_id:                                                                                                                             54
                vb_891:persistence:cursor_checkpoint_id:                                                                                                               54
                vb_891:persistence:cursor_seqno:                                                                                                                       3937
                vb_891:persistence:num_visits:                                                                                                                         47
                vb_891:state:                                                                                                                                          active
             
                vb_891:high_seqno:                      3936
                vb_891:ht_cache_size:                   391880
                vb_891:ht_item_memory:                  391880
                vb_891:ht_item_memory_uncompressed:     391880
                vb_891:ht_memory:                       26752
                vb_891:ht_size:                         3079
                vb_891:logical_clock_ticks:             39
                vb_891:max_cas:                         1637073329561272320
                vb_891:max_cas_str:                     2021-11-16T14:35:29.561272320
                vb_891:max_deleted_revid:               0
                vb_891:max_visible_seqno:               3936
                vb_891:might_contain_xattrs:            false
                vb_891:num_ejects:                      4112
                vb_891:num_items:                       3880
                vb_891:num_non_resident:                3880
            

            Apart from 'ht_item_memory:391880' that suggests that we have memory allocated for resident items.

            paolo.cocchi Paolo Cocchi added a comment - - edited Update The failure is observed on Neo 1361 and it's consistently reproducible. Live-debugging ( http://perf.jenkins.couchbase.com/job/Cloud-Tester/796/console ) reveals that we enter a state where the ItemPager finds only non-resident items in the HashTable, so nothing seems eligible for ejection at that point. That suggests that some HashTable stats might be broken or inconsistent (eg, ht_item_memory), and that actually there is just no resident item to eject in the state observed. Sean Corrigan has repeated the test on build 1695 and the test passes ( http://perf.jenkins.couchbase.com/job/Cloud-Tester/790 ). Not clear if some collateral HT/Pager change in (1361, 1695] has fixed the issue. Or if something is just hiding the issue in recent builds. Details on live-debugging At every ItemPager run, the PagingVisitor touches all StoredValues in HTs and hit this: bool HashTable::unlocked_ejectItem(const HashTable::HashBucketLock&, StoredValue*& vptr, EvictionPolicy policy) { if (vptr == nullptr) { throw std::invalid_argument("HashTable::unlocked_ejectItem: " "Unable to delete NULL StoredValue"); }   if (!vptr->eligibleForEviction(policy)) { ++stats.numFailedEjects; return false; <-- !! } .. }   bool eligibleForEviction(EvictionPolicy policy) const { // Pending SyncWrite are always resident if (isPending()) { return false; }   if (policy == EvictionPolicy::Value) { return isResident() && !isDirty(); <-- !! } else { return !isDirty(); } } Example on vbid_891: (gdb) p *vptr $4 = {.., bySeqno = { value = {<std::__atomic_base<long>> = {static _S_alignment = 8, _M_i = 957}, static is_always_lock_free = true}}, lock_expiry_or_delete_or_complete_time = { lock_expiry = 0, delete_or_complete_time = 0}, exptime = 0, flags = 2, revSeqno = {counter = {_M_elems = "\001\000\000\000\000"}}, datatype = 3 '\003', static dirtyIndex = 0, static deletedIndex = 1, static residentIndex = 2, static staleIndex = 3, bits = {static kBitsPerBlock = <optimized out>, static kOne = <optimized out>, data_ = {_M_elems = {{<std::__atomic_base<unsigned char>> = {static _S_alignment = 1, _M_i = 0 '\000'}, static is_always_lock_free = true}}}}, ordered = 0 '\000', deletionSource = 0 '\000', committed = 0 '\000'} 'bits = 000' indicates non-resident/non-dirty, ie item persisted and already ejected from the HashTable. Vbucket stats are consistent with the fact that all items in the vbucket are non-resident/non-dirty: vb_891:eq_dcpq:replication:ns_1@ec2-34-205-37-111.compute-1.amazonaws.com->ns_1@ec2-3-92-87-232.compute-1.amazonaws.com:bucket-1:cursor_checkpoint_id: 54 vb_891:eq_dcpq:replication:ns_1@ec2-34-205-37-111.compute-1.amazonaws.com->ns_1@ec2-3-92-87-232.compute-1.amazonaws.com:bucket-1:cursor_seqno: 3937 vb_891:eq_dcpq:replication:ns_1@ec2-34-205-37-111.compute-1.amazonaws.com->ns_1@ec2-3-92-87-232.compute-1.amazonaws.com:bucket-1:num_items_for_cursor: 0 vb_891:eq_dcpq:replication:ns_1@ec2-34-205-37-111.compute-1.amazonaws.com->ns_1@ec2-3-92-87-232.compute-1.amazonaws.com:bucket-1:num_visits: 0 vb_891:id_54:key_index_allocator_bytes: 0 vb_891:id_54:queued_items_mem_usage: 263 vb_891:id_54:snap_end: 3936 vb_891:id_54:snap_start: 3936 vb_891:id_54:state: CHECKPOINT_OPEN vb_891:id_54:to_write_allocator_bytes: 48 vb_891:id_54:type: Memory vb_891:id_54:visible_snap_end: 3936 vb_891:last_closed_checkpoint_id: 53 vb_891:mem_usage: 751 vb_891:num_checkpoint_items: 1 vb_891:num_checkpoints: 1 vb_891:num_conn_cursors: 2 vb_891:num_items_for_persistence: 0 <-- !! vb_891:num_open_checkpoint_items: 0 vb_891:open_checkpoint_id: 54 vb_891:persistence:cursor_checkpoint_id: 54 vb_891:persistence:cursor_seqno: 3937 vb_891:persistence:num_visits: 47 vb_891:state: active   vb_891:high_seqno: 3936 vb_891:ht_cache_size: 391880 vb_891:ht_item_memory: 391880 vb_891:ht_item_memory_uncompressed: 391880 vb_891:ht_memory: 26752 vb_891:ht_size: 3079 vb_891:logical_clock_ticks: 39 vb_891:max_cas: 1637073329561272320 vb_891:max_cas_str: 2021-11-16T14:35:29.561272320 vb_891:max_deleted_revid: 0 vb_891:max_visible_seqno: 3936 vb_891:might_contain_xattrs: false vb_891:num_ejects: 4112 vb_891:num_items: 3880 vb_891:num_non_resident: 3880 Apart from 'ht_item_memory:391880' that suggests that we have memory allocated for resident items.

            Build couchbase-server-7.1.0-1730 contains kv_engine commit df37d73 with commit message:
            MB-49037: Add ep_ht_item_memory stat

            build-team Couchbase Build Team added a comment - Build couchbase-server-7.1.0-1730 contains kv_engine commit df37d73 with commit message: MB-49037 : Add ep_ht_item_memory stat
            paolo.cocchi Paolo Cocchi added a comment - - edited

            ht_item_memory actually computed as (per each StoredValue):

            size_t size() const {
                return getObjectSize() + valuelen();
            }
             
            size_t StoredValue::getObjectSize() const {
                ..
                return sizeof(*this) + getKey().getObjectSize();
            }
            

            StoredValue::getObjectSize() (eg, 63 bytes on MacOS) is also what we compute as ht_metadata.
            So ht_item_memory includes ht_metadata, thus it is actually expected to be non-zero at ValueEjection also for a HashTable that contains only non-resident items.

            paolo.cocchi Paolo Cocchi added a comment - - edited ht_item_memory actually computed as (per each StoredValue): size_t size() const { return getObjectSize() + valuelen(); }   size_t StoredValue::getObjectSize() const { .. return sizeof(*this) + getKey().getObjectSize(); } StoredValue::getObjectSize() (eg, 63 bytes on MacOS) is also what we compute as ht_metadata. So ht_item_memory includes ht_metadata, thus it is actually expected to be non-zero at ValueEjection also for a HashTable that contains only non-resident items.
            paolo.cocchi Paolo Cocchi added a comment - - edited

            Interesting point, what we call "Mem Used - HashTable" in stats accounts for SVs Metadata + Blobs size:

            That means that in the Mem chart we are just seeing ~ 400MB of Metadata + ~ 400MB of Blobs -> ~ 800MB of allocation reported in the HashTable.

            That is misleading. Blobs are reference-counted objects. When a Blob is ejected from the HT (which is the case for all Blobs in this scenario) it shouldn't be accounted in HT mem-usage.
            Note that we still have ~ 400 MB of Blobs around as they are referenced in the replica checkpoints.

            paolo.cocchi Paolo Cocchi added a comment - - edited Interesting point, what we call "Mem Used - HashTable" in stats accounts for SVs Metadata + Blobs size: That means that in the Mem chart we are just seeing ~ 400MB of Metadata + ~ 400MB of Blobs -> ~ 800MB of allocation reported in the HashTable. That is misleading. Blobs are reference-counted objects. When a Blob is ejected from the HT (which is the case for all Blobs in this scenario) it shouldn't be accounted in HT mem-usage. Note that we still have ~ 400 MB of Blobs around as they are referenced in the replica checkpoints.
            paolo.cocchi Paolo Cocchi made changes -
            Attachment MB-49037_b1695.png [ 169483 ]
            paolo.cocchi Paolo Cocchi added a comment -

            Summary

            The issue observed in build 1361 is a conjunction of:

            • The test using Value Ejection
            • All items already ejected
            • Replica checkpoint taking up to the the entire CM Quota (ie 50% of the Bucket Quota in build 1361)
            • Replica checkpoint memory not being recovered as per default recovery thresholds (set in EP config)

            As per offline chat with Sean Corrigan, this test used to succeed at some build before 1361, so this was a regression in 1361.
            But, 1361 falls in the middle of the "Improvements" window, so some relevant things have changed since then:

            1. checkpoint_max_size was buggy in 1361, fixed in 1574 - note that the checkpoint_max_size param is directly related to the effectiveness of memory recovery by Checkpoint Removal
            2. CM Quota was set to 50% (of Bucket Quota) in 1361, set to 30% recently

            Those changes produce a very different memory pattern in checkpoints in build 1695, which can be summarized as checkpoint mem recovery being much more effective and avoiding any OOM during the test:

            paolo.cocchi Paolo Cocchi added a comment - Summary The issue observed in build 1361 is a conjunction of: The test using Value Ejection All items already ejected Replica checkpoint taking up to the the entire CM Quota (ie 50% of the Bucket Quota in build 1361) Replica checkpoint memory not being recovered as per default recovery thresholds (set in EP config) As per offline chat with Sean Corrigan , this test used to succeed at some build before 1361, so this was a regression in 1361. But, 1361 falls in the middle of the "Improvements" window, so some relevant things have changed since then: checkpoint_max_size was buggy in 1361, fixed in 1574 - note that the checkpoint_max_size param is directly related to the effectiveness of memory recovery by Checkpoint Removal CM Quota was set to 50% (of Bucket Quota) in 1361, set to 30% recently Those changes produce a very different memory pattern in checkpoints in build 1695, which can be summarized as checkpoint mem recovery being much more effective and avoiding any OOM during the test:
            paolo.cocchi Paolo Cocchi added a comment -

            Hi Sean Corrigan, I'm resolving this as fixed in the more recent builds, thanks.

            paolo.cocchi Paolo Cocchi added a comment - Hi Sean Corrigan , I'm resolving this as fixed in the more recent builds, thanks.
            paolo.cocchi Paolo Cocchi made changes -
            Triage Untriaged [ 10351 ] Triaged [ 10350 ]
            Resolution Fixed [ 1 ]
            Status In Progress [ 3 ] Resolved [ 5 ]
            ritam.sharma Ritam Sharma made changes -
            Labels arm memcached arm memcached performance
            owend Daniel Owen made changes -
            Assignee Paolo Cocchi [ paolo.cocchi ] Daniel Owen [ owend ]
            Status Resolved [ 5 ] Closed [ 6 ]

            People

              owend Daniel Owen
              sean.corrigan Sean Corrigan
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty