Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-35889

Replication can get stuck when checkpoint memory overhead is very high

    XMLWordPrintable

Details

    • Untriaged
    • No
    • KV-Engine MH 2nd Beta, KV Sprint 2020-April

    Description

      Build 6.5.0-4218

      Observed that replication stuck when data service goes into low resident ratio.
      While running some HiDD tests on couchbase bucket we came across this issue.
      In this test we have 2 data nodes, load 250M docs and RR goes to 0.43%. After load phase we wait for "ep_dcp_replica_items_remaining" to go to zero. "ep_dcp_replica_items_remaining" stays ~19K and never become zero.

      Job- http://perf.jenkins.couchbase.com/job/magma-hidd/441
      Logs-
      https://cb-jira.s3.us-east-2.amazonaws.com/logs/replica_issue_couchbase/collectinfo-2019-09-10T055022-ns_1%40172.23.97.38.zip
      https://cb-jira.s3.us-east-2.amazonaws.com/logs/replica_issue_couchbase/collectinfo-2019-09-10T055022-ns_1%40172.23.97.39.zip
       

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            mahesh.mandhare Mahesh Mandhare (Inactive) created issue -
            raju Raju Suravarjjala made changes -
            Field Original Value New Value
            Fix Version/s Mad-Hatter [ 15037 ]
            sarath Sarath Lakshman made changes -
            Labels Performance Performance hidd
            owend Daniel Owen added a comment -

            Hey Jim Walker Although the test was on a magma backend - and so not supported in MH. Could you take a look just to ensure that its not uncovering an issue we might hit in MH, with couchstore backend? thanks

            owend Daniel Owen added a comment - Hey Jim Walker Although the test was on a magma backend - and so not supported in MH. Could you take a look just to ensure that its not uncovering an issue we might hit in MH, with couchstore backend? thanks
            owend Daniel Owen made changes -
            Assignee Daniel Owen [ owend ] Jim Walker [ jwalker ]

            Daniel Owen , above test we ran with couchstore as backend to have benchmark numbers with couchstore.

            mahesh.mandhare Mahesh Mandhare (Inactive) added a comment - Daniel Owen , above test we ran with couchstore as backend to have benchmark numbers with couchstore.
            owend Daniel Owen added a comment -

            Ah thanks Mahesh Mandhare we will investigate.

            owend Daniel Owen added a comment - Ah thanks Mahesh Mandhare we will investigate.
            jwalker Jim Walker added a comment - - edited

            With memory used being so high, a number of "safety' systems have kicked in, overall replication has been massively slowed down.

            On one node (cbcollect_info_ns_1@172.23.97.39_20190910-055023) we see that it is very slowly supplying DCP from a disk backfill, here we can clearly see that the backfill is being paused due to the lack of memory for it to bring data in from disk.

            2019-09-09T22:33:26.328717-07:00 INFO (bucket-1) DCP backfilling task temporarily suspended because the current memory usage is too high
            2019-09-09T22:33:27.328810-07:00 INFO (bucket-1) DCP backfilling task temporarily suspended because the current memory usage is too high
            2019-09-09T22:33:28.328953-07:00 INFO (bucket-1) DCP backfilling task temporarily suspended because the current memory usage is too high
            2019-09-09T22:33:29.329039-07:00 INFO (bucket-1) DCP backfilling task temporarily suspended because the current memory usage is too high
            2019-09-09T22:33:30.329078-07:00 INFO (bucket-1) DCP backfilling task temporarily suspended because the current memory usage is too high
            

            We're backfilling because the severe lack of memory has caused DCP to drop cursors

            2019-09-09T22:33:21.000737-07:00 INFO (bucket-1) Triggering memory recovery as checkpoint_memory (3783 MB) exceeds cursor_dropping_checkpoint_mem_upper_mark (50%, 3072 MB). Attempting to free 4256 MB of memory.
            

            The same node is also unable to consume items

            2019-09-09T22:33:03.743991-07:00 WARNING 323: (bucket-1) DCP (Consumer) eq_dcpq:replication:ns_1@172.23.97.38->ns_1@172.23.97.39:bucket-1 - vb:80 Got error 'no memory' while trying to process mutation with seqno:569034
            2019-09-09T22:33:03.789213-07:00 WARNING 323: (bucket-1) DCP (Consumer) eq_dcpq:replication:ns_1@172.23.97.38->ns_1@172.23.97.39:bucket-1 - vb:70 Got error 'no memory' while trying to process mutation with seqno:546229
            2019-09-09T22:33:03.848016-07:00 WARNING 323: (bucket-1) DCP (Consumer) eq_dcpq:replication:ns_1@172.23.97.38->ns_1@172.23.97.39:bucket-1 - vb:2 Got error 'no memory' while trying to process mutation with seqno:563000
            2019-09-09T22:33:03.854111-07:00 WARNING 323: (bucket-1) DCP (Consumer) eq_dcpq:replication:ns_1@172.23.97.38->ns_1@172.23.97.39:bucket-1 - vb:18 Got error 'no memory' while trying to process mutation with seqno:557148
            2019-09-09T22:33:03.918303-07:00 WARNING 323: (bucket-1) DCP (Consumer) eq_dcpq:replication:ns_1@172.23.97.38->ns_1@172.23.97.39:bucket-1 - vb:20 Got error 'no memory' while trying to process mutation with seqno:555992
            

            The other node is similar.

            Note replication is possibly progressing, it's happening slowly as the various checks to avoid going above the bucket quota in the system keep pausing and retrying.

            E.g. progress looks to be happening e.g. vb:20 logged 'no memory' for seqno:555992, but at the cbcollect time we can see in stats.log that vb:20 is now at 556321

            jwalker Jim Walker added a comment - - edited With memory used being so high, a number of "safety' systems have kicked in, overall replication has been massively slowed down. On one node (cbcollect_info_ns_1@172.23.97.39_20190910-055023) we see that it is very slowly supplying DCP from a disk backfill, here we can clearly see that the backfill is being paused due to the lack of memory for it to bring data in from disk. 2019-09-09T22:33:26.328717-07:00 INFO (bucket-1) DCP backfilling task temporarily suspended because the current memory usage is too high 2019-09-09T22:33:27.328810-07:00 INFO (bucket-1) DCP backfilling task temporarily suspended because the current memory usage is too high 2019-09-09T22:33:28.328953-07:00 INFO (bucket-1) DCP backfilling task temporarily suspended because the current memory usage is too high 2019-09-09T22:33:29.329039-07:00 INFO (bucket-1) DCP backfilling task temporarily suspended because the current memory usage is too high 2019-09-09T22:33:30.329078-07:00 INFO (bucket-1) DCP backfilling task temporarily suspended because the current memory usage is too high We're backfilling because the severe lack of memory has caused DCP to drop cursors 2019-09-09T22:33:21.000737-07:00 INFO (bucket-1) Triggering memory recovery as checkpoint_memory (3783 MB) exceeds cursor_dropping_checkpoint_mem_upper_mark (50%, 3072 MB). Attempting to free 4256 MB of memory. The same node is also unable to consume items 2019-09-09T22:33:03.743991-07:00 WARNING 323: (bucket-1) DCP (Consumer) eq_dcpq:replication:ns_1@172.23.97.38->ns_1@172.23.97.39:bucket-1 - vb:80 Got error 'no memory' while trying to process mutation with seqno:569034 2019-09-09T22:33:03.789213-07:00 WARNING 323: (bucket-1) DCP (Consumer) eq_dcpq:replication:ns_1@172.23.97.38->ns_1@172.23.97.39:bucket-1 - vb:70 Got error 'no memory' while trying to process mutation with seqno:546229 2019-09-09T22:33:03.848016-07:00 WARNING 323: (bucket-1) DCP (Consumer) eq_dcpq:replication:ns_1@172.23.97.38->ns_1@172.23.97.39:bucket-1 - vb:2 Got error 'no memory' while trying to process mutation with seqno:563000 2019-09-09T22:33:03.854111-07:00 WARNING 323: (bucket-1) DCP (Consumer) eq_dcpq:replication:ns_1@172.23.97.38->ns_1@172.23.97.39:bucket-1 - vb:18 Got error 'no memory' while trying to process mutation with seqno:557148 2019-09-09T22:33:03.918303-07:00 WARNING 323: (bucket-1) DCP (Consumer) eq_dcpq:replication:ns_1@172.23.97.38->ns_1@172.23.97.39:bucket-1 - vb:20 Got error 'no memory' while trying to process mutation with seqno:555992 The other node is similar. Note replication is possibly progressing, it's happening slowly as the various checks to avoid going above the bucket quota in the system keep pausing and retrying. E.g. progress looks to be happening e.g. vb:20 logged 'no memory' for seqno:555992, but at the cbcollect time we can see in stats.log that vb:20 is now at 556321

            Jim Walker ,

            We see that "ep_dcp_replica_items_remaining" stuck at 19,360 from 2019-09-09T22:33:29 till we aborted run at 2019-09-10T02:42:02.

            We print this stat on console of job- http://perf.jenkins.couchbase.com/job/magma-hidd/441/consoleFull 

             

            mahesh.mandhare Mahesh Mandhare (Inactive) added a comment - Jim Walker  , We see that "ep_dcp_replica_items_remaining" stuck at 19,360 from 2019-09-09T22:33:29 till we aborted run at 2019-09-10T02:42:02. We print this stat on console of job- http://perf.jenkins.couchbase.com/job/magma-hidd/441/consoleFull    
            jwalker Jim Walker added a comment - - edited

            Node .39 is definietly a problem...

            The backfill task is perpetually being put to sleep because memory is too high (above quota), so this is one reason DCP appears to be paused, on that same node checkpoints seem to be consuming a lot of resources

            e.g. vb0 which is replica and only has the persistence cursor (similar for many others, the checkpoint can be seen to consume a lot of memory).

             vb_0:last_closed_checkpoint_id:                                                                11
             vb_0:mem_usage:                                                                                32701361
             vb_0:num_checkpoint_items:                                                                     364300
             vb_0:num_checkpoints:                                                                          1
             vb_0:num_conn_cursors:                                                                         1
             vb_0:num_items_for_persistence:                                                                0
             vb_0:num_open_checkpoint_items:                                                                364299
             vb_0:open_checkpoint_id:                                                                       12
             vb_0:persisted_checkpoint_id:                                                                  11
             vb_0:persistence:cursor_checkpoint_id:                                                         12
             vb_0:persistence:cursor_seqno:                                                                 566706
             vb_0:persistence:num_visits:                                                                   308
             vb_0:state:                                                                                    replica
            

            jwalker Jim Walker added a comment - - edited Node .39 is definietly a problem... The backfill task is perpetually being put to sleep because memory is too high (above quota), so this is one reason DCP appears to be paused, on that same node checkpoints seem to be consuming a lot of resources e.g. vb0 which is replica and only has the persistence cursor (similar for many others, the checkpoint can be seen to consume a lot of memory). vb_0:last_closed_checkpoint_id: 11 vb_0:mem_usage: 32701361 vb_0:num_checkpoint_items: 364300 vb_0:num_checkpoints: 1 vb_0:num_conn_cursors: 1 vb_0:num_items_for_persistence: 0 vb_0:num_open_checkpoint_items: 364299 vb_0:open_checkpoint_id: 12 vb_0:persisted_checkpoint_id: 11 vb_0:persistence:cursor_checkpoint_id: 12 vb_0:persistence:cursor_seqno: 566706 vb_0:persistence:num_visits: 308 vb_0:state: replica
            jwalker Jim Walker added a comment -

            Suspect this is could be cause by an issue fixed by https://issues.couchbase.com/browse/MB-35812 http://review.couchbase.org/#/c/114376/ where checkpoint memory isn't being freed.

            This was committed into build 6.5.0-4244

            Retry on latest build

            jwalker Jim Walker added a comment - Suspect this is could be cause by an issue fixed by https://issues.couchbase.com/browse/MB-35812 http://review.couchbase.org/#/c/114376/ where checkpoint memory isn't being freed. This was committed into build 6.5.0-4244 Retry on latest build
            jwalker Jim Walker made changes -
            Assignee Jim Walker [ jwalker ] Mahesh Mandhare [ mahesh.mandhare ]
            Resolution Fixed [ 1 ]
            Status Open [ 1 ] Resolved [ 5 ]
            jwalker Jim Walker added a comment -

            Mahesh Mandhare it would be good to re-run on a later build (as per prior comment), some further digging shows that the expel fix I linked to may not be the fix as checkpoint overhead is quite high, further changes are needed for checkpoint overhead reduction (during expelling) - so re-opening to track

            jwalker Jim Walker added a comment - Mahesh Mandhare it would be good to re-run on a later build (as per prior comment), some further digging shows that the expel fix I linked to may not be the fix as checkpoint overhead is quite high, further changes are needed for checkpoint overhead reduction (during expelling) - so re-opening to track
            jwalker Jim Walker made changes -
            Assignee Mahesh Mandhare [ mahesh.mandhare ] Jim Walker [ jwalker ]
            Resolution Fixed [ 1 ]
            Status Resolved [ 5 ] Reopened [ 4 ]
            jwalker Jim Walker made changes -
            Status Reopened [ 4 ] In Progress [ 3 ]
            mahesh.mandhare Mahesh Mandhare (Inactive) added a comment - Build 6.5.0-4282 Tried same test on build 6.5.0-4282 and seeing the same issue again. Replication stuck at 24,040 from 2019-09-17T03:08:19 till we abort job at 2019-09-17T08:57:27 Job-  http://perf.jenkins.couchbase.com/job/magma-hidd/458/ Logs- https://cb-jira.s3.us-east-2.amazonaws.com/logs/replication_stuck/collectinfo-2019-09-17T155416-ns_1%40172.23.97.38.zip https://cb-jira.s3.us-east-2.amazonaws.com/logs/replication_stuck/collectinfo-2019-09-17T155416-ns_1%40172.23.97.39.zip
            drigby Dave Rigby made changes -
            Sprint KV-Engine MH 2nd Beta [ 872 ]
            drigby Dave Rigby made changes -
            Rank Ranked higher
            jwalker Jim Walker added a comment - A toy-build passed http://perf.jenkins.couchbase.com/job/magma-hidd/463/console
            jwalker Jim Walker added a comment -

            Some more work to tidy up the mem overhead accounting, split patch into 2 parts. Nearly done...

            jwalker Jim Walker added a comment - Some more work to tidy up the mem overhead accounting, split patch into 2 parts. Nearly done...
            james.harrison James Harrison made changes -
            Link This issue is duplicated by MB-35970 [ MB-35970 ]

            Build couchbase-server-6.5.0-4357 contains kv_engine commit 0446ad5 with commit message:
            MB-35889: Use tracking allocator for Checkpoint memOverhead tracking

            build-team Couchbase Build Team added a comment - Build couchbase-server-6.5.0-4357 contains kv_engine commit 0446ad5 with commit message: MB-35889 : Use tracking allocator for Checkpoint memOverhead tracking
            jwalker Jim Walker made changes -
            Assignee Jim Walker [ jwalker ] Mahesh Mandhare [ mahesh.mandhare ]
            Resolution Fixed [ 1 ]
            Status In Progress [ 3 ] Resolved [ 5 ]

            Build couchbase-server-6.5.0-4386 contains kv_engine commit 581e575 with commit message:
            MB-35889: Purge Checkpoint key indexes during expel and state change

            build-team Couchbase Build Team added a comment - Build couchbase-server-6.5.0-4386 contains kv_engine commit 581e575 with commit message: MB-35889 : Purge Checkpoint key indexes during expel and state change

            Build couchbase-server-6.5.0-4415 contains kv_engine commit 22b6ac3 with commit message:
            MB-35889: Tune the dcp cursor dropping tests

            build-team Couchbase Build Team added a comment - Build couchbase-server-6.5.0-4415 contains kv_engine commit 22b6ac3 with commit message: MB-35889 : Tune the dcp cursor dropping tests

            Build 6.5.0-4415

            Verified that now replication does not stuck for same test.

            Job- http://perf.jenkins.couchbase.com/job/magma-hidd/482

            mahesh.mandhare Mahesh Mandhare (Inactive) added a comment - Build 6.5.0-4415 Verified that now replication does not stuck for same test. Job-  http://perf.jenkins.couchbase.com/job/magma-hidd/482
            mahesh.mandhare Mahesh Mandhare (Inactive) made changes -
            Status Resolved [ 5 ] Closed [ 6 ]
            mahesh.mandhare Mahesh Mandhare (Inactive) made changes -
            Resolution Fixed [ 1 ]
            Status Closed [ 6 ] Reopened [ 4 ]
            mahesh.mandhare Mahesh Mandhare (Inactive) made changes -
            Assignee Mahesh Mandhare [ mahesh.mandhare ] Jim Walker [ jwalker ]
            jwalker Jim Walker added a comment - - edited

            Latest logs node

            • .38 has dcp stuck at 19k, memory however seems. mem_used is below the low water mark, not that ~50% of memory does appear to be ep_overhead
            • .39 is not looking good though with mem_used way over high_water and close to max size

            .38 is stuck because .39 is not processing DCP input and flow-control has stoppped the flow of replication

            jwalker Jim Walker added a comment - - edited Latest logs node .38 has dcp stuck at 19k, memory however seems. mem_used is below the low water mark, not that ~50% of memory does appear to be ep_overhead .39 is not looking good though with mem_used way over high_water and close to max size .38 is stuck because .39 is not processing DCP input and flow-control has stoppped the flow of replication
            jwalker Jim Walker added a comment -

            This issue is likely to be back because MB-36261 has had to revert part of the original fix made in this MB.

            jwalker Jim Walker added a comment - This issue is likely to be back because MB-36261 has had to revert part of the original fix made in this MB.
            jwalker Jim Walker added a comment -

            To improve this issue it's going to need much more work on checkpoints, specifically the Checkpoint::keyIndex, but that will have an impact on sync-replication.

            For now will defer this to cheshire-cat where we will be able to address such low RR situations and make such changes once we've gained more confidence in sync-replication (and can do more work on the keyIndex)

            jwalker Jim Walker added a comment - To improve this issue it's going to need much more work on checkpoints, specifically the Checkpoint::keyIndex , but that will have an impact on sync-replication. For now will defer this to cheshire-cat where we will be able to address such low RR situations and make such changes once we've gained more confidence in sync-replication (and can do more work on the keyIndex)
            jwalker Jim Walker made changes -
            Fix Version/s Cheshire-Cat [ 15915 ]
            Fix Version/s Mad-Hatter [ 15037 ]
            drigby Dave Rigby made changes -
            Sprint KV-Engine MH 2nd Beta [ 872 ] KV-Engine MH 2nd Beta, KV-Engine MH 2nd Beta 2 [ 872, 910 ]

            Build couchbase-server-6.5.0-4546 contains kv_engine commit 380547e with commit message:
            MB-36301: Partial Revert "MB-35889: Purge Checkpoint key indexes during expel and state change"

            build-team Couchbase Build Team added a comment - Build couchbase-server-6.5.0-4546 contains kv_engine commit 380547e with commit message: MB-36301 : Partial Revert " MB-35889 : Purge Checkpoint key indexes during expel and state change"

            Build couchbase-server-7.0.0-1009 contains kv_engine commit 380547e with commit message:
            MB-36301: Partial Revert "MB-35889: Purge Checkpoint key indexes during expel and state change"

            build-team Couchbase Build Team added a comment - Build couchbase-server-7.0.0-1009 contains kv_engine commit 380547e with commit message: MB-36301 : Partial Revert " MB-35889 : Purge Checkpoint key indexes during expel and state change"

            Build couchbase-server-6.5.0-4583 contains kv_engine commit 2d04589 with commit message:
            MB-36301: Revert "MB-35889: Use tracking allocator for Checkpoint memOverhead tracking"

            build-team Couchbase Build Team added a comment - Build couchbase-server-6.5.0-4583 contains kv_engine commit 2d04589 with commit message: MB-36301 : Revert " MB-35889 : Use tracking allocator for Checkpoint memOverhead tracking"

            Build couchbase-server-7.0.0-1013 contains kv_engine commit 2d04589 with commit message:
            MB-36301: Revert "MB-35889: Use tracking allocator for Checkpoint memOverhead tracking"

            build-team Couchbase Build Team added a comment - Build couchbase-server-7.0.0-1013 contains kv_engine commit 2d04589 with commit message: MB-36301 : Revert " MB-35889 : Use tracking allocator for Checkpoint memOverhead tracking"
            drigby Dave Rigby made changes -
            Sprint KV-Engine MH 2nd Beta, KV-Engine Mad-Hatter GA [ 872, 910 ] KV-Engine MH 2nd Beta [ 872 ]
            drigby Dave Rigby made changes -
            Rank Ranked higher
            owend Daniel Owen made changes -
            Epic Link MB-30659 [ 88207 ]
            ben.huddleston Ben Huddleston made changes -
            Assignee Jim Walker [ jwalker ] Ben Huddleston [ ben.huddleston ]
            ben.huddleston Ben Huddleston added a comment - - edited

            Status of this ticket

            We now track checkpoint memory in a better way, but we are not erasing items from the keyIndex (checkpoint overhead) due to sync write issues (MB-36261).

            Things to do

            • Drop the keyIndex when we close a checkpoint - MB-35970. This was done as part of this ticket but was reverted. We can probably revisit doing this.
            • Investigate usage of the keyIndex for replicas. This can grow O( n) which is problematic when we have disk checkpoints. I believe the only current usage is sync write based sanity checks. Perhaps we could erase non sync write items from this or simplify it on replicas to use less memory.
            ben.huddleston Ben Huddleston added a comment - - edited Status of this ticket We now track checkpoint memory in a better way, but we are not erasing items from the keyIndex (checkpoint overhead) due to sync write issues ( MB-36261 ). Things to do Drop the keyIndex when we close a checkpoint - MB-35970 . This was done as part of this ticket but was reverted. We can probably revisit doing this. Investigate usage of the keyIndex for replicas. This can grow O( n) which is problematic when we have disk checkpoints. I believe the only current usage is sync write based sanity checks. Perhaps we could erase non sync write items from this or simplify it on replicas to use less memory.
            ben.huddleston Ben Huddleston made changes -
            Link This issue relates to MB-38012 [ MB-38012 ]
            ben.huddleston Ben Huddleston made changes -
            Summary Replication stuck if data service in low resident ratio Replication can get stuck when checkpoint memory overhead is very high

            It looks like it is fine to allow the expelling code to erase mutations from the committed key indexes of Disk checkpoints on replicas as we only ever send/receive mutations (not commits) as part of Disk snapshots. For a Disk checkpoints everything should be fine provided we only receive each key once. Level of confidence on this is very high for couchstore, but we don't yet have the bake time to have the same level of confidence in magma. The only issue we would (and this is the case currently too) see if we did have multiple mutations for the same key in a Disk snapshot on the replica would be incorrect stats accounting. If we had multiple prepares of the same key then we would throw an exception.

            We could consider allowing the replica to expel mutations from Memory checkpoints too, but this would result in stats being wrong whenever we de-dupe in the checkpoint manager on the replica. Should look more into what we can do with the stats to allow us to do this for Memory checkpoints, but the value in doing so is much less than for DIsk checkpoints.

            ben.huddleston Ben Huddleston added a comment - It looks like it is fine to allow the expelling code to erase mutations from the committed key indexes of Disk checkpoints on replicas as we only ever send/receive mutations (not commits) as part of Disk snapshots. For a Disk checkpoints everything should be fine provided we only receive each key once. Level of confidence on this is very high for couchstore, but we don't yet have the bake time to have the same level of confidence in magma. The only issue we would (and this is the case currently too) see if we did have multiple mutations for the same key in a Disk snapshot on the replica would be incorrect stats accounting. If we had multiple prepares of the same key then we would throw an exception. We could consider allowing the replica to expel mutations from Memory checkpoints too, but this would result in stats being wrong whenever we de-dupe in the checkpoint manager on the replica. Should look more into what we can do with the stats to allow us to do this for Memory checkpoints, but the value in doing so is much less than for DIsk checkpoints.
            ben.huddleston Ben Huddleston made changes -
            Sprint KV-Engine MH 2nd Beta [ 872 ] KV-Engine MH 2nd Beta, KV Spint 2020-March [ 872, 1002 ]
            ben.huddleston Ben Huddleston made changes -
            Rank Ranked lower
            owend Daniel Owen made changes -
            Sprint KV-Engine MH 2nd Beta, KV Spint 2020-March [ 872, 1002 ] KV-Engine MH 2nd Beta, KV Sprint 2020-April [ 872, 1044 ]
            owend Daniel Owen made changes -
            Rank Ranked lower
            owend Daniel Owen made changes -
            Link This issue relates to CBSE-8284 [ CBSE-8284 ]
            drigby Dave Rigby made changes -
            Labels Performance hidd Performance candidate-for-6.6 hidd
            drigby Dave Rigby made changes -
            Fix Version/s 6.6.0 [ 16787 ]
            drigby Dave Rigby added a comment - - edited

            Daniel Owen / Dave Finlay / Shivani Gupta Proposing this for 6.6.0 - I believe the work required is small - just fixing some unforeseen issues in the previous 6.5.0 version of the patch. This would fix issues observed at customers where rebalance can hang and result in unavailability of replicas / unable to perform cluster maintenance.

            drigby Dave Rigby added a comment - - edited Daniel Owen / Dave Finlay / Shivani Gupta Proposing this for 6.6.0 - I believe the work required is small - just fixing some unforeseen issues in the previous 6.5.0 version of the patch. This would fix issues observed at customers where rebalance can hang and result in unavailability of replicas / unable to perform cluster maintenance.
            ben.huddleston Ben Huddleston added a comment - - edited

            For this change I'd propose that we only do the following:

            • Investigate usage of the keyIndex for replicas. This can grow O( n) which is problematic when we have disk checkpoints. I believe the only current usage is dedupe / sync write based sanity checks. We can simply not add keys to Disk checkpoints to avoid this issue.

            It would be nice to also drop key indexes when closing checkpoints (MB-35970), but this created a regression the last time we did it (MB-36301). We can probably do this with a bit more work, but I think that's more reasonable for Cheshire-Cat than 6.6.0.

            ben.huddleston Ben Huddleston added a comment - - edited For this change I'd propose that we only do the following: Investigate usage of the keyIndex for replicas. This can grow O( n) which is problematic when we have disk checkpoints. I believe the only current usage is dedupe / sync write based sanity checks. We can simply not add keys to Disk checkpoints to avoid this issue. It would be nice to also drop key indexes when closing checkpoints ( MB-35970 ), but this created a regression the last time we did it ( MB-36301 ). We can probably do this with a bit more work, but I think that's more reasonable for Cheshire-Cat than 6.6.0.
            ben.huddleston Ben Huddleston made changes -
            Link This issue is duplicated by MB-35970 [ MB-35970 ]
            ben.huddleston Ben Huddleston made changes -
            Link This issue relates to MB-35970 [ MB-35970 ]
            owend Daniel Owen made changes -
            Affects Version/s 6.5.1 [ 16622 ]
            drigby Dave Rigby made changes -
            Priority Major [ 3 ] Critical [ 2 ]
            till Till Westmann made changes -
            Link This issue blocks MB-38724 [ MB-38724 ]
            till Till Westmann made changes -
            Labels Performance candidate-for-6.6 hidd Performance approved-for-6.6.0 candidate-for-6.6 hidd

            Build couchbase-server-6.6.0-7646 contains kv_engine commit 2bd86cd with commit message:
            MB-35889: Don't add keys to Checkpoint indexes for Disk checkpoints

            build-team Couchbase Build Team added a comment - Build couchbase-server-6.6.0-7646 contains kv_engine commit 2bd86cd with commit message: MB-35889 : Don't add keys to Checkpoint indexes for Disk checkpoints
            drigby Dave Rigby made changes -
            Fix Version/s Cheshire-Cat [ 15915 ]
            Resolution Fixed [ 1 ]
            Status Reopened [ 4 ] Resolved [ 5 ]

            Jepsen tests caught a crash with this, need to loosen up some assertions and add a test or two - MB-39435.

            ben.huddleston Ben Huddleston added a comment - Jepsen tests caught a crash with this, need to loosen up some assertions and add a test or two - MB-39435 .
            ben.huddleston Ben Huddleston made changes -
            Resolution Fixed [ 1 ]
            Status Resolved [ 5 ] Reopened [ 4 ]
            ben.huddleston Ben Huddleston made changes -
            Link This issue causes MB-39435 [ MB-39435 ]
            ben.huddleston Ben Huddleston made changes -
            Resolution Fixed [ 1 ]
            Status Reopened [ 4 ] Resolved [ 5 ]

            Build couchbase-server-6.6.0-7654 contains kv_engine commit 7579822 with commit message:
            MB-35889: Don't invalidate index entry for Disk checkpoint for expel

            build-team Couchbase Build Team added a comment - Build couchbase-server-6.6.0-7654 contains kv_engine commit 7579822 with commit message: MB-35889 : Don't invalidate index entry for Disk checkpoint for expel
            wayne Wayne Siu made changes -
            Assignee Ben Huddleston [ ben.huddleston ] Bo-Chun Wang [ bo-chun.wang ]
            ben.huddleston Ben Huddleston made changes -
            Link This issue is duplicated by MB-39440 [ MB-39440 ]
            wayne Wayne Siu made changes -
            Labels Performance approved-for-6.6.0 candidate-for-6.6 hidd Performance affects-cc-testing approved-for-6.6.0 candidate-for-6.6 hidd

            Build couchbase-server-7.0.0-2130 contains kv_engine commit 7579822 with commit message:
            MB-35889: Don't invalidate index entry for Disk checkpoint for expel

            build-team Couchbase Build Team added a comment - Build couchbase-server-7.0.0-2130 contains kv_engine commit 7579822 with commit message: MB-35889 : Don't invalidate index entry for Disk checkpoint for expel

            Build couchbase-server-7.0.0-2130 contains kv_engine commit 2bd86cd with commit message:
            MB-35889: Don't add keys to Checkpoint indexes for Disk checkpoints

            build-team Couchbase Build Team added a comment - Build couchbase-server-7.0.0-2130 contains kv_engine commit 2bd86cd with commit message: MB-35889 : Don't add keys to Checkpoint indexes for Disk checkpoints
            bo-chun.wang Bo-Chun Wang added a comment -

            I do a run on build 6.6.0-7883. In this run, ep_dcp_replica_items_remaining is able to reach 0. I close this ticket.

            http://perf.jenkins.couchbase.com/job/rhea-5node1/120/

             

            2020-07-21T13:43:40 [INFO] Monitoring DCP queues: bucket-1

            2020-07-21T13:43:40 [INFO] ep_dcp_replica_items_remaining reached 0

            2020-07-21T13:43:40 [INFO] ep_dcp_other_items_remaining reached 0

            2020-07-21T13:43:40 [INFO] Monitoring replica count match: bucket-1

            2020-07-21T13:43:40 [INFO] curr_items: 250000000, replica_curr_items: 250000000

            bo-chun.wang Bo-Chun Wang added a comment - I do a run on build 6.6.0-7883. In this run, ep_dcp_replica_items_remaining is able to reach 0. I close this ticket. http://perf.jenkins.couchbase.com/job/rhea-5node1/120/   2020-07-21T13:43:40 [INFO] Monitoring DCP queues: bucket-1 2020-07-21T13:43:40 [INFO] ep_dcp_replica_items_remaining reached 0 2020-07-21T13:43:40 [INFO] ep_dcp_other_items_remaining reached 0 2020-07-21T13:43:40 [INFO] Monitoring replica count match: bucket-1 2020-07-21T13:43:40 [INFO] curr_items: 250000000, replica_curr_items: 250000000
            bo-chun.wang Bo-Chun Wang made changes -
            Status Resolved [ 5 ] Closed [ 6 ]
            richard.demellow Richard deMellow made changes -
            Link This issue relates to MB-41283 [ MB-41283 ]

            People

              bo-chun.wang Bo-Chun Wang
              mahesh.mandhare Mahesh Mandhare (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                PagerDuty