Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-48834

Improve XDCR performance with Magma

    XMLWordPrintable

Details

    Description

      I re-run two existing XDCR tests with Magma. Compared to Couchstore, Magma performance is about 50% lower. I open this ticket to track XDCR+Magma performance improvement. All runs were running on build 7.1.0-1401.

       

      Avg. initial XDCR rate (items/sec), 1 -> 1 (2 source nozzles, 4 target nozzles), 1 bucket x 100M x 1KB

       

       

      Avg. initial XDCR rate (items/sec), 5 -> 5 (2 source nozzles, 4 target nozzles), 1 bucket x 250M x 1KB

       

      Attachments

        Issue Links

          For Gerrit Dashboard: MB-48834
          # Subject Branch Project Status CR V

          Activity

            Thanks Daniel Owen.

            If I understand correctly, if we have to avoid bg fetch on non-exist keys for value-only eviction, we need to keep a bloom filter to address the special case of deleted docs. Does the deleted doc mean tombstone document?

            In this specific XDCR test case, setWithMeta is the one triggering bgFetch.

            For couchstore, we rebuild the bloom filter every time full compaction happens. For magma, when the logically deleted doc is removed, we have to remove it from the bloom filter as well. But, bloom filter does not support a remove operation. Since magma does not have periodic full compaction, we may not be able to rebuild the bloom filter in KV-Engine.

            Magma internally maintains bloom filter per sstable for the key existence check, we could expose this bloom filter through a magma KeyMayExist API that only checks in-memory bloom filter without any I/O. Essentially when we queue a bgFetch, it checks against this bloom filter to respond not-found. I wonder if we directly expose this API and avoid bgFetch queueing code path, whether that would help improve the throughput.

            sarath Sarath Lakshman added a comment - Thanks Daniel Owen . If I understand correctly, if we have to avoid bg fetch on non-exist keys for value-only eviction, we need to keep a bloom filter to address the special case of deleted docs. Does the deleted doc mean tombstone document? In this specific XDCR test case, setWithMeta is the one triggering bgFetch. For couchstore, we rebuild the bloom filter every time full compaction happens. For magma, when the logically deleted doc is removed, we have to remove it from the bloom filter as well. But, bloom filter does not support a remove operation. Since magma does not have periodic full compaction, we may not be able to rebuild the bloom filter in KV-Engine. Magma internally maintains bloom filter per sstable for the key existence check, we could expose this bloom filter through a magma KeyMayExist API that only checks in-memory bloom filter without any I/O. Essentially when we queue a bgFetch, it checks against this bloom filter to respond not-found. I wonder if we directly expose this API and avoid bgFetch queueing code path, whether that would help improve the throughput.
            drigby Dave Rigby added a comment -

            Sarath Lakshman Does the Magma bloom filter track logically deleted (tombstones) keys, or just alive keys?

            If it's the latter, then I don't think exposing an API to query it directly would necessarily make much difference - the issue with SetWithMeta (and GetMeta) is even if the item being compared has been deleted, we need to compare CAS (or revId) as the deleted item could still be the "winning" mutation. As such, it's not valid to check the bloom filter to see an an alive item exists, as we also need deleted ones.

            drigby Dave Rigby added a comment - Sarath Lakshman Does the Magma bloom filter track logically deleted (tombstones) keys, or just alive keys? If it's the latter, then I don't think exposing an API to query it directly would necessarily make much difference - the issue with SetWithMeta (and GetMeta) is even if the item being compared has been deleted, we need to compare CAS (or revId) as the deleted item could still be the "winning" mutation. As such, it's not valid to check the bloom filter to see an an alive item exists, as we also need deleted ones.

            Got it. So the logically deleted document is also a party in conflict resolution/CAS.

            Magma bloom filter tracks the existence of a doc key item (it can be either tombstone or alive key) - not just the alive keys.
            The KeyMayExist API will have to do a lookup in the memtable (we do not maintain a bloomfilter for memtable), then binary search in sstable lists in each level. This is certainly more time consuming than the kv-engine full key range bloom filter. These checks happen during magma::Get() during every bgFetch to identify the sstable to read. Adding this API will only help in skipping the bgFetch code path for the non-existent key check. We can give it a try.

            sarath Sarath Lakshman added a comment - Got it. So the logically deleted document is also a party in conflict resolution/CAS. Magma bloom filter tracks the existence of a doc key item (it can be either tombstone or alive key) - not just the alive keys. The KeyMayExist API will have to do a lookup in the memtable (we do not maintain a bloomfilter for memtable), then binary search in sstable lists in each level. This is certainly more time consuming than the kv-engine full key range bloom filter. These checks happen during magma::Get() during every bgFetch to identify the sstable to read. Adding this API will only help in skipping the bgFetch code path for the non-existent key check. We can give it a try.
            drigby Dave Rigby added a comment -

            Ok, so sounds like Magma's BloomFilter behaves like the ep-engine Bloom Filter in full-eviction's mode, assuming resident ratio is below bfilter_residency_threshold which defaults to 10%:

            • see BloomFilterCallback::callback():

                          if (store.getItemEvictionPolicy() == EvictionPolicy::Value) {
                              /**
                               * VALUE-ONLY EVICTION POLICY
                               * Consider deleted items only.
                               */
                              if (isDeleted) {
                                  vb->addToTempFilter(key);
                              }
                          } else {
                              /**
                               * FULL EVICTION POLICY
                               * If vbucket's resident ratio is found to be less than
                               * the residency threshold, consider all items, otherwise
                               * consider deleted and non-resident items only.
                               */
                              bool residentRatioLessThanThreshold =
                                      vb->isResidentRatioUnderThreshold(
                                              store.getBfiltersResidencyThreshold());
                              if (residentRatioLessThanThreshold) {
                                  vb->addToTempFilter(key);
                              } else {
                                  if (isDeleted || !store.isMetaDataResident(vb, key)) {
                                      vb->addToTempFilter(key);
                                  }
                              }
                          }
              

            As you point out, there isn't the opportunity with Magma to rebuild the ep-engine level Bloom filters during full compaction, so it's not feasible to re-enable the ep-engine value-eviction delete-only Bloom filter when Magma is being used.

            One slightly surprising thing is the slowdown seen when performing the Magma bgFetch for the deleted item - assuming the key being looked up truly doesn't exist (which I'm assuming based on the workload being called "initial load"), then I would expect that Magma should be able to return that information quickly, if it is just checking memtables and (pinned IIRC?) Bloom filter block reads for each SST file. Yes there is the overhead of queuing and managing the BGFetch task, but I wouldn't expect that should be too large.

            If you have the logs to hand, you could look at the bg_wait and bg_load histograms on the destination cluster - those are the times each BGFetcher task was waiting to run, and then how long it ran for. Assuming we won't be able to reduce the BGLoad time much with a new API (it'll be doing essentially the same work, just without the task management), we should see how BGLoad compares to it.

            drigby Dave Rigby added a comment - Ok, so sounds like Magma's BloomFilter behaves like the ep-engine Bloom Filter in full-eviction's mode, assuming resident ratio is below bfilter_residency_threshold which defaults to 10%: see BloomFilterCallback::callback() : if (store.getItemEvictionPolicy() == EvictionPolicy::Value) { /** * VALUE-ONLY EVICTION POLICY * Consider deleted items only. */ if (isDeleted) { vb->addToTempFilter(key); } } else { /** * FULL EVICTION POLICY * If vbucket's resident ratio is found to be less than * the residency threshold, consider all items, otherwise * consider deleted and non-resident items only. */ bool residentRatioLessThanThreshold = vb->isResidentRatioUnderThreshold( store.getBfiltersResidencyThreshold()); if (residentRatioLessThanThreshold) { vb->addToTempFilter(key); } else { if (isDeleted || !store.isMetaDataResident(vb, key)) { vb->addToTempFilter(key); } } } As you point out, there isn't the opportunity with Magma to rebuild the ep-engine level Bloom filters during full compaction, so it's not feasible to re-enable the ep-engine value-eviction delete-only Bloom filter when Magma is being used. One slightly surprising thing is the slowdown seen when performing the Magma bgFetch for the deleted item - assuming the key being looked up truly doesn't exist (which I'm assuming based on the workload being called "initial load"), then I would expect that Magma should be able to return that information quickly, if it is just checking memtables and (pinned IIRC?) Bloom filter block reads for each SST file. Yes there is the overhead of queuing and managing the BGFetch task, but I wouldn't expect that should be too large. If you have the logs to hand, you could look at the bg_wait and bg_load histograms on the destination cluster - those are the times each BGFetcher task was waiting to run, and then how long it ran for. Assuming we won't be able to reduce the BGLoad time much with a new API (it'll be doing essentially the same work, just without the task management), we should see how BGLoad compares to it.
            sarath Sarath Lakshman added a comment - - edited

            Looking at the stats, get_meta is taking the most time compared to couchstore (avg 2us vs 45us) and contribution from bg_wait + bg_load (avg 15us + 13us). I agree, unless we are able to add a much lightweight KeyMayExist to magma which avoids some of the allocations and expensive work in magma::Get as well as magma-kvstore allocations, we won't be able to see much difference. We could potentially save the bg_wait time only.

             
            Couchstore:GET_META
            [  0.00 -   0.00]us (0.0000%)          38|
            [  0.00 -   1.00]us (10.0000%)   94874274| ############################################
            [  1.00 -   1.00]us (20.0000%)          0|
            [  1.00 -   1.00]us (30.0000%)          0|
            [  1.00 -   1.00]us (40.0000%)          0|
            [  1.00 -   2.00]us (50.0000%)   73288014| #################################
            [  2.00 -   2.00]us (55.0000%)          0|
            [  2.00 -   2.00]us (60.0000%)          0|
            [  2.00 -   2.00]us (65.0000%)          0|
            [  2.00 -   2.00]us (70.0000%)          0|
            [  2.00 -   2.00]us (75.0000%)          0|
            [  2.00 -   2.00]us (77.5000%)          0|
            [  2.00 -   2.00]us (80.0000%)          0|
            [  2.00 -   2.00]us (82.5000%)          0|
            [  2.00 -   3.00]us (85.0000%)   18531107| ########
            [  3.00 -   3.00]us (87.5000%)          0|
            [  3.00 -   3.00]us (88.7500%)          0|
            [  3.00 -   3.00]us (90.0000%)          0|
             
            Couchstore:SET_WITH_META
            [  0.00 -   7.00]us (0.0000%)        1804|
            [  7.00 -  14.00]us (10.0000%)   32829907| ############################################
            [ 14.00 -  15.00]us (20.0000%)   18193265| ########################
            [ 15.00 -  16.00]us (30.0000%)   12680040| ################
            [ 16.00 -  18.00]us (40.0000%)   22889054| ##############################
            [ 18.00 -  20.00]us (50.0000%)   19768797| ##########################
            [ 20.00 -  21.00]us (55.0000%)    8738984| ###########
            [ 21.00 -  22.00]us (60.0000%)    8702046| ###########
            [ 22.00 -  23.00]us (65.0000%)    8219478| ###########
            [ 23.00 -  25.00]us (70.0000%)   13427851| #################
            [ 25.00 -  27.00]us (75.0000%)    9817161| #############
            [ 27.00 -  28.00]us (77.5000%)    4026131| #####
            [ 28.00 -  29.00]us (80.0000%)    3600142| ####
            [ 29.00 -  31.00]us (82.5000%)    6114320| ########
            [ 31.00 -  33.00]us (85.0000%)    4856487| ######
            [ 33.00 -  35.00]us (87.5000%)    3776872| #####
            [ 35.00 -  37.00]us (88.7500%)    2916561| ###
            [ 37.00 -  39.00]us (90.0000%)    2327134| ###
             
            Magma:GET_META
            [  0.00 -  15.00]us (0.0000%)           1|
            [ 15.00 -  33.00]us (10.0000%)   24933018| ###################################
            [ 33.00 -  37.00]us (20.0000%)   27659672| #######################################
            [ 37.00 -  39.00]us (30.0000%)   14992550| #####################
            [ 39.00 -  41.00]us (40.0000%)   15372522| ######################
            [ 41.00 -  45.00]us (50.0000%)   30580630| ############################################
            [ 45.00 -  45.00]us (55.0000%)          0|
            [ 45.00 -  47.00]us (60.0000%)   13801443| ###################
            [ 47.00 -  49.00]us (65.0000%)   12080239| #################
            [ 49.00 -  51.00]us (70.0000%)   10275817| ##############
            [ 51.00 -  53.00]us (75.0000%)    8639969| ############
            [ 53.00 -  53.00]us (77.5000%)          0|
            [ 53.00 -  55.00]us (80.0000%)    7182416| ##########
            [ 55.00 -  55.00]us (82.5000%)          0|
            [ 55.00 -  57.00]us (85.0000%)    5895673| ########
            [ 57.00 -  59.00]us (87.5000%)    4790305| ######
            [ 59.00 -  61.00]us (88.7500%)    3884094| #####
            [ 61.00 -  63.00]us (90.0000%)    3155043| ####
             
            bg_load (200195320 total)
                   0us -    3us : (  0.0001%)      167
                   3us -    8us : ( 15.4150%) 30859959 #####
                   8us -    9us : ( 26.2742%) 21739563 ###
                   9us -   10us : ( 37.6359%) 22745641 ###
                  10us -   11us : ( 47.5433%) 19834146 ###
                  11us -   12us : ( 55.4577%) 15844259 ##
                  12us -   12us : ( 55.4577%)        0
                  12us -   13us : ( 61.9125%) 12922259 ##
                  13us -   14us : ( 67.6509%) 11487933 ##
                  14us -   15us : ( 73.0881%) 10885056 #
                  15us -   16us : ( 78.2455%) 10324826 #
                  16us -   16us : ( 78.2455%)        0
                  16us -   17us : ( 82.9068%)  9331629 #
                  17us -   17us : ( 82.9068%)        0
                  17us -   18us : ( 86.8507%)  7895522 #
                  18us -   19us : ( 89.9693%)  6243306 #
                  19us -   19us : ( 89.9693%)        0
                  19us -   20us : ( 92.3060%)  4678051
             
             bg_wait (200195320 total)
                   0us -    1us : (  0.0000%)        9
                   1us -    9us : ( 13.9306%) 27888376 ####
                   9us -   10us : ( 22.4575%) 17070440 ##
                  10us -   11us : ( 32.3701%) 19844622 ###
                  11us -   12us : ( 43.3733%) 22027818 ###
                  12us -   13us : ( 53.8355%) 20944885 ###
                  13us -   14us : ( 63.2961%) 18939702 ###
                  14us -   14us : ( 63.2961%)        0
                  14us -   15us : ( 71.1109%) 15644865 ##
                  15us -   15us : ( 71.1109%)        0
                  15us -   16us : ( 77.1788%) 12147619 ##
                  16us -   17us : ( 81.7857%)  9222777 #
                  17us -   17us : ( 81.7857%)        0
                  17us -   18us : ( 85.2620%)  6959391 #
                  18us -   18us : ( 85.2620%)        0
                  18us -   19us : ( 87.8841%)  5249372
                  19us -   20us : ( 89.8834%)  4002531
                  20us -   21us : ( 91.4266%)  3089343
            
            

            couchstore: http://perf.jenkins.couchbase.com/job/titan-xdcr-dev/70/console
            magma: http://perf.jenkins.couchbase.com/job/titan-xdcr-dev/82/console

            sarath Sarath Lakshman added a comment - - edited Looking at the stats, get_meta is taking the most time compared to couchstore (avg 2us vs 45us) and contribution from bg_wait + bg_load (avg 15us + 13us). I agree, unless we are able to add a much lightweight KeyMayExist to magma which avoids some of the allocations and expensive work in magma::Get as well as magma-kvstore allocations, we won't be able to see much difference. We could potentially save the bg_wait time only.   Couchstore:GET_META [ 0.00 - 0.00]us (0.0000%) 38| [ 0.00 - 1.00]us (10.0000%) 94874274| ############################################ [ 1.00 - 1.00]us (20.0000%) 0| [ 1.00 - 1.00]us (30.0000%) 0| [ 1.00 - 1.00]us (40.0000%) 0| [ 1.00 - 2.00]us (50.0000%) 73288014| ################################# [ 2.00 - 2.00]us (55.0000%) 0| [ 2.00 - 2.00]us (60.0000%) 0| [ 2.00 - 2.00]us (65.0000%) 0| [ 2.00 - 2.00]us (70.0000%) 0| [ 2.00 - 2.00]us (75.0000%) 0| [ 2.00 - 2.00]us (77.5000%) 0| [ 2.00 - 2.00]us (80.0000%) 0| [ 2.00 - 2.00]us (82.5000%) 0| [ 2.00 - 3.00]us (85.0000%) 18531107| ######## [ 3.00 - 3.00]us (87.5000%) 0| [ 3.00 - 3.00]us (88.7500%) 0| [ 3.00 - 3.00]us (90.0000%) 0|   Couchstore:SET_WITH_META [ 0.00 - 7.00]us (0.0000%) 1804| [ 7.00 - 14.00]us (10.0000%) 32829907| ############################################ [ 14.00 - 15.00]us (20.0000%) 18193265| ######################## [ 15.00 - 16.00]us (30.0000%) 12680040| ################ [ 16.00 - 18.00]us (40.0000%) 22889054| ############################## [ 18.00 - 20.00]us (50.0000%) 19768797| ########################## [ 20.00 - 21.00]us (55.0000%) 8738984| ########### [ 21.00 - 22.00]us (60.0000%) 8702046| ########### [ 22.00 - 23.00]us (65.0000%) 8219478| ########### [ 23.00 - 25.00]us (70.0000%) 13427851| ################# [ 25.00 - 27.00]us (75.0000%) 9817161| ############# [ 27.00 - 28.00]us (77.5000%) 4026131| ##### [ 28.00 - 29.00]us (80.0000%) 3600142| #### [ 29.00 - 31.00]us (82.5000%) 6114320| ######## [ 31.00 - 33.00]us (85.0000%) 4856487| ###### [ 33.00 - 35.00]us (87.5000%) 3776872| ##### [ 35.00 - 37.00]us (88.7500%) 2916561| ### [ 37.00 - 39.00]us (90.0000%) 2327134| ###   Magma:GET_META [ 0.00 - 15.00]us (0.0000%) 1| [ 15.00 - 33.00]us (10.0000%) 24933018| ################################### [ 33.00 - 37.00]us (20.0000%) 27659672| ####################################### [ 37.00 - 39.00]us (30.0000%) 14992550| ##################### [ 39.00 - 41.00]us (40.0000%) 15372522| ###################### [ 41.00 - 45.00]us (50.0000%) 30580630| ############################################ [ 45.00 - 45.00]us (55.0000%) 0| [ 45.00 - 47.00]us (60.0000%) 13801443| ################### [ 47.00 - 49.00]us (65.0000%) 12080239| ################# [ 49.00 - 51.00]us (70.0000%) 10275817| ############## [ 51.00 - 53.00]us (75.0000%) 8639969| ############ [ 53.00 - 53.00]us (77.5000%) 0| [ 53.00 - 55.00]us (80.0000%) 7182416| ########## [ 55.00 - 55.00]us (82.5000%) 0| [ 55.00 - 57.00]us (85.0000%) 5895673| ######## [ 57.00 - 59.00]us (87.5000%) 4790305| ###### [ 59.00 - 61.00]us (88.7500%) 3884094| ##### [ 61.00 - 63.00]us (90.0000%) 3155043| ####   bg_load (200195320 total) 0us - 3us : ( 0.0001%) 167 3us - 8us : ( 15.4150%) 30859959 ##### 8us - 9us : ( 26.2742%) 21739563 ### 9us - 10us : ( 37.6359%) 22745641 ### 10us - 11us : ( 47.5433%) 19834146 ### 11us - 12us : ( 55.4577%) 15844259 ## 12us - 12us : ( 55.4577%) 0 12us - 13us : ( 61.9125%) 12922259 ## 13us - 14us : ( 67.6509%) 11487933 ## 14us - 15us : ( 73.0881%) 10885056 # 15us - 16us : ( 78.2455%) 10324826 # 16us - 16us : ( 78.2455%) 0 16us - 17us : ( 82.9068%) 9331629 # 17us - 17us : ( 82.9068%) 0 17us - 18us : ( 86.8507%) 7895522 # 18us - 19us : ( 89.9693%) 6243306 # 19us - 19us : ( 89.9693%) 0 19us - 20us : ( 92.3060%) 4678051   bg_wait (200195320 total) 0us - 1us : ( 0.0000%) 9 1us - 9us : ( 13.9306%) 27888376 #### 9us - 10us : ( 22.4575%) 17070440 ## 10us - 11us : ( 32.3701%) 19844622 ### 11us - 12us : ( 43.3733%) 22027818 ### 12us - 13us : ( 53.8355%) 20944885 ### 13us - 14us : ( 63.2961%) 18939702 ### 14us - 14us : ( 63.2961%) 0 14us - 15us : ( 71.1109%) 15644865 ## 15us - 15us : ( 71.1109%) 0 15us - 16us : ( 77.1788%) 12147619 ## 16us - 17us : ( 81.7857%) 9222777 # 17us - 17us : ( 81.7857%) 0 17us - 18us : ( 85.2620%) 6959391 # 18us - 18us : ( 85.2620%) 0 18us - 19us : ( 87.8841%) 5249372 19us - 20us : ( 89.8834%) 4002531 20us - 21us : ( 91.4266%) 3089343 couchstore: http://perf.jenkins.couchbase.com/job/titan-xdcr-dev/70/console magma: http://perf.jenkins.couchbase.com/job/titan-xdcr-dev/82/console

            People

              sarath Sarath Lakshman
              bo-chun.wang Bo-Chun Wang
              Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

                Created:
                Updated:

                Gerrit Reviews

                  There are 2 open Gerrit changes

                  PagerDuty