Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-37349

cbbackupmgr backup has varying performance because of high fragmentation on Leto-server-04 due to having a faster disk

    XMLWordPrintable

Details

    • Untriaged
    • Yes

    Description

      Observing ~90% drop in cbbackupmgr backup throughput in RC4 build .

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          *there are no cbbackupmgr changes seen in changelog between rc3 & rc4 . Triaging it further....

           

          sharath.sulochana Sharath Sulochana (Inactive) added a comment - - edited *there are no cbbackupmgr changes seen in changelog between rc3 & rc4 . Triaging it further....  
          owend Daniel Owen added a comment -

          Hi Sharath Sulochana. It looks likely that this is an environmental issue. Could you rerun?

          many thanks

          owend Daniel Owen added a comment - Hi Sharath Sulochana . It looks likely that this is an environmental issue. Could you rerun? many thanks
          owend Daniel Owen added a comment - - edited

          The only KV changes between RC3 and RC4 are:

          Commit: 2a368c39481ff4d42c6f755bd7d185b9a57554ca in build: 6.5.0-4959
          MB-37294: Avoid starvation of low-pri VBs in Flusher::flushVB()

          Commit: dee7c17e96ca42fd2fd186a3c2e690c9d7231f68 in build: 6.5.0-4957
          MB-37150: Register cursor before backfilling after cursor dropping

          I don't believe either would be responsible for the significant performance drop we are seeing.

          owend Daniel Owen added a comment - - edited The only KV changes between RC3 and RC4 are: Commit: 2a368c39481ff4d42c6f755bd7d185b9a57554ca in build: 6.5.0-4959 MB-37294 : Avoid starvation of low-pri VBs in Flusher::flushVB() Commit: dee7c17e96ca42fd2fd186a3c2e690c9d7231f68 in build: 6.5.0-4957 MB-37150: Register cursor before backfilling after cursor dropping I don't believe either would be responsible for the significant performance drop we are seeing.

          Daniel Owen - I have runs scheduled . I will update once they are completed .

          sharath.sulochana Sharath Sulochana (Inactive) added a comment - Daniel Owen  - I have runs scheduled . I will update once they are completed .
          owend Daniel Owen added a comment -

          Looking at the backup log for the SQLLite runs for RC3 and RC4 it does look like we look significantly longer to stream the data over DCP.

          RC3

          2019-12-13T16:16:57.624-08:00 (Plan) Data transfer completed after 4m5.716315682s
          2019-12-13T16:16:57.624-08:00 (Plan) Transferred Mutations: 100000000, Deletions: 0 total size of 31.24GB
          

          RC4

          2019-12-24T11:01:11.751-08:00 (Plan) Data transfer completed after 21m19.884981675s
          2019-12-24T11:01:11.751-08:00 (Plan) Transferred Mutations: 100000000, Deletions: 0 total size of 31.20GB
          

          owend Daniel Owen added a comment - Looking at the backup log for the SQLLite runs for RC3 and RC4 it does look like we look significantly longer to stream the data over DCP. RC3 2019-12-13T16:16:57.624-08:00 (Plan) Data transfer completed after 4m5.716315682s 2019-12-13T16:16:57.624-08:00 (Plan) Transferred Mutations: 100000000, Deletions: 0 total size of 31.24GB RC4 2019-12-24T11:01:11.751-08:00 (Plan) Data transfer completed after 21m19.884981675s 2019-12-24T11:01:11.751-08:00 (Plan) Transferred Mutations: 100000000, Deletions: 0 total size of 31.20GB
          owend Daniel Owen added a comment -

          768 vbuckets were successfully streamed in ~4 minutes however 256 buckets are taking significantly longer
          Look like one of the nodes is running slow for some reason. Now need to identify which of the nodes..

          owend Daniel Owen added a comment - 768 vbuckets were successfully streamed in ~4 minutes however 256 buckets are taking significantly longer Look like one of the nodes is running slow for some reason. Now need to identify which of the nodes..
          owend Daniel Owen added a comment -

          Thanks Sharath Sulochana
          leto-srv-04 is the slow machine.

          Focusing on vb 900 which was the last vb bucket for the cbbackupmgr to receive we can see the following history:

          2019-12-24T07:51:14.556004-08:00 INFO (bucket-1) VBucket: created vb:900 with state:active initialState:dead lastSeqno:0 persistedRange:{0,0} max_cas:0 uuid:94765411349211 topology:null
          2019-12-24T07:51:14.556043-08:00 INFO (bucket-1) VBucket::setState: transitioning vb:900 with high seqno:0 from:active to:active meta:{"topology":[["ns_1@leto-srv-04.perf.couchbase.com","ns_1@leto-srv-02.perf.couchbase.com"]]}
          2019-12-24T07:51:14.912235-08:00 INFO 158: (bucket-1) DCP (Producer) eq_dcpq:replication:ns_1@leto-srv-04.perf.couchbase.com->ns_1@leto-srv-02.perf.couchbase.com:bucket-1 - (vb:900) Creating stream with start seqno 0 and end seqno 18446744073709551615; requested end seqno was 18446744073709551615, collections-manifest uid:none, sid:none
          2019-12-24T07:51:14.912247-08:00 INFO 158: (bucket-1) DCP (Producer) eq_dcpq:replication:ns_1@leto-srv-04.perf.couchbase.com->ns_1@leto-srv-02.perf.couchbase.com:bucket-1 - (vb:900) ActiveStream::scheduleBackfill_UNLOCKED register cursor with name "eq_dcpq:replication:ns_1@leto-srv-04.perf.couchbase.com->ns_1@leto-srv-02.perf.couchbase.com:bucket-1" backfill:true, seqno:1
          2019-12-24T07:51:14.912253-08:00 INFO 158: (bucket-1) DCP (Producer) eq_dcpq:replication:ns_1@leto-srv-04.perf.couchbase.com->ns_1@leto-srv-02.perf.couchbase.com:bucket-1 - (vb:900) ActiveStream::transitionState: Transitioning from backfilling to in-memory
          2019-12-24T08:01:27.932778-08:00 WARNING (bucket-1) Slow runtime for 'Checkpoint Remover on vb:900' on thread nonIO_worker_4: 67 ms
          2019-12-24T09:00:25.751493-08:00 WARNING (bucket-1) Slow runtime for 'Checkpoint Remover on vb:900' on thread nonIO_worker_3: 124 ms
          2019-12-24T10:39:55.120151-08:00 INFO 163: (bucket-1) DCP (Producer) eq_dcpq:cbbackupmgr_2019-12-24T10:39:51-08:00_71998_4 - (vb:900) Creating stream with start seqno 0 and end seqno 97655; requested end seqno was 97655, collections-manifest uid:none, sid:none
          2019-12-24T10:39:55.120175-08:00 INFO 163: (bucket-1) DCP (Producer) eq_dcpq:cbbackupmgr_2019-12-24T10:39:51-08:00_71998_4 - (vb:900) Scheduling backfill from 1 to 97655, reschedule flag : False
          2019-12-24T10:39:55.163063-08:00 INFO 163: (bucket-1) DCP (Producer) eq_dcpq:cbbackupmgr_2019-12-24T10:39:51-08:00_71998_4 - (vb:900) ActiveStream::markDiskSnapshot: Sending disk snapshot with start 0, end 97655, and high completed --, max visible 97655
          2019-12-24T11:00:25.577643-08:00 INFO 163: (bucket-1) DCP (Producer) eq_dcpq:cbbackupmgr_2019-12-24T10:39:51-08:00_71998_4 - (vb:900) Backfill complete, 0 items read from disk, 97655 from memory, last seqno read: 97655, pendingBackfill : False
          2019-12-24T11:01:07.101205-08:00 INFO 163: (bucket-1) DCP (Producer) eq_dcpq:cbbackupmgr_2019-12-24T10:39:51-08:00_71998_4 - (vb:900) Stream closing, sent until seqno 97655 remaining items 0, reason: The stream ended due to all items being streamed
          

          Scheduling backfill from 1 to 97655 at: 10:39:55
          Backfill completed at: 11:00:25

          So question is why the backfill took so long to complete?

          owend Daniel Owen added a comment - Thanks Sharath Sulochana leto-srv-04 is the slow machine. Focusing on vb 900 which was the last vb bucket for the cbbackupmgr to receive we can see the following history: 2019-12-24T07:51:14.556004-08:00 INFO (bucket-1) VBucket: created vb:900 with state:active initialState:dead lastSeqno:0 persistedRange:{0,0} max_cas:0 uuid:94765411349211 topology:null 2019-12-24T07:51:14.556043-08:00 INFO (bucket-1) VBucket::setState: transitioning vb:900 with high seqno:0 from:active to:active meta:{"topology":[["ns_1@leto-srv-04.perf.couchbase.com","ns_1@leto-srv-02.perf.couchbase.com"]]} 2019-12-24T07:51:14.912235-08:00 INFO 158: (bucket-1) DCP (Producer) eq_dcpq:replication:ns_1@leto-srv-04.perf.couchbase.com->ns_1@leto-srv-02.perf.couchbase.com:bucket-1 - (vb:900) Creating stream with start seqno 0 and end seqno 18446744073709551615; requested end seqno was 18446744073709551615, collections-manifest uid:none, sid:none 2019-12-24T07:51:14.912247-08:00 INFO 158: (bucket-1) DCP (Producer) eq_dcpq:replication:ns_1@leto-srv-04.perf.couchbase.com->ns_1@leto-srv-02.perf.couchbase.com:bucket-1 - (vb:900) ActiveStream::scheduleBackfill_UNLOCKED register cursor with name "eq_dcpq:replication:ns_1@leto-srv-04.perf.couchbase.com->ns_1@leto-srv-02.perf.couchbase.com:bucket-1" backfill:true, seqno:1 2019-12-24T07:51:14.912253-08:00 INFO 158: (bucket-1) DCP (Producer) eq_dcpq:replication:ns_1@leto-srv-04.perf.couchbase.com->ns_1@leto-srv-02.perf.couchbase.com:bucket-1 - (vb:900) ActiveStream::transitionState: Transitioning from backfilling to in-memory 2019-12-24T08:01:27.932778-08:00 WARNING (bucket-1) Slow runtime for 'Checkpoint Remover on vb:900' on thread nonIO_worker_4: 67 ms 2019-12-24T09:00:25.751493-08:00 WARNING (bucket-1) Slow runtime for 'Checkpoint Remover on vb:900' on thread nonIO_worker_3: 124 ms 2019-12-24T10:39:55.120151-08:00 INFO 163: (bucket-1) DCP (Producer) eq_dcpq:cbbackupmgr_2019-12-24T10:39:51-08:00_71998_4 - (vb:900) Creating stream with start seqno 0 and end seqno 97655; requested end seqno was 97655, collections-manifest uid:none, sid:none 2019-12-24T10:39:55.120175-08:00 INFO 163: (bucket-1) DCP (Producer) eq_dcpq:cbbackupmgr_2019-12-24T10:39:51-08:00_71998_4 - (vb:900) Scheduling backfill from 1 to 97655, reschedule flag : False 2019-12-24T10:39:55.163063-08:00 INFO 163: (bucket-1) DCP (Producer) eq_dcpq:cbbackupmgr_2019-12-24T10:39:51-08:00_71998_4 - (vb:900) ActiveStream::markDiskSnapshot: Sending disk snapshot with start 0, end 97655, and high completed --, max visible 97655 2019-12-24T11:00:25.577643-08:00 INFO 163: (bucket-1) DCP (Producer) eq_dcpq:cbbackupmgr_2019-12-24T10:39:51-08:00_71998_4 - (vb:900) Backfill complete, 0 items read from disk, 97655 from memory, last seqno read: 97655, pendingBackfill : False 2019-12-24T11:01:07.101205-08:00 INFO 163: (bucket-1) DCP (Producer) eq_dcpq:cbbackupmgr_2019-12-24T10:39:51-08:00_71998_4 - (vb:900) Stream closing, sent until seqno 97655 remaining items 0, reason: The stream ended due to all items being streamed Scheduling backfill from 1 to 97655 at: 10:39:55 Backfill completed at: 11:00:25 So question is why the backfill took so long to complete?
          pvarley Patrick Varley added a comment - - edited

          Fragmentation is much higher on the server side, we know that this has an affect on DCP streaming:

          We should really update the test so that a manual compaction is triggered at the end of the load phrase.

          Here is the side by side on showfast:

          http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=leto_650-4947_backup_1edf&snapshot=leto_650-4959_backup_a7d9

          pvarley Patrick Varley added a comment - - edited Fragmentation is much higher on the server side, we know that this has an affect on DCP streaming: We should really update the test so that a manual compaction is triggered at the end of the load phrase. Here is the side by side on showfast: http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=leto_650-4947_backup_1edf&snapshot=leto_650-4959_backup_a7d9

          Daniel Owen  are you suspecting fragmentation to have impact on only one node( leto-srv-04) ?  Or is it independent of it ? 

          I did a quick ping test between different nodes in cluster . I don't see any network issue . 

          sharath.sulochana Sharath Sulochana (Inactive) added a comment - Daniel Owen   are you suspecting fragmentation to have impact on only one node( leto-srv-04) ?  Or is it independent of it ?  I did a quick ping test between different nodes in cluster . I don't see any network issue . 

          Also are you recommending compaction at the end of load for all tools tests ?

          sharath.sulochana Sharath Sulochana (Inactive) added a comment - Also are you recommending compaction at the end of load for all tools tests ?
          owend Daniel Owen added a comment -

          Sharath Sulochana Does 04 have a different disk to all the others? - it's appears much faster. Just collecting stats...

          owend Daniel Owen added a comment - Sharath Sulochana Does 04 have a different disk to all the others? - it's appears much faster. Just collecting stats...
          owend Daniel Owen added a comment -

          Patrick Varley reminded me 04 is the machine where we had the faster drive installed see CBPS-680
          We can see that in the number of items in the disk queue being significantly lower and fragmentation being significantly high

          disk queue for the 4 nodes:

          fragmentation for the 4 nodes:

          So yes Sharath Sulochana we need to perform compaction at the end of the load phase. This will ensure we have the same behaviour / performance for all nodes in the cluster.

          thanks

          owend Daniel Owen added a comment - Patrick Varley reminded me 04 is the machine where we had the faster drive installed see CBPS-680 We can see that in the number of items in the disk queue being significantly lower and fragmentation being significantly high disk queue for the 4 nodes: fragmentation for the 4 nodes: So yes Sharath Sulochana we need to perform compaction at the end of the load phase. This will ensure we have the same behaviour / performance for all nodes in the cluster. thanks

          Daniel Owen sure. I have  enabled compaction for next runs for comparison .

          sharath.sulochana Sharath Sulochana (Inactive) added a comment - Daniel Owen  sure. I have  enabled compaction for next runs for comparison .
          owend Daniel Owen added a comment -

          great - many thanks Sharath Sulochana

          owend Daniel Owen added a comment - great - many thanks Sharath Sulochana

          Also are you recommending compaction at the end of load for all tools tests ?

          Yes - we did it for the XDCR tests a few weeks back, I been meaning to do it since then.

          pvarley Patrick Varley added a comment - Also are you recommending compaction at the end of load for all tools tests ? Yes - we did it for the XDCR tests a few weeks back, I been meaning to do it since then.

          Enabling compaction after loading the doc has resolved the issue . Closing this ticket .

          sharath.sulochana Sharath Sulochana (Inactive) added a comment - Enabling compaction after loading the doc has resolved the issue . Closing this ticket .

          People

            sharath.sulochana Sharath Sulochana (Inactive)
            sharath.sulochana Sharath Sulochana (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty