Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-48569

[Magma, 30TB, 1% DGM]: Indexer drain rate is extremely slow

    XMLWordPrintable

Details

    • Bug
    • Status: Closed
    • Critical
    • Resolution: Duplicate
    • 7.1.0
    • 7.1.0
    • couchbase-bucket
    • 7.1.0-1312

    Description

      1. Create a 3 kv node and 1 index/n1ql node cluster
      2. Create magma bucket, 100 collections under default scope
      3. Load 5B items and upsert them
      4. Load another 5B items and upsert them as well
      5. Create 100 indexes on 100 collections and build them. Start 100 QPS
      6. Rebalance In 1 KV node. Rebalance failed
      7. At this moment the indexer drain rate is very slow and building indexes is taking ages.
      If we look at idx45, the remaining mutations are 10M while the drain rate is ~500/s

      For today, the cluster is live at: http://172.23.110.64:8091/ui/index.htm

      Note: The total drain rate across indexes is around 10k - 75k/s on a 72 core machine, ssd disk, 200GB indexer RAM
      Total Disk Writes are varying between 5-30Mbps

      QE Test

      guides/gradlew --refresh-dependencies testrunner -P jython=/opt/jython/bin/jython -P 'args=-i /tmp/magma_temp_job3.ini -p bucket_storage=magma,bucket_eviction_policy=fullEviction,rerun=False -t aGoodDoctor.Hospital.Murphy.test_rebalance,nodes_init=3,graceful=True,skip_cleanup=True,num_items=100000000,num_buckets=1,bucket_names=GleamBook,doc_size=1536,bucket_type=membase,eviction_policy=fullEviction,iterations=5,batch_size=1000,sdk_timeout=60,log_level=debug,infra_log_level=debug,rerun=False,skip_cleanup=True,key_size=18,randomize_doc_size=False,randomize_value=True,assert_crashes_on_load=True,num_collections=50,maxttl=10,num_indexes=100,pc=10,index_nodes=1,cbas_nodes=0,fts_nodes=0,ops_rate=200000,ramQuota=102400,doc_ops=create:update:delete:read,mutation_perc=100,rebl_ops_rate=50000,key_type=RandomKey -m rest'
      

      Attachments

        1. idx45.png
          311 kB
          Ritesh Agarwal

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            Node 110.67 is the only indexer node.

            a. From ns_server.indexer_stats.log , the ts_queue_size stat for the keyspace INIT_STREAM_keyspace_GleamBookUsers0:_default:VolumeCollection5 is almost always "0". This means that indexer is catching up with all the mutations it is receiving from projector

            b. The mutation_queue_size for this stream is <1000 for most of the time

            c. Also, looking at projector latency stats from the time of cbcollect, projector is able to send mutations in <2ms of time. This means nothing is stuck on indexer side

            2021-09-23T00:28:00.951-07:00 projlat {"INIT_STREAM/172.23.110.64:8091/projector_latency":1532399,"INIT_STREAM/172.23.110.65:8091/projector_latency":1257879,"INIT_STREAM/172.23.110.66:8091/projector_latency":1763309}

            d. Looking at ENDP stats for each of the projectors at the time of cbcollect

            Node 172.23.110.64:

            2021-09-23T00:28:36.630-07:00 [Info] ENDP[<-(172.23.110.67:9103,1415)<-127.0.0.1:8091 #INIT_STREAM_TOPIC_53c16e3f091971da1cf99c771ba1e513] stats: {"mutCount":959883083,"upsertCount":490463993,"deleteCount":0,"upsdelCount":409161666,"syncCount":60227416,"beginCount":12226,"endCount":3676,"snapCount":6754,"flushCount":12522552,"collectionCreate":6156,"collectionDrop":0,"collectionFlush":0,"scopeCreate":0,"scopeDrop":0,"collectionChanged":0,"updateSeqno":0,"seqnoAdvanced":1196,"osoSnapshotStart":0,"osoSnapshotEnd":0,"latency.min":10541,"latency.max":51845162353,"latency.avg":20396924,"latency.movingAvg":1210177,"endpChLen":0}

            Node 172.23.110.65:

            2021-09-23T00:27:56.635-07:00 [Info] ENDP[<-(172.23.110.67:9103,b452)<-127.0.0.1:8091 #INIT_STREAM_TOPIC_53c16e3f091971da1cf99c771ba1e513] stats: {"mutCount":1186039171,"upsertCount":606346512,"deleteCount":1871,"upsdelCount":519622318,"syncCount":60038097,"beginCount":12284,"endCount":3751,"snapCount":6820,"flushCount":13867551,"collectionCreate":6138,"collectionDrop":0,"collectionFlush":0,"scopeCreate":0,"scopeDrop":0,"collectionChanged":0,"updateSeqno":0,"seqnoAdvanced":1380,"osoSnapshotStart":0,"osoSnapshotEnd":0,"latency.min":9757,"latency.max":51780373606,"latency.avg":7777851,"latency.movingAvg":1341601,"endpChLen":0}

            Node 172.23.110.66:

            2021-09-23T00:27:54.181-07:00 [Info] ENDP[<-(172.23.110.67:9103, 7ac)<-127.0.0.1:8091 #INIT_STREAM_TOPIC_53c16e3f091971da1cf99c771ba1e513] stats: {"mutCount":1155055689,"upsertCount":585126006,"deleteCount":3675,"upsdelCount":509897373,"syncCount":59998286,"beginCount":12276,"endCount":3751,"snapCount":6820,"flushCount":14300014,"collectionCreate":6138,"collectionDrop":0,"collectionFlush":0,"scopeCreate":0,"scopeDrop":0,"collectionChanged":0,"updateSeqno":0,"seqnoAdvanced":1364,"osoSnapshotStart":0,"osoSnapshotEnd":0,"latency.min":8640,"latency.max":51718189901,"latency.avg":5483651,"latency.movingAvg":1876106,"endpChLen":0}

            The endpoint channel length is almost always "0" from endpoint stats. This means that Either (1) Projector is taking more time to process mutations (2) Memcached is taking long time to stream mutations

            e. Projector taking long time to process mutations can be eliminated as the projector latency is <2ms from (c)

            f. Looking at DCPT stats on node 110.64, the DCP connection's receive channel length is almost always "0" - This confirms that memcached is slow in sending mutations

            2021-09-23T00:28:36.620-07:00 [Info] DCPT[secidx:proj-GleamBookUsers0:_default:VolumeCollection5-INIT_STREAM_TOPIC_53c16e3f091971da1cf99c771ba1e513-7806952980894807309/3] ##3 stats: {"bytes":24318484283,"bufferacks":5914,"toAckBytes":32,"streamreqs":170,"snapshots":170,"mutations":16602714,"collectionCreate":85,"collectionDrop":0,"collectionFlush":0,"scopeCreate":0,"scopeDrop":0,"collectionChanged":0,"seqnoAdvanced":170,"osoSnapshotStart":0,"osoSnapshotEnd":0,"streamends":85,"closestreams":0,"lastAckTime":3h56m11.511080795s,"lastNoopSend":18.363453791s,"lastNoopRecv":18.363475492s,"lastMsgSend":18.363453791s,"lastMsgRecv":18.363475492s,"rcvchLen":0,"incomingMsg":16603399}

            g. From memcached logs, looks like KV is having issues with backfill

            2021-09-23T00:28:21.828256-07:00 WARNING (No Engine) Slow runtime for 'Backfilling items for a DCP Connection' on thread AuxIoPool3: 366 ms
            2021-09-23T00:28:21.859324-07:00 WARNING (No Engine) Slow runtime for 'Backfilling items for a DCP Connection' on thread AuxIoPool7: 376 ms
            2021-09-23T00:28:22.146628-07:00 WARNING (No Engine) Slow runtime for 'Backfilling items for a DCP Connection' on thread AuxIoPool1: 39 s
            2021-09-23T00:28:24.601576-07:00 WARNING (No Engine) Slow runtime for 'Backfilling items for a DCP Connection' on thread AuxIoPool7: 372 ms
            2021-09-23T00:28:24.676034-07:00 WARNING (No Engine) Slow runtime for 'Backfilling items for a DCP Connection' on thread AuxIoPool1: 420 ms
            2021-09-23T00:28:25.310269-07:00 WARNING (No Engine) Slow runtime for 'Backfilling items for a DCP Connection' on thread AuxIoPool1: 316 ms
            2021-09-23T00:28:25.618340-07:00 WARNING (No Engine) Slow runtime for 'Backfilling items for a DCP Connection' on thread AuxIoPool1: 308 ms
            2021-09-23T00:28:25.854327-07:00 WARNING (No Engine) Slow runtime for 'Backfilling items for a DCP Connection' on thread AuxIoPool2: 38 s

            Daniel Owen, I do not see any issue on indexer side for slow mutation rate. Can you please take a look from KV side to see why the mutation transfer is very slow from memcached to projector.

            varun.velamuri Varun Velamuri added a comment - Node 110.67 is the only indexer node. a. From ns_server.indexer_stats.log , the ts_queue_size stat for the keyspace INIT_STREAM_keyspace_GleamBookUsers0:_default:VolumeCollection5 is almost always "0". This means that indexer is catching up with all the mutations it is receiving from projector b. The mutation_queue_size for this stream is <1000 for most of the time c. Also, looking at projector latency stats from the time of cbcollect, projector is able to send mutations in <2ms of time. This means nothing is stuck on indexer side 2021-09-23T00:28:00.951-07:00 projlat {"INIT_STREAM/172.23.110.64:8091/projector_latency":1532399,"INIT_STREAM/172.23.110.65:8091/projector_latency":1257879,"INIT_STREAM/172.23.110.66:8091/projector_latency":1763309} d. Looking at ENDP stats for each of the projectors at the time of cbcollect Node 172.23.110.64: 2021-09-23T00:28:36.630-07:00 [Info] ENDP[<-(172.23.110.67:9103,1415)<-127.0.0.1:8091 #INIT_STREAM_TOPIC_53c16e3f091971da1cf99c771ba1e513] stats: {"mutCount":959883083,"upsertCount":490463993,"deleteCount":0,"upsdelCount":409161666,"syncCount":60227416,"beginCount":12226,"endCount":3676,"snapCount":6754,"flushCount":12522552,"collectionCreate":6156,"collectionDrop":0,"collectionFlush":0,"scopeCreate":0,"scopeDrop":0,"collectionChanged":0,"updateSeqno":0,"seqnoAdvanced":1196,"osoSnapshotStart":0,"osoSnapshotEnd":0,"latency.min":10541,"latency.max":51845162353,"latency.avg":20396924,"latency.movingAvg":1210177,"endpChLen":0} Node 172.23.110.65: 2021-09-23T00:27:56.635-07:00 [Info] ENDP[<-(172.23.110.67:9103,b452)<-127.0.0.1:8091 #INIT_STREAM_TOPIC_53c16e3f091971da1cf99c771ba1e513] stats: {"mutCount":1186039171,"upsertCount":606346512,"deleteCount":1871,"upsdelCount":519622318,"syncCount":60038097,"beginCount":12284,"endCount":3751,"snapCount":6820,"flushCount":13867551,"collectionCreate":6138,"collectionDrop":0,"collectionFlush":0,"scopeCreate":0,"scopeDrop":0,"collectionChanged":0,"updateSeqno":0,"seqnoAdvanced":1380,"osoSnapshotStart":0,"osoSnapshotEnd":0,"latency.min":9757,"latency.max":51780373606,"latency.avg":7777851,"latency.movingAvg":1341601,"endpChLen":0} Node 172.23.110.66: 2021-09-23T00:27:54.181-07:00 [Info] ENDP[<-(172.23.110.67:9103, 7ac)<-127.0.0.1:8091 #INIT_STREAM_TOPIC_53c16e3f091971da1cf99c771ba1e513] stats: {"mutCount":1155055689,"upsertCount":585126006,"deleteCount":3675,"upsdelCount":509897373,"syncCount":59998286,"beginCount":12276,"endCount":3751,"snapCount":6820,"flushCount":14300014,"collectionCreate":6138,"collectionDrop":0,"collectionFlush":0,"scopeCreate":0,"scopeDrop":0,"collectionChanged":0,"updateSeqno":0,"seqnoAdvanced":1364,"osoSnapshotStart":0,"osoSnapshotEnd":0,"latency.min":8640,"latency.max":51718189901,"latency.avg":5483651,"latency.movingAvg":1876106,"endpChLen":0} The endpoint channel length is almost always "0" from endpoint stats. This means that Either (1) Projector is taking more time to process mutations (2) Memcached is taking long time to stream mutations e. Projector taking long time to process mutations can be eliminated as the projector latency is <2ms from (c) f. Looking at DCPT stats on node 110.64, the DCP connection's receive channel length is almost always "0" - This confirms that memcached is slow in sending mutations 2021-09-23T00:28:36.620-07:00 [Info] DCPT[secidx:proj-GleamBookUsers0:_default:VolumeCollection5-INIT_STREAM_TOPIC_53c16e3f091971da1cf99c771ba1e513-7806952980894807309/3] ##3 stats: {"bytes":24318484283,"bufferacks":5914,"toAckBytes":32,"streamreqs":170,"snapshots":170,"mutations":16602714,"collectionCreate":85,"collectionDrop":0,"collectionFlush":0,"scopeCreate":0,"scopeDrop":0,"collectionChanged":0,"seqnoAdvanced":170,"osoSnapshotStart":0,"osoSnapshotEnd":0,"streamends":85,"closestreams":0,"lastAckTime":3h56m11.511080795s,"lastNoopSend":18.363453791s,"lastNoopRecv":18.363475492s,"lastMsgSend":18.363453791s,"lastMsgRecv":18.363475492s,"rcvchLen":0,"incomingMsg":16603399} g. From memcached logs, looks like KV is having issues with backfill 2021-09-23T00:28:21.828256-07:00 WARNING (No Engine) Slow runtime for 'Backfilling items for a DCP Connection' on thread AuxIoPool3: 366 ms 2021-09-23T00:28:21.859324-07:00 WARNING (No Engine) Slow runtime for 'Backfilling items for a DCP Connection' on thread AuxIoPool7: 376 ms 2021-09-23T00:28:22.146628-07:00 WARNING (No Engine) Slow runtime for 'Backfilling items for a DCP Connection' on thread AuxIoPool1: 39 s 2021-09-23T00:28:24.601576-07:00 WARNING (No Engine) Slow runtime for 'Backfilling items for a DCP Connection' on thread AuxIoPool7: 372 ms 2021-09-23T00:28:24.676034-07:00 WARNING (No Engine) Slow runtime for 'Backfilling items for a DCP Connection' on thread AuxIoPool1: 420 ms 2021-09-23T00:28:25.310269-07:00 WARNING (No Engine) Slow runtime for 'Backfilling items for a DCP Connection' on thread AuxIoPool1: 316 ms 2021-09-23T00:28:25.618340-07:00 WARNING (No Engine) Slow runtime for 'Backfilling items for a DCP Connection' on thread AuxIoPool1: 308 ms 2021-09-23T00:28:25.854327-07:00 WARNING (No Engine) Slow runtime for 'Backfilling items for a DCP Connection' on thread AuxIoPool2: 38 s Daniel Owen , I do not see any issue on indexer side for slow mutation rate. Can you please take a look from KV side to see why the mutation transfer is very slow from memcached to projector.
            james.harrison James Harrison added a comment - - edited

            Looking at 172.23.110.66 there's some signs of MB-48473 - the purge seqno advancing in cases it doesn't need to, leading to streams being unnecessarily rollbacked.

            2021-09-22T19:13:07.919132-07:00 WARNING 39114: (GleamBookUsers0) DCP (Producer) eq_dcpq:replication:ns_1@172.23.110.66->ns_1@172.23.110.69:GleamBookUsers0 - DCPBackfillBySeqnoDisk::create(): (vb:1023) cannot be scanned. Associated stream is set to dead state. startSeqno:19501080 < purgeSeqno:19676538. The vbucket state:active, collHigh-valid:false, collHigh:184
            46744073709551615
            2021-09-22T19:13:07.919350-07:00 WARNING 39114: (GleamBookUsers0) DCP (Producer) eq_dcpq:replication:ns_1@172.23.110.66->ns_1@172.23.110.69:GleamBookUsers0 - (vb:1023) Stream closing, sent until seqno 19501079 remaining items 0, reason: The stream closed early because the vbucket rollback'ed
            ...
            2021-09-22T20:01:59.710346-07:00 WARNING 743: (GleamBookUsers0) DCP (Producer) eq_dcpq:secidx:proj-INIT_STREAM_TOPIC_53c16e3f091971da1cf99c771ba1e513-13969536484923249375/2 - DCPBackfillBySeqnoDisk::create(): (vb:685) cannot be scanned. Associated stream is set to dead state. startSeqno:19501651 < purgeSeqno:19748183. The vbucket state:active, collHigh-valid:tru
            e, collHigh:16624531
            2021-09-22T20:01:59.710361-07:00 WARNING 743: (GleamBookUsers0) DCP (Producer) eq_dcpq:secidx:proj-INIT_STREAM_TOPIC_53c16e3f091971da1cf99c771ba1e513-13969536484923249375/2 - (vb:685) Stream closing, sent until seqno 19501650 remaining items 0, reason: The stream closed early because the vbucket rollback'ed
            

            683 occurrences of this in the time the logs cover.

            Following one instance which affected a secidx stream:

            # start of the stream
            2021-09-22T14:36:50.885586-07:00 INFO 4614: (GleamBookUsers0) DCP (Producer) eq_dcpq:secidx:proj-INIT_STREAM_TOPIC_53c16e3f091971da1cf99c771ba1e513-14257740393263799446/0 - (vb:883) Creating stream with start seqno 0 and end seqno 18446744073709551615; requested end seqno was 18446744073709551615, collections-filter-size:1 sid:none
            2021-09-22T14:36:50.885618-07:00 INFO 4614: (GleamBookUsers0) DCP (Producer) eq_dcpq:secidx:proj-INIT_STREAM_TOPIC_53c16e3f091971da1cf99c771ba1e513-14257740393263799446/0 - (vb:883) ActiveStream::scheduleBackfill_UNLOCKED register cursor with name "eq_dcpq:secidx:proj-INIT_STREAM_TOPIC_53c16e3f091971da1cf99c771ba1e513-14257740393263799446/0" backfill:true, seqno
            :19579190
            2021-09-22T14:36:50.885668-07:00 INFO 4614: (GleamBookUsers0) DCP (Producer) eq_dcpq:secidx:proj-INIT_STREAM_TOPIC_53c16e3f091971da1cf99c771ba1e513-14257740393263799446/0 - (vb:883) Scheduling backfill from 1 to 19579189, reschedule flag : False
             
            2021-09-22T14:36:56.907942-07:00 INFO 4614: (GleamBookUsers0) DCP (Producer) eq_dcpq:secidx:proj-INIT_STREAM_TOPIC_53c16e3f091971da1cf99c771ba1e513-14257740393263799446/0 - (vb:883) ActiveStream::markDiskSnapshot: Sending disk snapshot with start 0, end 19593650, and high completed none, max visible none
            2021-09-22T14:36:56.908321-07:00 INFO 4614: (GleamBookUsers0) DCP (Producer) eq_dcpq:secidx:proj-INIT_STREAM_TOPIC_53c16e3f091971da1cf99c771ba1e513-14257740393263799446/0 - (vb:883) ActiveStream::registerCursor name "eq_dcpq:secidx:proj-INIT_STREAM_TOPIC_53c16e3f091971da1cf99c771ba1e513-14257740393263799446/0", backfill:false, seqno:19593651
             
            # cursor drop
            2021-09-22T16:14:58.823792-07:00 INFO 4614: (GleamBookUsers0) DCP (Producer) eq_dcpq:secidx:proj-INIT_STREAM_TOPIC_53c16e3f091971da1cf99c771ba1e513-14257740393263799446/0 - (vb:883) Handling slow stream; state_ : backfilling, lastReadSeqno : 0, lastSentSeqno : 6977289, vBucketHighSeqno : 19723098, isBackfillTaskRunning : True
             
            # backfill completes, around 6h after scheduling
            2021-09-22T20:32:59.133290-07:00 INFO 4614: (GleamBookUsers0) DCP (Producer) eq_dcpq:secidx:proj-INIT_STREAM_TOPIC_53c16e3f091971da1cf99c771ba1e513-14257740393263799446/0 - (vb:883) Backfill complete, 195937 items read from disk, 0 from memory, lastReadSeqno:19593650 lastSentSeqnoAdvance:19593650, pendingBackfill:True
            2021-09-22T20:32:59.133305-07:00 INFO [(GleamBookUsers0) magma_19/kvstore-883/rev-000000001]0x7f41dcd82200 Ending seqno iterator (16725978=>19593650) with seqno 19593650 restarted 71 times
             
            # reschedule backfill as the cursor was dropped
            2021-09-22T20:32:59.133385-07:00 INFO 4614: (GleamBookUsers0) DCP (Producer) eq_dcpq:secidx:proj-INIT_STREAM_TOPIC_53c16e3f091971da1cf99c771ba1e513-14257740393263799446/0 - (vb:883) ActiveStream::scheduleBackfill_UNLOCKED register cursor with name "eq_dcpq:secidx:proj-INIT_STREAM_TOPIC_53c16e3f091971da1cf99c771ba1e513-14257740393263799446/0" backfill:true, seqno
            :19882939
            2021-09-22T20:32:59.133397-07:00 INFO 4614: (GleamBookUsers0) DCP (Producer) eq_dcpq:secidx:proj-INIT_STREAM_TOPIC_53c16e3f091971da1cf99c771ba1e513-14257740393263799446/0 - (vb:883) Scheduling backfill from 19593651 to 19882938, reschedule flag : True
             
            # backfill fails, purge seqno after start seqno
            2021-09-22T20:33:01.158951-07:00 WARNING 4614: (GleamBookUsers0) DCP (Producer) eq_dcpq:secidx:proj-INIT_STREAM_TOPIC_53c16e3f091971da1cf99c771ba1e513-14257740393263799446/0 - DCPBackfillBySeqnoDisk::create(): (vb:883) cannot be scanned. Associated stream is set to dead state. startSeqno:19593651 < purgeSeqno:19841164. The vbucket state:active, collHigh-valid:tr
            ue, collHigh:16754135
            2021-09-22T20:33:01.158962-07:00 WARNING 4614: (GleamBookUsers0) DCP (Producer) eq_dcpq:secidx:proj-INIT_STREAM_TOPIC_53c16e3f091971da1cf99c771ba1e513-14257740393263799446/0 - (vb:883) Stream closing, sent until seqno 19593650 remaining items 0, reason: The stream closed early because the vbucket rollback'ed
             
            # recreate stream with a similar start seqno, which is still behind the start seqno - this wasn't rejected in streamRequest as the collection high seqno is behind the purge seqno
            2021-09-22T20:33:23.256552-07:00 INFO 4614: (GleamBookUsers0) DCP (Producer) eq_dcpq:secidx:proj-INIT_STREAM_TOPIC_53c16e3f091971da1cf99c771ba1e513-14257740393263799446/0 - (vb:883) Creating stream with start seqno 19593650 and end seqno 18446744073709551615; requested end seqno was 18446744073709551615, collections-filter-size:1 sid:none
            # schedule backfill
            2021-09-22T20:33:23.256566-07:00 INFO 4614: (GleamBookUsers0) DCP (Producer) eq_dcpq:secidx:proj-INIT_STREAM_TOPIC_53c16e3f091971da1cf99c771ba1e513-14257740393263799446/0 - (vb:883) ActiveStream::scheduleBackfill_UNLOCKED register cursor with name "eq_dcpq:secidx:proj-INIT_STREAM_TOPIC_53c16e3f091971da1cf99c771ba1e513-14257740393263799446/0" backfill:true, seqno
            :19882939
            2021-09-22T20:33:23.256574-07:00 INFO 4614: (GleamBookUsers0) DCP (Producer) eq_dcpq:secidx:proj-INIT_STREAM_TOPIC_53c16e3f091971da1cf99c771ba1e513-14257740393263799446/0 - (vb:883) Scheduling backfill from 19593651 to 19882938, reschedule flag : False
             
            # backfilling (start probably adjusted to 0 as this is the first marker sent for this stream, so start = min(start, snap_start_seqno_) )
            2021-09-22T20:33:44.221879-07:00 INFO 4614: (GleamBookUsers0) DCP (Producer) eq_dcpq:secidx:proj-INIT_STREAM_TOPIC_53c16e3f091971da1cf99c771ba1e513-14257740393263799446/0 - (vb:883) ActiveStream::markDiskSnapshot: Sending disk snapshot with start 0, end 19882938, and high completed none, max visible none
            2021-09-22T20:33:44.221900-07:00 INFO 4614: (GleamBookUsers0) DCP (Producer) eq_dcpq:secidx:proj-INIT_STREAM_TOPIC_53c16e3f091971da1cf99c771ba1e513-14257740393263799446/0 - (vb:883) ActiveStream::registerCursor name "eq_dcpq:secidx:proj-INIT_STREAM_TOPIC_53c16e3f091971da1cf99c771ba1e513-14257740393263799446/0", backfill:true, seqno:19882939
             
            # backfill ends near 1h45 later, read no items, sent a seqno advance
            2021-09-22T22:14:04.209638-07:00 INFO 4614: (GleamBookUsers0) DCP (Producer) eq_dcpq:secidx:proj-INIT_STREAM_TOPIC_53c16e3f091971da1cf99c771ba1e513-14257740393263799446/0 - (vb:883) Backfill complete, 0 items read from disk, 0 from memory, lastReadSeqno:19882938 lastSentSeqnoAdvance:19882938, pendingBackfill:False
            2021-09-22T22:14:04.209650-07:00 INFO [(GleamBookUsers0) magma_19/kvstore-883/rev-000000001]0x7f34797f5380 Ending seqno iterator (19593651=>19882938) with seqno 19882938 restarted 0 times
             
            # on to in memory
            2021-09-22T22:14:04.209744-07:00 INFO 4614: (GleamBookUsers0) DCP (Producer) eq_dcpq:secidx:proj-INIT_STREAM_TOPIC_53c16e3f091971da1cf99c771ba1e513-14257740393263799446/0 - (vb:883) ActiveStream::transitionState: Transitioning from backfilling to in-memory
            

            Interestingly, when the backfill was rejected it looks like it could have been allowed

            DCPBackfillBySeqnoDisk::create

            bool allowNonRollBackCollectionStream = false;
                if (collHigh.has_value()) {
                    allowNonRollBackCollectionStream =
                            stream->getStartSeqno() < scanCtx->purgeSeqno &&
                            stream->getStartSeqno() >= collHigh.value() &&
                            collHigh.value() <= scanCtx->purgeSeqno;
                }
             
                // Check startSeqno against the purge-seqno of the opened datafile.
                // 1) A normal stream request would of checked inside streamRequest, but
                //    compaction may have changed the purgeSeqno
                // 2) Cursor dropping can also schedule backfills and they must not re-start
                //    behind the current purge-seqno
                //
                // If allowNonRollBackCollectionStream is false then the purge seqno,
                // collections high seqno must have moved or the stream isn't a
                // collection stream, making it not valid to prevent the stream from
                // rolling back.
                //
                // If the startSeqno != 1 (a client 0 to n request becomes 1 to n) then
                // start-seqno must be above purge-seqno
                if (startSeqno != 1 && (startSeqno <= scanCtx->purgeSeqno) &&
                    !allowNonRollBackCollectionStream) {
            

            However, here there is a mix of the backfill startSeqno ( startSeqno ) and the stream start seqno ( stream->getStartSeqno() ). Worth investigating if this is a bug, and if the backfill was actually viable.
            The re-requested stream did not move the start seqno back (to zero or some other rollback point), but was accepted by a similar check during streamRequest:

            FailoverTable::needsRollback

                bool allowNonRollBackCollectionStream = false;
                if (maxCollectionHighSeqno.has_value()) {
                    allowNonRollBackCollectionStream =
                            start_seqno < purge_seqno &&
                            start_seqno >= maxCollectionHighSeqno.value() &&
                            maxCollectionHighSeqno.value() <= purge_seqno;
                }
             
                /* There may be items that are purged during compaction. We need
                   to rollback to seq no 0 in that case, only if we have purged beyond
                   start_seqno and if start_seqno is not 0 */
                if (start_seqno < purge_seqno && start_seqno != 0 &&
                    !allowNonRollBackCollectionStream) {
                    return std::make_pair(true,
                                          std::string("purge seqno (") +
                                                  std::to_string(purge_seqno) +
                                                  ") is greater than start seqno - "
                                                  "could miss purged deletions");
            

            This deals only in the stream start seqno, which is likely different now as the stream was closed and reopened.

            There's a reasonable number of backfills around, but not so many as to be concerned:

             ep_dcp_num_running_backfills:                                                                                                            2728
            

            james.harrison James Harrison added a comment - - edited Looking at 172.23.110.66 there's some signs of MB-48473 - the purge seqno advancing in cases it doesn't need to, leading to streams being unnecessarily rollbacked. 2021-09-22T19:13:07.919132-07:00 WARNING 39114: (GleamBookUsers0) DCP (Producer) eq_dcpq:replication:ns_1@172.23.110.66->ns_1@172.23.110.69:GleamBookUsers0 - DCPBackfillBySeqnoDisk::create(): (vb:1023) cannot be scanned. Associated stream is set to dead state. startSeqno:19501080 < purgeSeqno:19676538. The vbucket state:active, collHigh-valid:false, collHigh:184 46744073709551615 2021-09-22T19:13:07.919350-07:00 WARNING 39114: (GleamBookUsers0) DCP (Producer) eq_dcpq:replication:ns_1@172.23.110.66->ns_1@172.23.110.69:GleamBookUsers0 - (vb:1023) Stream closing, sent until seqno 19501079 remaining items 0, reason: The stream closed early because the vbucket rollback'ed ... 2021-09-22T20:01:59.710346-07:00 WARNING 743: (GleamBookUsers0) DCP (Producer) eq_dcpq:secidx:proj-INIT_STREAM_TOPIC_53c16e3f091971da1cf99c771ba1e513-13969536484923249375/2 - DCPBackfillBySeqnoDisk::create(): (vb:685) cannot be scanned. Associated stream is set to dead state. startSeqno:19501651 < purgeSeqno:19748183. The vbucket state:active, collHigh-valid:tru e, collHigh:16624531 2021-09-22T20:01:59.710361-07:00 WARNING 743: (GleamBookUsers0) DCP (Producer) eq_dcpq:secidx:proj-INIT_STREAM_TOPIC_53c16e3f091971da1cf99c771ba1e513-13969536484923249375/2 - (vb:685) Stream closing, sent until seqno 19501650 remaining items 0, reason: The stream closed early because the vbucket rollback'ed 683 occurrences of this in the time the logs cover. Following one instance which affected a secidx stream: # start of the stream 2021-09-22T14:36:50.885586-07:00 INFO 4614: (GleamBookUsers0) DCP (Producer) eq_dcpq:secidx:proj-INIT_STREAM_TOPIC_53c16e3f091971da1cf99c771ba1e513-14257740393263799446/0 - (vb:883) Creating stream with start seqno 0 and end seqno 18446744073709551615; requested end seqno was 18446744073709551615, collections-filter-size:1 sid:none 2021-09-22T14:36:50.885618-07:00 INFO 4614: (GleamBookUsers0) DCP (Producer) eq_dcpq:secidx:proj-INIT_STREAM_TOPIC_53c16e3f091971da1cf99c771ba1e513-14257740393263799446/0 - (vb:883) ActiveStream::scheduleBackfill_UNLOCKED register cursor with name "eq_dcpq:secidx:proj-INIT_STREAM_TOPIC_53c16e3f091971da1cf99c771ba1e513-14257740393263799446/0" backfill:true, seqno :19579190 2021-09-22T14:36:50.885668-07:00 INFO 4614: (GleamBookUsers0) DCP (Producer) eq_dcpq:secidx:proj-INIT_STREAM_TOPIC_53c16e3f091971da1cf99c771ba1e513-14257740393263799446/0 - (vb:883) Scheduling backfill from 1 to 19579189, reschedule flag : False   2021-09-22T14:36:56.907942-07:00 INFO 4614: (GleamBookUsers0) DCP (Producer) eq_dcpq:secidx:proj-INIT_STREAM_TOPIC_53c16e3f091971da1cf99c771ba1e513-14257740393263799446/0 - (vb:883) ActiveStream::markDiskSnapshot: Sending disk snapshot with start 0, end 19593650, and high completed none, max visible none 2021-09-22T14:36:56.908321-07:00 INFO 4614: (GleamBookUsers0) DCP (Producer) eq_dcpq:secidx:proj-INIT_STREAM_TOPIC_53c16e3f091971da1cf99c771ba1e513-14257740393263799446/0 - (vb:883) ActiveStream::registerCursor name "eq_dcpq:secidx:proj-INIT_STREAM_TOPIC_53c16e3f091971da1cf99c771ba1e513-14257740393263799446/0", backfill:false, seqno:19593651   # cursor drop 2021-09-22T16:14:58.823792-07:00 INFO 4614: (GleamBookUsers0) DCP (Producer) eq_dcpq:secidx:proj-INIT_STREAM_TOPIC_53c16e3f091971da1cf99c771ba1e513-14257740393263799446/0 - (vb:883) Handling slow stream; state_ : backfilling, lastReadSeqno : 0, lastSentSeqno : 6977289, vBucketHighSeqno : 19723098, isBackfillTaskRunning : True   # backfill completes, around 6h after scheduling 2021-09-22T20:32:59.133290-07:00 INFO 4614: (GleamBookUsers0) DCP (Producer) eq_dcpq:secidx:proj-INIT_STREAM_TOPIC_53c16e3f091971da1cf99c771ba1e513-14257740393263799446/0 - (vb:883) Backfill complete, 195937 items read from disk, 0 from memory, lastReadSeqno:19593650 lastSentSeqnoAdvance:19593650, pendingBackfill:True 2021-09-22T20:32:59.133305-07:00 INFO [(GleamBookUsers0) magma_19/kvstore-883/rev-000000001]0x7f41dcd82200 Ending seqno iterator (16725978=>19593650) with seqno 19593650 restarted 71 times   # reschedule backfill as the cursor was dropped 2021-09-22T20:32:59.133385-07:00 INFO 4614: (GleamBookUsers0) DCP (Producer) eq_dcpq:secidx:proj-INIT_STREAM_TOPIC_53c16e3f091971da1cf99c771ba1e513-14257740393263799446/0 - (vb:883) ActiveStream::scheduleBackfill_UNLOCKED register cursor with name "eq_dcpq:secidx:proj-INIT_STREAM_TOPIC_53c16e3f091971da1cf99c771ba1e513-14257740393263799446/0" backfill:true, seqno :19882939 2021-09-22T20:32:59.133397-07:00 INFO 4614: (GleamBookUsers0) DCP (Producer) eq_dcpq:secidx:proj-INIT_STREAM_TOPIC_53c16e3f091971da1cf99c771ba1e513-14257740393263799446/0 - (vb:883) Scheduling backfill from 19593651 to 19882938, reschedule flag : True   # backfill fails, purge seqno after start seqno 2021-09-22T20:33:01.158951-07:00 WARNING 4614: (GleamBookUsers0) DCP (Producer) eq_dcpq:secidx:proj-INIT_STREAM_TOPIC_53c16e3f091971da1cf99c771ba1e513-14257740393263799446/0 - DCPBackfillBySeqnoDisk::create(): (vb:883) cannot be scanned. Associated stream is set to dead state. startSeqno:19593651 < purgeSeqno:19841164. The vbucket state:active, collHigh-valid:tr ue, collHigh:16754135 2021-09-22T20:33:01.158962-07:00 WARNING 4614: (GleamBookUsers0) DCP (Producer) eq_dcpq:secidx:proj-INIT_STREAM_TOPIC_53c16e3f091971da1cf99c771ba1e513-14257740393263799446/0 - (vb:883) Stream closing, sent until seqno 19593650 remaining items 0, reason: The stream closed early because the vbucket rollback'ed   # recreate stream with a similar start seqno, which is still behind the start seqno - this wasn't rejected in streamRequest as the collection high seqno is behind the purge seqno 2021-09-22T20:33:23.256552-07:00 INFO 4614: (GleamBookUsers0) DCP (Producer) eq_dcpq:secidx:proj-INIT_STREAM_TOPIC_53c16e3f091971da1cf99c771ba1e513-14257740393263799446/0 - (vb:883) Creating stream with start seqno 19593650 and end seqno 18446744073709551615; requested end seqno was 18446744073709551615, collections-filter-size:1 sid:none # schedule backfill 2021-09-22T20:33:23.256566-07:00 INFO 4614: (GleamBookUsers0) DCP (Producer) eq_dcpq:secidx:proj-INIT_STREAM_TOPIC_53c16e3f091971da1cf99c771ba1e513-14257740393263799446/0 - (vb:883) ActiveStream::scheduleBackfill_UNLOCKED register cursor with name "eq_dcpq:secidx:proj-INIT_STREAM_TOPIC_53c16e3f091971da1cf99c771ba1e513-14257740393263799446/0" backfill:true, seqno :19882939 2021-09-22T20:33:23.256574-07:00 INFO 4614: (GleamBookUsers0) DCP (Producer) eq_dcpq:secidx:proj-INIT_STREAM_TOPIC_53c16e3f091971da1cf99c771ba1e513-14257740393263799446/0 - (vb:883) Scheduling backfill from 19593651 to 19882938, reschedule flag : False   # backfilling (start probably adjusted to 0 as this is the first marker sent for this stream, so start = min(start, snap_start_seqno_) ) 2021-09-22T20:33:44.221879-07:00 INFO 4614: (GleamBookUsers0) DCP (Producer) eq_dcpq:secidx:proj-INIT_STREAM_TOPIC_53c16e3f091971da1cf99c771ba1e513-14257740393263799446/0 - (vb:883) ActiveStream::markDiskSnapshot: Sending disk snapshot with start 0, end 19882938, and high completed none, max visible none 2021-09-22T20:33:44.221900-07:00 INFO 4614: (GleamBookUsers0) DCP (Producer) eq_dcpq:secidx:proj-INIT_STREAM_TOPIC_53c16e3f091971da1cf99c771ba1e513-14257740393263799446/0 - (vb:883) ActiveStream::registerCursor name "eq_dcpq:secidx:proj-INIT_STREAM_TOPIC_53c16e3f091971da1cf99c771ba1e513-14257740393263799446/0", backfill:true, seqno:19882939   # backfill ends near 1h45 later, read no items, sent a seqno advance 2021-09-22T22:14:04.209638-07:00 INFO 4614: (GleamBookUsers0) DCP (Producer) eq_dcpq:secidx:proj-INIT_STREAM_TOPIC_53c16e3f091971da1cf99c771ba1e513-14257740393263799446/0 - (vb:883) Backfill complete, 0 items read from disk, 0 from memory, lastReadSeqno:19882938 lastSentSeqnoAdvance:19882938, pendingBackfill:False 2021-09-22T22:14:04.209650-07:00 INFO [(GleamBookUsers0) magma_19/kvstore-883/rev-000000001]0x7f34797f5380 Ending seqno iterator (19593651=>19882938) with seqno 19882938 restarted 0 times   # on to in memory 2021-09-22T22:14:04.209744-07:00 INFO 4614: (GleamBookUsers0) DCP (Producer) eq_dcpq:secidx:proj-INIT_STREAM_TOPIC_53c16e3f091971da1cf99c771ba1e513-14257740393263799446/0 - (vb:883) ActiveStream::transitionState: Transitioning from backfilling to in-memory Interestingly, when the backfill was rejected it looks like it could have been allowed DCPBackfillBySeqnoDisk::create bool allowNonRollBackCollectionStream = false; if (collHigh.has_value()) { allowNonRollBackCollectionStream = stream->getStartSeqno() < scanCtx->purgeSeqno && stream->getStartSeqno() >= collHigh.value() && collHigh.value() <= scanCtx->purgeSeqno; }   // Check startSeqno against the purge-seqno of the opened datafile. // 1) A normal stream request would of checked inside streamRequest, but // compaction may have changed the purgeSeqno // 2) Cursor dropping can also schedule backfills and they must not re-start // behind the current purge-seqno // // If allowNonRollBackCollectionStream is false then the purge seqno, // collections high seqno must have moved or the stream isn't a // collection stream, making it not valid to prevent the stream from // rolling back. // // If the startSeqno != 1 (a client 0 to n request becomes 1 to n) then // start-seqno must be above purge-seqno if (startSeqno != 1 && (startSeqno <= scanCtx->purgeSeqno) && !allowNonRollBackCollectionStream) { However, here there is a mix of the backfill startSeqno ( startSeqno ) and the stream start seqno ( stream->getStartSeqno() ). Worth investigating if this is a bug, and if the backfill was actually viable. The re-requested stream did not move the start seqno back (to zero or some other rollback point), but was accepted by a similar check during streamRequest : FailoverTable::needsRollback bool allowNonRollBackCollectionStream = false; if (maxCollectionHighSeqno.has_value()) { allowNonRollBackCollectionStream = start_seqno < purge_seqno && start_seqno >= maxCollectionHighSeqno.value() && maxCollectionHighSeqno.value() <= purge_seqno; }   /* There may be items that are purged during compaction. We need to rollback to seq no 0 in that case, only if we have purged beyond start_seqno and if start_seqno is not 0 */ if (start_seqno < purge_seqno && start_seqno != 0 && !allowNonRollBackCollectionStream) { return std::make_pair(true, std::string("purge seqno (") + std::to_string(purge_seqno) + ") is greater than start seqno - " "could miss purged deletions"); This deals only in the stream start seqno, which is likely different now as the stream was closed and reopened. There's a reasonable number of backfills around, but not so many as to be concerned: ep_dcp_num_running_backfills: 2728

            It doesn't appear indexing actually rolls back (I recall rollback gets mapped to StateChanged, so it may well just trigger a reconnect), so the rollback errors are likely not as big of an impediment as they seems at first glance (doesn't appear to actually restream everything).

            It does not appear OSO is enabled; backfills will read items in seqno order (and would have to anyway after cursor dropping, OSO only applies to the initial disk snapshot). Backfills yield after reading a certain number of bytes for the stream and items which are filtered out do not contribute to this. As a result, a backfill which filters out the majority of the items found by the backfill may run for a long time - this may explain the long runtimes logged. Long running backfills may starve other backfills (by monopolising the backfill manager task, and one of the auxio threads).

            james.harrison James Harrison added a comment - It doesn't appear indexing actually rolls back (I recall rollback gets mapped to StateChanged, so it may well just trigger a reconnect), so the rollback errors are likely not as big of an impediment as they seems at first glance (doesn't appear to actually restream everything). It does not appear OSO is enabled; backfills will read items in seqno order (and would have to anyway after cursor dropping, OSO only applies to the initial disk snapshot). Backfills yield after reading a certain number of bytes for the stream and items which are filtered out do not contribute to this. As a result, a backfill which filters out the majority of the items found by the backfill may run for a long time - this may explain the long runtimes logged. Long running backfills may starve other backfills (by monopolising the backfill manager task, and one of the auxio threads).
            drigby Dave Rigby added a comment -

            Backfills yield after reading a certain number of bytes for the stream and items which are filtered out do not contribute to this. As a result, a backfill which filters out the majority of the items found by the backfill may run for a long time - this may explain the long runtimes logged. Long running backfills may starve other backfills (by monopolising the backfill manager task, and one of the auxio threads).

            This is an interesting observation. Given that the majority of the runtime cost of a backfill is the disk IO (even if items are skipped), I think it's worth investigating if we can modify the way backfills yield to use the bytes actually read from disk (or perhaps a second item-based limit if the size isn't available).

            drigby Dave Rigby added a comment - Backfills yield after reading a certain number of bytes for the stream and items which are filtered out do not contribute to this. As a result, a backfill which filters out the majority of the items found by the backfill may run for a long time - this may explain the long runtimes logged. Long running backfills may starve other backfills (by monopolising the backfill manager task, and one of the auxio threads). This is an interesting observation. Given that the majority of the runtime cost of a backfill is the disk IO (even if items are skipped), I think it's worth investigating if we can modify the way backfills yield to use the bytes actually read from disk (or perhaps a second item-based limit if the size isn't available).

            Build couchbase-server-7.1.0-1672 contains kv_engine commit 4b4ff9a with commit message:
            MB-48569: Remove spurious warning when readLocalDoc fails

            build-team Couchbase Build Team added a comment - Build couchbase-server-7.1.0-1672 contains kv_engine commit 4b4ff9a with commit message: MB-48569 : Remove spurious warning when readLocalDoc fails

            MB-49685 has been raised to track improving backfill yielding.

            There are enough backfill manager tasks to saturate all 8 auxio threads:

            5345241    4  R   GleamBookUsers0      0:00.00  *0:01.47          1:04:18.32        5762  AuxIO   BackfillManagerTask                  0x00007f4c0d9d9c90  Backfilling items for a DCP Connection
            5345243    4  R   GleamBookUsers0     -0:07.35   0:15.59          1:06:24.77        5699  AuxIO   BackfillManagerTask                  0x00007f4c0da59410  Backfilling items for a DCP Connection
            5345245    4  R   GleamBookUsers0      0:00.00  *0:08.31          1:05:27.69        5711  AuxIO   BackfillManagerTask                  0x00007f4473480690  Backfilling items for a DCP Connection
            5345247    4  R   GleamBookUsers0     -0:01.88   0:15.33          1:06:06.65        5716  AuxIO   BackfillManagerTask                  0x00007f4c0d829410  Backfilling items for a DCP Connection
            5347744    4  R   GleamBookUsers0      0:00.00  *0:06.43          1:07:26.31        5777  AuxIO   BackfillManagerTask                  0x00007f4c0aa5c810  Backfilling items for a DCP Connection
            5347746    4  R   GleamBookUsers0     -0:03.60   0:19.31          1:08:12.56        5707  AuxIO   BackfillManagerTask                  0x00007f4c0b600b10  Backfilling items for a DCP Connection
            5347748    4  R   GleamBookUsers0      0:00.00  *0:08.98          1:08:21.47        5715  AuxIO   BackfillManagerTask                  0x00007f4aa1494790  Backfilling items for a DCP Connection
            5347750    4  R   GleamBookUsers0      0:00.00  *0:17.76          1:09:47.45        5714  AuxIO   BackfillManagerTask                  0x00007f4c04127410  Backfilling items for a DCP Connection
            5355823    4  R   GleamBookUsers0     -0:08.73   0:00.32          1:04:33.46        5753  AuxIO   BackfillManagerTask                  0x00007f46d8c1c790  Backfilling items for a DCP Connection
            5355824    4  R   GleamBookUsers0      0:00.00  *0:00.10          1:03:55.44        5750  AuxIO   BackfillManagerTask                  0x00007f4c0dc73810  Backfilling items for a DCP Connection
            5355826    4  R   GleamBookUsers0     -0:08.91   0:00.58          1:04:29.38        5757  AuxIO   BackfillManagerTask                  0x00007f4c0d6d7710  Backfilling items for a DCP Connection
            5355828    4  R   GleamBookUsers0      0:00.00  *0:11.83          1:04:35.16        5763  AuxIO   BackfillManagerTask                  0x00007f4a93f35790  Backfilling items for a DCP Connection
            5369225    4  R   GleamBookUsers0     -0:07.01   0:00.34            52:47.97        5722  AuxIO   BackfillManagerTask                  0x00007f4c186cb210  Backfilling items for a DCP Connection
            5369227    4  R   GleamBookUsers0      0:00.00  *0:08.02            52:20.05        5706  AuxIO   BackfillManagerTask                  0x00007f4a82cec710  Backfilling items for a DCP Connection
            5369229    4  R   GleamBookUsers0     -0:00.84   0:00.40            52:31.64        5708  AuxIO   BackfillManagerTask                  0x00007f4c186da110  Backfilling items for a DCP Connection
            5369231    4  R   GleamBookUsers0     -0:00.10   0:00.32            50:53.52        5722  AuxIO   BackfillManagerTask                  0x00007f4c0db86410  Backfilling items for a DCP Connection
            5371280    4  R   GleamBookUsers0     -0:02.69   0:00.62          1:00:06.14        5667  AuxIO   BackfillManagerTask                  0x00007f4c010d6910  Backfilling items for a DCP Connection
            5371282    4  R   GleamBookUsers0     -0:00.42   0:00.42            58:05.63        5657  AuxIO   BackfillManagerTask                  0x00007f4c0a73d590  Backfilling items for a DCP Connection
            5371284    4  R   GleamBookUsers0     -0:08.31   0:00.37          1:00:47.09        5658  AuxIO   BackfillManagerTask                  0x00007f4c060b1210  Backfilling items for a DCP Connection
            5371286    4  R   GleamBookUsers0     -0:01.88   0:00.23            56:36.30        5671  AuxIO   BackfillManagerTask                  0x00007f4c0b761d90  Backfilling items for a DCP Connection
            5373090    4  R   GleamBookUsers0     -0:08.69   0:00.22            59:57.31        5114  AuxIO   BackfillManagerTask                  0x00007f4c092b8f90  Backfilling items for a DCP Connection
            5373092    4  R   GleamBookUsers0     -0:06.43   0:00.34          1:03:27.88        5566  AuxIO   BackfillManagerTask                  0x00007f4c09578090  Backfilling items for a DCP Connection
            5373094    4  R   GleamBookUsers0     -0:06.78   0:00.22          1:00:57.73        5575  AuxIO   BackfillManagerTask                  0x00007f4c0d6bb490  Backfilling items for a DCP Connection
            5373096    4  R   GleamBookUsers0     -0:01.47   0:00.40          1:02:09.32        5572  AuxIO   BackfillManagerTask                  0x00007f48cc91fa10  Backfilling items for a DCP Connection
            5374328    4  R   GleamBookUsers0     -0:08.40   0:00.32            59:37.40        4215  AuxIO   BackfillManagerTask                  0x00007f4c02b1ee90  Backfilling items for a DCP Connection
            5374330    4  R   GleamBookUsers0     -0:01.24   0:00.35          1:04:27.94        5284  AuxIO   BackfillManagerTask                  0x00007f48cd106f10  Backfilling items for a DCP Connection
            5374332    4  R   GleamBookUsers0     -0:08.02   0:00.37          1:05:02.76        5280  AuxIO   BackfillManagerTask                  0x00007f4c0d6ee490  Backfilling items for a DCP Connection
            5374334    4  R   GleamBookUsers0     -0:01.60   0:00.27          1:05:31.21        5359  AuxIO   BackfillManagerTask                  0x00007f48cc88ed10  Backfilling items for a DCP Connection
            5375528    4  R   GleamBookUsers0     -0:03.31   0:00.28          1:04:28.85        4989  AuxIO   BackfillManagerTask                  0x00007f4c186c7d90  Backfilling items for a DCP Connection
            5375530    4  R   GleamBookUsers0     -0:02.11   0:00.36          1:05:31.10        5545  AuxIO   BackfillManagerTask                  0x00007f4c0d73a310  Backfilling items for a DCP Connection
            5375532    4  R   GleamBookUsers0     -0:08.98   0:00.15          1:05:39.84        5562  AuxIO   BackfillManagerTask                  0x00007f44334fd510  Backfilling items for a DCP Connection
            5375534    4  R   GleamBookUsers0     -0:02.47   0:00.21          1:07:04.19        5547  AuxIO   BackfillManagerTask                  0x00007f4473757690  Backfilling items for a DCP Connection
            

             ep_dcp_producer_count:                                                                                                                   43
             ep_num_auxio_threads:                                  8
            

            100 single collection, non-oso backfills for indexing would lead to all 10B items being read from disk 100 times, each time discarding 99% of the data, keeping only that of the relevant collection.
            The 500 items/s drain rate could then be considered to approximately reflect a 50,000 item/s backfilling rate.

            This inefficiency, combined with MB-49685, may be a sufficient cause for this low drain rate.

            I understand that indexing is facing some issues with OSO (MB-43796), however once this is resolved, it would be useful to test this scenario with OSO enabled.

            james.harrison James Harrison added a comment - MB-49685 has been raised to track improving backfill yielding. There are enough backfill manager tasks to saturate all 8 auxio threads: 5345241 4 R GleamBookUsers0 0:00.00 *0:01.47 1:04:18.32 5762 AuxIO BackfillManagerTask 0x00007f4c0d9d9c90 Backfilling items for a DCP Connection 5345243 4 R GleamBookUsers0 -0:07.35 0:15.59 1:06:24.77 5699 AuxIO BackfillManagerTask 0x00007f4c0da59410 Backfilling items for a DCP Connection 5345245 4 R GleamBookUsers0 0:00.00 *0:08.31 1:05:27.69 5711 AuxIO BackfillManagerTask 0x00007f4473480690 Backfilling items for a DCP Connection 5345247 4 R GleamBookUsers0 -0:01.88 0:15.33 1:06:06.65 5716 AuxIO BackfillManagerTask 0x00007f4c0d829410 Backfilling items for a DCP Connection 5347744 4 R GleamBookUsers0 0:00.00 *0:06.43 1:07:26.31 5777 AuxIO BackfillManagerTask 0x00007f4c0aa5c810 Backfilling items for a DCP Connection 5347746 4 R GleamBookUsers0 -0:03.60 0:19.31 1:08:12.56 5707 AuxIO BackfillManagerTask 0x00007f4c0b600b10 Backfilling items for a DCP Connection 5347748 4 R GleamBookUsers0 0:00.00 *0:08.98 1:08:21.47 5715 AuxIO BackfillManagerTask 0x00007f4aa1494790 Backfilling items for a DCP Connection 5347750 4 R GleamBookUsers0 0:00.00 *0:17.76 1:09:47.45 5714 AuxIO BackfillManagerTask 0x00007f4c04127410 Backfilling items for a DCP Connection 5355823 4 R GleamBookUsers0 -0:08.73 0:00.32 1:04:33.46 5753 AuxIO BackfillManagerTask 0x00007f46d8c1c790 Backfilling items for a DCP Connection 5355824 4 R GleamBookUsers0 0:00.00 *0:00.10 1:03:55.44 5750 AuxIO BackfillManagerTask 0x00007f4c0dc73810 Backfilling items for a DCP Connection 5355826 4 R GleamBookUsers0 -0:08.91 0:00.58 1:04:29.38 5757 AuxIO BackfillManagerTask 0x00007f4c0d6d7710 Backfilling items for a DCP Connection 5355828 4 R GleamBookUsers0 0:00.00 *0:11.83 1:04:35.16 5763 AuxIO BackfillManagerTask 0x00007f4a93f35790 Backfilling items for a DCP Connection 5369225 4 R GleamBookUsers0 -0:07.01 0:00.34 52:47.97 5722 AuxIO BackfillManagerTask 0x00007f4c186cb210 Backfilling items for a DCP Connection 5369227 4 R GleamBookUsers0 0:00.00 *0:08.02 52:20.05 5706 AuxIO BackfillManagerTask 0x00007f4a82cec710 Backfilling items for a DCP Connection 5369229 4 R GleamBookUsers0 -0:00.84 0:00.40 52:31.64 5708 AuxIO BackfillManagerTask 0x00007f4c186da110 Backfilling items for a DCP Connection 5369231 4 R GleamBookUsers0 -0:00.10 0:00.32 50:53.52 5722 AuxIO BackfillManagerTask 0x00007f4c0db86410 Backfilling items for a DCP Connection 5371280 4 R GleamBookUsers0 -0:02.69 0:00.62 1:00:06.14 5667 AuxIO BackfillManagerTask 0x00007f4c010d6910 Backfilling items for a DCP Connection 5371282 4 R GleamBookUsers0 -0:00.42 0:00.42 58:05.63 5657 AuxIO BackfillManagerTask 0x00007f4c0a73d590 Backfilling items for a DCP Connection 5371284 4 R GleamBookUsers0 -0:08.31 0:00.37 1:00:47.09 5658 AuxIO BackfillManagerTask 0x00007f4c060b1210 Backfilling items for a DCP Connection 5371286 4 R GleamBookUsers0 -0:01.88 0:00.23 56:36.30 5671 AuxIO BackfillManagerTask 0x00007f4c0b761d90 Backfilling items for a DCP Connection 5373090 4 R GleamBookUsers0 -0:08.69 0:00.22 59:57.31 5114 AuxIO BackfillManagerTask 0x00007f4c092b8f90 Backfilling items for a DCP Connection 5373092 4 R GleamBookUsers0 -0:06.43 0:00.34 1:03:27.88 5566 AuxIO BackfillManagerTask 0x00007f4c09578090 Backfilling items for a DCP Connection 5373094 4 R GleamBookUsers0 -0:06.78 0:00.22 1:00:57.73 5575 AuxIO BackfillManagerTask 0x00007f4c0d6bb490 Backfilling items for a DCP Connection 5373096 4 R GleamBookUsers0 -0:01.47 0:00.40 1:02:09.32 5572 AuxIO BackfillManagerTask 0x00007f48cc91fa10 Backfilling items for a DCP Connection 5374328 4 R GleamBookUsers0 -0:08.40 0:00.32 59:37.40 4215 AuxIO BackfillManagerTask 0x00007f4c02b1ee90 Backfilling items for a DCP Connection 5374330 4 R GleamBookUsers0 -0:01.24 0:00.35 1:04:27.94 5284 AuxIO BackfillManagerTask 0x00007f48cd106f10 Backfilling items for a DCP Connection 5374332 4 R GleamBookUsers0 -0:08.02 0:00.37 1:05:02.76 5280 AuxIO BackfillManagerTask 0x00007f4c0d6ee490 Backfilling items for a DCP Connection 5374334 4 R GleamBookUsers0 -0:01.60 0:00.27 1:05:31.21 5359 AuxIO BackfillManagerTask 0x00007f48cc88ed10 Backfilling items for a DCP Connection 5375528 4 R GleamBookUsers0 -0:03.31 0:00.28 1:04:28.85 4989 AuxIO BackfillManagerTask 0x00007f4c186c7d90 Backfilling items for a DCP Connection 5375530 4 R GleamBookUsers0 -0:02.11 0:00.36 1:05:31.10 5545 AuxIO BackfillManagerTask 0x00007f4c0d73a310 Backfilling items for a DCP Connection 5375532 4 R GleamBookUsers0 -0:08.98 0:00.15 1:05:39.84 5562 AuxIO BackfillManagerTask 0x00007f44334fd510 Backfilling items for a DCP Connection 5375534 4 R GleamBookUsers0 -0:02.47 0:00.21 1:07:04.19 5547 AuxIO BackfillManagerTask 0x00007f4473757690 Backfilling items for a DCP Connection ep_dcp_producer_count: 43 ep_num_auxio_threads: 8 100 single collection, non-oso backfills for indexing would lead to all 10B items being read from disk 100 times, each time discarding 99% of the data, keeping only that of the relevant collection. The 500 items/s drain rate could then be considered to approximately reflect a 50,000 item/s backfilling rate. This inefficiency, combined with MB-49685 , may be a sufficient cause for this low drain rate. I understand that indexing is facing some issues with OSO ( MB-43796 ), however once this is resolved, it would be useful to test this scenario with OSO enabled.

            Closing as duplicate of MB-48834, investigating magma backfills being slower than an equivalent couchstore backfill.

            MB-49685 also addresses a potential perf issue with collection filtered seqno order backfills. Once magma supports OSO backfills and GSI enables OSO, that issue may be less relevant.

            james.harrison James Harrison added a comment - Closing as duplicate of MB-48834 , investigating magma backfills being slower than an equivalent couchstore backfill. MB-49685 also addresses a potential perf issue with collection filtered seqno order backfills. Once magma supports OSO backfills and GSI enables OSO, that issue may be less relevant.

            Closing all Duplicates, Not a Bug, Incomplete, Duplicate

            ritam.sharma Ritam Sharma added a comment - Closing all Duplicates, Not a Bug, Incomplete, Duplicate

            People

              ritesh.agarwal Ritesh Agarwal
              ritesh.agarwal Ritesh Agarwal
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty