Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-62744

[System Test] Ingestion is slow - link state seems to change to "stopped" over and over

    XMLWordPrintable

Details

    • Untriaged
    • 0
    • Unknown
    • Analytics Sprint 46, Analytics Sprint 47

    Description

      The cluster had run through one full cycle of the system test. It's a 4-node cluster (4 vcpus + 64 GB) ingested about 10 billion items

      Workload -

      Type Number of collections Number of items in millions Total count in millions
      Remote 80 75 6000
      Standalone 50 8 4000*
      Kafka 5 10 50

      *Some standalone collections have 8 mil and some have multiples of 8 million items. The total doc count is 4000 million ( 4 billion) items.
      Number of links = 6 ( 2 remote + 2 external + 2 kafka). 1 remote link and 1 kafka link is active.

      It went through scaling operations. From 4 to 8 to 16 to 32 back to 8 to 4 nodes.

      Second cycle would repeat the same workload

      But remote ingestion is very slow. It's been almost 18 hours and ingestion is not complete. In comparison, during the first cycle, remote ingestion was completed in around 6 to 8 hours.
      There are still a bunch of datasets where ingestion is not complete -

      some examples

      Database0cFsFELXI.scope0NPwGeHgC.remotedatasetCuxGntPc = 52928724
      Database0cFsFELXI.scope0NPwGeHgC.remotedatasetSdKQaBRi = 52934946
      Database0cFsFELXI.scope0NPwGeHgC.remotedatasetStiHRVEF = 52941446
      

      Before creating the second batch of collections, the link was disconnected, then all the remote datasets were created and then link was reconnected.

      I see messages like these -

      on node 006

      "entityId":"linkIUWEhdXs/default1", "state":"STARTING", "prev state":"STOPPED", "suspended":false})
      2024-07-10T18:01:12.523+00:00
       
      "entityId":"linkIUWEhdXs/default1", "state":"STARTING", "prev state":"STOPPED", "suspended":false})
      2024-07-10T17:52:55.357+00:00 INFO CBAS.adapter.CouchbaseConnector [cbas:linkIUWEhdXs:default1:f8bbb0059527fb8c59160733f2baae59:0 idle connection watchdog] will notify CC on idle streams after 120 seconds
      
      

      Unsure if this indicates any problems. Also, the /analytics/status/ingestion API would throw such responses intermittently -

      {
          "links": [
              {
                  "name": "linkIUWEhdXs",
                  "status": "stopped",
                  "state": []
              }
          ]
      }
      

      cbcollect ->

      https://cb-engineering.s3.amazonaws.com/SysTestColumnarJul9/collectinfo-2024-07-11T090009-ns_1%40svc-da-node-006.adewm3olqtoa4vfw.sandbox.nonprod-project-avengers.com.zip
      https://cb-engineering.s3.amazonaws.com/SysTestColumnarJul9/collectinfo-2024-07-11T090009-ns_1%40svc-da-node-008.adewm3olqtoa4vfw.sandbox.nonprod-project-avengers.com.zip
      https://cb-engineering.s3.amazonaws.com/SysTestColumnarJul9/collectinfo-2024-07-11T090009-ns_1%40svc-da-node-016.adewm3olqtoa4vfw.sandbox.nonprod-project-avengers.com.zip
      https://cb-engineering.s3.amazonaws.com/SysTestColumnarJul9/collectinfo-2024-07-11T090009-ns_1%40svc-da-node-022.adewm3olqtoa4vfw.sandbox.nonprod-project-avengers.com.zip

      Remote cluster logs ->

      https://cb-engineering.s3.amazonaws.com/RemoteClusterMB62863/collectinfo-2024-07-11T102339-ns_1%40svc-d-node-001.cbzexddeqouqo8iv.sandbox.nonprod-project-avengers.com.zip
      https://cb-engineering.s3.amazonaws.com/RemoteClusterMB62863/collectinfo-2024-07-11T102339-ns_1%40svc-d-node-002.cbzexddeqouqo8iv.sandbox.nonprod-project-avengers.com.zip
      https://cb-engineering.s3.amazonaws.com/RemoteClusterMB62863/collectinfo-2024-07-11T102339-ns_1%40svc-d-node-003.cbzexddeqouqo8iv.sandbox.nonprod-project-avengers.com.zip

      Attachments

        Issue Links

          For Gerrit Dashboard: MB-62744
          # Subject Branch Project Status CR V

          Activity

            People

              umang.agrawal Umang
              michael.blow Michael Blow
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty