Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-52490

[30TB, 1% KV DGM, CBAS]: Rebalance in 1 KV node is stuck since 35 hours. No movement in data/vBuckets.

    XMLWordPrintable

Details

    • Bug
    • Status: Reopened
    • Critical
    • Resolution: Unresolved
    • 7.1.1
    • Morpheus
    • couchbase-bucket
    • Enterprise Edition 7.1.1 build 3067

    Description

      1. Create a 3 node KV cluster
      2. Create a magma bucket with 1 replica with RAM=200GB
      3. Load 10B 1024 bytes documents. This is 20TB of Active + replica and puts the bucket in 1% DGM.
      4. Upsert the whole data to create 50% fragmentation.
      5. Create 25 datasets on cbas ingesting data from different collections. Let the ingestion start. Start SQL++ load with 10QPS asynchronously.
      6. Start an asnyc CRUD data load:

        Read Start: 0
        Read End: 100000000
        Update Start: 0
        Update End: 100000000
        Expiry Start: 0
        Expiry End: 0
        Delete Start: 100000000
        Delete End: 200000000
        Create Start: 200000000
        Create End: 300000000
        Final Start: 200000000
        Final End: 300000000
        

      7. Rebalance in 1 KV node. Rebalance seem to be stuck since hours...

      QE Test

      guides/gradlew --refresh-dependencies testrunner -P jython=/opt/jython/bin/jython -P 'args=-i /tmp/magma_temp_job3.ini -p bucket_storage=magma,bucket_eviction_policy=fullEviction,rerun=False -t aGoodDoctor.Hospital.Murphy.ClusterOpsVolume,nodes_init=3,graceful=True,skip_cleanup=True,num_items=100000000,num_buckets=1,bucket_names=GleamBook,doc_size=1300,bucket_type=membase,eviction_policy=fullEviction,iterations=2,batch_size=1000,sdk_timeout=60,log_level=debug,infra_log_level=debug,rerun=False,skip_cleanup=True,key_size=18,randomize_doc_size=False,randomize_value=True,assert_crashes_on_load=True,num_collections=50,maxttl=10,num_indexes=25,pc=10,index_nodes=0,cbas_nodes=1,fts_nodes=0,ops_rate=200000,ramQuota=68267,doc_ops=create:update:delete:read,mutation_perc=100,rebl_ops_rate=50000,key_type=RandomKey -m rest'
      

      Attachments

        Issue Links

          For Gerrit Dashboard: MB-52490
          # Subject Branch Project Status CR V

          Activity

            paolo.cocchi Paolo Cocchi added a comment - - edited

            We seems to be hitting again MB-44562 here.

            We hit the max number of backfills that can run on a node:

             ep_dcp_max_running_backfills:                                                                                           4096
             ep_dcp_num_running_backfills:                                                                                           4096
            

            Replication streams 1010/1011 stay in the pending queue:

             eq_dcpq:replication:ns_1@172.23.110.67->ns_1@172.23.110.70:GleamBookUsers0:backfill_num_pending:                        2
            

            We are close to 10k outboud streams on the node:

            cbcollect_info_ns_1@172.23.110.67_20220609-182552 % grep -E "stream_.*opaque" stats.log | wc -l
                9623
            

            Most of them are cbas streams:

            cbcollect_info_ns_1@172.23.110.67_20220609-182552 % grep -E "stream_.*opaque" stats.log | grep "replication" | wc -l
                 673
            cbcollect_info_ns_1@172.23.110.67_20220609-182552 % grep -E "stream_.*opaque" stats.log | grep "cbas" | wc -l
                8950
            

            Hey Ritesh Agarwal, could you ask cbas to have a look here for checking whether the high number of open streams is normal/expected behaviour please?
            We had a similar problem in MB-44562 with FTS. At some point FTS had introduced a bug where stale streams were left open.

            Meanwhile, I'm reviewing some possible improvements in the way KV handles this kind of scenarios.

            Update
            Ritesh Agarwal There's also an ongoing discussion in MB-51950 where the same CBAS behaviour (ie, creating one stream per collection) has pushed one single node to creating ~ 125k streams. CBAS seems to have set the fix for that to Morpheus (MB-45591).

            paolo.cocchi Paolo Cocchi added a comment - - edited We seems to be hitting again MB-44562 here. We hit the max number of backfills that can run on a node: ep_dcp_max_running_backfills: 4096 ep_dcp_num_running_backfills: 4096 Replication streams 1010/1011 stay in the pending queue: eq_dcpq:replication:ns_1@172.23.110.67->ns_1@172.23.110.70:GleamBookUsers0:backfill_num_pending: 2 We are close to 10k outboud streams on the node: cbcollect_info_ns_1@172.23.110.67_20220609-182552 % grep -E "stream_.*opaque" stats.log | wc -l 9623 Most of them are cbas streams: cbcollect_info_ns_1@172.23.110.67_20220609-182552 % grep -E "stream_.*opaque" stats.log | grep "replication" | wc -l 673 cbcollect_info_ns_1@172.23.110.67_20220609-182552 % grep -E "stream_.*opaque" stats.log | grep "cbas" | wc -l 8950 Hey Ritesh Agarwal , could you ask cbas to have a look here for checking whether the high number of open streams is normal/expected behaviour please? We had a similar problem in MB-44562 with FTS. At some point FTS had introduced a bug where stale streams were left open. Meanwhile, I'm reviewing some possible improvements in the way KV handles this kind of scenarios. Update Ritesh Agarwal There's also an ongoing discussion in MB-51950 where the same CBAS behaviour (ie, creating one stream per collection) has pushed one single node to creating ~ 125k streams. CBAS seems to have set the fix for that to Morpheus ( MB-45591 ).

            Update from Michael Blow: it is normal / expected in 7.1.1 to have one vbucket stream per mapped collection

            ritesh.agarwal Ritesh Agarwal added a comment - Update from Michael Blow : it is normal / expected in 7.1.1 to have one vbucket stream per mapped collection
            paolo.cocchi Paolo Cocchi added a comment - - edited

            Hi Ritesh Agarwal,
            we have a possible improvement for this in KV on gerrit that would be nice to pass a verification on the real scenario before being merged.
            Could you repeat this test on toy-build at http://latestbuilds.service.couchbase.com/builds/latestbuilds/couchbase-server/toybuilds/202206290/ please?
            Thanks

            paolo.cocchi Paolo Cocchi added a comment - - edited Hi Ritesh Agarwal , we have a possible improvement for this in KV on gerrit that would be nice to pass a verification on the real scenario before being merged. Could you repeat this test on toy-build at http://latestbuilds.service.couchbase.com/builds/latestbuilds/couchbase-server/toybuilds/202206290/ please? Thanks

            Sure Paolo Cocchi, started the run.

            ritesh.agarwal Ritesh Agarwal added a comment - Sure Paolo Cocchi , started the run.
            ritesh.agarwal Ritesh Agarwal added a comment - - edited

            Hi Paolo Cocchi, I had a test run on the last toy build you shared and I see that there is no progress in the rebalance during the initial >8 hours or so and the test failed. After checking the cluster, the rebalance did get finished in 16 hours of total time while there were no data mutations during the last 8 hours on the cluster as the test finished.

            Here are the logs to check why there is 0 progress in the initial 8 hours of rebalance:
            http://supportal.couchbase.com/snapshot/27e6fb34222441ce2447c176748afb16::0
            s3://cb-customers-secure/rebalance/2022-07-07/collectinfo-2022-07-07t184505-ns_1@172.23.110.64.zip
            s3://cb-customers-secure/rebalance/2022-07-07/collectinfo-2022-07-07t184505-ns_1@172.23.110.65.zip
            s3://cb-customers-secure/rebalance/2022-07-07/collectinfo-2022-07-07t184505-ns_1@172.23.110.66.zip
            s3://cb-customers-secure/rebalance/2022-07-07/collectinfo-2022-07-07t184505-ns_1@172.23.110.67.zip
            s3://cb-customers-secure/rebalance/2022-07-07/collectinfo-2022-07-07t184505-ns_1@172.23.110.68.zip

            ritesh.agarwal Ritesh Agarwal added a comment - - edited Hi Paolo Cocchi , I had a test run on the last toy build you shared and I see that there is no progress in the rebalance during the initial >8 hours or so and the test failed. After checking the cluster, the rebalance did get finished in 16 hours of total time while there were no data mutations during the last 8 hours on the cluster as the test finished. Here are the logs to check why there is 0 progress in the initial 8 hours of rebalance: http://supportal.couchbase.com/snapshot/27e6fb34222441ce2447c176748afb16::0 s3://cb-customers-secure/rebalance/2022-07-07/collectinfo-2022-07-07t184505-ns_1@172.23.110.64.zip s3://cb-customers-secure/rebalance/2022-07-07/collectinfo-2022-07-07t184505-ns_1@172.23.110.65.zip s3://cb-customers-secure/rebalance/2022-07-07/collectinfo-2022-07-07t184505-ns_1@172.23.110.66.zip s3://cb-customers-secure/rebalance/2022-07-07/collectinfo-2022-07-07t184505-ns_1@172.23.110.67.zip s3://cb-customers-secure/rebalance/2022-07-07/collectinfo-2022-07-07t184505-ns_1@172.23.110.68.zip

            People

              owend Daniel Owen
              ritesh.agarwal Ritesh Agarwal
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:

                Gerrit Reviews

                  There are 5 open Gerrit changes

                  PagerDuty