Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-46099

130 node testing: Rebalance in 1 node in a 85 node cluster failed due to buckets_cleanup_failed

    XMLWordPrintable

Details

    • Bug
    • Status: Closed
    • Critical
    • Resolution: Fixed
    • Cheshire-Cat
    • 7.0.2
    • ns_server
    • 7.0.0-5085

    Description

      Steps:
      1. Create a 85 node cluster
      2. Create a bucket and 20 collections
      3. Start load: maxttl=60, durability=Majority
      4. Rebalance in 172.23.96.89. Rebalance failed due to bas replicas. buckets_cleanup_failed

      QE test

      guides/gradlew --refresh-dependencies testrunner -P jython=/opt/jython/bin/jython -P 'args=-i /tmp/magma_temp_job4.ini -p bucket_storage=couchstore,bucket_eviction_policy=fullEviction,rerun=False -t volumetests.Magma.volume.SystemTestMagma,nodes_init=85,replicas=1,skip_cleanup=True,num_items=1000000000,num_buckets=1,bucket_names=GleamBook,doc_size=256,bucket_type=membase,compression_mode=off,iterations=100,batch_size=1000,sdk_timeout=60,log_level=debug,infra_log_level=info,skip_cleanup=True,key_size=18,randomize_doc_size=True,randomize_value=True,assert_crashes_on_load=True,doc_ops=update,maxttl=60,num_collections=20,durability=Majority,fragmentation=20,pc=2,crashes=10,sdk_client_pool=True -m rest'
      

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            ritesh.agarwal Ritesh Agarwal created issue -
            ritesh.agarwal Ritesh Agarwal made changes -
            Field Original Value New Value
            Link to Log File, atop/blg, CBCollectInfo, Core dump http://supportal.couchbase.com/snapshot/28b5d7a41e42afa4b17f82f6d79bdfde::0

            s3://cb-customers-secure/bad_replicas/2021-05-04/collectinfo-2021-05-04t121859-ns_1@172.23.106.95.zip
            s3://cb-customers-secure/bad_replicas/2021-05-04/collectinfo-2021-05-04t121859-ns_1@172.23.106.99.zip
            s3://cb-customers-secure/bad_replicas/2021-05-04/collectinfo-2021-05-04t121859-ns_1@172.23.96.89.zip
            s3://cb-customers-secure/bad_replicas/2021-05-04/collectinfo-2021-05-04t121859-ns_1@172.23.106.163.zip
            s3://cb-customers-secure/bad_replicas/2021-05-04/collectinfo-2021-05-04t121859-ns_1@172.23.106.164.zip
            s3://cb-customers-secure/bad_replicas/2021-05-04/collectinfo-2021-05-04t121859-ns_1@172.23.106.53.zip
            s3://cb-customers-secure/bad_replicas/2021-05-04/collectinfo-2021-05-04t121859-ns_1@172.23.106.56.zip
            s3://cb-customers-secure/bad_replicas/2021-05-04/collectinfo-2021-05-04t121859-ns_1@172.23.106.62.zip
            s3://cb-customers-secure/bad_replicas/2021-05-04/collectinfo-2021-05-04t121859-ns_1@172.23.120.120.zip
            s3://cb-customers-secure/bad_replicas/2021-05-04/collectinfo-2021-05-04t121859-ns_1@172.23.120.124.zip
            s3://cb-customers-secure/bad_replicas/2021-05-04/collectinfo-2021-05-04t121859-ns_1@172.23.120.125.zip
            s3://cb-customers-secure/bad_replicas/2021-05-04/collectinfo-2021-05-04t121859-ns_1@172.23.120.170.zip
            s3://cb-customers-secure/bad_replicas/2021-05-04/collectinfo-2021-05-04t121859-ns_1@172.23.121.123.zip
            s3://cb-customers-secure/bad_replicas/2021-05-04/collectinfo-2021-05-04t121859-ns_1@172.23.121.124.zip
            s3://cb-customers-secure/bad_replicas/2021-05-04/collectinfo-2021-05-04t121859-ns_1@172.23.121.127.zip
            s3://cb-customers-secure/bad_replicas/2021-05-04/collectinfo-2021-05-04t121859-ns_1@172.23.121.128.zip
            s3://cb-customers-secure/bad_replicas/2021-05-04/collectinfo-2021-05-04t121859-ns_1@172.23.121.129.zip
            s3://cb-customers-secure/bad_replicas/2021-05-04/collectinfo-2021-05-04t121859-ns_1@172.23.121.131.zip
            s3://cb-customers-secure/bad_replicas/2021-05-04/collectinfo-2021-05-04t121859-ns_1@172.23.121.135.zip
            s3://cb-customers-secure/bad_replicas/2021-05-04/collectinfo-2021-05-04t121859-ns_1@172.23.121.48.zip
            s3://cb-customers-secure/bad_replicas/2021-05-04/collectinfo-2021-05-04t121859-ns_1@172.23.122.208.zip
            s3://cb-customers-secure/bad_replicas/2021-05-04/collectinfo-2021-05-04t121859-ns_1@172.23.122.240.zip
            s3://cb-customers-secure/bad_replicas/2021-05-04/collectinfo-2021-05-04t121859-ns_1@172.23.122.241.zip
            s3://cb-customers-secure/bad_replicas/2021-05-04/collectinfo-2021-05-04t121859-ns_1@172.23.122.242.zip
            s3://cb-customers-secure/bad_replicas/2021-05-04/collectinfo-2021-05-04t121859-ns_1@172.23.122.243.zip
            s3://cb-customers-secure/bad_replicas/2021-05-04/collectinfo-2021-05-04t121859-ns_1@172.23.122.245.zip
            s3://cb-customers-secure/bad_replicas/2021-05-04/collectinfo-2021-05-04t121859-ns_1@172.23.122.247.zip
            s3://cb-customers-secure/bad_replicas/2021-05-04/collectinfo-2021-05-04t121859-ns_1@172.23.122.248.zip
            s3://cb-customers-secure/bad_replicas/2021-05-04/collectinfo-2021-05-04t121859-ns_1@172.23.122.249.zip
            s3://cb-customers-secure/bad_replicas/2021-05-04/collectinfo-2021-05-04t121859-ns_1@172.23.123.10.zip
            s3://cb-customers-secure/bad_replicas/2021-05-04/collectinfo-2021-05-04t121859-ns_1@172.23.123.11.zip
            s3://cb-customers-secure/bad_replicas/2021-05-04/collectinfo-2021-05-04t121859-ns_1@172.23.123.12.zip
            s3://cb-customers-secure/bad_replicas/2021-05-04/collectinfo-2021-05-04t121859-ns_1@172.23.123.4.zip
            s3://cb-customers-secure/bad_replicas/2021-05-04/collectinfo-2021-05-04t121859-ns_1@172.23.123.5.zip
            s3://cb-customers-secure/bad_replicas/2021-05-04/collectinfo-2021-05-04t121859-ns_1@172.23.123.6.zip
            s3://cb-customers-secure/bad_replicas/2021-05-04/collectinfo-2021-05-04t121859-ns_1@172.23.123.7.zip
            http://supportal.couchbase.com/snapshot/28b5d7a41e42afa4b17f82f6d79bdfde::0
            http://supportal.couchbase.com/snapshot/28b5d7a41e42afa4b17f82f6d79bdfde::1


            s3://cb-customers-secure/bad_replicas/2021-05-04/collectinfo-2021-05-04t121859-ns_1@172.23.106.95.zip
            s3://cb-customers-secure/bad_replicas/2021-05-04/collectinfo-2021-05-04t121859-ns_1@172.23.106.99.zip
            s3://cb-customers-secure/bad_replicas/2021-05-04/collectinfo-2021-05-04t121859-ns_1@172.23.96.89.zip
            s3://cb-customers-secure/bad_replicas/2021-05-04/collectinfo-2021-05-04t121859-ns_1@172.23.106.163.zip
            s3://cb-customers-secure/bad_replicas/2021-05-04/collectinfo-2021-05-04t121859-ns_1@172.23.106.164.zip
            s3://cb-customers-secure/bad_replicas/2021-05-04/collectinfo-2021-05-04t121859-ns_1@172.23.106.53.zip
            s3://cb-customers-secure/bad_replicas/2021-05-04/collectinfo-2021-05-04t121859-ns_1@172.23.106.56.zip
            s3://cb-customers-secure/bad_replicas/2021-05-04/collectinfo-2021-05-04t121859-ns_1@172.23.106.62.zip
            s3://cb-customers-secure/bad_replicas/2021-05-04/collectinfo-2021-05-04t121859-ns_1@172.23.120.120.zip
            s3://cb-customers-secure/bad_replicas/2021-05-04/collectinfo-2021-05-04t121859-ns_1@172.23.120.124.zip
            s3://cb-customers-secure/bad_replicas/2021-05-04/collectinfo-2021-05-04t121859-ns_1@172.23.120.125.zip
            s3://cb-customers-secure/bad_replicas/2021-05-04/collectinfo-2021-05-04t121859-ns_1@172.23.120.170.zip
            s3://cb-customers-secure/bad_replicas/2021-05-04/collectinfo-2021-05-04t121859-ns_1@172.23.121.123.zip
            s3://cb-customers-secure/bad_replicas/2021-05-04/collectinfo-2021-05-04t121859-ns_1@172.23.121.124.zip
            s3://cb-customers-secure/bad_replicas/2021-05-04/collectinfo-2021-05-04t121859-ns_1@172.23.121.127.zip
            s3://cb-customers-secure/bad_replicas/2021-05-04/collectinfo-2021-05-04t121859-ns_1@172.23.121.128.zip
            s3://cb-customers-secure/bad_replicas/2021-05-04/collectinfo-2021-05-04t121859-ns_1@172.23.121.129.zip
            s3://cb-customers-secure/bad_replicas/2021-05-04/collectinfo-2021-05-04t121859-ns_1@172.23.121.131.zip
            s3://cb-customers-secure/bad_replicas/2021-05-04/collectinfo-2021-05-04t121859-ns_1@172.23.121.135.zip
            s3://cb-customers-secure/bad_replicas/2021-05-04/collectinfo-2021-05-04t121859-ns_1@172.23.121.48.zip
            s3://cb-customers-secure/bad_replicas/2021-05-04/collectinfo-2021-05-04t121859-ns_1@172.23.122.208.zip
            s3://cb-customers-secure/bad_replicas/2021-05-04/collectinfo-2021-05-04t121859-ns_1@172.23.122.240.zip
            s3://cb-customers-secure/bad_replicas/2021-05-04/collectinfo-2021-05-04t121859-ns_1@172.23.122.241.zip
            s3://cb-customers-secure/bad_replicas/2021-05-04/collectinfo-2021-05-04t121859-ns_1@172.23.122.242.zip
            s3://cb-customers-secure/bad_replicas/2021-05-04/collectinfo-2021-05-04t121859-ns_1@172.23.122.243.zip
            s3://cb-customers-secure/bad_replicas/2021-05-04/collectinfo-2021-05-04t121859-ns_1@172.23.122.245.zip
            s3://cb-customers-secure/bad_replicas/2021-05-04/collectinfo-2021-05-04t121859-ns_1@172.23.122.247.zip
            s3://cb-customers-secure/bad_replicas/2021-05-04/collectinfo-2021-05-04t121859-ns_1@172.23.122.248.zip
            s3://cb-customers-secure/bad_replicas/2021-05-04/collectinfo-2021-05-04t121859-ns_1@172.23.122.249.zip
            s3://cb-customers-secure/bad_replicas/2021-05-04/collectinfo-2021-05-04t121859-ns_1@172.23.123.10.zip
            s3://cb-customers-secure/bad_replicas/2021-05-04/collectinfo-2021-05-04t121859-ns_1@172.23.123.11.zip
            s3://cb-customers-secure/bad_replicas/2021-05-04/collectinfo-2021-05-04t121859-ns_1@172.23.123.12.zip
            s3://cb-customers-secure/bad_replicas/2021-05-04/collectinfo-2021-05-04t121859-ns_1@172.23.123.4.zip
            s3://cb-customers-secure/bad_replicas/2021-05-04/collectinfo-2021-05-04t121859-ns_1@172.23.123.5.zip
            s3://cb-customers-secure/bad_replicas/2021-05-04/collectinfo-2021-05-04t121859-ns_1@172.23.123.6.zip
            s3://cb-customers-secure/bad_replicas/2021-05-04/collectinfo-2021-05-04t121859-ns_1@172.23.123.7.zip
            ritesh.agarwal Ritesh Agarwal made changes -
            Summary 130 node testing: Rebalance in 1 node in a 85 node cluster leads to bad replicators port rebalance 130 node testing: Rebalance in 1 node in a 85 node cluster leads to bad replicators post rebalance
            ritesh.agarwal Ritesh Agarwal made changes -
            Link to Log File, atop/blg, CBCollectInfo, Core dump http://supportal.couchbase.com/snapshot/28b5d7a41e42afa4b17f82f6d79bdfde::0
            http://supportal.couchbase.com/snapshot/28b5d7a41e42afa4b17f82f6d79bdfde::1


            s3://cb-customers-secure/bad_replicas/2021-05-04/collectinfo-2021-05-04t121859-ns_1@172.23.106.95.zip
            s3://cb-customers-secure/bad_replicas/2021-05-04/collectinfo-2021-05-04t121859-ns_1@172.23.106.99.zip
            s3://cb-customers-secure/bad_replicas/2021-05-04/collectinfo-2021-05-04t121859-ns_1@172.23.96.89.zip
            s3://cb-customers-secure/bad_replicas/2021-05-04/collectinfo-2021-05-04t121859-ns_1@172.23.106.163.zip
            s3://cb-customers-secure/bad_replicas/2021-05-04/collectinfo-2021-05-04t121859-ns_1@172.23.106.164.zip
            s3://cb-customers-secure/bad_replicas/2021-05-04/collectinfo-2021-05-04t121859-ns_1@172.23.106.53.zip
            s3://cb-customers-secure/bad_replicas/2021-05-04/collectinfo-2021-05-04t121859-ns_1@172.23.106.56.zip
            s3://cb-customers-secure/bad_replicas/2021-05-04/collectinfo-2021-05-04t121859-ns_1@172.23.106.62.zip
            s3://cb-customers-secure/bad_replicas/2021-05-04/collectinfo-2021-05-04t121859-ns_1@172.23.120.120.zip
            s3://cb-customers-secure/bad_replicas/2021-05-04/collectinfo-2021-05-04t121859-ns_1@172.23.120.124.zip
            s3://cb-customers-secure/bad_replicas/2021-05-04/collectinfo-2021-05-04t121859-ns_1@172.23.120.125.zip
            s3://cb-customers-secure/bad_replicas/2021-05-04/collectinfo-2021-05-04t121859-ns_1@172.23.120.170.zip
            s3://cb-customers-secure/bad_replicas/2021-05-04/collectinfo-2021-05-04t121859-ns_1@172.23.121.123.zip
            s3://cb-customers-secure/bad_replicas/2021-05-04/collectinfo-2021-05-04t121859-ns_1@172.23.121.124.zip
            s3://cb-customers-secure/bad_replicas/2021-05-04/collectinfo-2021-05-04t121859-ns_1@172.23.121.127.zip
            s3://cb-customers-secure/bad_replicas/2021-05-04/collectinfo-2021-05-04t121859-ns_1@172.23.121.128.zip
            s3://cb-customers-secure/bad_replicas/2021-05-04/collectinfo-2021-05-04t121859-ns_1@172.23.121.129.zip
            s3://cb-customers-secure/bad_replicas/2021-05-04/collectinfo-2021-05-04t121859-ns_1@172.23.121.131.zip
            s3://cb-customers-secure/bad_replicas/2021-05-04/collectinfo-2021-05-04t121859-ns_1@172.23.121.135.zip
            s3://cb-customers-secure/bad_replicas/2021-05-04/collectinfo-2021-05-04t121859-ns_1@172.23.121.48.zip
            s3://cb-customers-secure/bad_replicas/2021-05-04/collectinfo-2021-05-04t121859-ns_1@172.23.122.208.zip
            s3://cb-customers-secure/bad_replicas/2021-05-04/collectinfo-2021-05-04t121859-ns_1@172.23.122.240.zip
            s3://cb-customers-secure/bad_replicas/2021-05-04/collectinfo-2021-05-04t121859-ns_1@172.23.122.241.zip
            s3://cb-customers-secure/bad_replicas/2021-05-04/collectinfo-2021-05-04t121859-ns_1@172.23.122.242.zip
            s3://cb-customers-secure/bad_replicas/2021-05-04/collectinfo-2021-05-04t121859-ns_1@172.23.122.243.zip
            s3://cb-customers-secure/bad_replicas/2021-05-04/collectinfo-2021-05-04t121859-ns_1@172.23.122.245.zip
            s3://cb-customers-secure/bad_replicas/2021-05-04/collectinfo-2021-05-04t121859-ns_1@172.23.122.247.zip
            s3://cb-customers-secure/bad_replicas/2021-05-04/collectinfo-2021-05-04t121859-ns_1@172.23.122.248.zip
            s3://cb-customers-secure/bad_replicas/2021-05-04/collectinfo-2021-05-04t121859-ns_1@172.23.122.249.zip
            s3://cb-customers-secure/bad_replicas/2021-05-04/collectinfo-2021-05-04t121859-ns_1@172.23.123.10.zip
            s3://cb-customers-secure/bad_replicas/2021-05-04/collectinfo-2021-05-04t121859-ns_1@172.23.123.11.zip
            s3://cb-customers-secure/bad_replicas/2021-05-04/collectinfo-2021-05-04t121859-ns_1@172.23.123.12.zip
            s3://cb-customers-secure/bad_replicas/2021-05-04/collectinfo-2021-05-04t121859-ns_1@172.23.123.4.zip
            s3://cb-customers-secure/bad_replicas/2021-05-04/collectinfo-2021-05-04t121859-ns_1@172.23.123.5.zip
            s3://cb-customers-secure/bad_replicas/2021-05-04/collectinfo-2021-05-04t121859-ns_1@172.23.123.6.zip
            s3://cb-customers-secure/bad_replicas/2021-05-04/collectinfo-2021-05-04t121859-ns_1@172.23.123.7.zip
            http://supportal.couchbase.com/snapshot/28b5d7a41e42afa4b17f82f6d79bdfde::0
            http://supportal.couchbase.com/snapshot/28b5d7a41e42afa4b17f82f6d79bdfde::1


            s3://cb-customers-secure/bad_replicas/2021-05-04/collectinfo-2021-05-04t121859-ns_1@172.23.106.95.zip
            s3://cb-customers-secure/bad_replicas/2021-05-04/collectinfo-2021-05-04t121859-ns_1@172.23.106.99.zip
            s3://cb-customers-secure/bad_replicas/2021-05-04/collectinfo-2021-05-04t121859-ns_1@172.23.96.89.zip
            s3://cb-customers-secure/bad_replicas/2021-05-04/collectinfo-2021-05-04t121859-ns_1@172.23.106.163.zip
            s3://cb-customers-secure/bad_replicas/2021-05-04/collectinfo-2021-05-04t121859-ns_1@172.23.106.164.zip
            s3://cb-customers-secure/bad_replicas/2021-05-04/collectinfo-2021-05-04t121859-ns_1@172.23.106.53.zip
            s3://cb-customers-secure/bad_replicas/2021-05-04/collectinfo-2021-05-04t121859-ns_1@172.23.106.56.zip
            s3://cb-customers-secure/bad_replicas/2021-05-04/collectinfo-2021-05-04t121859-ns_1@172.23.106.62.zip
            s3://cb-customers-secure/bad_replicas/2021-05-04/collectinfo-2021-05-04t121859-ns_1@172.23.120.120.zip
            s3://cb-customers-secure/bad_replicas/2021-05-04/collectinfo-2021-05-04t121859-ns_1@172.23.120.124.zip
            s3://cb-customers-secure/bad_replicas/2021-05-04/collectinfo-2021-05-04t121859-ns_1@172.23.120.125.zip
            s3://cb-customers-secure/bad_replicas/2021-05-04/collectinfo-2021-05-04t121859-ns_1@172.23.120.170.zip
            s3://cb-customers-secure/bad_replicas/2021-05-04/collectinfo-2021-05-04t121859-ns_1@172.23.121.123.zip
            s3://cb-customers-secure/bad_replicas/2021-05-04/collectinfo-2021-05-04t121859-ns_1@172.23.121.124.zip
            s3://cb-customers-secure/bad_replicas/2021-05-04/collectinfo-2021-05-04t121859-ns_1@172.23.121.127.zip
            s3://cb-customers-secure/bad_replicas/2021-05-04/collectinfo-2021-05-04t121859-ns_1@172.23.121.128.zip
            s3://cb-customers-secure/bad_replicas/2021-05-04/collectinfo-2021-05-04t121859-ns_1@172.23.121.129.zip
            s3://cb-customers-secure/bad_replicas/2021-05-04/collectinfo-2021-05-04t121859-ns_1@172.23.121.131.zip
            s3://cb-customers-secure/bad_replicas/2021-05-04/collectinfo-2021-05-04t121859-ns_1@172.23.121.135.zip
            s3://cb-customers-secure/bad_replicas/2021-05-04/collectinfo-2021-05-04t121859-ns_1@172.23.121.48.zip
            s3://cb-customers-secure/bad_replicas/2021-05-04/collectinfo-2021-05-04t121859-ns_1@172.23.122.208.zip
            s3://cb-customers-secure/bad_replicas/2021-05-04/collectinfo-2021-05-04t121859-ns_1@172.23.122.240.zip
            s3://cb-customers-secure/bad_replicas/2021-05-04/collectinfo-2021-05-04t121859-ns_1@172.23.122.241.zip
            s3://cb-customers-secure/bad_replicas/2021-05-04/collectinfo-2021-05-04t121859-ns_1@172.23.122.242.zip
            s3://cb-customers-secure/bad_replicas/2021-05-04/collectinfo-2021-05-04t121859-ns_1@172.23.122.243.zip
            s3://cb-customers-secure/bad_replicas/2021-05-04/collectinfo-2021-05-04t121859-ns_1@172.23.122.245.zip
            s3://cb-customers-secure/bad_replicas/2021-05-04/collectinfo-2021-05-04t121859-ns_1@172.23.122.247.zip
            s3://cb-customers-secure/bad_replicas/2021-05-04/collectinfo-2021-05-04t121859-ns_1@172.23.122.248.zip
            s3://cb-customers-secure/bad_replicas/2021-05-04/collectinfo-2021-05-04t121859-ns_1@172.23.122.249.zip
            s3://cb-customers-secure/bad_replicas/2021-05-04/collectinfo-2021-05-04t121859-ns_1@172.23.123.10.zip
            s3://cb-customers-secure/bad_replicas/2021-05-04/collectinfo-2021-05-04t121859-ns_1@172.23.123.11.zip
            s3://cb-customers-secure/bad_replicas/2021-05-04/collectinfo-2021-05-04t121859-ns_1@172.23.123.12.zip
            s3://cb-customers-secure/bad_replicas/2021-05-04/collectinfo-2021-05-04t121859-ns_1@172.23.123.4.zip
            s3://cb-customers-secure/bad_replicas/2021-05-04/collectinfo-2021-05-04t121859-ns_1@172.23.123.5.zip
            s3://cb-customers-secure/bad_replicas/2021-05-04/collectinfo-2021-05-04t121859-ns_1@172.23.123.6.zip
            s3://cb-customers-secure/bad_replicas/2021-05-04/collectinfo-2021-05-04t121859-ns_1@172.23.123.7.zip
            s3://cb-customers-secure/bad_replicas/2021-05-04/collectinfo-2021-05-04t121859-ns_1@172.23.106.155.zip
            s3://cb-customers-secure/bad_replicas/2021-05-04/collectinfo-2021-05-04t121859-ns_1@172.23.106.159.zip
            s3://cb-customers-secure/bad_replicas/2021-05-04/collectinfo-2021-05-04t121859-ns_1@172.23.107.1.zip
            s3://cb-customers-secure/bad_replicas/2021-05-04/collectinfo-2021-05-04t121859-ns_1@172.23.107.110.zip
            s3://cb-customers-secure/bad_replicas/2021-05-04/collectinfo-2021-05-04t121859-ns_1@172.23.107.112.zip
            s3://cb-customers-secure/bad_replicas/2021-05-04/collectinfo-2021-05-04t121859-ns_1@172.23.107.12.zip
            s3://cb-customers-secure/bad_replicas/2021-05-04/collectinfo-2021-05-04t121859-ns_1@172.23.107.16.zip
            s3://cb-customers-secure/bad_replicas/2021-05-04/collectinfo-2021-05-04t121859-ns_1@172.23.107.18.zip
            s3://cb-customers-secure/bad_replicas/2021-05-04/collectinfo-2021-05-04t121859-ns_1@172.23.107.31.zip
            s3://cb-customers-secure/bad_replicas/2021-05-04/collectinfo-2021-05-04t121859-ns_1@172.23.107.32.zip
            s3://cb-customers-secure/bad_replicas/2021-05-04/collectinfo-2021-05-04t121859-ns_1@172.23.107.33.zip
            s3://cb-customers-secure/bad_replicas/2021-05-04/collectinfo-2021-05-04t121859-ns_1@172.23.120.119.zip
            s3://cb-customers-secure/bad_replicas/2021-05-04/collectinfo-2021-05-04t121859-ns_1@172.23.120.122.zip
            s3://cb-customers-secure/bad_replicas/2021-05-04/collectinfo-2021-05-04t121859-ns_1@172.23.120.123.zip
            s3://cb-customers-secure/bad_replicas/2021-05-04/collectinfo-2021-05-04t121859-ns_1@172.23.120.126.zip
            s3://cb-customers-secure/bad_replicas/2021-05-04/collectinfo-2021-05-04t121859-ns_1@172.23.120.127.zip
            s3://cb-customers-secure/bad_replicas/2021-05-04/collectinfo-2021-05-04t121859-ns_1@172.23.120.128.zip
            s3://cb-customers-secure/bad_replicas/2021-05-04/collectinfo-2021-05-04t121859-ns_1@172.23.120.129.zip
            s3://cb-customers-secure/bad_replicas/2021-05-04/collectinfo-2021-05-04t121859-ns_1@172.23.121.115.zip
            s3://cb-customers-secure/bad_replicas/2021-05-04/collectinfo-2021-05-04t121859-ns_1@172.23.121.116.zip
            s3://cb-customers-secure/bad_replicas/2021-05-04/collectinfo-2021-05-04t121859-ns_1@172.23.121.126.zip
            s3://cb-customers-secure/bad_replicas/2021-05-04/collectinfo-2021-05-04t121859-ns_1@172.23.121.130.zip
            s3://cb-customers-secure/bad_replicas/2021-05-04/collectinfo-2021-05-04t121859-ns_1@172.23.121.132.zip
            s3://cb-customers-secure/bad_replicas/2021-05-04/collectinfo-2021-05-04t121859-ns_1@172.23.121.133.zip
            s3://cb-customers-secure/bad_replicas/2021-05-04/collectinfo-2021-05-04t121859-ns_1@172.23.121.134.zip
            s3://cb-customers-secure/bad_replicas/2021-05-04/collectinfo-2021-05-04t121859-ns_1@172.23.121.136.zip
            s3://cb-customers-secure/bad_replicas/2021-05-04/collectinfo-2021-05-04t121859-ns_1@172.23.121.139.zip
            s3://cb-customers-secure/bad_replicas/2021-05-04/collectinfo-2021-05-04t121859-ns_1@172.23.121.140.zip
            s3://cb-customers-secure/bad_replicas/2021-05-04/collectinfo-2021-05-04t121859-ns_1@172.23.121.141.zip
            s3://cb-customers-secure/bad_replicas/2021-05-04/collectinfo-2021-05-04t121859-ns_1@172.23.121.142.zip
            s3://cb-customers-secure/bad_replicas/2021-05-04/collectinfo-2021-05-04t121859-ns_1@172.23.121.143.zip
            s3://cb-customers-secure/bad_replicas/2021-05-04/collectinfo-2021-05-04t121859-ns_1@172.23.121.144.zip
            s3://cb-customers-secure/bad_replicas/2021-05-04/collectinfo-2021-05-04t121859-ns_1@172.23.121.152.zip
            s3://cb-customers-secure/bad_replicas/2021-05-04/collectinfo-2021-05-04t121859-ns_1@172.23.121.153.zip
            s3://cb-customers-secure/bad_replicas/2021-05-04/collectinfo-2021-05-04t121859-ns_1@172.23.121.154.zip
            s3://cb-customers-secure/bad_replicas/2021-05-04/collectinfo-2021-05-04t121859-ns_1@172.23.121.155.zip
            s3://cb-customers-secure/bad_replicas/2021-05-04/collectinfo-2021-05-04t121859-ns_1@172.23.121.156.zip
            s3://cb-customers-secure/bad_replicas/2021-05-04/collectinfo-2021-05-04t121859-ns_1@172.23.121.157.zip
            s3://cb-customers-secure/bad_replicas/2021-05-04/collectinfo-2021-05-04t121859-ns_1@172.23.121.162.zip
            s3://cb-customers-secure/bad_replicas/2021-05-04/collectinfo-2021-05-04t121859-ns_1@172.23.121.76.zip
            s3://cb-customers-secure/bad_replicas/2021-05-04/collectinfo-2021-05-04t121859-ns_1@172.23.121.79.zip
            s3://cb-customers-secure/bad_replicas/2021-05-04/collectinfo-2021-05-04t121859-ns_1@172.23.122.221.zip
            s3://cb-customers-secure/bad_replicas/2021-05-04/collectinfo-2021-05-04t121859-ns_1@172.23.122.223.zip
            s3://cb-customers-secure/bad_replicas/2021-05-04/collectinfo-2021-05-04t121859-ns_1@172.23.122.224.zip
            s3://cb-customers-secure/bad_replicas/2021-05-04/collectinfo-2021-05-04t121859-ns_1@172.23.122.225.zip
            s3://cb-customers-secure/bad_replicas/2021-05-04/collectinfo-2021-05-04t121859-ns_1@172.23.122.226.zip
            s3://cb-customers-secure/bad_replicas/2021-05-04/collectinfo-2021-05-04t121859-ns_1@172.23.122.227.zip
            s3://cb-customers-secure/bad_replicas/2021-05-04/collectinfo-2021-05-04t121859-ns_1@172.23.122.228.zip
            s3://cb-customers-secure/bad_replicas/2021-05-04/collectinfo-2021-05-04t121859-ns_1@172.23.122.229.zip
            s3://cb-customers-secure/bad_replicas/2021-05-04/collectinfo-2021-05-04t121859-ns_1@172.23.122.231.zip
            ritesh.agarwal Ritesh Agarwal made changes -
            Summary 130 node testing: Rebalance in 1 node in a 85 node cluster leads to bad replicators post rebalance 130 node testing: Rebalance in 1 node in a 85 node cluster leads to bad replicators and rebalance failed.
            owend Daniel Owen added a comment - - edited

            Looking on the orchestrator (172.23.106.155) we have 2 different rebalance failures
            Rebalance failure 1

            [user:info,2021-05-04T05:16:30.211-07:00,ns_1@172.23.106.155:<0.764.0>:ns_orchestrator:idle:773]Starting rebalance, KeepNodes = ['ns_1@172.23.106.62','ns_1@172.23.121.76',
                                             'ns_1@172.23.123.6','ns_1@172.23.121.128',
                                             'ns_1@172.23.120.124','ns_1@172.23.122.226',
                                             'ns_1@172.23.121.136','ns_1@172.23.122.248',
                                             'ns_1@172.23.107.18','ns_1@172.23.122.243',
                                             'ns_1@172.23.121.153','ns_1@172.23.96.89',
                                             'ns_1@172.23.121.127','ns_1@172.23.120.125',
                                             'ns_1@172.23.120.129','ns_1@172.23.122.221',
                                             'ns_1@172.23.121.123','ns_1@172.23.121.131',
                                             'ns_1@172.23.122.225','ns_1@172.23.106.99',
                                             'ns_1@172.23.107.12','ns_1@172.23.123.5',
                                             'ns_1@172.23.107.110','ns_1@172.23.121.162',
                                             'ns_1@172.23.122.247','ns_1@172.23.121.154',
                                             'ns_1@172.23.120.128','ns_1@172.23.120.120',
                                             'ns_1@172.23.123.10','ns_1@172.23.123.4',
                                             'ns_1@172.23.120.123','ns_1@172.23.106.53',
                                             'ns_1@172.23.121.155','ns_1@172.23.121.152',
                                             'ns_1@172.23.107.112','ns_1@172.23.107.1',
                                             'ns_1@172.23.121.144','ns_1@172.23.123.7',
                                             'ns_1@172.23.106.164','ns_1@172.23.122.242',
                                             'ns_1@172.23.107.33','ns_1@172.23.106.159',
                                             'ns_1@172.23.121.130','ns_1@172.23.121.141',
                                             'ns_1@172.23.106.56','ns_1@172.23.121.133',
                                             'ns_1@172.23.122.223','ns_1@172.23.122.231',
                                             'ns_1@172.23.107.31','ns_1@172.23.106.155',
                                             'ns_1@172.23.121.48','ns_1@172.23.121.139',
                                             'ns_1@172.23.121.156','ns_1@172.23.122.229',
                                             'ns_1@172.23.122.245','ns_1@172.23.121.142',
                                             'ns_1@172.23.106.163','ns_1@172.23.122.240',
                                             'ns_1@172.23.121.134','ns_1@172.23.123.12',
                                             'ns_1@172.23.121.116','ns_1@172.23.122.228',
                                             'ns_1@172.23.120.126','ns_1@172.23.107.32',
                                             'ns_1@172.23.120.122','ns_1@172.23.121.157',
                                             'ns_1@172.23.107.16','ns_1@172.23.121.143',
                                             'ns_1@172.23.122.241','ns_1@172.23.121.140',
                                             'ns_1@172.23.121.115','ns_1@172.23.121.132',
                                             'ns_1@172.23.122.208','ns_1@172.23.121.135',
                                             'ns_1@172.23.120.170','ns_1@172.23.121.129',
                                             'ns_1@172.23.122.224','ns_1@172.23.121.124',
                                             'ns_1@172.23.121.126','ns_1@172.23.122.227',
                                             'ns_1@172.23.106.95','ns_1@172.23.122.249',
                                             'ns_1@172.23.120.127','ns_1@172.23.123.11',
                                             'ns_1@172.23.120.119','ns_1@172.23.121.79'], EjectNodes = [], Failed over and being ejected nodes = []; no delta recovery nodes; Operation Id = 41a4e6ef0de9120bda567a6716bde64a
            [user:error,2021-05-04T05:17:15.953-07:00,ns_1@172.23.106.155:<0.764.0>:ns_orchestrator:log_rebalance_completion:1405]Rebalance exited with reason bad_replicas.
            

            Bit more info..

            [user:info,2021-05-04T05:17:15.951-07:00,ns_1@172.23.106.155:<0.9679.17>:ns_rebalancer:verify_replication:879]Bad replicators after rebalance:
            Missing = [{'ns_1@172.23.106.155','ns_1@172.23.107.1',208},
                       {'ns_1@172.23.106.155','ns_1@172.23.107.1',209},
                       {'ns_1@172.23.106.155','ns_1@172.23.107.18',91},
                       {'ns_1@172.23.106.155','ns_1@172.23.107.18',328},
                       {'ns_1@172.23.106.155','ns_1@172.23.107.32',69},
                       {'ns_1@172.23.106.155','ns_1@172.23.120.122',109},
                       {'ns_1@172.23.106.155','ns_1@172.23.120.125',38},
                       {'ns_1@172.23.106.155','ns_1@172.23.120.128',868},
                       {'ns_1@172.23.106.155','ns_1@172.23.121.128',74},
                       {'ns_1@172.23.106.155','ns_1@172.23.121.140',520},
                       {'ns_1@172.23.106.155','ns_1@172.23.121.143',664},
                       {'ns_1@172.23.106.155','ns_1@172.23.121.156',278},
                       {'ns_1@172.23.106.164','ns_1@172.23.106.155',0},
                       {'ns_1@172.23.106.164','ns_1@172.23.106.155',1},
                       {'ns_1@172.23.120.123','ns_1@172.23.106.155',124},
                       {'ns_1@172.23.121.126','ns_1@172.23.106.155',532},
                       {'ns_1@172.23.121.133','ns_1@172.23.106.155',332},
                       {'ns_1@172.23.121.142','ns_1@172.23.106.155',119},
                       {'ns_1@172.23.121.153','ns_1@172.23.106.155',714},
                       {'ns_1@172.23.121.162','ns_1@172.23.106.155',700},
                       {'ns_1@172.23.122.224','ns_1@172.23.106.155',916},
                       {'ns_1@172.23.122.229','ns_1@172.23.106.155',892},
                       {'ns_1@172.23.122.245','ns_1@172.23.106.155',761},
                       {'ns_1@172.23.122.245','ns_1@172.23.106.155',810}]
            Extras = []
            [ns_server:info,2021-05-04T05:17:15.951-07:00,ns_1@172.23.106.155:rebalance_agent<0.815.0>:rebalance_agent:handle_down:290]Rebalancer process <0.9679.17> died (reason bad_replicas).
            

            The common node is 172.23.106.155'
            On 172.23.106.155 we see memcached exited with status 137, which mean it was external terminated (possibly using kill)

            [ns_server:info,2021-05-04T05:17:01.303-07:00,babysitter_of_ns_1@cb.local:<0.132.0>:supervisor_cushion:handle_info:54]Cushion managed supervisor for memcached failed:  {abnormal,137}
            

            Rebalance failure 2

            [user:info,2021-05-04T05:21:36.329-07:00,ns_1@172.23.106.155:<0.764.0>:ns_orchestrator:idle:773]Starting rebalance, KeepNodes = ['ns_1@172.23.106.62','ns_1@172.23.121.76',
            [user:error,2021-05-04T05:21:52.145-07:00,ns_1@172.23.106.155:<0.764.0>:ns_orchestrator:log_rebalance_completion:1405]Rebalance exited with reason {buckets_cleanup_failed,
            

            owend Daniel Owen added a comment - - edited Looking on the orchestrator (172.23.106.155) we have 2 different rebalance failures Rebalance failure 1 [user:info,2021-05-04T05:16:30.211-07:00,ns_1@172.23.106.155:<0.764.0>:ns_orchestrator:idle:773]Starting rebalance, KeepNodes = ['ns_1@172.23.106.62','ns_1@172.23.121.76', 'ns_1@172.23.123.6','ns_1@172.23.121.128', 'ns_1@172.23.120.124','ns_1@172.23.122.226', 'ns_1@172.23.121.136','ns_1@172.23.122.248', 'ns_1@172.23.107.18','ns_1@172.23.122.243', 'ns_1@172.23.121.153','ns_1@172.23.96.89', 'ns_1@172.23.121.127','ns_1@172.23.120.125', 'ns_1@172.23.120.129','ns_1@172.23.122.221', 'ns_1@172.23.121.123','ns_1@172.23.121.131', 'ns_1@172.23.122.225','ns_1@172.23.106.99', 'ns_1@172.23.107.12','ns_1@172.23.123.5', 'ns_1@172.23.107.110','ns_1@172.23.121.162', 'ns_1@172.23.122.247','ns_1@172.23.121.154', 'ns_1@172.23.120.128','ns_1@172.23.120.120', 'ns_1@172.23.123.10','ns_1@172.23.123.4', 'ns_1@172.23.120.123','ns_1@172.23.106.53', 'ns_1@172.23.121.155','ns_1@172.23.121.152', 'ns_1@172.23.107.112','ns_1@172.23.107.1', 'ns_1@172.23.121.144','ns_1@172.23.123.7', 'ns_1@172.23.106.164','ns_1@172.23.122.242', 'ns_1@172.23.107.33','ns_1@172.23.106.159', 'ns_1@172.23.121.130','ns_1@172.23.121.141', 'ns_1@172.23.106.56','ns_1@172.23.121.133', 'ns_1@172.23.122.223','ns_1@172.23.122.231', 'ns_1@172.23.107.31','ns_1@172.23.106.155', 'ns_1@172.23.121.48','ns_1@172.23.121.139', 'ns_1@172.23.121.156','ns_1@172.23.122.229', 'ns_1@172.23.122.245','ns_1@172.23.121.142', 'ns_1@172.23.106.163','ns_1@172.23.122.240', 'ns_1@172.23.121.134','ns_1@172.23.123.12', 'ns_1@172.23.121.116','ns_1@172.23.122.228', 'ns_1@172.23.120.126','ns_1@172.23.107.32', 'ns_1@172.23.120.122','ns_1@172.23.121.157', 'ns_1@172.23.107.16','ns_1@172.23.121.143', 'ns_1@172.23.122.241','ns_1@172.23.121.140', 'ns_1@172.23.121.115','ns_1@172.23.121.132', 'ns_1@172.23.122.208','ns_1@172.23.121.135', 'ns_1@172.23.120.170','ns_1@172.23.121.129', 'ns_1@172.23.122.224','ns_1@172.23.121.124', 'ns_1@172.23.121.126','ns_1@172.23.122.227', 'ns_1@172.23.106.95','ns_1@172.23.122.249', 'ns_1@172.23.120.127','ns_1@172.23.123.11', 'ns_1@172.23.120.119','ns_1@172.23.121.79'], EjectNodes = [], Failed over and being ejected nodes = []; no delta recovery nodes; Operation Id = 41a4e6ef0de9120bda567a6716bde64a [user:error,2021-05-04T05:17:15.953-07:00,ns_1@172.23.106.155:<0.764.0>:ns_orchestrator:log_rebalance_completion:1405]Rebalance exited with reason bad_replicas. Bit more info.. [user:info,2021-05-04T05:17:15.951-07:00,ns_1@172.23.106.155:<0.9679.17>:ns_rebalancer:verify_replication:879]Bad replicators after rebalance: Missing = [{'ns_1@172.23.106.155','ns_1@172.23.107.1',208}, {'ns_1@172.23.106.155','ns_1@172.23.107.1',209}, {'ns_1@172.23.106.155','ns_1@172.23.107.18',91}, {'ns_1@172.23.106.155','ns_1@172.23.107.18',328}, {'ns_1@172.23.106.155','ns_1@172.23.107.32',69}, {'ns_1@172.23.106.155','ns_1@172.23.120.122',109}, {'ns_1@172.23.106.155','ns_1@172.23.120.125',38}, {'ns_1@172.23.106.155','ns_1@172.23.120.128',868}, {'ns_1@172.23.106.155','ns_1@172.23.121.128',74}, {'ns_1@172.23.106.155','ns_1@172.23.121.140',520}, {'ns_1@172.23.106.155','ns_1@172.23.121.143',664}, {'ns_1@172.23.106.155','ns_1@172.23.121.156',278}, {'ns_1@172.23.106.164','ns_1@172.23.106.155',0}, {'ns_1@172.23.106.164','ns_1@172.23.106.155',1}, {'ns_1@172.23.120.123','ns_1@172.23.106.155',124}, {'ns_1@172.23.121.126','ns_1@172.23.106.155',532}, {'ns_1@172.23.121.133','ns_1@172.23.106.155',332}, {'ns_1@172.23.121.142','ns_1@172.23.106.155',119}, {'ns_1@172.23.121.153','ns_1@172.23.106.155',714}, {'ns_1@172.23.121.162','ns_1@172.23.106.155',700}, {'ns_1@172.23.122.224','ns_1@172.23.106.155',916}, {'ns_1@172.23.122.229','ns_1@172.23.106.155',892}, {'ns_1@172.23.122.245','ns_1@172.23.106.155',761}, {'ns_1@172.23.122.245','ns_1@172.23.106.155',810}] Extras = [] [ns_server:info,2021-05-04T05:17:15.951-07:00,ns_1@172.23.106.155:rebalance_agent<0.815.0>:rebalance_agent:handle_down:290]Rebalancer process <0.9679.17> died (reason bad_replicas). The common node is 172.23.106.155' On 172.23.106.155 we see memcached exited with status 137, which mean it was external terminated (possibly using kill) [ns_server:info,2021-05-04T05:17:01.303-07:00,babysitter_of_ns_1@cb.local:<0.132.0>:supervisor_cushion:handle_info:54]Cushion managed supervisor for memcached failed: {abnormal,137} Rebalance failure 2 [user:info,2021-05-04T05:21:36.329-07:00,ns_1@172.23.106.155:<0.764.0>:ns_orchestrator:idle:773]Starting rebalance, KeepNodes = ['ns_1@172.23.106.62','ns_1@172.23.121.76', [user:error,2021-05-04T05:21:52.145-07:00,ns_1@172.23.106.155:<0.764.0>:ns_orchestrator:log_rebalance_completion:1405]Rebalance exited with reason {buckets_cleanup_failed,
            owend Daniel Owen made changes -
            Assignee Dave Finlay [ dfinlay ] Daniel Owen [ owend ]
            sumedh.basarkod Sumedh Basarkod (Inactive) made changes -
            Environment 5.0.0-5085 7.0.0-5085
            owend Daniel Owen added a comment -

            Hi Ritesh Agarwal. For the first rebalance failure the memcached process was externally terminated (with kill). It does not appear to be due to a Linux OOM. Does the test perform a kill of the memcached process?

            thanks

            owend Daniel Owen added a comment - Hi Ritesh Agarwal . For the first rebalance failure the memcached process was externally terminated (with kill). It does not appear to be due to a Linux OOM. Does the test perform a kill of the memcached process? thanks
            owend Daniel Owen made changes -
            Assignee Daniel Owen [ owend ] Ritesh Agarwal [ ritesh.agarwal ]
            ritesh.agarwal Ritesh Agarwal made changes -
            Description Steps:
            1. Create a 85 node cluster
            2. Create a bucket and 20 collections
            3. Start load: maxttl=60, durability=Majority
            4. Rebalance in 172.23.96.89. Rebalance failed due to bas replicas. buckets_cleanup_failed
            Steps:
            1. Create a 85 node cluster
            2. Create a bucket and 20 collections
            3. Start load: maxttl=60, durability=Majority
            4. Rebalance in 172.23.96.89. Rebalance failed due to bas replicas. buckets_cleanup_failed

            {code:title=QE test}
            guides/gradlew --refresh-dependencies testrunner -P jython=/opt/jython/bin/jython -P 'args=-i /tmp/magma_temp_job4.ini -p bucket_storage=couchstore,bucket_eviction_policy=fullEviction,rerun=False -t volumetests.Magma.volume.SystemTestMagma,nodes_init=85,replicas=1,skip_cleanup=True,num_items=1000000000,num_buckets=1,bucket_names=GleamBook,doc_size=256,bucket_type=membase,compression_mode=off,iterations=100,batch_size=1000,sdk_timeout=60,log_level=debug,infra_log_level=info,skip_cleanup=True,key_size=18,randomize_doc_size=True,randomize_value=True,assert_crashes_on_load=True,doc_ops=update,maxttl=60,num_collections=20,durability=Majority,fragmentation=20,pc=2,crashes=10,sdk_client_pool=True -m rest'
            {code}

            Hi Daniel Owen, you are right, the first rebalance was an intentional abort and then we resumed rebalance and saw the 2nd rebalance failure.

            Please note when i restarted the test i did not see this issue hence it is an intermittent problem.

            ritesh.agarwal Ritesh Agarwal added a comment - Hi Daniel Owen , you are right, the first rebalance was an intentional abort and then we resumed rebalance and saw the 2nd rebalance failure. Please note when i restarted the test i did not see this issue hence it is an intermittent problem.
            ritesh.agarwal Ritesh Agarwal made changes -
            Assignee Ritesh Agarwal [ ritesh.agarwal ] Daniel Owen [ owend ]
            owend Daniel Owen added a comment -

            OK thanks Ritesh Agarwal so the rebalance of interest is the second one that failed due to buckets_cleanup_failed I will take a look at this one. Thanks

            owend Daniel Owen added a comment - OK thanks Ritesh Agarwal so the rebalance of interest is the second one that failed due to buckets_cleanup_failed I will take a look at this one. Thanks
            owend Daniel Owen made changes -
            Summary 130 node testing: Rebalance in 1 node in a 85 node cluster leads to bad replicators and rebalance failed. 130 node testing: Rebalance in 1 node in a 85 node cluster failed due to buckets_cleanup_failed
            owend Daniel Owen made changes -
            Component/s couchbase-bucket [ 10173 ]
            Component/s ns_server [ 10019 ]
            owend Daniel Owen made changes -
            Sprint KV-Engine CC Final Sprint [ 1548 ]
            owend Daniel Owen made changes -
            Rank Ranked higher
            owend Daniel Owen made changes -
            Status Open [ 1 ] In Progress [ 3 ]
            ritam.sharma Ritam Sharma made changes -
            Labels 130nodetest 130nodetest affects-cc-testing
            owend Daniel Owen added a comment -

            Looking at rebalance failure 2 on 172.23.106.155 we see

            [user:error,2021-05-04T05:21:52.145-07:00,ns_1@172.23.106.155:<0.764.0>:ns_orchestrator:log_rebalance_completion:1405]Rebalance exited with reason {buckets_cleanup_failed,
                                             ['ns_1@172.23.120.126',
                                              'ns_1@172.23.122.223']}.
            

            On node 172.23.120.126 the ns_server.debug.log last entry is at 2021-05-04T05:21:04.468
            On node 172.23.122.223 the ns_server.debug.log last entry is at 2021-05-04T05:21:05.408
            Therefore the logs don't cover the period when the buckets_cleanup_failed is reported.

            However on 172.23.106.155 we do see

            [rebalance:error,2021-05-04T05:21:52.137-07:00,ns_1@172.23.106.155:<0.25724.23>:ns_rebalancer:maybe_cleanup_old_buckets:941]Failed to cleanup old buckets on node 'ns_1@172.23.120.126': {badrpc,
                                                                          {'EXIT',
                                                                           timeout}}
            [rebalance:error,2021-05-04T05:21:52.137-07:00,ns_1@172.23.106.155:<0.25724.23>:ns_rebalancer:maybe_cleanup_old_buckets:941]Failed to cleanup old buckets on node 'ns_1@172.23.122.223': {badrpc,
                                                                          {'EXIT',
                                                                           timeout}}
            

            Strongly suspect it is a duplicate of MB-45594.
            However will let ns_server confirm.

            owend Daniel Owen added a comment - Looking at rebalance failure 2 on 172.23.106.155 we see [user:error,2021-05-04T05:21:52.145-07:00,ns_1@172.23.106.155:<0.764.0>:ns_orchestrator:log_rebalance_completion:1405]Rebalance exited with reason {buckets_cleanup_failed, ['ns_1@172.23.120.126', 'ns_1@172.23.122.223']}. On node 172.23.120.126 the ns_server.debug.log last entry is at 2021-05-04T05:21:04.468 On node 172.23.122.223 the ns_server.debug.log last entry is at 2021-05-04T05:21:05.408 Therefore the logs don't cover the period when the buckets_cleanup_failed is reported. However on 172.23.106.155 we do see [rebalance:error,2021-05-04T05:21:52.137-07:00,ns_1@172.23.106.155:<0.25724.23>:ns_rebalancer:maybe_cleanup_old_buckets:941]Failed to cleanup old buckets on node 'ns_1@172.23.120.126': {badrpc, {'EXIT', timeout}} [rebalance:error,2021-05-04T05:21:52.137-07:00,ns_1@172.23.106.155:<0.25724.23>:ns_rebalancer:maybe_cleanup_old_buckets:941]Failed to cleanup old buckets on node 'ns_1@172.23.122.223': {badrpc, {'EXIT', timeout}} Strongly suspect it is a duplicate of MB-45594 . However will let ns_server confirm.
            owend Daniel Owen made changes -
            Status In Progress [ 3 ] Open [ 1 ]
            owend Daniel Owen made changes -
            Link This issue duplicates MB-45594 [ MB-45594 ]
            owend Daniel Owen made changes -
            Component/s ns_server [ 10019 ]
            Component/s couchbase-bucket [ 10173 ]
            owend Daniel Owen made changes -
            Assignee Daniel Owen [ owend ] Meni Hillel [ JIRAUSER25407 ]
            dfinlay Dave Finlay made changes -
            Link This issue is duplicated by MB-46067 [ MB-46067 ]
            dfinlay Dave Finlay made changes -
            Link This issue is duplicated by MB-45594 [ MB-45594 ]

            I agree. This seems quite similar to MB-45594 and couple other tickets with the same behavior. Assigning to Aliaksey for confirmation.

            meni.hillel Meni Hillel (Inactive) added a comment - I agree. This seems quite similar to MB-45594 and couple other tickets with the same behavior. Assigning to Aliaksey for confirmation.
            meni.hillel Meni Hillel (Inactive) made changes -
            Assignee Meni Hillel [ JIRAUSER25407 ] Aliaksey Artamonau [ aliaksey artamonau ]

            I've not had a lot of luck with logs from this ticket. But the information on MB-45594 was somewhat useful.

            First observation is that these issues always occur during log collection. The cluster from MB-45594 fortunately has got atop collection enabled with 10 second granularity. Looking at the atop archives, all three nodes that timed out had the disk partition that chronicle writes to at 100% utilization. All of it seems to be cbcollect_info writing temporary files.

            That being said, I don't know if just that is enough to explain the timeouts: the faulty nodes lag behind by up to a minute compared to other nodes. And it's not like there are lots of mutations happening.

            I'll keep looking.

            In addition, I'll use this ticket to add more diagnostics in chronicle.

            Aliaksey Artamonau Aliaksey Artamonau (Inactive) added a comment - - edited I've not had a lot of luck with logs from this ticket. But the information on MB-45594 was somewhat useful. First observation is that these issues always occur during log collection. The cluster from MB-45594 fortunately has got atop collection enabled with 10 second granularity. Looking at the atop archives, all three nodes that timed out had the disk partition that chronicle writes to at 100% utilization. All of it seems to be cbcollect_info writing temporary files. That being said, I don't know if just that is enough to explain the timeouts: the faulty nodes lag behind by up to a minute compared to other nodes. And it's not like there are lots of mutations happening. I'll keep looking. In addition, I'll use this ticket to add more diagnostics in chronicle.

            Some improvements to make.

            1. I did some basic testing, and I see that there's an order of magnitude difference in the time that fsync and fdatasync take on a system with a lot of writes generated by external processes (trying to simulate cbcollect_info here). I've seen fsync latency over 100ms, but fdatasync latency never broke 10ms. And I suspect that on the test clusters the fsync latency may actually be as high a 1 second (based on some circumstantial evidence). My speculation as to why there's such a significant difference between fsync and fdatasync is that the former needs to sync the filesystem journal. But it probably can only sync all of the journal, so the effects of outside writes are more visible. I remember that fdatasync posed some complexities, which is why I decided not to use it. But it may be worth reconsidering.

            2. Currently there's no logic in chronicle that would try to coalesce adjacent writes on downstream nodes. And each write results in its own fsync. This makes some sense for nodes that are up to date: writes are batched on the leader, and each node needs to respond asap for those writes to be acknowledged. But for nodes that lag behind it doesn't make sense to continue fsyncing each individual write.

            Aliaksey Artamonau Aliaksey Artamonau (Inactive) added a comment - Some improvements to make. 1. I did some basic testing, and I see that there's an order of magnitude difference in the time that fsync and fdatasync take on a system with a lot of writes generated by external processes (trying to simulate cbcollect_info here). I've seen fsync latency over 100ms, but fdatasync latency never broke 10ms. And I suspect that on the test clusters the fsync latency may actually be as high a 1 second (based on some circumstantial evidence). My speculation as to why there's such a significant difference between fsync and fdatasync is that the former needs to sync the filesystem journal. But it probably can only sync all of the journal, so the effects of outside writes are more visible. I remember that fdatasync posed some complexities, which is why I decided not to use it. But it may be worth reconsidering. 2. Currently there's no logic in chronicle that would try to coalesce adjacent writes on downstream nodes. And each write results in its own fsync. This makes some sense for nodes that are up to date: writes are batched on the leader, and each node needs to respond asap for those writes to be acknowledged. But for nodes that lag behind it doesn't make sense to continue fsyncing each individual write.
            lynn.straus Lynn Straus made changes -
            Due Date 25/May/21
            lynn.straus Lynn Straus made changes -
            Due Date 25/May/21 21/May/21
            lynn.straus Lynn Straus made changes -
            Due Date 21/May/21 24/May/21
            lynn.straus Lynn Straus made changes -
            Link This issue blocks MB-44894 [ MB-44894 ]
            lynn.straus Lynn Straus made changes -
            Labels 130nodetest affects-cc-testing 130nodetest affects-cc-testing approved-for-7.0.0
            dfinlay Dave Finlay made changes -
            Fix Version/s 7.0.1 [ 17104 ]
            Fix Version/s Cheshire-Cat [ 15915 ]
            lynn.straus Lynn Straus made changes -
            Labels 130nodetest affects-cc-testing approved-for-7.0.0 130nodetest approved-for-7.0.0
            lynn.straus Lynn Straus made changes -
            Labels 130nodetest approved-for-7.0.0 130nodetest
            meni.hillel Meni Hillel (Inactive) made changes -
            Fix Version/s 7.1.0 [ 17117 ]
            Fix Version/s 7.0.1 [ 17104 ]
            lynn.straus Lynn Straus made changes -
            Link This issue blocks MB-44894 [ MB-44894 ]
            meni.hillel Meni Hillel (Inactive) made changes -
            Fix Version/s Neo [ 17615 ]
            Fix Version/s 7.1.0 [ 17117 ]

            Have there been any recent occurrences of this issue? If so, I'd be interested to look at the logs.

            Aliaksey Artamonau Aliaksey Artamonau (Inactive) added a comment - - edited Have there been any recent occurrences of this issue? If so, I'd be interested to look at the logs.
            wayne Wayne Siu made changes -
            Link This issue relates to MB-47169 [ MB-47169 ]
            artem Artem Stemkovski added a comment - This one? https://issues.couchbase.com/browse/MB-47169

            Hello Aliaksey Artamonau, we do not run 130Node test quite often hence it is difficult to say on any recent occurrences.
            MB-47169 may be another instance of this issue...not too sure but the error seems to be similar.

            ritesh.agarwal Ritesh Agarwal added a comment - Hello Aliaksey Artamonau , we do not run 130Node test quite often hence it is difficult to say on any recent occurrences. MB-47169 may be another instance of this issue...not too sure but the error seems to be similar.

            I've merged changes to mitigate this to chronicle. In addition, cbcollect_info (which is what triggered the issue in this test) should be less disruptive now. So I'm going to close the ticket.

            The investigation into a related MB-47169 is still ongoing.

            Aliaksey Artamonau Aliaksey Artamonau (Inactive) added a comment - I've merged changes to mitigate this to chronicle. In addition, cbcollect_info (which is what triggered the issue in this test) should be less disruptive now. So I'm going to close the ticket. The investigation into a related MB-47169 is still ongoing.
            Aliaksey Artamonau Aliaksey Artamonau (Inactive) made changes -
            Fix Version/s 7.0.1 [ 17104 ]
            Fix Version/s Neo [ 17615 ]
            Resolution Fixed [ 1 ]
            Status Open [ 1 ] Resolved [ 5 ]

            Build couchbase-server-7.0.1-5960 contains ns_server commit 80e2872 with commit message:
            MB-46099 [cbcollect_info] Control the amount of outstanding IO.

            build-team Couchbase Build Team added a comment - Build couchbase-server-7.0.1-5960 contains ns_server commit 80e2872 with commit message: MB-46099 [cbcollect_info] Control the amount of outstanding IO.

            Build couchbase-server-7.1.0-1096 contains ns_server commit 80e2872 with commit message:
            MB-46099 [cbcollect_info] Control the amount of outstanding IO.

            build-team Couchbase Build Team added a comment - Build couchbase-server-7.1.0-1096 contains ns_server commit 80e2872 with commit message: MB-46099 [cbcollect_info] Control the amount of outstanding IO.
            lynn.straus Lynn Straus made changes -
            Fix Version/s 7.0.2 [ 18012 ]
            lynn.straus Lynn Straus made changes -
            Fix Version/s 7.0.1 [ 17104 ]
            Balakumaran.Gopal Balakumaran Gopal made changes -
            Assignee Aliaksey Artamonau [ aliaksey artamonau ] Ritesh Agarwal [ ritesh.agarwal ]

            Not seeing it on: Enterprise Edition 7.0.2 build 6683.
            Closing the defect.

            ritesh.agarwal Ritesh Agarwal added a comment - Not seeing it on: Enterprise Edition 7.0.2 build 6683. Closing the defect.
            ritesh.agarwal Ritesh Agarwal made changes -
            Status Resolved [ 5 ] Closed [ 6 ]

            People

              ritesh.agarwal Ritesh Agarwal
              ritesh.agarwal Ritesh Agarwal
              Votes:
              0 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty