Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-58136

[SizingIssue]: cbbackupmgr restore stops/hung while restoring 1B(1.8TB including replica + fragmentation) kv data having close to 100M tombstones.

    XMLWordPrintable

Details

    • Bug
    • Resolution: Known Error
    • Blocker
    • None
    • 7.2.1
    • couchbase-bucket
    • couchbase-cloud-server-7.2.1-5878-v1.0.19
    • Untriaged
    • 0
    • Unknown

    Description

      Cluster Config:
      3 KV nodes, 2 GSI/N1QL nodes
      c5.2xlarge instance types
      Data service RAM quota: 12800MB

      Steps:

      1. Create a cluster with the above config
      2. Create a bucket, 10 collections. Load 100M items in each collection
      3. Create GSI indexes on 2 collections and let them build completely. Start a N1QL query workload
      4. Start a 10k R:W workload with 10s expiry.
      5. Scale UP/Down the cluster a few times by increasing/decreasing the nodes in a service groups by 1
      6. When there are 4 KV nodes and 3 GSI nodes trigger a backup. Back up completes successfully
      7. Flush the bucket. Indexes rollbacks to 0.
      8. Trigger a restore and restore is hung at 668,309,473 items.

      The restoration of 1000630970 mutations which has 99369030 tombstones reportedly, failed. I can see that the restore is still running on the backup node:

      sh-4.2$ ps -aef | grep restore
      ssm-user 12389 12381  0 21:25 pts/0    00:00:00 grep restore
      ec2-user 16381  2382 99 12:30 ?        09:09:17 /opt/couchbase/bin/cbbackupmgr restore -a s3://cbc-storage-3f11ad/backups/buckets/default0/cycles/553531be-b401-4d46-a1cd-b3b3f46a6db1 -r 553531be-b401-4d46-a1cd-b3b3f46a6db1 -c couchbases://svc-d-node-001.tixzmd21xarhtlc2.sandbox.nonprod-project-avengers.com,svc-d-node-002.tixzmd21xarhtlc2.sandbox.nonprod-project-avengers.com,svc-d-node-003.tixzmd21xarhtlc2.sandbox.nonprod-project-avengers.com,svc-d-node-006.tixzmd21xarhtlc2.sandbox.nonprod-project-avengers.com,svc-qi-node-004.tixzmd21xarhtlc2.sandbox.nonprod-project-avengers.com,svc-qi-node-005.tixzmd21xarhtlc2.sandbox.nonprod-project-avengers.com,svc-qi-node-007.tixzmd21xarhtlc2.sandbox.nonprod-project-avengers.com -u couchbase-cloud-admin -p BJvgA%6@P!lXP7Sso2BDkpZE --obj-staging-dir /home/ec2-user/staging/553531be-b401-4d46-a1cd-b3b3f46a6db1 --auto-select-threads --json-progress restore --cacert /home/ec2-user/ca.pem --start start --end 2023-08-02T15_14_03.318274428Z --purge
      sh-4.2$ df -h
      Filesystem      Size  Used Avail Use% Mounted on
      devtmpfs        7.7G     0  7.7G   0% /dev
      tmpfs           7.7G     0  7.7G   0% /dev/shm
      tmpfs           7.7G  392K  7.7G   1% /run
      tmpfs           7.7G     0  7.7G   0% /sys/fs/cgroup
      /dev/nvme0n1p1  1.0T   74G  951G   8% /
      tmpfs           1.6G     0  1.6G   0% /run/user/994
      sh-4.2$
      

      Bucket: The constant ops are coming in as seen in the dashboard but not sure where they are going as the items count isn’t increasing at all:


      It seems that the RAM quota is completely blocked and the RAM is not getting released to accept the further mutations. But while the eviction policy is fullEviction why the ram isn’t getting freed up for the upcoming traffic from restore?

      Environment

      1. Database Type: Provisioned
      2. Is the database still running?: NO
      3. Environment: sandbox
      4. Organisation ID (aka Tenant ID): 82c310b9-1c07-468c-a36c-7423cde5f7ed
      5. Project ID: 3c81cb13-4ff7-474b-b729-3e29cbf6f738
      6. Cluster/Database ID: 80786aba-8cdb-47a8-a5b1-d6f0a73f11ad

      Server Logs:

      s3://cb-customers-secure/restore/2023-08-03/collectinfo-2023-08-03t211604-ns_1@svc-d-node-001.tixzmd21xarhtlc2.sandbox.nonprod-project-avengers.com.zip
      s3://cb-customers-secure/restore/2023-08-03/collectinfo-2023-08-03t211604-ns_1@svc-d-node-002.tixzmd21xarhtlc2.sandbox.nonprod-project-avengers.com.zip
      s3://cb-customers-secure/restore/2023-08-03/collectinfo-2023-08-03t211604-ns_1@svc-d-node-003.tixzmd21xarhtlc2.sandbox.nonprod-project-avengers.com.zip
      s3://cb-customers-secure/restore/2023-08-03/collectinfo-2023-08-03t211604-ns_1@svc-d-node-006.tixzmd21xarhtlc2.sandbox.nonprod-project-avengers.com.zip
      s3://cb-customers-secure/restore/2023-08-03/collectinfo-2023-08-03t211604-ns_1@svc-qi-node-004.tixzmd21xarhtlc2.sandbox.nonprod-project-avengers.com.zip
      s3://cb-customers-secure/restore/2023-08-03/collectinfo-2023-08-03t211604-ns_1@svc-qi-node-005.tixzmd21xarhtlc2.sandbox.nonprod-project-avengers.com.zip
      s3://cb-customers-secure/restore/2023-08-03/collectinfo-2023-08-03t211604-ns_1@svc-qi-node-007.tixzmd21xarhtlc2.sandbox.nonprod-project-avengers.com.zip
      

      Backup Logs:

      {"InnerError":{"InnerError":{"InnerError":{},"Message":"ambiguous timeout"}},"OperationID":"SetMeta","Opaque":"58827611","TimeObserved":300000018315,"RetryReasons":["KV_TEMPORARY_FAILURE"],"RetryAttempts":125,"LastDispatchedTo":"svc-d-node-002.tixzmd21xarhtlc2.sandbox.nonprod-project-avengers.com:11207","LastDispatchedFrom":"10.2.2.232:40004","LastConnectionID":"d31e460be1233941/cee443be6a947dcf","Internal":{"ResourceUnits":null}} -- couchbase.(*MemcachedWorker).processOperation() at pool_worker.go:279
      

      cc: Raju Suravarjjala, Ritam Sharma This seems to be a KV(server) bug to me.
      cc: James Lee, Shelby Ramsey

      QE Test

      sudo guides/gradlew --refresh-dependencies testrunner -P jython=/opt/jython/bin/jython -P 'args=-i /tmp/couchbase_capella_volume_2_new.ini -p bucket_storage=magma,bucket_eviction_policy=fullEviction,rerun=False -t aGoodDoctor.hostedHospital.Murphy.test_rebalance,num_items=100000000,num_buckets=1,bucket_names=GleamBook,bucket_type=membase,iterations=1,batch_size=1000,sdk_timeout=60,log_level=debug,infra_log_level=debug,rerun=False,skip_cleanup=True,key_size=18,randomize_doc_size=False,randomize_value=True,maxttl=10,pc=20,gsi_nodes=2,cbas_nodes=2,fts_nodes=2,kv_nodes=3,n1ql_nodes=2,kv_disk=1000,n1ql_disk=50,gsi_disk=500,fts_disk=1000,cbas_disk=1000,kv_compute=c5.2xlarge,gsi_compute=c5.2xlarge,n1ql_compute=c5.2xlarge,fts_compute=c5.2xlarge,cbas_compute=c5.2xlarge,mutation_perc=20,key_type=CircularKey,capella_run=true,services=data-index:query,rebl_services=data-index:query,max_rebl_nodes=27,provider=AWS,region=us-east-1,type=GP3,size=1000,collections=10,ops_rate=100000,skip_teardown_cleanup=true,wait_timeout=14400,index_timeout=28800,runtype=dedicated,skip_init=true,rebl_ops_rate=10000,collections=10,expiry=true,vh_scaling=true,horizontal_scale=1 -m rest'
      

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          People

            ritesh.agarwal Ritesh Agarwal
            ritesh.agarwal Ritesh Agarwal
            Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty