Loading...

Details

Type: Bug
Resolution: Known Error
Priority: Blocker
Fix Version/s: None
Affects Version/s: 7.2.1
Component/s: couchbase-bucket
Labels:
- volume-test
- zenith
Environment:
couchbase-cloud-server-7.2.1-5878-v1.0.19

Triage:
Untriaged
Story Points:
0
Is this a Regression?:
Unknown

Description

Cluster Config:
3 KV nodes, 2 GSI/N1QL nodes
c5.2xlarge instance types
Data service RAM quota: 12800MB

Steps:

Create a cluster with the above config
Create a bucket, 10 collections. Load 100M items in each collection
Create GSI indexes on 2 collections and let them build completely. Start a N1QL query workload
Start a 10k R:W workload with 10s expiry.
Scale UP/Down the cluster a few times by increasing/decreasing the nodes in a service groups by 1
When there are 4 KV nodes and 3 GSI nodes trigger a backup. Back up completes successfully
Flush the bucket. Indexes rollbacks to 0.
Trigger a restore and restore is hung at 668,309,473 items.

The restoration of 1000630970 mutations which has 99369030 tombstones reportedly, failed. I can see that the restore is still running on the backup node:

sh-4.2$ ps -aef | grep restore

ssm-user 12389 12381  0 21:25 pts/0    00:00:00 grep restore

ec2-user 16381  2382 99 12:30 ?        09:09:17 /opt/couchbase/bin/cbbackupmgr restore -a s3://cbc-storage-3f11ad/backups/buckets/default0/cycles/553531be-b401-4d46-a1cd-b3b3f46a6db1 -r 553531be-b401-4d46-a1cd-b3b3f46a6db1 -c couchbases://svc-d-node-001.tixzmd21xarhtlc2.sandbox.nonprod-project-avengers.com,svc-d-node-002.tixzmd21xarhtlc2.sandbox.nonprod-project-avengers.com,svc-d-node-003.tixzmd21xarhtlc2.sandbox.nonprod-project-avengers.com,svc-d-node-006.tixzmd21xarhtlc2.sandbox.nonprod-project-avengers.com,svc-qi-node-004.tixzmd21xarhtlc2.sandbox.nonprod-project-avengers.com,svc-qi-node-005.tixzmd21xarhtlc2.sandbox.nonprod-project-avengers.com,svc-qi-node-007.tixzmd21xarhtlc2.sandbox.nonprod-project-avengers.com -u couchbase-cloud-admin -p BJvgA%6@P!lXP7Sso2BDkpZE --obj-staging-dir /home/ec2-user/staging/553531be-b401-4d46-a1cd-b3b3f46a6db1 --auto-select-threads --json-progress restore --cacert /home/ec2-user/ca.pem --start start --end 2023-08-02T15_14_03.318274428Z --purge

sh-4.2$ df -h

Filesystem      Size  Used Avail Use% Mounted on

devtmpfs        7.7G     0  7.7G   0% /dev

tmpfs           7.7G     0  7.7G   0% /dev/shm

tmpfs           7.7G  392K  7.7G   1% /run

tmpfs           7.7G     0  7.7G   0% /sys/fs/cgroup

/dev/nvme0n1p1  1.0T   74G  951G   8% /

tmpfs           1.6G     0  1.6G   0% /run/user/994

sh-4.2$

Bucket: The constant ops are coming in as seen in the dashboard but not sure where they are going as the items count isn’t increasing at all:

It seems that the RAM quota is completely blocked and the RAM is not getting released to accept the further mutations. But while the eviction policy is fullEviction why the ram isn’t getting freed up for the upcoming traffic from restore?

Environment

Database Type: Provisioned
Is the database still running?: NO
Environment: sandbox
Organisation ID (aka Tenant ID): 82c310b9-1c07-468c-a36c-7423cde5f7ed
Project ID: 3c81cb13-4ff7-474b-b729-3e29cbf6f738
Cluster/Database ID: 80786aba-8cdb-47a8-a5b1-d6f0a73f11ad

Server Logs:

s3://cb-customers-secure/restore/2023-08-03/collectinfo-2023-08-03t211604-ns_1@svc-d-node-001.tixzmd21xarhtlc2.sandbox.nonprod-project-avengers.com.zip

s3://cb-customers-secure/restore/2023-08-03/collectinfo-2023-08-03t211604-ns_1@svc-d-node-002.tixzmd21xarhtlc2.sandbox.nonprod-project-avengers.com.zip

s3://cb-customers-secure/restore/2023-08-03/collectinfo-2023-08-03t211604-ns_1@svc-d-node-003.tixzmd21xarhtlc2.sandbox.nonprod-project-avengers.com.zip

s3://cb-customers-secure/restore/2023-08-03/collectinfo-2023-08-03t211604-ns_1@svc-d-node-006.tixzmd21xarhtlc2.sandbox.nonprod-project-avengers.com.zip

s3://cb-customers-secure/restore/2023-08-03/collectinfo-2023-08-03t211604-ns_1@svc-qi-node-004.tixzmd21xarhtlc2.sandbox.nonprod-project-avengers.com.zip

s3://cb-customers-secure/restore/2023-08-03/collectinfo-2023-08-03t211604-ns_1@svc-qi-node-005.tixzmd21xarhtlc2.sandbox.nonprod-project-avengers.com.zip

s3://cb-customers-secure/restore/2023-08-03/collectinfo-2023-08-03t211604-ns_1@svc-qi-node-007.tixzmd21xarhtlc2.sandbox.nonprod-project-avengers.com.zip

Backup Logs:

{"InnerError":{"InnerError":{"InnerError":{},"Message":"ambiguous timeout"}},"OperationID":"SetMeta","Opaque":"58827611","TimeObserved":300000018315,"RetryReasons":["KV_TEMPORARY_FAILURE"],"RetryAttempts":125,"LastDispatchedTo":"svc-d-node-002.tixzmd21xarhtlc2.sandbox.nonprod-project-avengers.com:11207","LastDispatchedFrom":"10.2.2.232:40004","LastConnectionID":"d31e460be1233941/cee443be6a947dcf","Internal":{"ResourceUnits":null}} -- couchbase.(*MemcachedWorker).processOperation() at pool_worker.go:279

cc: Raju Suravarjjala, Ritam Sharma This seems to be a KV(server) bug to me.
cc: James Lee, Shelby Ramsey

QE Test

sudo guides/gradlew --refresh-dependencies testrunner -P jython=/opt/jython/bin/jython -P 'args=-i /tmp/couchbase_capella_volume_2_new.ini -p bucket_storage=magma,bucket_eviction_policy=fullEviction,rerun=False -t aGoodDoctor.hostedHospital.Murphy.test_rebalance,num_items=100000000,num_buckets=1,bucket_names=GleamBook,bucket_type=membase,iterations=1,batch_size=1000,sdk_timeout=60,log_level=debug,infra_log_level=debug,rerun=False,skip_cleanup=True,key_size=18,randomize_doc_size=False,randomize_value=True,maxttl=10,pc=20,gsi_nodes=2,cbas_nodes=2,fts_nodes=2,kv_nodes=3,n1ql_nodes=2,kv_disk=1000,n1ql_disk=50,gsi_disk=500,fts_disk=1000,cbas_disk=1000,kv_compute=c5.2xlarge,gsi_compute=c5.2xlarge,n1ql_compute=c5.2xlarge,fts_compute=c5.2xlarge,cbas_compute=c5.2xlarge,mutation_perc=20,key_type=CircularKey,capella_run=true,services=data-index:query,rebl_services=data-index:query,max_rebl_nodes=27,provider=AWS,region=us-east-1,type=GP3,size=1000,collections=10,ops_rate=100000,skip_teardown_cleanup=true,wait_timeout=14400,index_timeout=28800,runtype=dedicated,skip_init=true,rebl_ops_rate=10000,collections=10,expiry=true,vh_scaling=true,horizontal_scale=1 -m rest'