Reproducible: Performance Regression and Very High Memory Consumption W/ CBBackup
Description
Components
Affects versions
Labels
Environment
Link to Log File, atop/blg, CBCollectInfo, Core dump
Release Notes Description
blocks
Activity
Eric Cooper June 24, 2015 at 5:28 PM
Verified in 3.0.5-1749
Eric Cooper June 23, 2015 at 10:44 PM
I resurrected some automated backup test cases (they look like they hadn't been run for a while) and ran as follows:
./testrunner -i /Users/eric/servers/group2-2.ini t backup.opsbeforebackup.OpsBeforeBackupTests.CreateUpdateDeleteBeforeBackup,command_options="-single-node",standard_buckets=1,items=1000000,doc_ops=update\;create,replicas=0,item_flag=232216,max_verify=1000,skip_cleanup=False,GROUP=P1
Here is the backup output of the test case and it looks like we are backing up 2 millions KVs:
2015-06-23 15:25:28 | ERROR | MainProcess | test_thread | [remote_util.log_command_output] [####################] 100.0% (1999998/estimated 1999998 msgs)
2015-06-23 15:25:28 | ERROR | MainProcess | test_thread | [remote_util.log_command_output] bucket: default, msgs transferred...
2015-06-23 15:25:28 | ERROR | MainProcess | test_thread | [remote_util.log_command_output] : total | last | per sec
2015-06-23 15:25:28 | ERROR | MainProcess | test_thread | [remote_util.log_command_output] byte : 511999488 | 511999488 | 855342.9
2015-06-23 15:25:28 | ERROR | MainProcess | test_thread | [remote_util.log_command_output] [####################] 100.0% (1999998/estimated 1999998 msgs)
2015-06-23 15:25:28 | ERROR | MainProcess | test_thread | [remote_util.log_command_output] bucket: standard_bucket0, msgs transferred...
2015-06-23 15:25:28 | ERROR | MainProcess | test_thread | [remote_util.log_command_output] : total | last | per sec
2015-06-23 15:25:28 | ERROR | MainProcess | test_thread | [remote_util.log_command_output] byte : 1023998976 | 1023998976 | 1595580.2
Note that though this is logged as an error it is not really an error. I monitored memory for the backup while running 3.0.3 and 3.1.0 using ps -aux.
Here is 3.0.3:
root 11720 130 2.5 861924 98700 ? Ssl 15:04 17:26 python /opt/couchbase/lib/python/cbbackup http://Administrator:password@10.1.2.32:8091 /tmp/backup --single-node
and here is 3.0.5:
root 10177 73.6 0.5 635952 22628 ? Ssl 14:11 5:09 python /opt/couchbase/lib/python/cbrestore /tmp/backup http://Administrator:password@10.1.2.32:8091 -b default
The memory is the 5th and 6th column and clearly 3.0.5 uses less memory.
Abhishek Singh June 17, 2015 at 1:46 AM
Minor correction in backup command, removed `-x`
Anil Kumar June 16, 2015 at 11:03 PMEdited
If you’re trying to backup large dataset and hit any performance regressions or high memory consumption with CBBackup tool in 3.0/3.0.1 CE. Follow the following workarounds to mitigate the issue -
Steps:
Grab cbbackupwrapper patch from S3:
> wget https://s3.amazonaws.com/support.couchbase.com/CBBackup-Patch/orig.patch
Apply the patch to file /opt/couchbase/lib/python/cbbackupwrapper:
> patch -p1 < orig.patch
This patch enables --sequential flag, which basically controls limits
the amount of backfills we do while performing backup.
Sample command:
> /opt/couchbase/bin/cbbackupwrapper http://<cluster-ip>:8091 <backup_dir> -u <username> -p <password> --path /opt/couchbase/bin/ —sequential -n <no_vbucket_per_batch>
Note: Important thing to note here is the number you're passing to -n flag, which controls the number of vbuckets to backfill at a time. You should tune it according to memory headroom you have on your couchbase node(by default, batch size is 100).
Eben Haber June 11, 2015 at 3:42 PM
Resolved in 3.0.x with Gerrit change 51792.
Resolved in Sherlock with Gerrit change 51658.
Details
Details
Assignee
Reporter
Is this a Regression?
Triage
Original estimate
Time tracking
Priority
Instabug
PagerDuty
PagerDuty Incident
PagerDuty

Sentry
Linked Issues
Sentry
Zendesk Support
Linked Tickets
Zendesk Support

This appears to be a continuation of the problem reported in https://couchbasecloud.atlassian.net/browse/CB-1407#icft=CB-1407/https://couchbasecloud.atlassian.net/browse/MB-12670#icft=MB-12670, which is listed as fixed in 3.0.2. In that bug, customer reported extremely long runtime for cbbackup in V3.0.1 relative to earlier runtime when the cluster ran on V2.2. In addition, customer reported that in 3.0.1 all available RAM was consumed such that the OOM-killer was invoked and started killing Couchbase processes, resulting in node failover.
In this case, a different customer reported the same problem moving from 2.5.1 to 3.0.3, and reported that in their case cbbackup was now causing frequent failovers which in turn result from the OOM-killer being invoked.
While I have not been able to cause a failover or cause the OOM-killer to be invoked, I have reproduced slow cbbackup runtime and much greater memory consumption in 3.0.3 relative to 2.5.2. To reproduce I did the following:
1) Create 6-node cluster on 2.5.2 with beer-sample and default buckets. Populate the default bucket with 10 million 250 byte JSON items (I used cbworkloadgen)
2) perform cbbackup of the cluster with the linux "time" command. I received the following results:
.
[morrie@ip-10-249-139-86 tmp]$ time cbbackup http://127.0.0.1:8091 /tmp/BACKUP -u Administrator -p password
...
done
real 8m20.214s
user 3m6.024s
sys 1m19.961s
.
Additionally TOP showed Virt 482M, Res 41M SHR 4948 CPU apx 4.19
3) swap-rebalance upgrade of cluster to 3.0.3
4) repeat cbbackup on same node. Get the following results:
.
time cbbackup http://127.0.0.1:8091 /tmp/BACKUP -u Administrator -p password
#################### 100.0% (7303/estimated 7303 msgs)
...
done
real 23m23.213s
user 6m4.847s
sys 3m40.270s
TOP showed Virt 819M, Res 7288 SHR 4948 CPU apx 9:40
TOP also showed overall memory consumption continuously increasing until the end of the cbbackup.
Note that the actual output of the 2 cbbackups was essentially the same size and that the messages statistics for both runs were the same.