Reproducible: Performance Regression and Very High Memory Consumption W/ CBBackup

Description

This appears to be a continuation of the problem reported in https://couchbasecloud.atlassian.net/browse/CB-1407#icft=CB-1407/https://couchbasecloud.atlassian.net/browse/MB-12670#icft=MB-12670, which is listed as fixed in 3.0.2. In that bug, customer reported extremely long runtime for cbbackup in V3.0.1 relative to earlier runtime when the cluster ran on V2.2. In addition, customer reported that in 3.0.1 all available RAM was consumed such that the OOM-killer was invoked and started killing Couchbase processes, resulting in node failover.

In this case, a different customer reported the same problem moving from 2.5.1 to 3.0.3, and reported that in their case cbbackup was now causing frequent failovers which in turn result from the OOM-killer being invoked.

While I have not been able to cause a failover or cause the OOM-killer to be invoked, I have reproduced slow cbbackup runtime and much greater memory consumption in 3.0.3 relative to 2.5.2. To reproduce I did the following:

1) Create 6-node cluster on 2.5.2 with beer-sample and default buckets. Populate the default bucket with 10 million 250 byte JSON items (I used cbworkloadgen)

2) perform cbbackup of the cluster with the linux "time" command. I received the following results:
.

[morrie@ip-10-249-139-86 tmp]$ time cbbackup http://127.0.0.1:8091 /tmp/BACKUP -u Administrator -p password

...

done

real 8m20.214s
user 3m6.024s
sys 1m19.961s
.
Additionally TOP showed Virt 482M, Res 41M SHR 4948 CPU apx 4.19

3) swap-rebalance upgrade of cluster to 3.0.3

4) repeat cbbackup on same node. Get the following results:
.
time cbbackup http://127.0.0.1:8091 /tmp/BACKUP -u Administrator -p password
#################### 100.0% (7303/estimated 7303 msgs)
...

done

real 23m23.213s
user 6m4.847s
sys 3m40.270s

TOP showed Virt 819M, Res 7288 SHR 4948 CPU apx 9:40

TOP also showed overall memory consumption continuously increasing until the end of the cbbackup.

Note that the actual output of the 2 cbbackups was essentially the same size and that the messages statistics for both runs were the same.

Components

Affects versions

Fix versions

Environment

Reported by customer on 3.0.3: . OS Name : Linux 2.6.32-431.29.2.el6.x86_64 OS Version : Red Hat Enterprise Linux Server release 6.5 (Santiago) CB Version : 3.0.3-1716-rel-enterprise CB Packages : couchbase-server-3.0.3-1716.x86_64 . Reproduced by Support on 3.0.3 on AWS(Rightscale)

Link to Log File, atop/blg, CBCollectInfo, Core dump

None

Release Notes Description

None

blocks

Activity

Show:

Eric Cooper June 24, 2015 at 5:28 PM

Verified in 3.0.5-1749

Eric Cooper June 23, 2015 at 10:44 PM

I resurrected some automated backup test cases (they look like they hadn't been run for a while) and ran as follows:

./testrunner -i /Users/eric/servers/group2-2.ini t backup.opsbeforebackup.OpsBeforeBackupTests.CreateUpdateDeleteBeforeBackup,command_options="-single-node",standard_buckets=1,items=1000000,doc_ops=update\;create,replicas=0,item_flag=232216,max_verify=1000,skip_cleanup=False,GROUP=P1

Here is the backup output of the test case and it looks like we are backing up 2 millions KVs:
2015-06-23 15:25:28 | ERROR | MainProcess | test_thread | [remote_util.log_command_output] [####################] 100.0% (1999998/estimated 1999998 msgs)
2015-06-23 15:25:28 | ERROR | MainProcess | test_thread | [remote_util.log_command_output] bucket: default, msgs transferred...
2015-06-23 15:25:28 | ERROR | MainProcess | test_thread | [remote_util.log_command_output] : total | last | per sec
2015-06-23 15:25:28 | ERROR | MainProcess | test_thread | [remote_util.log_command_output] byte : 511999488 | 511999488 | 855342.9
2015-06-23 15:25:28 | ERROR | MainProcess | test_thread | [remote_util.log_command_output] [####################] 100.0% (1999998/estimated 1999998 msgs)
2015-06-23 15:25:28 | ERROR | MainProcess | test_thread | [remote_util.log_command_output] bucket: standard_bucket0, msgs transferred...
2015-06-23 15:25:28 | ERROR | MainProcess | test_thread | [remote_util.log_command_output] : total | last | per sec
2015-06-23 15:25:28 | ERROR | MainProcess | test_thread | [remote_util.log_command_output] byte : 1023998976 | 1023998976 | 1595580.2

Note that though this is logged as an error it is not really an error. I monitored memory for the backup while running 3.0.3 and 3.1.0 using ps -aux.

Here is 3.0.3:
root 11720 130 2.5 861924 98700 ? Ssl 15:04 17:26 python /opt/couchbase/lib/python/cbbackup http://Administrator:password@10.1.2.32:8091 /tmp/backup --single-node

and here is 3.0.5:
root 10177 73.6 0.5 635952 22628 ? Ssl 14:11 5:09 python /opt/couchbase/lib/python/cbrestore /tmp/backup http://Administrator:password@10.1.2.32:8091 -b default

The memory is the 5th and 6th column and clearly 3.0.5 uses less memory.

Abhishek Singh June 17, 2015 at 1:46 AM

Minor correction in backup command, removed `-x`

Anil Kumar June 16, 2015 at 11:03 PM
Edited

If you’re trying to backup large dataset and hit any performance regressions or high memory consumption with CBBackup tool in 3.0/3.0.1 CE. Follow the following workarounds to mitigate the issue -

Steps:

Grab cbbackupwrapper patch from S3:

> wget https://s3.amazonaws.com/support.couchbase.com/CBBackup-Patch/orig.patch

Apply the patch to file /opt/couchbase/lib/python/cbbackupwrapper:

> patch -p1 < orig.patch

This patch enables --sequential flag, which basically controls limits
the amount of backfills we do while performing backup.

Sample command:

> /opt/couchbase/bin/cbbackupwrapper http://<cluster-ip>:8091 <backup_dir> -u <username> -p <password> --path /opt/couchbase/bin/ —sequential -n <no_vbucket_per_batch>

Note: Important thing to note here is the number you're passing to -n flag, which controls the number of vbuckets to backfill at a time. You should tune it according to memory headroom you have on your couchbase node(by default, batch size is 100).

Eben Haber June 11, 2015 at 3:42 PM

Resolved in 3.0.x with Gerrit change 51792.

Resolved in Sherlock with Gerrit change 51658.

Fixed
Pinned fields
Click on the next to a field label to start pinning.

Details

Assignee

Reporter

Is this a Regression?

Yes

Triage

Triaged

Original estimate

Time tracking

No time logged1w remaining

Priority

Instabug

Open Instabug

PagerDuty

Sentry

Zendesk Support

Created May 6, 2015 at 8:41 AM
Updated October 26, 2023 at 3:25 AM
Resolved June 11, 2015 at 3:42 PM
Instabug