Details
-
Bug
-
Resolution: Fixed
-
Critical
-
3.0.1, 3.0.3
-
Security Level: Public
-
couchbase server in Docker on CoreOS under AWS
-
Untriaged
-
Centos 64-bit
-
Unknown
Description
On Amazon EC2
Start up 2 completely fresh couchbase servers from:
https://github.com/couchbaselabs/couchbase-server-coreos
Ensure /var/lib/couchbase is mounted to EBS storage and mapped to couchbase docker container at /opt/couchbase/var/lib/couchbase. This is formatted as ext4.
Add both beer sample and game sample. Use 100MB for memory size of default bucket. Leave it as couchbase and change nothing else.
Add a second server inside the same VPC.
Click rebalance.
Rebalance fails part way though.
Rebalance exited with reason {badmatch,
{error,
}}
The problem also occurs when there is only a single bucket containing no documents. When publishing a view, the other server becomes unavailable and rebalance fails with the same error.
This happens in both 3.0.1 community edition and 3.0.3 enterprise edition.
Immediately after this occurs, CPU load on the down node is very high.
The culprit is beam.smp.
Connecting strace to it and it appears to be that it's just trying over and over again to connect to memcached. It appears it connects then gets cut short or something. I end up with thousands of connections like this:
Literally, 6000+
tcp 0 0 localhost:36527 localhost:11209 TIME_WAIT
tcp 0 0 localhost:55519 localhost:11209 TIME_WAIT
tcp 0 0 localhost:54337 localhost:11209 TIME_WAIT
tcp 0 0 localhost:32772 localhost:11209 TIME_WAIT
tcp 0 0 localhost:45226 localhost:11209 TIME_WAIT
tcp 0 0 localhost:55206 localhost:11209 TIME_WAIT
tcp 0 0 localhost:33358 localhost:11209 TIME_WAIT
tcp 0 0 localhost:55473 localhost:11209 TIME_WAIT
tcp 0 0 localhost:56703 localhost:11209 TIME_WAIT
tcp 0 0 localhost:38388 localhost:11209 TIME_WAIT
tcp 0 0 localhost:40668 localhost:11209 TIME_WAIT
tcp 0 0 localhost:54342 localhost:11209 TIME_WAIT
tcp 0 0 localhost:58936 localhost:11209 TIME_WAIT
tcp 0 0 localhost:45226 localhost:11209 TIME_WAIT
tcp 0 0 localhost:55206 localhost:11209 TIME_WAIT
tcp 0 0 localhost:33358 localhost:11209 TIME_WAIT
tcp 0 0 localhost:55473 localhost:11209 TIME_WAIT
tcp 0 0 localhost:56703 localhost:11209 TIME_WAIT
tcp 0 0 localhost:38388 localhost:11209 TIME_WAIT
tcp 0 0 localhost:40668 localhost:11209 TIME_WAIT
tcp 0 0 localhost:54342 localhost:11209 TIME_WAIT
tcp 0 0 localhost:58936 localhost:11209 TIME_WAIT
I've attached a collectdb of when it occurs. (different buckets but same issue).
I'd like to point out that couchbase is running inside docker on CoreOS on AWS.
I've upped the ulimits inside the container which are showing:
core file size (blocks, -c) unlimited
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 29972
max locked memory (kbytes, -l) unlimited
max memory size (kbytes, -m) unlimited
open files (-n) 1048576
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 8192
cpu time (seconds, -t) unlimited
max user processes (-u) 1048576
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited