Details
-
Bug
-
Resolution: Fixed
-
Major
-
1.8.1
-
Security Level: Public
-
2 node with 15G physical memory. Mem quota = 12G. 3 clients were trying to load 7M items in total.
Description
From Mike:
This issue is an operational deadlock that we are already aware of and have fixed in 2.0. It is not a regression from 1.8. It is caused by items being loaded into Couchbase at a very fast rate. On a two node cluster the each node surpasses 90% memory used. This causes the tap consumers to tell the producers to back off since they will not be accepting data. At the same time the item pager is running and trying to evict items, but it is unable to because all of values in memory are waiting in the checkpoint queues to be replicated.
This fix can potentially be back-ported, but Chiyoung feels like this may be too risky for 1.8.1.
From Ronnie:
The ec2 cluster had two nodes with 12G mem quota each. When memory usage reached 90%, clients suffered backoff signals from server (errorno: 134). And the cluster wouldn't recover from this point. (running for more than 12 hours). And the highest cluster-wide ops was around 6k