We upgraded our performance clusters to CentOS 7.3 a few days ago.
Unfortunately that upgrade caused a lot of troubles:
- There was ~60% drop in DGM cases.
- KV latency in non-DGM cases became more inconsistent.
I started analyzing the most basic case with the initial data load. I noticed that the drain rate became more choppy on 3 boxes (see screenshot) while one server was working just fine.
I tried to examine IO performance using standalone benchmarks but I didn't manage to find anything interesting. Only read and write performance of Couchbase Server was affected.
Eventually I noticed a tiny difference between those boxes. "Bad" machines had kernel 3.10.0-514.6.2 and "good" machine had 3.10.0-514.2.2. A few experiments confirmed that upgrade from *.514.2.2 to *.514.6.2. caused all those problems.
I downgraded our servers all the way to 3.10.0-317 and relaxed. Until I started working with a setup provided by one of our partners. That setup has RHEL 7.3 with 3.10.0-514.6.2 and I am supposed to run some heavy DGM workloads...
RHEL/CentOS is a very conservative distribution. Who knows how long this issue will remain open. I think we better find out what exactly happened before other people start hitting the same problem.