Please see issue we encountered with our latest version below.
This passed QA on stage but when going on to a live environment with full production load we see the behavior below. At first this occurred on a cluster we had just rebalanced. During that rebalance we saw rise in couchbase clients apps' CPU, which did not decrease after the rebalance was (successfully) done, until we restarted said clients. We suspected that the problem below is directly related to the issues we saw during rebalance so we also tested it on a different cluster that did not go any such rebalance. Results were the same. After searching all over to see what changed we realized that one change during this version was that we upgraded from Couchbase Client 1.0.1 and spy 2.8.0 to 1.0.3 and 2.8.2. We then took that exact build swapping the 1.0.3 with the 1.0.1 jars and everything started behaving fine.
The reason we MUST have 1.0.3 on production is the following from 1.0.3's release notes (http://www.couchbase.com/docs/couchbase-sdk-java-1.0/couchbase-sdk-java-rn_1-0-3.html):
It was found that in the dependent spymemcached client library that errors encountered in optimized set operations would not be handled correctly and thus application code would receive unexpected errors during a rebalance. This has been worked around in this release by disabling optimization. This may have a negilgable drop in throughput but shorter latencies.
We believe the issues mentioned above on the clients during the rebalance are exactly this.
1. Any ideas on reason for this?
2. How would you advise to proceed.
1. One server is putting data to a memcached bucket. TTL is about 30 minutes.
2. Another server tries to get this data but randomly fails (at about of 50% miss rate). We are getting nulls instead of real values. We are using asyncGet and then Future.get() with timeout of 5 seconds. We did not observe that timeout was reached.
Time period between (1) and (2) is less than a minute. We debugged (1) and saw that it is being written without errors.
No exceptions or errors.
Data cluster wasn't heavy loaded, other clients (1.0.1) were working at the same time with this bucket and operated properly.
From: Ira Holtzer