Uploaded image for project: 'Couchbase Java Client'
  1. Couchbase Java Client
  2. JCBC-197

High Couchbase clients apps' CPU after upgraded from Couchbase Client 1.0.1 and spy 2.8.0 to 1.0.3 and 2.8.2.

    XMLWordPrintable

Details

    • Bug
    • Resolution: Incomplete
    • Major
    • 1.0.3
    • 1.0.3
    • Core
    • Security Level: Public
    • None

    Description

      Incident #2373

      Please see issue we encountered with our latest version below.

      This passed QA on stage but when going on to a live environment with full production load we see the behavior below. At first this occurred on a cluster we had just rebalanced. During that rebalance we saw rise in couchbase clients apps' CPU, which did not decrease after the rebalance was (successfully) done, until we restarted said clients. We suspected that the problem below is directly related to the issues we saw during rebalance so we also tested it on a different cluster that did not go any such rebalance. Results were the same. After searching all over to see what changed we realized that one change during this version was that we upgraded from Couchbase Client 1.0.1 and spy 2.8.0 to 1.0.3 and 2.8.2. We then took that exact build swapping the 1.0.3 with the 1.0.1 jars and everything started behaving fine.

      The reason we MUST have 1.0.3 on production is the following from 1.0.3's release notes (http://www.couchbase.com/docs/couchbase-sdk-java-1.0/couchbase-sdk-java-rn_1-0-3.html):

      It was found that in the dependent spymemcached client library that errors encountered in optimized set operations would not be handled correctly and thus application code would receive unexpected errors during a rebalance. This has been worked around in this release by disabling optimization. This may have a negilgable drop in throughput but shorter latencies.

      We believe the issues mentioned above on the clients during the rebalance are exactly this.

      1. Any ideas on reason for this?
      2. How would you advise to proceed.

      Cheers,
      Ira

      Hi

      1. One server is putting data to a memcached bucket. TTL is about 30 minutes.
      2. Another server tries to get this data but randomly fails (at about of 50% miss rate). We are getting nulls instead of real values. We are using asyncGet and then Future.get() with timeout of 5 seconds. We did not observe that timeout was reached.
      Time period between (1) and (2) is less than a minute. We debugged (1) and saw that it is being written without errors.
      No exceptions or errors.
      Data cluster wasn't heavy loaded, other clients (1.0.1) were working at the same time with this bucket and operated properly.

      Sergey

      From: Ira Holtzer

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          People

            skumar Saran Kumar (Inactive)
            skumar Saran Kumar (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty