Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-6904

System testing for XDCR: Entire cluster is gradually down/pending, seeing few cores on beam.smp - Sig Abort

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • 2.0-beta-2
    • 2.0-beta-2
    • XDCR
    • Security Level: Public
    • None
    • 2.0.0-1835-rel
      Ubuntu EC2
      10 (west coast) : 10 (south east)
      standard bucket :: west -> southeast
      default :: southeast -> west

    Description

      • Set up 2 unidirectional replications either way on the 2 clusters (mentioned).
      • After load and replication of about 75M+ items on the standard bucket and 45M+ items on the default bucket,
        it started with a few nodes and then gradually all of them on cluster1 (west coast): nodes started going down.
      • Couchbase-server seemed to be running still on each of them.
      • It was seen that some of these nodes were warmed up (for unknown reasons), and after the warmup, ns_server couldn't connect to the vbuckets, leaving many of them in a dead state.

      The ATOP information on one of the nodes that went down (c1):
      24830 couchbas 20 0 3717m 3.4g 1956 S 14 23.4 2:53.76 memcached
      9590 couchbas 20 0 3095m 1.4g 1648 S 10 9.8 1040:28 beam.smp

      Was able to grab the diags of a few nodes before all of them went down (attached).

      c1: http://ec2-184-169-219-253.us-west-1.compute.amazonaws.com:8091/
      c2: http://ec2-54-251-5-97.ap-southeast-1.compute.amazonaws.com:8091/

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          People

            damien damien (Inactive)
            abhinav Abhi Dangeti
            Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty