Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-6573

[longevity] rebalance failed due to error "Resetting rebalance status since it's not really running" when there are major page faults on some of the nodes in the cluster

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 2.0-beta
    • Fix Version/s: 2.0
    • Component/s: ns_server
    • Security Level: Public
    • Environment:
      centos 6.2 64bit build 2.0.0-1697

      Description

      Cluster information:

      • 11 centos 6.2 64bit server with 4 cores CPU
      • Each server has 10 GB RAM and 150 GB disk.
      • 8 GB RAM for couchbase server at each node (80% total system memmories)
      • Disk format ext3 on both data and root
      • Each server has its own drive, no disk sharing with other server.
      • Load 7 million items to both buckets
      • Cluster has 2 buckets, default (3GB) and saslbucket (3GB)
      • Each bucket has one doc and 2 views for each doc (default d1 and saslbucket d11)
      • Maintain load about 10 K ops, query view on both doc

      10.3.121.13
      10.3.121.14
      10.3.121.15
      10.3.121.16
      10.3.121.17
      10.3.121.20
      10.3.121.22
      10.3.121.24
      10.3.121.25
      10.3.121.23

      Create cluster with 10 nodes
      Do swap rebalance. Add node 26 and remove node 25.
      Before and during rebalance, cluster does not go into swap.

      Rebalance failed with error "Resetting rebalance status since it's not really running"

      Link to diags of all nodes https://s3.amazonaws.com/packages.couchbase/diag-logs/orange/201209/11ndoes-1697-reb-failed-reset-reb-20120907.tgz

      Link to atop of all nodes https://s3.amazonaws.com/packages.couchbase/atop-files/orange/201209/atop-11nodes-1697-reb-failed-reset-reb-20120907.tgz

      No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

        Hide
        karan Karan Kumar (Inactive) added a comment -

        Unable to repro on 1781

        Show
        karan Karan Kumar (Inactive) added a comment - Unable to repro on 1781
        Hide
        farshid Farshid Ghods (Inactive) added a comment -

        Karan,

        should this be assigned to Aliaksey now ?

        Show
        farshid Farshid Ghods (Inactive) added a comment - Karan, should this be assigned to Aliaksey now ?
        Hide
        thuan Thuan Nguyen added a comment -

        Here is the link to atop file node 13 (around 1.4 GB in size). https://s3.amazonaws.com/packages.couchbase/atop-files/orange/201209/atop-node13
        This atop file also stored at local vm at /tmp directory

        All other atop files of other nodes are in /tmp directory

        All of these atop started on
        drwx------ 2 root root 4.0K Sep 8 01:21 atop.d

        Show
        thuan Thuan Nguyen added a comment - Here is the link to atop file node 13 (around 1.4 GB in size). https://s3.amazonaws.com/packages.couchbase/atop-files/orange/201209/atop-node13 This atop file also stored at local vm at /tmp directory All other atop files of other nodes are in /tmp directory All of these atop started on drwx------ 2 root root 4.0K Sep 8 01:21 atop.d
        Hide
        alkondratenko Aleksey Kondratenko (Inactive) added a comment -

        atop data starts at Sep 8 01:31:58

        While last message with memcached slowness indication is at:

        696802:[ns_server:error,2012-09-08T1:24:43.330,ns_1@10.3.121.13:<0.6381.0>:ns_memcached:verify_report_long_call:274]call

        {stats,<<>>}

        took too long: 720860 us

        I'd like you guys to keep atop data for longer periods of time.

        Show
        alkondratenko Aleksey Kondratenko (Inactive) added a comment - atop data starts at Sep 8 01:31:58 While last message with memcached slowness indication is at: 696802: [ns_server:error,2012-09-08T1:24:43.330,ns_1@10.3.121.13:<0.6381.0>:ns_memcached:verify_report_long_call:274] call {stats,<<>>} took too long: 720860 us I'd like you guys to keep atop data for longer periods of time.
        Show
        thuan Thuan Nguyen added a comment - Link to diags of all nodes on Sept 8th 2012 https://s3.amazonaws.com/packages.couchbase/diag-logs/orange/201209/11nodes-1697-rebalance-failed-bulk_set_vbucket_state_failed-20120908.tgz

          People

          • Assignee:
            karan Karan Kumar (Inactive)
            Reporter:
            thuan Thuan Nguyen
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Gerrit Reviews

              There are no open Gerrit changes