Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-7115

Rebalance operation failed repetitively while trying to rebalance in 5 nodes and rebalance out 3 nodes on a 5 node cluster, reason possibly because: "Unable to listen" to one of the nodes that was being rebalanced out.

    Details

      Description

      Scenario:

      • 10 node cluster with build 1942
      • Rebalance out 5 nodes (completed successfully)
      • Cluster right now: 5 nodes
      • Add 5 nodes (with build 1944) and remove 3 nodes.
      • Hit rebalance.
      • Rebalance failed with reason:

      Rebalance exited with reason {badmatch,
      [{<0.26283.119>,
      badmatch,{error,emfile,
      [

      {ns_replicas_builder_utils, kill_a_bunch_of_tap_names,3}

      ,

      {misc,try_with_maybe_ignorant_after,2}, {gen_server,terminate,6}, {proc_lib,init_p_do_apply,3}]}}]}

      - Tried rebalance again, but failed repetitively:

      Rebalance exited with reason {{{badmatch,[{<18058.14511.0>,noproc}]},
      [{misc,sync_shutdown_many_i_am_trapping_exits, 1},{misc,try_with_maybe_ignorant_after,2}

      ,

      {gen_server,terminate,6}

      ,

      {proc_lib,init_p_do_apply,3}

      ]},
      {gen_server,call,
      [<0.11023.120>,

      {shutdown_replicator, 'ns_1@ec2-54-251-5-97.ap-southeast-1.compute.amazonaws.com'}

      ,
      infinity]}}

      Will upload logs from one of the nodes in the cluster present in the cluster during the time of the rebalance failures, shortly.

      _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

      Noticed this on one of the nodes being rebalanced out:
      Unable to listen on 'ns_1@ec2-122-248-217-156.ap-southeast-1.compute.amazonaws.com'.

      So failed over the node and tried rebalancing, rebalancing still failed.

      So added that node back, and did not involve that particular node in the rebalance operation, rebalance succeeded.

      No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

        Hide
        steve Steve Yen added a comment -

        bug-scrub:

        ketaki reports it works

        Show
        steve Steve Yen added a comment - bug-scrub: ketaki reports it works
        Hide
        steve Steve Yen added a comment -

        bug-scrub: 1949 didn't have the timeout fix? Please try again with 1950. thanks.

        Show
        steve Steve Yen added a comment - bug-scrub: 1949 didn't have the timeout fix? Please try again with 1950. thanks.
        Hide
        abhinav Abhinav Dangeti added a comment -

        When this issue originally occurred, the clusters were in an XDCR setup but before starting the rebalance operation, replication had been deleted, and the failures occurred on the destination cluster.

        To investigate whether this issue happens with/without XDCR,

        case1: Tried reproducing this with 2 ongoing unidirectional replications, on the destination cluster (with 7 nodes) - rebalanced in 3 nodes (1949) and rebalanced out 5 nodes (1944)
        Source was on 1944 (a 7 node cluster) had a fixed no. of items on both buckets, with no front end load.
        Rebalance on the destination cluster failed with the same reason, where the cluster was "unable to listen" to one of the nodes.

        case2: Deleted replications on the source, cleaned the destination cluster, loaded all nodes with (1949).
        Created a 5 node cluster (not part of any xdcr).
        Created 2 buckets and with ongoing front end load:

        • rebalanced out 3 nodes, rebalanced in 5 nodes :: rebalance operation completed successfully.
        • rebalanced out 5 nodes, rebalanced in 3 nodes :: rebalance operation completed successfully.
        Show
        abhinav Abhinav Dangeti added a comment - When this issue originally occurred, the clusters were in an XDCR setup but before starting the rebalance operation, replication had been deleted, and the failures occurred on the destination cluster. To investigate whether this issue happens with/without XDCR, case1: Tried reproducing this with 2 ongoing unidirectional replications, on the destination cluster (with 7 nodes) - rebalanced in 3 nodes (1949) and rebalanced out 5 nodes (1944) Source was on 1944 (a 7 node cluster) had a fixed no. of items on both buckets, with no front end load. Rebalance on the destination cluster failed with the same reason, where the cluster was "unable to listen" to one of the nodes. case2: Deleted replications on the source, cleaned the destination cluster, loaded all nodes with (1949). Created a 5 node cluster (not part of any xdcr). Created 2 buckets and with ongoing front end load: rebalanced out 3 nodes, rebalanced in 5 nodes :: rebalance operation completed successfully. rebalanced out 5 nodes, rebalanced in 3 nodes :: rebalance operation completed successfully.
        Hide
        ketaki Ketaki Gangal added a comment -

        Re-opening this , for repro/ observation on the xdcr- fixed changes.

        Show
        ketaki Ketaki Gangal added a comment - Re-opening this , for repro/ observation on the xdcr- fixed changes.
        Hide
        alkondratenko Aleksey Kondratenko (Inactive) added a comment -

        emfile in this case (at least in collect infos from Tony) seem to be caused by CLOSE_WAIT sockets. My leading guess is that it's those hang is_missing_rev requests that we're seeing in MB-7129. Only in this environment it's not killing node entirely but rather exhausts file descriptors

        Show
        alkondratenko Aleksey Kondratenko (Inactive) added a comment - emfile in this case (at least in collect infos from Tony) seem to be caused by CLOSE_WAIT sockets. My leading guess is that it's those hang is_missing_rev requests that we're seeing in MB-7129 . Only in this environment it's not killing node entirely but rather exhausts file descriptors

          People

          • Assignee:
            abhinav Abhinav Dangeti
            Reporter:
            abhinav Abhinav Dangeti
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Gerrit Reviews

              There are no open Gerrit changes