Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-44348

Rebalance out fails with reason "shun_failed"

    XMLWordPrintable

Details

    Description

      Build: 7.0.0-4454

      Scenario:

      • Initialize cluster with two nodes (kv, kv+index+n1ql)
      • Create couchbase bucket with replica=1
      • Rebalance_in 2 nodes into the cluster with doc cruds in parallel (Success)
      • Rebalance_in 2 more nodes with doc_cruds (Success)
      • Final Cluster stats:

        +----------------+-----------------+------+------------+------------+----------------------+------------------+
        | Node           | Services        | CPU  | Mem_total  | Mem_free   | Swap_mem_used        | Active / Replica |
        +----------------+-----------------+------+------------+------------+----------------------+------------------+
        | 172.23.105.126 | kv              | 6.43 | 4201627648 | 3425701888 | 1048576 / 3758092288 | 4989 / 5108      |
        | 172.23.105.128 | kv              | 6.16 | 4201627648 | 3424706560 | 0 / 3758092288       | 5127 / 4934      |
        | 172.23.104.172 | index, kv, n1ql | 11.7 | 3947372544 | 3012943872 | 221184 / 3758092288  | 4982 / 5103      |
        | 172.23.105.127 | kv              | 4.88 | 4201627648 | 3397816320 | 0 / 3758092288       | 5075 / 5043      |
        | 172.23.105.158 | kv              | 5.79 | 4201631744 | 3393880064 | 0 / 3758092288       | 4936 / 4884      |
        | 172.23.104.158 | kv              | 14.8 | 4201676800 | 3443019776 | 1310720 / 3758092288 | 4891 / 4928      |
        +----------------+-----------------+------+------------+------------+----------------------+------------------+

      • Rebalance out all nodes

      Observation:

      During final rebalance out of all nodes, seeing rebalance failure due to memcached getting killing with exit code 137 with following logs,

      Service 'memcached' exited with status 137. Restarting. Messages:WARNING: Logging before InitGoogleLogging() is written to STDERRW0216 00:59:13.516377 22657 HazptrDomain.h:671] Using the default inline executor for asynchronous reclamation may be susceptible to deadlock if the current thread happens to hold a resource needed by the deleter of a reclaimable object

      Rebalance failure UI logs:

      Node 'ns_1@172.23.104.172' saw that node 'ns_1@172.23.105.158' went down. Details: [{nodedown_reason, connection_closed}]
      Node 'ns_1@172.23.104.158' saw that node 'ns_1@172.23.105.158' went down. Details: [{nodedown_reason, connection_closed}]
      Rebalance exited with reason shun_failed.
      Rebalance Operation Id = 9ebe83ad7194372a38613770f88d57a1
      Node 'ns_1@172.23.105.158' is leaving cluster."}
      Node 'ns_1@172.23.104.172' saw that node 'ns_1@172.23.105.127' went down. Details: [{nodedown_reason, connection_closed}]
      Node 'ns_1@172.23.104.158' saw that node 'ns_1@172.23.105.127' went down. Details: [{nodedown_reason, connection_closed}]
      Node 'ns_1@172.23.105.158' saw that node 'ns_1@172.23.105.127' went down. Details: [{nodedown_reason, connection_closed}]
      Node 'ns_1@172.23.105.127' is leaving cluster.
      Node 'ns_1@172.23.104.172' saw that node 'ns_1@172.23.105.128' went down. Details: [{nodedown_reason, connection_closed}]
      Node 'ns_1@172.23.104.158' saw that node 'ns_1@172.23.105.128' went down. Details: [{nodedown_reason, connection_closed}]

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            People

              artem Artem Stemkovski
              ashwin.govindarajulu Ashwin Govindarajulu
              Votes:
              0 Vote for this issue
              Watchers:
              15 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty