Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-45064

Replicators sometimes get stuck during failover

    XMLWordPrintable

Details

    Description

      Script to Repo

      guides/gradlew --refresh-dependencies testrunner -P jython=/opt/jython/bin/jython -P 'args=-i /tmp/durability_volume.ini rerun=False -t bucket_collections.collections_quorum_loss.CollectionsQuorumLoss.test_quorum_loss_failover,nodes_init=5,bucket_spec=multi_bucket.buckets_all_membase_for_quorum_loss,replicas=3,failover_action=firewall,num_node_failures=3,quota_percent=80,GROUP=P1'

      Steps to Repro
      1. Create a  5 node init cluster

      2021-03-18 05:53:37,832 | test  | INFO    | pool-1-thread-7 | [table_view:display:72] Rebalance Overview
      +----------------+----------+-----------------------+---------------+--------------+
      | Nodes          | Services | Version               | CPU           | Status       |
      +----------------+----------+-----------------------+---------------+--------------+
      | 172.23.105.215 | kv       | 7.0.0-4721-enterprise | 4.46293494705 | Cluster node |
      | 172.23.105.217 | None     |                       |               | <--- IN ---  |
      | 172.23.105.219 | None     |                       |               | <--- IN ---  |
      | 172.23.105.220 | None     |                       |               | <--- IN ---  |
      | 172.23.106.237 | None     |                       |               | <--- IN ---  |
      +----------------+----------+-----------------------+---------------+--------------+

      2. Create 2 buckets

      2021-03-18 05:55:56,053 | test  | INFO    | MainThread | [table_view:display:72] Bucket statistics
      +-------------------------------------+-----------+----------+------------+-----+--------+------------+-----------+-----------+
      | Bucket                              | Type      | Replicas | Durability | TTL | Items  | RAM Quota  | RAM Used  | Disk Used |
      +-------------------------------------+-----------+----------+------------+-----+--------+------------+-----------+-----------+
      | A9%1Zc1YY4wOO3%lD-48-277000         | couchbase | 3        | none       | 0   | 175000 | 4194304000 | 576985160 | 844152192 |
      | DMOCEALabmUOiFZj6L_ccczdP-48-189000 | couchbase | 3        | none       | 0   | 175000 | 4194304000 | 637234472 | 802156547 |
      +-------------------------------------+-----------+----------+------------+-----+--------+------------+-----------+-----------+

      3. Induce firewall on majority (3) of nodes

      2021-03-18 05:55:56,092 | test  | INFO    | MainThread | [collections_quorum_loss:test_quorum_loss_failover:261] Inducing failure firewall on nodes: [ip:172.23.105.217 port:8091 ssh_username:root, ip:172.23.105.219 port:8091 ssh_username:root, ip:172.23.105.220 port:8091 ssh_username:root]

      4.  Quorum failover the above nodes

      2021-03-18 05:56:58,549 | test  | INFO    | MainThread | [collections_quorum_loss:test_quorum_loss_failover:266] Failing over nodes explicitly [ip:172.23.105.217 port:8091 ssh_username:root, ip:172.23.105.219 port:8091 ssh_username:root, ip:172.23.105.220 port:8091 ssh_username:root]
      2021-03-18 05:57:15,174 | test  | ERROR   | pool-1-thread-14 | [rest_client:_http_request:748] POST http://172.23.105.215:8091/controller/failOver body: otpNode=ns_1%40172.23.105.217&otpNode=ns_1%40172.23.105.219&otpNode=ns_1%40172.23.105.220&allowUnsafe=true headers: {'Accept': '*/*', 'Connection': 'close', 'Authorization': 'Basic QWRtaW5pc3RyYXRvcjpwYXNzd29yZA==\n', 'Content-Type': 'application/x-www-form-urlencoded'} error: 500 reason: unknown ["Unexpected server error, request logged."] auth: Administrator:password
      2021-03-18 05:57:15,177 | test  | ERROR   | pool-1-thread-14 | [rest_client:fail_over:1291] [u'ns_1@172.23.105.217', u'ns_1@172.23.105.219', u'ns_1@172.23.105.220'] - Failover error: ["Unexpected server error, request logged."]

      Fails with "Unexpected server error, request logged" error. 
      Also worth noting that the cluster becomes unusable : teardown fails, with orphan buckets and we see errors like "Unfinished failover of nodes was found" on UI.

      Logs attached.

      Attachments

        1. conosoleText_qt.txt
          33 kB
          Sumedh Basarkod
        2. Screenshot 2021-03-18 at 6.40.38 PM.png
          353 kB
          Sumedh Basarkod

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            People

              sumedh.basarkod Sumedh Basarkod (Inactive)
              sumedh.basarkod Sumedh Basarkod (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty