Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-46253

[Backport to 6.6.3] Replicators sometimes get stuck during failover

    XMLWordPrintable

Details

    Description

      Script to Repo

      guides/gradlew --refresh-dependencies testrunner -P jython=/opt/jython/bin/jython -P 'args=-i /tmp/durability_volume.ini rerun=False -t bucket_collections.collections_quorum_loss.CollectionsQuorumLoss.test_quorum_loss_failover,nodes_init=5,bucket_spec=multi_bucket.buckets_all_membase_for_quorum_loss,replicas=3,failover_action=firewall,num_node_failures=3,quota_percent=80,GROUP=P1'

      Steps to Repro
      1. Create a  5 node init cluster

      2021-03-18 05:53:37,832 | test  | INFO    | pool-1-thread-7 | [table_view:display:72] Rebalance Overview
      +----------------+----------+-----------------------+---------------+--------------+
      | Nodes          | Services | Version               | CPU           | Status       |
      +----------------+----------+-----------------------+---------------+--------------+
      | 172.23.105.215 | kv       | 7.0.0-4721-enterprise | 4.46293494705 | Cluster node |
      | 172.23.105.217 | None     |                       |               | <--- IN ---  |
      | 172.23.105.219 | None     |                       |               | <--- IN ---  |
      | 172.23.105.220 | None     |                       |               | <--- IN ---  |
      | 172.23.106.237 | None     |                       |               | <--- IN ---  |
      +----------------+----------+-----------------------+---------------+--------------+

      2. Create 2 buckets

      2021-03-18 05:55:56,053 | test  | INFO    | MainThread | [table_view:display:72] Bucket statistics
      +-------------------------------------+-----------+----------+------------+-----+--------+------------+-----------+-----------+
      | Bucket                              | Type      | Replicas | Durability | TTL | Items  | RAM Quota  | RAM Used  | Disk Used |
      +-------------------------------------+-----------+----------+------------+-----+--------+------------+-----------+-----------+
      | A9%1Zc1YY4wOO3%lD-48-277000         | couchbase | 3        | none       | 0   | 175000 | 4194304000 | 576985160 | 844152192 |
      | DMOCEALabmUOiFZj6L_ccczdP-48-189000 | couchbase | 3        | none       | 0   | 175000 | 4194304000 | 637234472 | 802156547 |
      +-------------------------------------+-----------+----------+------------+-----+--------+------------+-----------+-----------+

      3. Induce firewall on majority (3) of nodes

      2021-03-18 05:55:56,092 | test  | INFO    | MainThread | [collections_quorum_loss:test_quorum_loss_failover:261] Inducing failure firewall on nodes: [ip:172.23.105.217 port:8091 ssh_username:root, ip:172.23.105.219 port:8091 ssh_username:root, ip:172.23.105.220 port:8091 ssh_username:root]

      4.  Quorum failover the above nodes

      2021-03-18 05:56:58,549 | test  | INFO    | MainThread | [collections_quorum_loss:test_quorum_loss_failover:266] Failing over nodes explicitly [ip:172.23.105.217 port:8091 ssh_username:root, ip:172.23.105.219 port:8091 ssh_username:root, ip:172.23.105.220 port:8091 ssh_username:root]
      2021-03-18 05:57:15,174 | test  | ERROR   | pool-1-thread-14 | [rest_client:_http_request:748] POST http://172.23.105.215:8091/controller/failOver body: otpNode=ns_1%40172.23.105.217&otpNode=ns_1%40172.23.105.219&otpNode=ns_1%40172.23.105.220&allowUnsafe=true headers: {'Accept': '*/*', 'Connection': 'close', 'Authorization': 'Basic QWRtaW5pc3RyYXRvcjpwYXNzd29yZA==\n', 'Content-Type': 'application/x-www-form-urlencoded'} error: 500 reason: unknown ["Unexpected server error, request logged."] auth: Administrator:password
      2021-03-18 05:57:15,177 | test  | ERROR   | pool-1-thread-14 | [rest_client:fail_over:1291] [u'ns_1@172.23.105.217', u'ns_1@172.23.105.219', u'ns_1@172.23.105.220'] - Failover error: ["Unexpected server error, request logged."]

      Fails with "Unexpected server error, request logged" error. 
      Also worth noting that the cluster becomes unusable : teardown fails, with orphan buckets and we see errors like "Unfinished failover of nodes was found" on UI.

      Logs attached.

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            People

              sumedh.basarkod Sumedh Basarkod (Inactive)
              meni.hillel Meni Hillel (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                PagerDuty