Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-50422

MultiNodeFailover: Non-KV nodes got failed over when unable to failover KV node

    XMLWordPrintable

Details

    Description

      Steps to reproduce:

      • Multi node cluster as follows

        +----------------+-------------+-----------------+-----------+----------+-----------------------+-------------------+ | Node           | Services    | CPU_utilization | Mem_total | Mem_free | Swap_mem_used         | Active / Replica  | +----------------+-------------+-----------------+-----------+----------+-----------------------+-------------------+ | 172.23.100.21  | n1ql        | 1.78078755957   | 3.91 GiB  | 3.13 GiB | 7.50 MiB / 3.50 GiB   | 0 / 0             | | 172.23.105.212 | index, n1ql | 2.76019245379   | 3.91 GiB  | 3.13 GiB | 57.22 MiB / 3.50 GiB  | 0 / 0             | | 172.23.108.238 | n1ql        | 0.778307808185  | 3.69 GiB  | 3.01 GiB | 0.0 Byte / 3.50 GiB   | 0 / 0             | | 172.23.105.244 | index       | 3.06918238994   | 3.91 GiB  | 3.22 GiB | 116.25 MiB / 3.50 GiB | 0 / 0             | | 172.23.105.245 | index       | 2.69453538152   | 3.91 GiB  | 3.20 GiB | 157.75 MiB / 3.50 GiB | 0 / 0             | | 172.23.105.155 | kv          | 16.5342219944   | 3.91 GiB  | 2.94 GiB | 115.80 MiB / 3.50 GiB | 0 / 0             | | 172.23.105.213 | index       | 5.10101010101   | 3.91 GiB  | 3.22 GiB | 64.89 MiB / 3.50 GiB  | 0 / 0             | | 172.23.100.22  | n1ql        | 1.65912518854   | 3.91 GiB  | 3.25 GiB | 134.75 MiB / 3.50 GiB | 0 / 0             | | 172.23.105.211 | kv          | 13.910158244    | 3.91 GiB  | 3.19 GiB | 146.50 MiB / 3.50 GiB | 0 / 0             | +----------------+-------------+-----------------+-----------+----------+-----------------------+-------------------+

      • Couchbase bucket, replica=1
      • Auto-failover settings - maxCount=5
      • Bring down 4 nodes

        +----------------+----------+-------------+----------------+
        | Node           | Services | Node status | Failover type  |
        +----------------+----------+-------------+----------------+
        | 172.23.105.213 | index    | active      | stop_couchbase |
        | 172.23.100.21  | n1ql     | active      | stop_couchbase |
        | 172.23.105.155 | kv       | active      | stop_couchbase |
        | 172.23.100.22  | n1ql     | active      | stop_couchbase |
        +----------------+----------+-------------+----------------+

      Observation:

      Non-kv nodes .213, .21, .22 got failed over leaving out the KV node.

      Expected behavior:

      No failover should be allowed since KV failover is impossible here due to data-loss

       

      Attachments

        For Gerrit Dashboard: MB-50422
        # Subject Branch Project Status CR V

        Activity

          Build couchbase-server-7.1.0-2271 contains ns_server commit 5c4bfac with commit message:
          MB-50422 disallow auto failing over service nodes if down kv nodes cannot

          build-team Couchbase Build Team added a comment - Build couchbase-server-7.1.0-2271 contains ns_server commit 5c4bfac with commit message: MB-50422 disallow auto failing over service nodes if down kv nodes cannot

          Build couchbase-server-7.1.0-2271 contains ns_server commit 44eb9d2 with commit message:
          MB-50422 correctly combine multiple failover actions into one

          build-team Couchbase Build Team added a comment - Build couchbase-server-7.1.0-2271 contains ns_server commit 44eb9d2 with commit message: MB-50422 correctly combine multiple failover actions into one
          ashwin.govindarajulu Ashwin Govindarajulu added a comment - - edited

          Able to reproduce the scenario using the 7.1.0-2325 cluster as follows,

           

          | Node           | Services    |
          +----------------+-------------+
          | 172.23.108.240 | index       |
          | 172.23.108.79  | kv          |
          | 172.23.98.18   | n1ql        |
          | 172.23.108.58  | kv          |
          | 172.23.98.164  | n1ql        |
          | 172.23.108.60  | index       |
          | 172.23.109.71  | n1ql        |
          | 172.23.107.192 | index       |
          | 172.23.108.57  | index, n1ql |
           
          +------------------------------------------------------------+-----------+-----------------+-------+
          | Bucket                                                     | Type      | Storage Backend | Items |
          +------------------------------------------------------------+-----------+-----------------+------ |
          | mREGMnmMrkiWM0xnA0UL2nWiNqXjYBDevh3lprCkXgHN1w6i5-1-455000 | couchbase | couchstore      | 50000 |
          +------------------------------------------------------------+-----------+-----------------+-------+
           
          Failover 1-kv, 2-n1ql, 1-index node
           
          +----------------+----------+----------------+----------------+
          | Node           | Services | Node status    | Failover type  |
          +----------------+----------+----------------+----------------+
          | 172.23.107.192 | index    | inactiveFailed | stop_couchbase |
          | 172.23.108.58  | kv       | active         | stop_couchbase |
          | 172.23.98.164  | n1ql     | inactiveFailed | stop_couchbase |
          | 172.23.109.71  | n1ql     | inactiveFailed | stop_couchbase |
          +----------------+----------+----------------+----------------+

          Log Snaphot: http://supportal.couchbase.com/snapshot/f0f6b3384029ada1687b62c002e77f6b::0

          cbcollect file for 7.1.0-2325:
          https://cb-engineering.s3.amazonaws.com/non_kv_nodes_failed/collectinfo-2022-02-17T112244-ns_1%40172.23.136.101.zip
          https://cb-engineering.s3.amazonaws.com/non_kv_nodes_failed/collectinfo-2022-02-17T112244-ns_1%40172.23.136.102.zip
          https://cb-engineering.s3.amazonaws.com/non_kv_nodes_failed/collectinfo-2022-02-17T112244-ns_1%40172.23.136.103.zip
          https://cb-engineering.s3.amazonaws.com/non_kv_nodes_failed/collectinfo-2022-02-17T112244-ns_1%40172.23.136.104.zip
          https://cb-engineering.s3.amazonaws.com/non_kv_nodes_failed/collectinfo-2022-02-17T112244-ns_1%40172.23.136.106.zip
          https://cb-engineering.s3.amazonaws.com/non_kv_nodes_failed/collectinfo-2022-02-17T112244-ns_1%40172.23.136.107.zip
          https://cb-engineering.s3.amazonaws.com/non_kv_nodes_failed/collectinfo-2022-02-17T112244-ns_1%40172.23.136.108.zip
          https://cb-engineering.s3.amazonaws.com/non_kv_nodes_failed/collectinfo-2022-02-17T112244-ns_1%40172.23.136.109.zip
          https://cb-engineering.s3.amazonaws.com/non_kv_nodes_failed/collectinfo-2022-02-17T112244-ns_1%40172.23.136.110.zip

          Test case:

          guides/gradlew --refresh-dependencies testrunner -P jython=/opt/jython/bin/jython -P 'args=-i /tmp/testexec.49394.ini -t failover.concurrent_failovers.ConcurrentFailoverTests.test_concurrent_failover,nodes_init=9,services_init=kv-kv-index:n1ql-index-index-index-n1ql-n1ql-n1ql,replicas=1,maxCount=5,timeout=30,failover_order=kv:index:n1ql:n1ql,failover_method=stop_couchbase,bucket_spec=single_bucket.default,num_items=100000'
          

          ashwin.govindarajulu Ashwin Govindarajulu added a comment - - edited Able to reproduce the scenario using the 7.1.0-2325 cluster as follows,   | Node           | Services    | +----------------+-------------+ | 172.23.108.240 | index       | | 172.23.108.79  | kv          | | 172.23.98.18   | n1ql        | | 172.23.108.58  | kv          | | 172.23.98.164  | n1ql        | | 172.23.108.60  | index       | | 172.23.109.71  | n1ql        | | 172.23.107.192 | index       | | 172.23.108.57  | index, n1ql |   +------------------------------------------------------------+-----------+-----------------+-------+ | Bucket                                                     | Type      | Storage Backend | Items | +------------------------------------------------------------+-----------+-----------------+------ | | mREGMnmMrkiWM0xnA0UL2nWiNqXjYBDevh3lprCkXgHN1w6i5-1-455000 | couchbase | couchstore      | 50000 | +------------------------------------------------------------+-----------+-----------------+-------+   Failover 1-kv, 2-n1ql, 1-index node   +----------------+----------+----------------+----------------+ | Node | Services | Node status | Failover type | +----------------+----------+----------------+----------------+ | 172.23.107.192 | index | inactiveFailed | stop_couchbase | | 172.23.108.58 | kv | active | stop_couchbase | | 172.23.98.164 | n1ql | inactiveFailed | stop_couchbase | | 172.23.109.71 | n1ql | inactiveFailed | stop_couchbase | +----------------+----------+----------------+----------------+ Log Snaphot: http://supportal.couchbase.com/snapshot/f0f6b3384029ada1687b62c002e77f6b::0 cbcollect file for 7.1.0-2325: https://cb-engineering.s3.amazonaws.com/non_kv_nodes_failed/collectinfo-2022-02-17T112244-ns_1%40172.23.136.101.zip https://cb-engineering.s3.amazonaws.com/non_kv_nodes_failed/collectinfo-2022-02-17T112244-ns_1%40172.23.136.102.zip https://cb-engineering.s3.amazonaws.com/non_kv_nodes_failed/collectinfo-2022-02-17T112244-ns_1%40172.23.136.103.zip https://cb-engineering.s3.amazonaws.com/non_kv_nodes_failed/collectinfo-2022-02-17T112244-ns_1%40172.23.136.104.zip https://cb-engineering.s3.amazonaws.com/non_kv_nodes_failed/collectinfo-2022-02-17T112244-ns_1%40172.23.136.106.zip https://cb-engineering.s3.amazonaws.com/non_kv_nodes_failed/collectinfo-2022-02-17T112244-ns_1%40172.23.136.107.zip https://cb-engineering.s3.amazonaws.com/non_kv_nodes_failed/collectinfo-2022-02-17T112244-ns_1%40172.23.136.108.zip https://cb-engineering.s3.amazonaws.com/non_kv_nodes_failed/collectinfo-2022-02-17T112244-ns_1%40172.23.136.109.zip https://cb-engineering.s3.amazonaws.com/non_kv_nodes_failed/collectinfo-2022-02-17T112244-ns_1%40172.23.136.110.zip Test case: guides/gradlew --refresh-dependencies testrunner -P jython=/opt/jython/bin/jython -P 'args=-i /tmp/testexec.49394.ini -t failover.concurrent_failovers.ConcurrentFailoverTests.test_concurrent_failover,nodes_init=9,services_init=kv-kv-index:n1ql-index-index-index-n1ql-n1ql-n1ql,replicas=1,maxCount=5,timeout=30,failover_order=kv:index:n1ql:n1ql,failover_method=stop_couchbase,bucket_spec=single_bucket.default,num_items=100000'

          Provided log files do not match nodes mentioned in the reproduction scenario. Please provide correct log files.

          artem Artem Stemkovski added a comment - Provided log files do not match nodes mentioned in the reproduction scenario. Please provide correct log files.
          ashwin.govindarajulu Ashwin Govindarajulu added a comment - - edited

          Artem Stemkovski Sorry about the mismatch in log_file names. Please find the new logs,

          +----------------+-------------+
          | Node           | Services    |
          +----------------+-------------+
          | 172.23.105.212 | index, n1ql |
          | 172.23.105.244 | index       |
          | 172.23.105.245 | index       |
          | 172.23.136.112 | n1ql        |
          | 172.23.105.155 | kv          |
          | 172.23.105.213 | index       |
          | 172.23.136.113 | n1ql        |
          | 172.23.136.105 | n1ql        |
          | 172.23.105.211 | kv          |
          +----------------+-------------+
          +----------------------------------------------------+-----------+-----------------+----------+-------+
          | Bucket                                             | Type      | Storage Backend | Replicas | Items |
          +----------------------------------------------------+-----------+-----------------+----------+-------+
          | qMOy2WDPQCPuoNOhi9JUVxNUGNdIrG_9p81x%v1A-59-999000 | couchbase | couchstore      | 1        | 50000 |
          +----------------------------------------------------+-----------+-----------------+----------+-------+

          Failover info:

          +----------------+----------+----------------+----------------+
          | Node           | Services | Node status    | Failover type  |
          +----------------+----------+----------------+----------------+
          | 172.23.136.112 | n1ql     | inactiveFailed | stop_couchbase |
          | 172.23.105.155 | kv       | active         | stop_couchbase |
          | 172.23.136.105 | n1ql     | inactiveFailed | stop_couchbase |
          | 172.23.105.213 | index    | inactiveFailed | stop_couchbase |
          +----------------+----------+----------------+----------------+

          Snapshot: http://supportal.couchbase.com/snapshot/fad0f2b249a8a7570bdf633e7645d441::0

          cbcollect info files:
          https://cb-engineering.s3.amazonaws.com/mb_50422/collectinfo-2022-02-18T021903-ns_1%40172.23.105.155.zip
          https://cb-engineering.s3.amazonaws.com/mb_50422/collectinfo-2022-02-18T021903-ns_1%40172.23.105.211.zip
          https://cb-engineering.s3.amazonaws.com/mb_50422/collectinfo-2022-02-18T021903-ns_1%40172.23.105.212.zip
          https://cb-engineering.s3.amazonaws.com/mb_50422/collectinfo-2022-02-18T021903-ns_1%40172.23.105.213.zip
          https://cb-engineering.s3.amazonaws.com/mb_50422/collectinfo-2022-02-18T021903-ns_1%40172.23.105.244.zip
          https://cb-engineering.s3.amazonaws.com/mb_50422/collectinfo-2022-02-18T021903-ns_1%40172.23.105.245.zip
          https://cb-engineering.s3.amazonaws.com/mb_50422/collectinfo-2022-02-18T021903-ns_1%40172.23.136.105.zip
          https://cb-engineering.s3.amazonaws.com/mb_50422/collectinfo-2022-02-18T021903-ns_1%40172.23.136.112.zip
          https://cb-engineering.s3.amazonaws.com/mb_50422/collectinfo-2022-02-18T021903-ns_1%40172.23.136.113.zip

          ashwin.govindarajulu Ashwin Govindarajulu added a comment - - edited Artem Stemkovski Sorry about the mismatch in log_file names. Please find the new logs, +----------------+-------------+ | Node           | Services    | +----------------+-------------+ | 172.23.105.212 | index, n1ql | | 172.23.105.244 | index       | | 172.23.105.245 | index       | | 172.23.136.112 | n1ql        | | 172.23.105.155 | kv          | | 172.23.105.213 | index       | | 172.23.136.113 | n1ql        | | 172.23.136.105 | n1ql        | | 172.23.105.211 | kv          | +----------------+-------------+ +----------------------------------------------------+-----------+-----------------+----------+-------+ | Bucket                                             | Type      | Storage Backend | Replicas | Items | +----------------------------------------------------+-----------+-----------------+----------+-------+ | qMOy2WDPQCPuoNOhi9JUVxNUGNdIrG_9p81x%v1A-59-999000 | couchbase | couchstore      | 1        | 50000 | +----------------------------------------------------+-----------+-----------------+----------+-------+ Failover info: +----------------+----------+----------------+----------------+ | Node | Services | Node status | Failover type | +----------------+----------+----------------+----------------+ | 172.23.136.112 | n1ql | inactiveFailed | stop_couchbase | | 172.23.105.155 | kv | active | stop_couchbase | | 172.23.136.105 | n1ql | inactiveFailed | stop_couchbase | | 172.23.105.213 | index | inactiveFailed | stop_couchbase | +----------------+----------+----------------+----------------+ Snapshot : http://supportal.couchbase.com/snapshot/fad0f2b249a8a7570bdf633e7645d441::0 cbcollect info files: https://cb-engineering.s3.amazonaws.com/mb_50422/collectinfo-2022-02-18T021903-ns_1%40172.23.105.155.zip https://cb-engineering.s3.amazonaws.com/mb_50422/collectinfo-2022-02-18T021903-ns_1%40172.23.105.211.zip https://cb-engineering.s3.amazonaws.com/mb_50422/collectinfo-2022-02-18T021903-ns_1%40172.23.105.212.zip https://cb-engineering.s3.amazonaws.com/mb_50422/collectinfo-2022-02-18T021903-ns_1%40172.23.105.213.zip https://cb-engineering.s3.amazonaws.com/mb_50422/collectinfo-2022-02-18T021903-ns_1%40172.23.105.244.zip https://cb-engineering.s3.amazonaws.com/mb_50422/collectinfo-2022-02-18T021903-ns_1%40172.23.105.245.zip https://cb-engineering.s3.amazonaws.com/mb_50422/collectinfo-2022-02-18T021903-ns_1%40172.23.136.105.zip https://cb-engineering.s3.amazonaws.com/mb_50422/collectinfo-2022-02-18T021903-ns_1%40172.23.136.112.zip https://cb-engineering.s3.amazonaws.com/mb_50422/collectinfo-2022-02-18T021903-ns_1%40172.23.136.113.zip

          Build couchbase-server-7.1.0-2391 contains ns_server commit d9447d2 with commit message:
          MB-50422 do not rely on issued warnings to figure out if any down kv

          build-team Couchbase Build Team added a comment - Build couchbase-server-7.1.0-2391 contains ns_server commit d9447d2 with commit message: MB-50422 do not rely on issued warnings to figure out if any down kv

          Validated on build  7.1.0-2393-enterprise.

          Closing this ticket

          ashwin.govindarajulu Ashwin Govindarajulu added a comment - Validated on build  7.1.0-2393-enterprise. Closing this ticket

          People

            ashwin.govindarajulu Ashwin Govindarajulu
            ashwin.govindarajulu Ashwin Govindarajulu
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty