Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-36910

[Volume]Doc ops stops after failover of a node.

    XMLWordPrintable

Details

    Description

      Steps to Reproduce:

      1. Create a 4-node cluster.

        +--------------+----------+--------------+
        16:19:37 | Nodes        | Services | Status       |
        16:19:37 +--------------+----------+--------------+
        16:19:37 | 172.23.97.6  | kv       | Cluster node |
        16:19:37 | 172.23.97.4  | None     | <--- IN ---  |
        16:19:37 | 172.23.97.10 | None     | <--- IN ---  |
        16:19:37 | 172.23.97.7  | None     | <--- IN ---  |
        16:19:37 +--------------+----------+--------------+ 

      2. Create a bucket with compression=off, replicas = 1, eviction=valueOnly.
      3. Load 100M items in the bucket with durability=MAJORITY. Bucket Stats after this Step:

         +----------------+---------+----------+-----+-----------+--------------+--------------+-------------+
        17:40:52 | Bucket         | Type    | Replicas | TTL | Items     | RAM Quota    | RAM Used     | Disk Used   |
        17:40:52 +----------------+---------+----------+-----+-----------+--------------+--------------+-------------+
        17:40:52 | GleamBookUsers | membase | 1        | 0   | 100000000 | 431270920192 | 273089101384 | 81521510935 |
        17:40:52 +----------------+---------+----------+-----+-----------+--------------+--------------+-------------+

      4. Rebalance In 1 node(172.23.97.3) with another 40M updates, 20M creates in parallel with durability=MAJORITY. Bucket Stats after this Step:

         +----------------+---------+----------+-----+-----------+--------------+--------------+-------------+
        21:35:55 | Bucket         | Type    | Replicas | TTL | Items     | RAM Quota    | RAM Used     | Disk Used   |
        21:35:55 +----------------+---------+----------+-----+-----------+--------------+--------------+-------------+
        21:35:55 | GleamBookUsers | membase | 1        | 0   | 120000000 | 539088650240 | 299280923712 | 70423305467 |
        21:35:55 +----------------+---------+----------+-----+-----------+--------------+--------------+-------------+

      5. Rebalance Out 1 node(172.23.97.7) with another 40M updates, 20M creates, 20M deletes in parallel with durability=MAJORITY. Bucket Stats after this Step:

         +----------------+---------+----------+-----+-----------+--------------+--------------+-------------+
        01:00:01 | Bucket         | Type    | Replicas | TTL | Items     | RAM Quota    | RAM Used     | Disk Used   |
        01:00:01 +----------------+---------+----------+-----+-----------+--------------+--------------+-------------+
        01:00:01 | GleamBookUsers | membase | 1        | 0   | 120000000 | 431270920192 | 272898375440 | 58203573362 |
        01:00:01 +----------------+---------+----------+-----+-----------+--------------+--------------+-------------+

      6. Rebalance In 2 nodes(172.23.97.7, 172.23.97.5) and Rebalance Out 1 node(172.23.97.3)with another 40M updates, 20M creates,20M deletes in parallel with durability=MAJORITY. Bucket Stats after this Step:

         +----------------+---------+----------+-----+-----------+--------------+--------------+-------------+
        04:19:21 | Bucket         | Type    | Replicas | TTL | Items     | RAM Quota    | RAM Used     | Disk Used   |
        04:19:21 +----------------+---------+----------+-----+-----------+--------------+--------------+-------------+
        04:19:21 | GleamBookUsers | membase | 1        | 0   | 120000000 | 539088650240 | 272818305944 | 64718980716 |
        04:19:21 +----------------+---------+----------+-----+-----------+--------------+--------------+-------------+

      7. Swap Rebalance 1 node(IN=172.23.97.3, OUT=172.23.97.4) with another 40M updates, 20M creates,20M deletes in parallel with durability=MAJORITY. Bucket Stats after this Step:

         +----------------+---------+----------+-----+-----------+--------------+--------------+-------------+
        07:17:52 | Bucket         | Type    | Replicas | TTL | Items     | RAM Quota    | RAM Used     | Disk Used   |
        07:17:52 +----------------+---------+----------+-----+-----------+--------------+--------------+-------------+
        07:17:52 | GleamBookUsers | membase | 1        | 0   | 120000000 | 539088650240 | 245219703984 | 67488163088 |
        07:17:52 +----------------+---------+----------+-----+-----------+--------------+--------------+-------------+

      8. Update the bucket replica from 1 to 2.
      9. Rebalance In 1 node(172.23.97.4) with another 40M updates, 20M creates, 20M deletes in parallel with durability=MAJORITY. Bucket Stats after this Step:

         +----------------+---------+----------+-----+-----------+--------------+--------------+-------------+
        12:45:00 | Bucket         | Type    | Replicas | TTL | Items     | RAM Quota    | RAM Used     | Disk Used   |
        12:45:00 +----------------+---------+----------+-----+-----------+--------------+--------------+-------------+
        12:45:00 | GleamBookUsers | membase | 2        | 0   | 120000000 | 646906380288 | 319439571792 | 97107407634 |
        12:45:00 +----------------+---------+----------+-----+-----------+--------------+--------------+-------------+
        12:45:00 

      10. Rebalance the cluster. After rebalance completes successfully, Perform another 40M updates, 20M creates, 20M deletes with durability=MAJORITY.
      11. While Step 10 is in progress , stop the memcached process. Restart it again after 20 seconds. Bucket Stats after 10th&11th Step:

         +----------------+---------+----------+-----+-----------+--------------+--------------+--------------+
        15:33:22 | Bucket         | Type    | Replicas | TTL | Items     | RAM Quota    | RAM Used     | Disk Used    |
        15:33:22 +----------------+---------+----------+-----+-----------+--------------+--------------+--------------+
        15:33:22 | GleamBookUsers | membase | 2        | 0   | 120000000 | 646906380288 | 323341148496 | 103890043892 |
        15:33:22 +----------------+---------+----------+-----+-----------+--------------+--------------+--------------+

      12. Perform another 40M updates, 20M creates, 20M deletes with durability=MAJORITY.
      13. While Step 12 is in progress, failover a node(172.23.97.5).
      14. Rebalance Out the node failed over in Step 13 and while Step 12 is in progress.
      15. Wait for Step 12 to finish.
      16. Rebalance In 1 node(172.23.97.5). Bucket Stats after operations performed in Steps 12-16:

         +----------------+---------+----------+-----+-----------+--------------+--------------+--------------+
        21:49:45 | Bucket         | Type    | Replicas | TTL | Items     | RAM Quota    | RAM Used     | Disk Used    |
        21:49:45 +----------------+---------+----------+-----+-----------+--------------+--------------+--------------+
        21:49:45 | GleamBookUsers | membase | 2        | 0   | 120000000 | 646906380288 | 312651660488 | 137025893952 |
        21:49:45 +----------------+---------+----------+-----+-----------+--------------+--------------+--------------+

      17. Perform another 40M updates, 20M creates, 20M deletes with durability=MAJORITY.
      18. While Step 17 is in progress, failover a node(172.23.97.5).
      19. Fully Recover the node failed over in Step 18 while Step 17 is in progress.
      20. Wait for Step 17 to finish.
      21. Bucket Stats after Steps 17-20:

         +----------------+---------+----------+-----+-----------+--------------+--------------+--------------+
        01:12:23 | Bucket         | Type    | Replicas | TTL | Items     | RAM Quota    | RAM Used     | Disk Used    |
        01:12:23 +----------------+---------+----------+-----+-----------+--------------+--------------+--------------+
        01:12:23 | GleamBookUsers | membase | 2        | 0   | 120000000 | 646906380288 | 307424522632 | 119972427126 |
        01:12:23 +----------------+---------+----------+-----+-----------+--------------+--------------+--------------+

      22. Perform another 40M updates, 20M creates, 20M deletes with durability=MAJORITY.
      23. While Step 22 is in progress, failover a node(172.23.97.5).

      The doc ops went to 0 and was stuck there for 9 hours.

      CPU config of the slave:

       Architecture:          x86_64
      CPU op-mode(s):        32-bit, 64-bit
      Byte Order:            Little Endian
      CPU(s):                56
      On-line CPU(s) list:   0-55
      Thread(s) per core:    2
      Core(s) per socket:    14
      Socket(s):             2
      NUMA node(s):          2
      Vendor ID:             GenuineIntel
      CPU family:            6
      Model:                 79
      Model name:            Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz
      Stepping:              1
      CPU MHz:               1200.750
      CPU max MHz:           3300.0000
      CPU min MHz:           1200.0000
      BogoMIPS:              4795.53
      Virtualization:        VT-x
      L1d cache:             32K
      L1i cache:             32K
      L2 cache:              256K
      L3 cache:              35840K
      NUMA node0 CPU(s):     0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46,48,50,52,54
      NUMA node1 CPU(s):     1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39,41,43,45,47,49,51,53,55
      Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch epb invpcid_single intel_pt ssbd ibrs ibpb stibp kaiser tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm rdseed adx smap xsaveopt cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts flush_l1d

      CPU config of the master node:

       Architecture:          x86_64
      CPU op-mode(s):        32-bit, 64-bit
      Byte Order:            Little Endian
      CPU(s):                56
      On-line CPU(s) list:   0-55
      Thread(s) per core:    2
      Core(s) per socket:    14
      Socket(s):             2
      NUMA node(s):          2
      Vendor ID:             GenuineIntel
      CPU family:            6
      Model:                 79
      Model name:            Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz
      Stepping:              1
      CPU MHz:               2899.968
      CPU max MHz:           3300.0000
      CPU min MHz:           1200.0000
      BogoMIPS:              4794.82
      Virtualization:        VT-x
      L1d cache:             32K
      L1i cache:             32K
      L2 cache:              256K
      L3 cache:              35840K
      NUMA node0 CPU(s):     0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46,48,50,52,54
      NUMA node1 CPU(s):     1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39,41,43,45,47,49,51,53,55
      Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch epb cat_l3 cdp_l3 intel_pt tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm rdt_a rdseed adx smap xsaveopt cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts

       

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          People

            prateek.kumar Prateek Kumar (Inactive)
            prateek.kumar Prateek Kumar (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty