Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-36564

[Volume] Failing Over a node is causing memcached crash on 2 other nodes.

    XMLWordPrintable

Details

    Description

      Steps to Reproduce:

      1. Create a 7 node cluster.

      +----------------+----------+--------------+
      | Nodes          | Services | Status       |
      +----------------+----------+--------------+
      | 172.23.106.134 | [u'kv']  | Cluster node |
      | 172.23.106.136 | None     | <--- IN ---  |
      | 172.23.106.137 | None     | <--- IN ---  |
      | 172.23.106.138 | None     | <--- IN ---  |
      | 172.23.105.168 | None     | <--- IN ---  |
      | 172.23.106.82  | None     | <--- IN ---  |
      | 172.23.106.83  | None     | <--- IN ---  |
      +----------------+----------+--------------+
       

      2. Create a bucket with replicas=1, eviction policy= valueOnly, compression=off.

      3. Load 50M docs with durability = MAJORITY.

      4. Rebalance In 1 node(172.23.106.85) with 10M creates, 20M updates with          durability=MAJORITY in parallel.

      5. Rebalance Out 1 node(172.23.106.83) with 10M creates, 20M updates, 10M deletes with durability=MAJORITY in parallel.

      6. Rebalance In 2 nodes(172.23.106.83;172.23.106.86) and Rebalance Out 1 node (172.23.106.82) with 10M creates, 20M updates, 10M deletes with durability=MAJORITY in parallel.

      7. Swap Rebalance 1 node(IN=172.23.106.82, OUT=172.23.105.168) with 10M creates, 20M updates, 10M deletes with durability=MAJORITY in parallel.

      8. Update the Bucket replica number from 1 to 2.

      9. Rebalance In 1 node (172.23.105.168) with 10M creates, 20M updates, 10M deletes with durability=MAJORITY in parallel.

      10. Rebalance the cluster.

      11. Perform 10M creates, 20M updates, 10M deletes with durability = MAJORITY.

      12. While Step 11 is in progress, stop the memcached process on 172.23.106.137.

      13. Sleep for 20 seconds before restarting the memcached process on 172.23.106.137. Step 11 was successfully completed.

      14. Perform 10M creates, 20M updates, 10M deletes with  durability = MAJORITY.

      15. While Step 14 is in progress, Failover a node(172.23.106.83).

      16. Rebalance Out the node failed over in Step 15. Step 14 was successfully completed. 

      17. Rebalance In 1 node(172.23.106.83).  

      18. Perform 10M creates, 20M updates, 10M deletes with  durability = MAJORITY.

      19. While Step 18 is in progress , failover a node(172.23.106.83). 

      20. Failover could not complete properly with this error.

      Janitor cleanup of "GleamBookUsers" failed after failover of ['ns_1@172.23.106.83']: {'EXIT',
      {{badmatch,
      {error,
      {failed_nodes,
      ['ns_1@172.23.106.137',
      'ns_1@172.23.106.134',
      'ns_1@172.23.106.82',
      'ns_1@172.23.106.136']}}},
      [{ns_janitor,
      cleanup_apply_config_body,
      4,
      [{file,
      "src/ns_janitor.erl"},
      {line,
      286}]},
      {ns_janitor,
      '-cleanup_apply_config/4-fun-0-',
      4,
      [{file,
      "src/ns_janitor.erl"},
      {line,
      209}]},
      {async,
      '-async_init/4-fun-1-',
      3,
      [{file,
      "src/async.erl"},
      {line,
      197}]}]}} 

      Failover couldn't complete on some nodes:
      ['ns_1@172.23.106.83'] 

      20. Deactivating process(Full Recovery of the failed over node) for failed over node in Step 19 was started. (172.23.106.83)

      21.  While Step 19 and 20 were in progress, there were memcached crashes on 2 nodes(172.23.106.82, 172.23.106.137)

      Crash Message on 172.23.106.82:

      Service 'memcached' exited with status 134. Restarting. Messages:
      2019-10-19T15:07:09.819701-07:00 CRITICAL /opt/couchbase/bin/../lib/libstdc++.so.6() [0x7fbbcd8df000+0x8efd1]
      2019-10-19T15:07:09.819723-07:00 CRITICAL /opt/couchbase/bin/../lib/libstdc++.so.6() [0x7fbbcd8df000+0x8f213]
      2019-10-19T15:07:09.819750-07:00 CRITICAL /opt/couchbase/bin/../lib/../lib/ep.so() [0x7fbbc8265000+0xd3098]
      2019-10-19T15:07:09.819765-07:00 CRITICAL /opt/couchbase/bin/../lib/../lib/ep.so() [0x7fbbc8265000+0xe6eef]
      2019-10-19T15:07:09.819774-07:00 CRITICAL /opt/couchbase/bin/../lib/../lib/ep.so() [0x7fbbc8265000+0x1375d5]
      2019-10-19T15:07:09.819782-07:00 CRITICAL /opt/couchbase/bin/../lib/../lib/ep.so() [0x7fbbc8265000+0x137b8d]
      2019-10-19T15:07:09.819788-07:00 CRITICAL /opt/couchbase/bin/../lib/../lib/ep.so() [0x7fbbc8265000+0x131574]
      2019-10-19T15:07:09.819794-07:00 CRITICAL /opt/couchbase/bin/../lib/libplatform_so.so.0.1.0() [0x7fbbcf78b000+0x8f27]
      2019-10-19T15:07:09.819800-07:00 CRITICAL /lib64/libpthread.so.0() [0x7fbbcd1aa000+0x7dd5]
      2019-10-19T15:07:09.819829-07:00 CRITICAL /lib64/libc.so.6(clone+0x6d) [0x7fbbccddd000+0xfdead] 

      Crash Message on 172.23.106.137:

      Service 'memcached' exited with status 134. Restarting. Messages:
      2019-10-19T17:47:12.108628-07:00 CRITICAL /opt/couchbase/bin/../lib/libstdc++.so.6() [0x7f641ff1f000+0x8efd1]
      2019-10-19T17:47:12.108642-07:00 CRITICAL /opt/couchbase/bin/../lib/libstdc++.so.6() [0x7f641ff1f000+0x8f213]
      2019-10-19T17:47:12.388398-07:00 CRITICAL /opt/couchbase/bin/../lib/../lib/ep.so() [0x7f641aa65000+0xd3098]
      2019-10-19T17:47:12.388430-07:00 CRITICAL /opt/couchbase/bin/../lib/../lib/ep.so() [0x7f641aa65000+0xe6eef]
      2019-10-19T17:47:12.388439-07:00 CRITICAL /opt/couchbase/bin/../lib/../lib/ep.so() [0x7f641aa65000+0x1375d5]
      2019-10-19T17:47:12.388473-07:00 CRITICAL /opt/couchbase/bin/../lib/../lib/ep.so() [0x7f641aa65000+0x137b8d]
      2019-10-19T17:47:12.388482-07:00 CRITICAL /opt/couchbase/bin/../lib/../lib/ep.so() [0x7f641aa65000+0x131574]
      2019-10-19T17:47:12.388501-07:00 CRITICAL /opt/couchbase/bin/../lib/libplatform_so.so.0.1.0() [0x7f6421dcb000+0x8f27]
      2019-10-19T17:47:12.388508-07:00 CRITICAL /lib64/libpthread.so.0() [0x7f641f7ea000+0x7dd5]
      2019-10-19T17:47:12.388544-07:00 CRITICAL /lib64/libc.so.6(clone+0x6d) [0x7f641f41d000+0xfdead] 

      22. Step 18 completes successfully. 

      23. Rebalance the Cluster.

      24. After Step 23, 172.23.106.137 was auto-failed over with this error:

      Node ('ns_1@172.23.106.137') was automatically failed over. Reason: The data service did not respond for the duration of the auto-failover threshold. Either none of the buckets have warmed up or there is an issue with the data service.  

       

      Attachments

        For Gerrit Dashboard: MB-36564
        # Subject Branch Project Status CR V

        Activity

          People

            prateek.kumar Prateek Kumar (Inactive)
            prateek.kumar Prateek Kumar (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty