Loading...

XML

Word

Printable

Details

Type: Bug
Resolution: Fixed
Priority: Major
Fix Version/s: 6.5.0
Affects Version/s: 6.5.0
Component/s: couchbase-bucket
Labels:
Environment:
Enterprise Edition 6.5.0 build 4380

Triage:
Triaged
Operating System:
Centos 64-bit
Link to Log File, atop/blg, CBCollectInfo, Core dump:

Hide
https://cb-jira.s3.us-east-2.amazonaws.com/logs/FauloverQuotaused/collectinfo-2019-10-20T192729-ns_1%40172.23.105.168.zip
https://cb-jira.s3.us-east-2.amazonaws.com/logs/FauloverQuotaused/collectinfo-2019-10-20T192729-ns_1%40172.23.106.134.zip
https://cb-jira.s3.us-east-2.amazonaws.com/logs/FauloverQuotaused/collectinfo-2019-10-20T192729-ns_1%40172.23.106.136.zip
https://cb-jira.s3.us-east-2.amazonaws.com/logs/FauloverQuotaused/collectinfo-2019-10-20T192729-ns_1%40172.23.106.137.zip
https://cb-jira.s3.us-east-2.amazonaws.com/logs/FauloverQuotaused/collectinfo-2019-10-20T192729-ns_1%40172.23.106.138.zip
https://cb-jira.s3.us-east-2.amazonaws.com/logs/FauloverQuotaused/collectinfo-2019-10-20T192729-ns_1%40172.23.106.82.zip
https://cb-jira.s3.us-east-2.amazonaws.com/logs/FauloverQuotaused/collectinfo-2019-10-20T192729-ns_1%40172.23.106.83.zip
https://cb-jira.s3.us-east-2.amazonaws.com/logs/FauloverQuotaused/collectinfo-2019-10-20T192729-ns_1%40172.23.106.85.zip
https://cb-jira.s3.us-east-2.amazonaws.com/logs/FauloverQuotaused/collectinfo-2019-10-20T192729-ns_1%40172.23.106.86.zip

Show
https://cb-jira.s3.us-east-2.amazonaws.com/logs/FauloverQuotaused/collectinfo-2019-10-20T192729-ns_1%40172.23.105.168.zip https://cb-jira.s3.us-east-2.amazonaws.com/logs/FauloverQuotaused/collectinfo-2019-10-20T192729-ns_1%40172.23.106.134.zip https://cb-jira.s3.us-east-2.amazonaws.com/logs/FauloverQuotaused/collectinfo-2019-10-20T192729-ns_1%40172.23.106.136.zip https://cb-jira.s3.us-east-2.amazonaws.com/logs/FauloverQuotaused/collectinfo-2019-10-20T192729-ns_1%40172.23.106.137.zip https://cb-jira.s3.us-east-2.amazonaws.com/logs/FauloverQuotaused/collectinfo-2019-10-20T192729-ns_1%40172.23.106.138.zip https://cb-jira.s3.us-east-2.amazonaws.com/logs/FauloverQuotaused/collectinfo-2019-10-20T192729-ns_1%40172.23.106.82.zip https://cb-jira.s3.us-east-2.amazonaws.com/logs/FauloverQuotaused/collectinfo-2019-10-20T192729-ns_1%40172.23.106.83.zip https://cb-jira.s3.us-east-2.amazonaws.com/logs/FauloverQuotaused/collectinfo-2019-10-20T192729-ns_1%40172.23.106.85.zip https://cb-jira.s3.us-east-2.amazonaws.com/logs/FauloverQuotaused/collectinfo-2019-10-20T192729-ns_1%40172.23.106.86.zip
Is this a Regression?:
Unknown
Sprint:
KV-Engine Mad-Hatter GA

Description

Steps to Reproduce:

1. Create a 7 node cluster.

+----------------+----------+--------------+

| Nodes          | Services | Status       |

+----------------+----------+--------------+

| 172.23.106.134 | [u'kv']  | Cluster node |

| 172.23.106.136 | None     | <--- IN ---  |

| 172.23.106.137 | None     | <--- IN ---  |

| 172.23.106.138 | None     | <--- IN ---  |

| 172.23.105.168 | None     | <--- IN ---  |

| 172.23.106.82  | None     | <--- IN ---  |

| 172.23.106.83  | None     | <--- IN ---  |

+----------------+----------+--------------+

2. Create a bucket with replicas=1, eviction policy= valueOnly, compression=off.

3. Load 50M docs with durability = MAJORITY.

4. Rebalance In 1 node(172.23.106.85) with 10M creates, 20M updates with durability=MAJORITY in parallel.

5. Rebalance Out 1 node(172.23.106.83) with 10M creates, 20M updates, 10M deletes with durability=MAJORITY in parallel.

6. Rebalance In 2 nodes(172.23.106.83;172.23.106.86) and Rebalance Out 1 node (172.23.106.82) with 10M creates, 20M updates, 10M deletes with durability=MAJORITY in parallel.

7. Swap Rebalance 1 node(IN=172.23.106.82, OUT=172.23.105.168) with 10M creates, 20M updates, 10M deletes with durability=MAJORITY in parallel.

8. Update the Bucket replica number from 1 to 2.

9. Rebalance In 1 node (172.23.105.168) with 10M creates, 20M updates, 10M deletes with durability=MAJORITY in parallel.

10. Rebalance the cluster.

11. Perform 10M creates, 20M updates, 10M deletes with durability = MAJORITY.

12. While Step 11 is in progress, stop the memcached process on 172.23.106.137.

13. Sleep for 20 seconds before restarting the memcached process on 172.23.106.137. Step 11 was successfully completed.

14. Perform 10M creates, 20M updates, 10M deletes with durability = MAJORITY.

15. While Step 14 is in progress, Failover a node(172.23.106.83).

16. Rebalance Out the node failed over in Step 15. Step 14 was successfully completed.

17. Rebalance In 1 node(172.23.106.83).

18. Perform 10M creates, 20M updates, 10M deletes with durability = MAJORITY.

19. While Step 18 is in progress , failover a node(172.23.106.83).

20. Failover could not complete properly with this error.

Janitor cleanup of "GleamBookUsers" failed after failover of ['ns_1@172.23.106.83']: {'EXIT',

{{badmatch,

{error,

{failed_nodes,

['ns_1@172.23.106.137',

'ns_1@172.23.106.134',

'ns_1@172.23.106.82',

'ns_1@172.23.106.136']}}},

[{ns_janitor,

cleanup_apply_config_body,

4,

[{file,

"src/ns_janitor.erl"},

{line,

286}]},

{ns_janitor,

'-cleanup_apply_config/4-fun-0-',

4,

[{file,

"src/ns_janitor.erl"},

{line,

209}]},

{async,

'-async_init/4-fun-1-',

3,

[{file,

"src/async.erl"},

{line,

197}]}]}}

Failover couldn't complete on some nodes:

['ns_1@172.23.106.83']

20. Deactivating process(Full Recovery of the failed over node) for failed over node in Step 19 was started. (172.23.106.83)

21. While Step 19 and 20 were in progress, there were memcached crashes on 2 nodes(172.23.106.82, 172.23.106.137)

Crash Message on 172.23.106.82:

Service 'memcached' exited with status 134. Restarting. Messages:

2019-10-19T15:07:09.819701-07:00 CRITICAL /opt/couchbase/bin/../lib/libstdc++.so.6() [0x7fbbcd8df000+0x8efd1]

2019-10-19T15:07:09.819723-07:00 CRITICAL /opt/couchbase/bin/../lib/libstdc++.so.6() [0x7fbbcd8df000+0x8f213]

2019-10-19T15:07:09.819750-07:00 CRITICAL /opt/couchbase/bin/../lib/../lib/ep.so() [0x7fbbc8265000+0xd3098]

2019-10-19T15:07:09.819765-07:00 CRITICAL /opt/couchbase/bin/../lib/../lib/ep.so() [0x7fbbc8265000+0xe6eef]

2019-10-19T15:07:09.819774-07:00 CRITICAL /opt/couchbase/bin/../lib/../lib/ep.so() [0x7fbbc8265000+0x1375d5]

2019-10-19T15:07:09.819782-07:00 CRITICAL /opt/couchbase/bin/../lib/../lib/ep.so() [0x7fbbc8265000+0x137b8d]

2019-10-19T15:07:09.819788-07:00 CRITICAL /opt/couchbase/bin/../lib/../lib/ep.so() [0x7fbbc8265000+0x131574]

2019-10-19T15:07:09.819794-07:00 CRITICAL /opt/couchbase/bin/../lib/libplatform_so.so.0.1.0() [0x7fbbcf78b000+0x8f27]

2019-10-19T15:07:09.819800-07:00 CRITICAL /lib64/libpthread.so.0() [0x7fbbcd1aa000+0x7dd5]

2019-10-19T15:07:09.819829-07:00 CRITICAL /lib64/libc.so.6(clone+0x6d) [0x7fbbccddd000+0xfdead]

Crash Message on 172.23.106.137:

Service 'memcached' exited with status 134. Restarting. Messages:

2019-10-19T17:47:12.108628-07:00 CRITICAL /opt/couchbase/bin/../lib/libstdc++.so.6() [0x7f641ff1f000+0x8efd1]

2019-10-19T17:47:12.108642-07:00 CRITICAL /opt/couchbase/bin/../lib/libstdc++.so.6() [0x7f641ff1f000+0x8f213]

2019-10-19T17:47:12.388398-07:00 CRITICAL /opt/couchbase/bin/../lib/../lib/ep.so() [0x7f641aa65000+0xd3098]

2019-10-19T17:47:12.388430-07:00 CRITICAL /opt/couchbase/bin/../lib/../lib/ep.so() [0x7f641aa65000+0xe6eef]

2019-10-19T17:47:12.388439-07:00 CRITICAL /opt/couchbase/bin/../lib/../lib/ep.so() [0x7f641aa65000+0x1375d5]

2019-10-19T17:47:12.388473-07:00 CRITICAL /opt/couchbase/bin/../lib/../lib/ep.so() [0x7f641aa65000+0x137b8d]

2019-10-19T17:47:12.388482-07:00 CRITICAL /opt/couchbase/bin/../lib/../lib/ep.so() [0x7f641aa65000+0x131574]

2019-10-19T17:47:12.388501-07:00 CRITICAL /opt/couchbase/bin/../lib/libplatform_so.so.0.1.0() [0x7f6421dcb000+0x8f27]

2019-10-19T17:47:12.388508-07:00 CRITICAL /lib64/libpthread.so.0() [0x7f641f7ea000+0x7dd5]

2019-10-19T17:47:12.388544-07:00 CRITICAL /lib64/libc.so.6(clone+0x6d) [0x7f641f41d000+0xfdead]

22. Step 18 completes successfully.

23. Rebalance the Cluster.

24. After Step 23, 172.23.106.137 was auto-failed over with this error:

Node ('ns_1@172.23.106.137') was automatically failed over. Reason: The data service did not respond for the duration of the auto-failover threshold. Either none of the buckets have warmed up or there is an issue with the data service.

Attachments

Gerrit Reviews

- Issue Only
- Show All Reviews
- Show Open Reviews
- Show All Issues
- Show Open Issues

For Gerrit Dashboard: MB-36564
#	Subject	Branch	Project	Status	CR	V
116736,2	MB-36564: Do not send first commit as commit when in disk snapshot	master	kv_engine	Status: MERGED	+2	+1

Activity

People

Assignee:: Prateek Kumar (Inactive)

Reporter:: Prateek Kumar (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 20/Oct/19 11:52 PM

Updated:: 05/Nov/19 12:45 AM

Resolved:: 21/Oct/19 8:37 AM

Gerrit Reviews

There are no open Gerrit changes

Show There is 1 closed Gerrit change

Hide There is 1 closed Gerrit change

MB-36564: Do not send first commit as commit when in disk snapshot: Gerrit Review:

[Volume] Failing Over a node is causing memcached crash on 2 other nodes.

Details

Description

Attachments

Gerrit Reviews

Activity

People

Dates

Gerrit Reviews

PagerDuty