Loading...

XML

Word

Printable

Details

Type: Bug
Resolution: Fixed
Priority: Major
Fix Version/s: 7.2.0
Affects Version/s: 7.2.0
Component/s: secondary-index
Labels:
Environment:
Enterprise Edition 7.2.0 build 5232

Triage:
Untriaged
Story Points:
0
Is this a Regression?:
Unknown

Description

Build :7.2.0-5232
Test : -test tests/2i/neo/test_neo_idx_clusterops_recovery.yml -scope tests/2i/neo/scope_neo_plasma_idx_dgm.yml
Scale : 3

It looks like an auto-failover of a node was attempted (not really sure why), but didn't go through because of safety check failure.

The problematic node appears to be 172.23.97.109 -

/opt/couchbase/var/lib/couchbase/logs/info.log.1.gz:[user:info,2023-03-07T08:25:00.948-08:00,ns_1@172.23.96.198:<0.27073.0>:auto_failover:log_unsafe_node:670]Could not automatically fail over node ('ns_1@172.23.97.109') due to operation being unsafe for service index. Safety check failed.

/opt/couchbase/var/lib/couchbase/logs/info.log.1.gz:[user:info,2023-03-07T08:25:08.964-08:00,ns_1@172.23.96.198:<0.27073.0>:auto_failover:log_unsafe_node:670]Could not automatically fail over node ('ns_1@172.23.97.109') due to operation being unsafe for service index. Safety check failed.

/opt/couchbase/var/lib/couchbase/logs/info.log.1.gz:[ns_server:info,2023-03-07T08:25:08.965-08:00,ns_1@172.23.96.198:ns_log<0.25245.0>:ns_log:is_duplicate_log:156]suppressing duplicate log auto_failover:0([<<"Could not automatically fail over node ('ns_1@172.23.97.109') due to operation being unsafe for service index. Safety check failed.">>]) because it's been seen 1 times in the past 8.016095 secs (last seen 8.016095 secs ago

/opt/couchbase/var/lib/couchbase/logs/info.log.1.gz:[user:info,2023-03-07T08:25:15.977-08:00,ns_1@172.23.96.198:<0.27073.0>:auto_failover:log_unsafe_node:670]Could not automatically fail over node ('ns_1@172.23.97.109') due to operation being unsafe for service index. Safety check failed.

/opt/couchbase/var/lib/couchbase/logs/info.log.1.gz:[ns_server:info,2023-03-07T08:25:15.977-08:00,ns_1@172.23.96.198:ns_log<0.25245.0>:ns_log:is_duplicate_log:156]suppressing duplicate log auto_failover:0([<<"Could not automatically fail over node ('ns_1@172.23.97.109') due to operation being unsafe for service index. Safety check failed.">>]) because it's been seen 2 times in the past 15.028912 secs (last seen 7.012817 secs ago

The info.log on 172.23.97.109 shows these errors -

[ns_server:error,2023-03-07T08:24:46.533-08:00,ns_1@172.23.97.109:service_agent-index<0.30705.77>:service_agent:terminate:259]Terminating abnormally

[ns_server:error,2023-03-07T08:24:53.409-08:00,ns_1@172.23.97.109:service_status_keeper_worker<0.13783.0>:rest_utils:get_json:62]Request to (indexer) getIndexStatus with headers [{"If-None-Match",

                                                   "61fe7b1db8796333"}] failed: {error,

                                                                                 timeout}

[ns_server:error,2023-03-07T08:24:53.410-08:00,ns_1@172.23.97.109:service_status_keeper-index<0.13786.0>:service_status_keeper:handle_cast:103]Service service_index returned incorrect status

[ns_server:error,2023-03-07T08:25:08.413-08:00,ns_1@172.23.97.109:service_status_keeper_worker<0.13783.0>:rest_utils:get_json:62]Request to (indexer) getIndexStatus with headers [{"If-None-Match",

                                                   "61fe7b1db8796333"}] failed: {error,

                                                                                 timeout}

[ns_server:error,2023-03-07T08:25:08.414-08:00,ns_1@172.23.97.109:service_status_keeper-index<0.13786.0>:service_status_keeper:handle_cast:103]Service service_index returned incorrect status

[user:info,2023-03-07T08:25:21.336-08:00,ns_1@172.23.97.109:<0.5258.78>:menelaus_web_alerts_srv:global_alert:178]Warning: approaching low index resident percentage. Indexer RAM percentage on node "172.23.97.109" is 7%, which is under the threshold of 10%.

[ns_server:info,2023-03-07T08:25:23.328-08:00,ns_1@172.23.97.109:ns_config_rep<0.13579.0>:ns_config_rep:pull_one_node:421]Pulling config from: 'ns_1@172.23.97.66'

[ns_server:error,2023-03-07T08:25:23.417-08:00,ns_1@172.23.97.109:service_status_keeper_worker<0.13783.0>:rest_utils:get_json:62]Request to (indexer) getIndexStatus with headers [{"If-None-Match",

                                                   "61fe7b1db8796333"}] failed: {error,

                                                                                 timeout}

[ns_server:error,2023-03-07T08:25:23.418-08:00,ns_1@172.23.97.109:service_status_keeper-index<0.13786.0>:service_status_keeper:handle_cast:103]Service service_index returned incorrect status

cbcollect ->

         url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1678209145/collectinfo-2023-03-07T171226-ns_1%40172.23.105.122.zip

         url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1678209145/collectinfo-2023-03-07T171226-ns_1%40172.23.106.171.zip

         url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1678209145/collectinfo-2023-03-07T171226-ns_1%40172.23.106.176.zip

         url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1678209145/collectinfo-2023-03-07T171226-ns_1%40172.23.106.30.zip

         url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1678209145/collectinfo-2023-03-07T171226-ns_1%40172.23.96.198.zip

         url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1678209145/collectinfo-2023-03-07T171226-ns_1%40172.23.96.230.zip

         url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1678209145/collectinfo-2023-03-07T171226-ns_1%40172.23.96.245.zip

         url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1678209145/collectinfo-2023-03-07T171226-ns_1%40172.23.97.100.zip

         url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1678209145/collectinfo-2023-03-07T171226-ns_1%40172.23.97.108.zip

         url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1678209145/collectinfo-2023-03-07T171226-ns_1%40172.23.97.109.zip

         url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1678209145/collectinfo-2023-03-07T171226-ns_1%40172.23.97.66.zip

         url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1678209145/collectinfo-2023-03-07T171226-ns_1%40172.23.97.67.zip

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

720_5298-mutation_queue_size.png
70 kB
02/May/23 9:57 AM
720_5298-ts_queue_size.png
72 kB
02/May/23 9:57 AM
720_5304-mut_queue_size.png
68 kB
02/May/23 9:56 AM
720_5304-ts_queue_size.png
79 kB
02/May/23 9:56 AM
N97_109_CPU_utilisation_mortimer.png
836 kB
03/Apr/23 12:23 AM
N97_109_indexer_cprof.svg
170 kB
03/Apr/23 12:23 AM
N97_109_indexer_mprof.svg
126 kB
03/Apr/23 12:23 AM
N97_109_memoryRss_vs_Quota.png
588 kB
03/Apr/23 12:25 AM

Gerrit Reviews

- Issue Only
- Show All Reviews
- Show Open Reviews
- Show All Issues
- Show Open Issues

For Gerrit Dashboard: MB-55879
#	Subject	Branch	Project	Status	CR	V
189417,1	MB-55879 Change default config for minVbQueueLength	neo	indexing	Status: MERGED	+2	+1

Activity

People

Assignee:: Shivansh Rustagi

Reporter:: Pavan PB

Votes:: 0 Vote for this issue

Watchers:: 12 Start watching this issue

Dates

Due:: 12/Apr/23

Created:: 08/Mar/23 12:28 AM

Updated:: 07/May/23 10:36 PM

Resolved:: 12/Apr/23 2:26 AM

Gerrit Reviews

There are no open Gerrit changes

Show There is 1 closed Gerrit change

Hide There is 1 closed Gerrit change

MB-55879 Change default config for minVbQueueLength: Gerrit Review:

[System Test] Autofailover did not go through because of safety check failure

Details

Description

Attachments

Attachments

Gerrit Reviews

Activity

People

Dates

Gerrit Reviews

PagerDuty