Loading...

XML

Word

Printable

Details

Type: Bug
Resolution: Fixed
Priority: Critical
Fix Version/s: 7.1.0
Affects Version/s: 7.1.0
Component/s: storage-engine
Labels:
Environment:
7.1.0-2440

Triage:
Untriaged
Operating System:
Centos 64-bit
Story Points:
1
Is this a Regression?:
Unknown
Sprint:
KV 2022-Feb, KV March-22

Description

Script to Repro

There is not a particular test that repro's this. Basically cleanup can fail for any test rendering one or more nodes in the cluster unusable.

Logs before node became unusable.

2022-03-06 13:49:41,844 | test  | INFO    | MainThread | [basetestcase:log_setup_status:647] ========= BaseTestCase setup started for test #5 test_data_load_collections_with_graceful_failover_rebalance_out =========

2022-03-06 13:50:22,311 | test  | INFO    | MainThread | [rest_client:monitorRebalance:1610] Rebalance done. Taken 11.0540001392 seconds to complete

2022-03-06 13:50:22,319 | test  | INFO    | MainThread | [common_lib:sleep:23] Sleep 5 seconds. Reason: Wait after rebalance complete

2022-03-06 13:50:27,359 | test  | ERROR   | MainThread | [rest_client:_http_request:834] GET http://172.23.100.35:8091/nodes/self body:  headers: {'Accept': '*/*', 'Connection': 'close', 'Authorization': 'Basic QWRtaW5pc3RyYXRvcjpwYXNzd29yZA==', 'Content-Type': 'application/json'} error: 404 reason: unknown "Node is unknown to this cluster." auth: Administrator:password

http://172.23.100.35:8091/nodes/self with status 0: Node is unknown to this cluster.

2022-03-06 13:50:27,362 | test  | ERROR   | MainThread | [rest_client:__init__:312] Error Node is unknown to this cluster. was gotten, 5 seconds sleep before retry

2022-03-06 13:50:32,378 | test  | ERROR   | MainThread | [rest_client:_http_request:834] GET http://172.23.100.35:8091/nodes/self body:  headers: {'Accept': '*/*', 'Connection': 'close', 'Authorization': 'Basic QWRtaW5pc3RyYXRvcjpwYXNzd29yZA==', 'Content-Type': 'application/json'} error: 404 reason: unknown "Node is unknown to this cluster." auth: Administrator:password

http://172.23.100.35:8091/nodes/self with status 0: Node is unknown to this cluster.

2022-03-06 13:50:32,380 | test  | ERROR   | MainThread | [rest_client:__init__:312] Error Node is unknown to this cluster. was gotten, 5 seconds sleep before retry

2022-03-06 13:50:37,394 | test  | ERROR   | MainThread | [rest_client:_http_request:834] GET http://172.23.100.35:8091/nodes/self body:  headers: {'Accept': '*/*', 'Connection': 'close', 'Authorization': 'Basic QWRtaW5pc3RyYXRvcjpwYXNzd29yZA==', 'Content-Type': 'application/json'} error: 404 reason: unknown "Node is unknown to this cluster." auth: Administrator:password

http://172.23.100.35:8091/nodes/self with status 0: Node is unknown to this cluster.

2022-03-06 13:50:37,395 | test  | ERROR   | MainThread | [rest_client:__init__:312] Error Node is unknown to this cluster. was gotten, 5 seconds sleep before retry

2022-03-06 13:50:42,404 | test  | ERROR   | MainThread | [rest_client:__init__:317] Node 172.23.100.35:8091 is in a broken state!

2022-03-06 13:50:42,404 | test  | ERROR   | MainThread | [cluster_ready_functions:cleanup_cluster:232] Can't create rest connection after rebalance out for ejected nodes, will retry after 10 seconds according to MB-8430: Unable to reach the host @ 172.23.100.35

2022-03-06 13:50:42,411 | test  | INFO    | MainThread | [common_lib:sleep:23] Sleep 10 seconds. Reason: MB-8430

2022-03-06 13:50:52,420 | test  | ERROR   | MainThread | [rest_client:_http_request:834] GET http://172.23.100.35:8091/nodes/self body:  headers: {'Accept': '*/*', 'Connection': 'close', 'Authorization': 'Basic QWRtaW5pc3RyYXRvcjpwYXNzd29yZA==', 'Content-Type': 'application/json'} error: 404 reason: unknown "Node is unknown to this cluster." auth: Administrator:password

http://172.23.100.35:8091/nodes/self with status 0: Node is unknown to this cluster.

2022-03-06 13:50:52,421 | test  | ERROR   | MainThread | [rest_client:__init__:312] Error Node is unknown to this cluster. was gotten, 5 seconds sleep before retry

2022-03-06 13:50:57,436 | test  | ERROR   | MainThread | [rest_client:_http_request:834] GET http://172.23.100.35:8091/nodes/self body:  headers: {'Accept': '*/*', 'Connection': 'close', 'Authorization': 'Basic QWRtaW5pc3RyYXRvcjpwYXNzd29yZA==', 'Content-Type': 'application/json'} error: 404 reason: unknown "Node is unknown to this cluster." auth: Administrator:password

http://172.23.100.35:8091/nodes/self with status 0: Node is unknown to this cluster.

2022-03-06 13:50:57,437 | test  | ERROR   | MainThread | [rest_client:__init__:312] Error Node is unknown to this cluster. was gotten, 5 seconds sleep before retry

2022-03-06 13:51:02,453 | test  | ERROR   | MainThread | [rest_client:_http_request:834] GET http://172.23.100.35:8091/nodes/self body:  headers: {'Accept': '*/*', 'Connection': 'close', 'Authorization': 'Basic QWRtaW5pc3RyYXRvcjpwYXNzd29yZA==', 'Content-Type': 'application/json'} error: 404 reason: unknown "Node is unknown to this cluster." auth: Administrator:password

http://172.23.100.35:8091/nodes/self with status 0: Node is unknown to this cluster.

2022-03-06 13:51:02,456 | test  | ERROR   | MainThread | [rest_client:__init__:312] Error Node is unknown to this cluster. was gotten, 5 seconds sleep before retry

2022-03-06 13:51:07,463 | test  | ERROR   | MainThread | [rest_client:__init__:317] Node 172.23.100.35:8091 is in a broken state!

Traceback (most recent call last):

  File "pytests/basetestcase.py", line 363, in setUp

    self.cluster_util.cluster_cleanup(cluster,

  File "pytests/basetestcase.py", line 363, in setUp

    self.cluster_util.cluster_cleanup(cluster,

  File "couchbase_utils/cluster_utils/cluster_ready_functions.py", line 169, in cluster_cleanup

    self.cleanup_cluster(cluster, master=cluster.master)

  File "couchbase_utils/cluster_utils/cluster_ready_functions.py", line 237, in cleanup_cluster

    rest = RestConnection(removed)

  File "lib/membase/api/rest_client.py", line 319, in __init__

    raise ServerUnavailableException(self.ip)

ServerUnavailableException: Unable to reach the host @ 172.23.100.35

172.23.100.35

[ns_server:debug,2022-03-06T21:12:03.979-08:00,ns_1@172.23.100.35:<0.17157.46>:ns_memcached:ensure_bucket_inner:1318]Bucket "default" not found during ensure_bucket

[ns_server:error,2022-03-06T21:12:04.012-08:00,ns_1@172.23.100.35:<0.17160.46>:ns_server_stats:report_prom_stats:172]ns_server stats reporting exception: error:badarg

[{ets,lookup,

      [ns_server_stats,{c,{<<"rest_request_enters">>,[]}}],

      [{error_info,#{cause => id,module => erl_stdlib_errors}}]},

 {ns_server_stats,'-report_ns_server_lc_stats/1-fun-0-',2,

                  [{file,"src/ns_server_stats.erl"},{line,257}]},

 {lists,foreach,2,[{file,"lists.erl"},{line,1342}]},

 {ns_server_stats,'-report_prom_stats/2-fun-0-',2,

                  [{file,"src/ns_server_stats.erl"},{line,170}]},

 {ns_server_stats,report_prom_stats,2,

                  [{file,"src/ns_server_stats.erl"},{line,180}]},

 {async,'-async_init/4-fun-1-',3,[{file,"src/async.erl"},{line,191}]}]

[ns_server:error,2022-03-06T21:12:04.012-08:00,ns_1@172.23.100.35:<0.17160.46>:ns_server_stats:report_prom_stats:172]system stats reporting exception: exit:{noproc,

                                        {gen_server,call,

                                         [ns_server_stats,get_stats]}}

[{gen_server,call,2,[{file,"gen_server.erl"},{line,239}]},

 {ns_server_stats,report_system_stats,1,

                  [{file,"src/ns_server_stats.erl"},{line,188}]},

 {ns_server_stats,'-report_prom_stats/2-fun-0-',2,

                  [{file,"src/ns_server_stats.erl"},{line,170}]},

 {ns_server_stats,report_prom_stats,2,

                  [{file,"src/ns_server_stats.erl"},{line,182}]},

 {async,'-async_init/4-fun-1-',3,[{file,"src/async.erl"},{line,191}]}]

[ns_server:debug,2022-03-06T21:12:04.110-08:00,ns_1@172.23.100.35:<0.17158.46>:ns_memcached:ensure_bucket_inner:1318]Bucket "default" not found during ensure_bucket

[ns_server:error,2022-03-06T21:12:04.324-08:00,ns_1@172.23.100.35:<0.17166.46>:ns_server_stats:report_prom_stats:172]ns_server stats reporting exception: error:badarg

[{ets,safe_fixtable,

      [ns_server_stats,true],

      [{error_info,#{cause => id,module => erl_stdlib_errors}}]},

 {ets,foldl,3,[{file,"ets.erl"},{line,625}]},

 {ns_server_stats,report_ns_server_hc_stats,1,

                  [{file,"src/ns_server_stats.erl"},{line,264}]},

 {ns_server_stats,'-report_prom_stats/2-fun-0-',2,

                  [{file,"src/ns_server_stats.erl"},{line,170}]},

 {ns_server_stats,report_prom_stats,2,

                  [{file,"src/ns_server_stats.erl"},{line,178}]},

 {async,'-async_init/4-fun-1-',3,[{file,"src/async.erl"},{line,191}]}]

[ns_server:error,2022-03-06T21:12:04.651-08:00,ns_1@172.23.100.35:<0.17058.46>:menelaus_util:reply_server_error_before_close:210]Server error during processing: ["web request failed",

                                 {path,"/pools/default"},

                                 {method,'GET'},

                                 {type,exit},

                                 {what,

                                  {noproc,

                                   {gen_server,call,

                                    ['service_status_keeper-index',

                                     get_version]}}},

                                 {trace,

                                  [{gen_server,call,2,

                                    [{file,"gen_server.erl"},{line,239}]},

                                   {menelaus_web_pools,do_build_pool_info,4,

                                    [{file,"src/menelaus_web_pools.erl"},

                                     {line,211}]},

                                   {menelaus_web_pools,pool_info,6,

                                    [{file,"src/menelaus_web_pools.erl"},

                                     {line,106}]},

                                   {menelaus_web_pools,handle_pool_info_wait,

5,

                                    [{file,"src/menelaus_web_pools.erl"},

                                     {line,118}]},

                                   {request_tracker,request,2,

                                    [{file,"src/request_tracker.erl"},

                                     {line,40}]},

                                   {menelaus_util,handle_request,2,

                                    [{file,"src/menelaus_util.erl"},

                                     {line,221}]},

                                   {mochiweb_http,headers,6,

                                    [{file,

                                      "/home/couchbase/jenkins/workspace/couchbase-server-unix/couchdb/src/mochiweb/mochiweb_http.erl"},

                                     {line,153}]},

                                   {proc_lib,init_p_do_apply,3,

                                    [{file,"proc_lib.erl"},{line,226}]}]}]

Maybe it is another side effect of ~~MB-49512~~ which is also hit frequently during cleanups related to bucket not being dropped completely.
cbcollect_info attached.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

consoleText_MB-51319.txt
1.98 MB
06/Mar/22 9:26 PM
debug.log.1
40.00 MB
07/Mar/22 5:35 AM
latest-diag.log
1.52 MB
07/Mar/22 5:35 AM
memcached.log.000021.txt
7.94 MB
07/Mar/22 5:35 AM
node35.gdb.log
789 kB
07/Mar/22 7:48 AM

Gerrit Reviews

- Issue Only
- Show All Reviews
- Show Open Reviews
- Show All Issues
- Show Open Issues

No reviews matched the request. Check your Options in the drop-down menu of this sections header.

Activity

People

Assignee:: Balakumaran Gopal

Reporter:: Balakumaran Gopal

Votes:: 0 Vote for this issue

Watchers:: 11 Start watching this issue

Dates

Due:: 10/Mar/22

Created:: 06/Mar/22 9:23 PM

Updated:: 13/Mar/22 9:41 PM

Resolved:: 09/Mar/22 9:07 AM

Gerrit Reviews

There are no open Gerrit changes

Show There are 3 closed Gerrit changes

Hide There are 3 closed Gerrit changes

MB-51319 lsm: Fix setting up cancel method for purger task: Gerrit Review:

MB-51319 tests: Add unit test for db shutdown during purging: Gerrit Review:

MB-51319 lsm: Make cancel method mandatory for scheduled tasks: Gerrit Review:

Some nodes in the cluster are rendered unusable post cleanup

Details

Description

Attachments

Attachments

Gerrit Reviews

Activity

People

Dates

Gerrit Reviews

PagerDuty