Details
-
Bug
-
Resolution: User Error
-
Major
-
None
-
7.6.0
-
Operating System : Debian GNU/Linux 10 (buster)
Initial Version : Couchbase Enterprise Edition build 7.1.5-3878
Upgraded Version : Couchbase Enterprise Edition build 7.6.0-1980
-
Untriaged
-
Linux x86_64
-
-
0
-
Unknown
Description
Steps to reproduce
- Created a 4 node cluster with the following services
- 172.23.108.224 - kv
- 172.23.108.248 - kv, index, n1ql
- 172.23.121.158 - kv, index, fts
- 172.23.121.176 - kv, index
- Created a couchstore bucket named default and loaded some items onto it
- Ran a few n1ql and search queries
- Gracefully failed over node 172.23.121.158
- Upgraded to Couchbase Enterprise Edition build 7.6.0-1980
- Added back using full recovery and started a rebalance - Rebalance succeeds
- Gracefully failed over node 172.23.121.176
- Upgraded node to Couchbase Enterprise Edition build 7.6.0-1980
- Added back using full recovery
- Rebalance fails
Rebalance fails
2024-01-09T00:11:07.860-08:00, ns_orchestrator:0:critical:message(ns_1@172.23.121.158) - Rebalance exited with reason {prepare_rebalance_failed, {error, {failed_nodes, [{'ns_1@172.23.121.176',{error,timeout}}]}}}.Rebalance Operation Id = 9195d476d8c4f961b66ed05d9d3192b9 |
Observing no diag logs on node 172.23.121.176
curl: (7) Failed to connect to 127.0.0.1 port 80 after 0 ms: Couldn't connect to server |
Observing ERROR REPORTS and CRASH REPORTS in babysitter logs on node 172.23.121.176
[ns_server:info,2024-01-09T00:01:03.729-08:00,babysitter_of_ns_1@cb.local:<0.133.0>:ns_port_server:log:226]ns_server<0.133.0>: 10278: Booted. Waiting for shutdown requestns_server<0.133.0>: 10278: Booted. Waiting for shutdown request |
[ns_server:debug,2024-01-09T00:09:34.325-08:00,babysitter_of_ns_1@cb.local:goxdcr-goport<0.152.0>:goport:handle_eof:592]Stream 'stdout' closed[ns_server:debug,2024-01-09T00:09:34.325-08:00,babysitter_of_ns_1@cb.local:goxdcr-goport<0.152.0>:goport:handle_eof:592]Stream 'stderr' closed[ns_server:info,2024-01-09T00:09:34.327-08:00,babysitter_of_ns_1@cb.local:goxdcr-goport<0.152.0>:goport:handle_process_exit:573]Port exited with status 1.[error_logger:error,2024-01-09T00:09:34.330-08:00,babysitter_of_ns_1@cb.local:<0.146.0>:ale_error_logger_handler:do_log:101]=========================ERROR REPORT=========================** Generic server <0.146.0> terminating ** Last message in was {<0.152.0>,{exit_status,1}}** When Server state == {state,<0.152.0>,10459, {goxdcr,"/opt/couchbase/bin/goxdcr", ["-sourceKVAdminPort=8091","-xdcrRestPort=9998", "-isEnterprise=true","-ipv4=required", "-ipv6=optional", "-caFile=/opt/couchbase/var/lib/couchbase/config/certs/ca.pem", "-clientCertFile=/opt/couchbase/var/lib/couchbase/config/certs/client_chain.pem", "-clientKeyFile=/opt/couchbase/var/lib/couchbase/config/certs/client_pkey.pem"], [via_goport,exit_status,stderr_to_stdout, {env, [{"GOTRACEBACK","single"}, {"GOMAXPROCS","4"}, {"CBAUTH_REVRPC_URL", {dynamic, {ns_ports_setup, remote_build_cbauth_revrpc_url, [goxdcr]}}}]}]}, {ringbuffer,1488,1024, {[{<<"2024-01-09T00:09:34.321-08:00 WARN GOXDCR.Utils: GetAllMetadataFromCatalog(remoteCluster) took 15.501958857s\nMetadata service not available after 30"...>>, 160}, {<<"2024-01-09T00:09:34.321-08:00 WARN GOXDCR.MetadataSvc: metakv.ListAllChildren failed. path=/remoteCluster/, err=Get \"http://127.0.0.1:8091/_meta"...>>, 354}, {<<"2024-01-09T00:09:26.321-08:00 WARN GOXDCR.MetadataSvc: metakv.ListAllChildren failed. path=/remoteCluster/, err=Get \"http://127.0.0.1:8091/_"...>>, 487}], [{<<"2024-01-09T00:09:22.320-08:00 WARN GOXDCR.MetadataSvc: metakv.ListAllChildren failed. path=/remoteCluster/, err=Get \"http://127.0.0.1:8091/_meta"...>>, 487}]}}, goxdcr,undefined,[],0}** Reason for termination ==** {abnormal,1} |
[error_logger:error,2024-01-09T00:09:34.332-08:00,babysitter_of_ns_1@cb.local:<0.146.0>:ale_error_logger_handler:do_log:101]=========================CRASH REPORT========================= crasher: initial call: ns_port_server:init/1 pid: <0.146.0> registered_name: [] exception exit: {abnormal,1} in function gen_server:handle_common_reply/8 (gen_server.erl, line 1241) ancestors: [<0.145.0>,<0.144.0>,ns_child_ports_sup,ns_babysitter_sup, <0.115.0>] message_queue_len: 1 messages: [{'EXIT',<0.152.0>,normal}] links: [<0.145.0>] dictionary: [] trap_exit: true status: running heap_size: 10958 stack_size: 28 reductions: 60089 neighbours: |
[ns_server:info,2024-01-09T00:09:34.332-08:00,babysitter_of_ns_1@cb.local:<0.145.0>:supervisor_cushion:handle_info:66]Cushion managed supervisor for goxdcr failed: {abnormal,1}[error_logger:error,2024-01-09T00:09:34.333-08:00,babysitter_of_ns_1@cb.local:<0.145.0>:ale_error_logger_handler:do_log:101]=========================ERROR REPORT=========================** Generic server <0.145.0> terminating ** Last message in was {die,{abnormal,1}}** When Server state == {state,goxdcr,5000,-576460740445204035,undefined, infinity, #{always_delay => false}}** Reason for termination ==** {abnormal,1} |
[error_logger:error,2024-01-09T00:09:34.333-08:00,babysitter_of_ns_1@cb.local:<0.145.0>:ale_error_logger_handler:do_log:101]=========================CRASH REPORT========================= crasher: initial call: supervisor_cushion:init/1 pid: <0.145.0> registered_name: [] exception exit: {abnormal,1} in function gen_server:handle_common_reply/8 (gen_server.erl, line 1241) ancestors: [<0.144.0>,ns_child_ports_sup,ns_babysitter_sup,<0.115.0>] message_queue_len: 0 messages: [] links: [<0.144.0>] dictionary: [] trap_exit: true status: running heap_size: 6772 stack_size: 28 reductions: 11729 neighbours: |
[error_logger:error,2024-01-09T00:09:34.333-08:00,babysitter_of_ns_1@cb.local:<0.144.0>:ale_error_logger_handler:do_log:101]=========================CRASH REPORT========================= crasher: initial call: erlang:apply/2 pid: <0.144.0> registered_name: [] exception exit: {abnormal,1} in function restartable:loop/4 (src/restartable.erl, line 63) ancestors: [ns_child_ports_sup,ns_babysitter_sup,<0.115.0>] message_queue_len: 0 messages: [] links: [<0.134.0>] dictionary: [] trap_exit: true status: running heap_size: 2586 stack_size: 28 reductions: 3083 neighbours: |
[error_logger:error,2024-01-09T00:09:34.333-08:00,babysitter_of_ns_1@cb.local:ns_child_ports_sup<0.134.0>:ale_error_logger_handler:do_log:101]=========================SUPERVISOR REPORT========================= supervisor: {local,ns_child_ports_sup} errorContext: child_terminated reason: {abnormal,1} offender: [{pid,<0.144.0>}, {id,{goxdcr,"/opt/couchbase/bin/goxdcr", ["-sourceKVAdminPort=8091","-xdcrRestPort=9998", "-isEnterprise=true","-ipv4=required", "-ipv6=optional", "-caFile=/opt/couchbase/var/lib/couchbase/config/certs/ca.pem", "-clientCertFile=/opt/couchbase/var/lib/couchbase/config/certs/client_chain.pem", "-clientKeyFile=/opt/couchbase/var/lib/couchbase/config/certs/client_pkey.pem"], [via_goport,exit_status,stderr_to_stdout, {env, [{"GOTRACEBACK","single"}, {"GOMAXPROCS","4"}, {"CBAUTH_REVRPC_URL", {dynamic, {ns_ports_setup, remote_build_cbauth_revrpc_url, [goxdcr]}}}]}, {log,"goxdcr.log"}]}}, {mfargs, {restartable,start_link, [{supervisor_cushion,start_link, [goxdcr,5000,infinity,ns_port_server,start_link, [#Fun<ns_child_ports_sup.2.17249174>]]}, 86400000]}}, {restart_type,permanent}, {significant,false}, {shutdown,infinity}, {child_type,worker}] |
[ns_server:debug,2024-01-09T00:09:34.333-08:00,babysitter_of_ns_1@cb.local:<0.397.0>:supervisor_cushion:init:39]Starting supervisor cushion for goxdcr with delay of 5000[error_logger:info,2024-01-09T00:09:34.336-08:00,babysitter_of_ns_1@cb.local:ale_dynamic_sup<0.96.0>:ale_error_logger_handler:do_log:101]=========================PROGRESS REPORT========================= supervisor: {local,ale_dynamic_sup} started: [{pid,<0.399.0>}, {id,'sink-goxdcr'}, {mfargs, {ale_dynamic_sup,delay_death, [{ale_disk_sink,start_link, ['sink-goxdcr', "/opt/couchbase/var/lib/couchbase/logs/goxdcr.log", [{rotation, [{compress,true}, {size,41943040}, {num_files,10}, {buffer_size_max,52428800}]}]]}, 1000]}}, {restart_type,permanent}, {significant,false}, {shutdown,5000}, {child_type,worker}] |
[ns_server:debug,2024-01-09T00:09:34.585-08:00,babysitter_of_ns_1@cb.local:<0.396.0>:restartable:start_child:92]Started child process <0.397.0> MFA: {supervisor_cushion,start_link, [goxdcr,5000,infinity,ns_port_server,start_link, [#Fun<ns_child_ports_sup.2.17249174>]]}[error_logger:info,2024-01-09T00:09:34.585-08:00,babysitter_of_ns_1@cb.local:ns_child_ports_sup<0.134.0>:ale_error_logger_handler:do_log:101]=========================PROGRESS REPORT========================= supervisor: {local,ns_child_ports_sup} started: [{pid,<0.396.0>}, {id,{goxdcr,"/opt/couchbase/bin/goxdcr", ["-sourceKVAdminPort=8091","-xdcrRestPort=9998", "-isEnterprise=true","-ipv4=required","-ipv6=optional", "-caFile=/opt/couchbase/var/lib/couchbase/config/certs/ca.pem", "-clientCertFile=/opt/couchbase/var/lib/couchbase/config/certs/client_chain.pem", "-clientKeyFile=/opt/couchbase/var/lib/couchbase/config/certs/client_pkey.pem"], [via_goport,exit_status,stderr_to_stdout, {env, [{"GOTRACEBACK","single"}, {"GOMAXPROCS","4"}, {"CBAUTH_REVRPC_URL", {dynamic, {ns_ports_setup, remote_build_cbauth_revrpc_url, [goxdcr]}}}]}, {log,"goxdcr.log"}]}}, {mfargs, {restartable,start_link, [{supervisor_cushion,start_link, [goxdcr,5000,infinity,ns_port_server,start_link, [#Fun<ns_child_ports_sup.2.17249174>]]}, 86400000]}}, {restart_type,permanent}, {significant,false}, {shutdown,infinity}, {child_type,worker}] |
|
Observing many "Accepted new connection" and "Connection down" in ns_server.debug.log on node 172.23.121.176
[ns_server:debug,2024-01-09T00:11:08.678-08:00,ns_1@cb.local:cb_dist<0.227.0>:cb_dist:info_msg:1098]cb_dist: Accepted new connection from <0.231.0> DistCtrl #Port<0.4201>: {con, #Ref<0.2369029036.2466250753.8093>, inet_tcp_dist, undefined, undefined}[ns_server:debug,2024-01-09T00:11:08.678-08:00,ns_1@cb.local:net_kernel<0.229.0>:cb_dist:info_msg:1098]cb_dist: Accepting connection from <0.231.0> using module inet_tcp_dist[ns_server:debug,2024-01-09T00:11:08.678-08:00,ns_1@cb.local:cb_dist<0.227.0>:cb_dist:info_msg:1098]cb_dist: Updated connection: {con,#Ref<0.2369029036.2466250753.8093>, inet_tcp_dist,<0.32550.0>, #Ref<0.2369029036.2466250753.8096>}[ns_server:debug,2024-01-09T00:11:08.678-08:00,ns_1@cb.local:cb_dist<0.227.0>:cb_dist:info_msg:1098]cb_dist: Connection down: {con,#Ref<0.2369029036.2466250753.8093>, inet_tcp_dist,<0.32550.0>, #Ref<0.2369029036.2466250753.8096>}[error_logger:info,2024-01-09T00:11:08.678-08:00,ns_1@cb.local:net_kernel<0.229.0>:ale_error_logger_handler:do_log:101]=========================NOTICE REPORT========================={net_kernel,{'EXIT',<0.32550.0>,{recv_challenge_reply_failed,{error,closed}}}}[error_logger:info,2024-01-09T00:11:08.679-08:00,ns_1@cb.local:net_kernel<0.229.0>:ale_error_logger_handler:do_log:101] |
Observing "failed with error not_leader" in ns_server.debug.log on node 172.23.121.176
ns_server:error,2024-01-09T00:11:08.648-08:00,ns_1@cb.local:<0.31710.0>:leader_activities:report_error:944]Activity {default,{ns_janitor,["default"],cleanup}} failed with error not_leader[error_logger:error,2024-01-09T00:11:08.649-08:00,ns_1@cb.local:cleanup_process<0.31708.0>:ale_error_logger_handler:do_log:101]=========================CRASH REPORT========================= crasher: initial call: ns_janitor_server:run_cleanup/2 pid: <0.31708.0> registered_name: cleanup_process exception error: no match of right hand side value {leader_activities_error, {default,{ns_janitor,["default"],cleanup}}, not_leader} in function ns_janitor:run_buckets_cleanup_activity/3 (src/ns_janitor.erl, line 85) in call from ns_janitor:cleanup_buckets/2 (src/ns_janitor.erl, line 78) in call from ns_janitor:cleanup/2 (src/ns_janitor.erl, line 41) in call from ns_janitor_server:'-run_cleanup/2-fun-0-'/2 (src/ns_janitor_server.erl, line 180) in call from lists:foldl_1/3 (lists.erl, line 1355) in call from ns_janitor_server:run_cleanup/2 (src/ns_janitor_server.erl, line 178) ancestors: [ns_janitor_server,ns_orchestrator_child_sup, ns_orchestrator_sup,mb_master_sup,mb_master, leader_registry_sup,leader_services_sup,<0.778.0>, ns_server_sup,ns_server_nodes_sup,<0.298.0>, ns_server_cluster_sup,root_sup,<0.155.0>] message_queue_len: 0 messages: [] links: [<0.1378.0>] dictionary: [] trap_exit: false status: running heap_size: 10958 stack_size: 28 reductions: 679 neighbours: |
Testrunner script to reproduce
./testrunner -i /data/workspace/debian-p0-upgrade-vset00-00-new-with-services-from-715_2/testexec.47662.ini -p upgrade_version=7.6.0-1980,stop-on-failure=False,skip_cleanup=False,skip_init_check_cbserver=true,groups=simple,gsi_type=memory_optimized,upgrade_test=True,dgm_run=True,skip_log_scan=true,use_single_bucket=True,get-cbcollect-info=True,bucket_storage=couchstore,sirius_url=http://172.23.120.103:4000 -t newupgradetests.MultiNodesUpgradeTests.test_offline_upgrade_with_add_new_services,initial_version=7.1.5-3878,nodes_init=4,ddocs_num=3,pre_upgrade=create_replica_index,doc-per-day=1,dataset=default,after_upgrade_buckets_flush=True,groups=simple,initial-services-setting=kv-kv,n1ql,index-kv,index,after_upgrade_services_in=index,num_index_replica=2,disable_HTP=True,offline_failover_upgrade=True,upgrade_test=True,dgm_run=True,init_nodes=False |
Job name : debian-upgrade-new-with-services-from-715_2
Attachments
Issue Links
- relates to
-
MB-60324 dump-guts doesn't correctly handle ns_config keys with a purge timestamp
- Closed