Details
Description
Description
Reinstalling Couchbase and then attempting to start it results in ns_server crashing.
Steps to reproduce
- Spawn a completely new vagrant virtual machine running 7.0.0-4390 on Centos7.
- Stop and remove Couchbase.
systemctl stop couchbase-server
yum -y remove couchbase-server
- Download build 7.0.0-4390
wget http://172.23.126.166/builds/latestbuilds/couchbase-server/cheshire-cat/4390/couchbase-server-enterprise-7.0.0-4390-centos7.x86_64.rpm
- Reinstall and start Couchbase
yum -y install couchbase-server-enterprise-7.0.0-4390-centos7.x86_64.rpm
systemctl start couchbase-server
What happens:
The installation succeeds, but ns_server crashes. The output of 'systemctl status couchbase-server -l' shows the following:
systemctl status couchbase-server -l |
Feb 10 13:22:44 node1-cheshire-cat-testing-centos7.vagrants couchbase[7748]: client_info: undefined
|
Feb 10 13:22:44 node1-cheshire-cat-testing-centos7.vagrants couchbase[7748]: label: {gen_server,terminate}
|
Feb 10 13:22:44 node1-cheshire-cat-testing-centos7.vagrants couchbase[7748]: last_message: {die,{abnormal,3}}
|
Feb 10 13:22:44 node1-cheshire-cat-testing-centos7.vagrants couchbase[7748]: log: []
|
Feb 10 13:22:44 node1-cheshire-cat-testing-centos7.vagrants couchbase[7748]: name: <0.120.0>
|
Feb 10 13:22:44 node1-cheshire-cat-testing-centos7.vagrants couchbase[7748]: reason: {abnormal,3}
|
Feb 10 13:22:44 node1-cheshire-cat-testing-centos7.vagrants couchbase[7748]: state: {state,ns_server,5000,-576460751640103199,undefined,infinity}
|
Feb 10 13:22:44 node1-cheshire-cat-testing-centos7.vagrants couchbase[7748]: 2021-02-10 13:22:44.346049 crash_report #{label=>{proc_lib,crash},report=>[[{initial_call,{supervisor_cushion,init,['Argument__1']}},{pid,<0.120.0>},{registered_name,[]},{error_info,{exit,{abnormal,3},[{gen_server,handle_common_reply,8,[{file,"gen_server.erl"},{line,751}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,249}]}]}},{ancestors,[<0.119.0>,child_ns_server_sup,ns_babysitter_sup,<0.105.0>]},{message_queue_len,0},{messages,[]},{links,[<0.119.0>]},{dictionary,[]},{trap_exit,true},{status,running},{heap_size,1598},{stack_size,27},{reductions,8769}],[]]}
|
Feb 10 13:22:44 node1-cheshire-cat-testing-centos7.vagrants couchbase[7748]: 2021-02-10 13:22:44.346634 crash_report #{label=>{proc_lib,crash},report=>[[{initial_call,{erlang,apply,['Argument__1','Argument__2']}},{pid,<0.119.0>},{registered_name,[]},{error_info,{exit,{abnormal,3},[{restartable,loop,4,[{file,"src/restartable.erl"},{line,69}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,249}]}]}},{ancestors,[child_ns_server_sup,ns_babysitter_sup,<0.105.0>]},{message_queue_len,0},{messages,[]},{links,[<0.118.0>]},{dictionary,[]},{trap_exit,true},{status,running},{heap_size,10958},{stack_size,27},{reductions,3018}],[]]}
|
Feb 10 13:22:44 node1-cheshire-cat-testing-centos7.vagrants couchbase[7748]: 2021-02-10 13:22:44.347059 supervisor_report #{label=>{supervisor,child_terminated},report=>[{supervisor,{local,child_ns_server_sup}},{errorContext,child_terminated},{reason,{abnormal,3}},{offender,[{pid,<0.119.0>},{id,{ns_server,"/opt/couchbase/lib/erlang/bin/erl",["+A","16","-smp","enable","+sbt","u","+P","327680","+K","true","+swt","low","+sbwt","none","+MMmcs","30","+e102400","-setcookie","nocookie","-kernel","logger","[{handler, default, undefined}]","-user","user_io","-run","child_erlang","child_start","ns_bootstrap","--","-smp","enable","-kernel","logger","[{handler, default, undefined}]","inetrc","\"/opt/couchbase/etc/couchbase/hosts.cfg\"","dist_config_file","\"/opt/couchbase/var/lib/couchbase/config/dist_cfg\"","-proto_dist","cb","-epmd_module","cb_epmd","-ssl_dist_optfile","/opt/couchbase/etc/couchbase/ssl_dist_opts","-couch_ini","/opt/couchbase/etc/couchdb/default.ini","/opt/couchbase/etc/couchdb/default.d/capi.ini","/opt/couchbase/etc/couchdb/default.d/geocouch.ini","/opt/couchbase/etc/couchdb/local.ini"],[{env,[{"NS_SERVER_BABYSITTER_PID","7748"},{"CHILD_ERLANG_ENV_ARGS","[{ns_babysitter,\n [{cookiefile,\n \"/opt/couchbase/var/lib/couchbase/couchbase-server.babysitter.cookie\"},\n {pidfile,\"/opt/couchbase/var/lib/couchbase/couchbase-server.pid\"},\n {nodefile,\n \"/opt/couchbase/var/lib/couchbase/couchbase-server.babysitter.node\"}]},\n {ns_server,\n [{loglevel_stats,debug},\n {error_logger_mf_dir,\"/opt/couchbase/var/lib/couchbase/logs\"},\n {max_t,10},\n {loglevel_views,debug},\n {path_config_etcdir,\"/opt/couchbase/etc/couchbase\"},\n {path_config_secdir,\"/opt/couchbase/etc/security\"},\n {path_config_datadir,\"/opt/couchbase/var/lib/couchbase\"},\n {loglevel_cluster,debug},\n {loglevel_error_logger,debug},\n {loglevel_user,debug},\n {loglevel_ns_server,debug},\n {path_config_libdir,\"/opt/couchbase/lib\"},\n {loglevel_cbas,debug},\n {max_r,20},\n {config_path,\"/opt/couchbase/etc/couchbase/static_config\"},\n {disk_sink_opts_json_rpc,\n [{rotation,\n [{compress,true},\n {size,41943040},\n {num_files,2},\n {buffer_size_max,52428800}]}]},\n {loglevel_access,info},\n {loglevel_couchdb,info},\n {path_config_tmpdir,\"/opt/couchbase/var/lib/couchbase/tmp\"},\n {net_kernel_verbosity,10},\n {loglevel_rebalance,debug},\n {loglevel_default,debug},\n {nodefile,\"/opt/couchbase/var/lib/couchbase/couchbase-server.node\"},\n {loglevel_xdcr,debug},\n {loglevel_ns_doctor,debug},\n {path_config_bindir,\"/opt/couchbase/bin\"},\n {loglevel_menelaus,debug},\n {disk_sink_opts,\n [{rotation,\n [{compress,true},\n {size,41943040},\n {num_files,10},\n {buffer_size_max,52428800}]}]},\n {loglevel_mapreduce_errors,debug}]}]"},{"ERL_CRASH_DUMP","erl_crash.dump.1612963342.7748.ns_server"}]},exit_status,use_stdio,stream,eof]}},{mfargs,{restartable,start_link,[{supervisor_cushion,start_link,[ns_server,5000,infinity,ns_port_server,start_link,[#Fun<ns_child_ports_sup.4.12324699>]]},86400000]}},{restart_type,permanent},{shutdown,infinity},{child_type,worker}]}]}
|
Furthermore, I could not make http requests to 'localhost:8091' using curl or wget.
The ns_server.error log file reports additional errors (see appendix).
A reason for why this may happen:
With help from Daniel Owen, I was able to narrow down what might be the cause.
Post uninstallation, the 'chronicle' folder is left behind:
[root@node1-cheshire-cat-testing-centos7 ~]# tree /opt/couchbase/var/lib/couchbase/config/chronicle/
|
/opt/couchbase/var/lib/couchbase/config/chronicle/
|
├── logs
|
│ └── 0.log
|
└── snapshots
|
Deleting this folder after the uninstall and before the reinstall prevents the crash after reinstallation.
rm -rf /opt/couchbase/var/lib/couchbase/config/chronicle/
|
What I expected to happen:
I expected ns_server to start successfully after reinstallation.
Logs:
Logs from vagrant (based on the steps to reproduce): output_vagrant.zip
I observed this issue on a test vm first, the logs for that can be found here: output_testing_vm.zip
Steps to reproduce in a video format:
Appendix
The last 100 entries from 'ns_server.error.log':
ns_server.error.log |
[ns_server:error,2021-02-10T13:19:21.744Z,ns_1@10.112.210.101:service_status_keeper-index<0.608.0>:service_status_keeper:handle_cast:132]Service service_index returned incorrect status
|
[ns_server:error,2021-02-10T13:19:36.748Z,ns_1@10.112.210.101:service_status_keeper_worker<0.607.0>:rest_utils:get_json:59]Request to (indexer) getIndexStatus failed: {error,timeout}
|
[ns_server:error,2021-02-10T13:19:36.748Z,ns_1@10.112.210.101:service_status_keeper-index<0.608.0>:service_status_keeper:handle_cast:132]Service service_index returned incorrect status
|
[ns_server:error,2021-02-10T13:19:51.423Z,ns_1@10.112.210.101:<0.12005.0>:service_agent:wait_for_connection_loop:351]No connection with label "index-service_api" after 60000ms. Exiting.
|
[ns_server:error,2021-02-10T13:19:51.424Z,ns_1@10.112.210.101:service_agent-index<0.12004.0>:service_agent:handle_info:283]Linked process <0.12005.0> died with reason {no_connection,
|
"index-service_api"}. Terminating
|
[ns_server:error,2021-02-10T13:19:51.424Z,ns_1@10.112.210.101:service_agent-index<0.12004.0>:service_agent:terminate:312]Terminating abnormally
|
[ns_server:error,2021-02-10T13:19:51.750Z,ns_1@10.112.210.101:service_status_keeper_worker<0.607.0>:rest_utils:get_json:59]Request to (indexer) getIndexStatus failed: {error,timeout}
|
[ns_server:error,2021-02-10T13:19:51.750Z,ns_1@10.112.210.101:service_status_keeper-index<0.608.0>:service_status_keeper:handle_cast:132]Service service_index returned incorrect status
|
[ns_server:error,2021-02-10T13:20:06.752Z,ns_1@10.112.210.101:service_status_keeper_worker<0.607.0>:rest_utils:get_json:59]Request to (indexer) getIndexStatus failed: {error,timeout}
|
[ns_server:error,2021-02-10T13:20:06.752Z,ns_1@10.112.210.101:service_status_keeper-index<0.608.0>:service_status_keeper:handle_cast:132]Service service_index returned incorrect status
|
[ns_server:error,2021-02-10T13:20:21.754Z,ns_1@10.112.210.101:service_status_keeper_worker<0.607.0>:rest_utils:get_json:59]Request to (indexer) getIndexStatus failed: {error,timeout}
|
[ns_server:error,2021-02-10T13:20:21.754Z,ns_1@10.112.210.101:service_status_keeper-index<0.608.0>:service_status_keeper:handle_cast:132]Service service_index returned incorrect status
|
[ns_server:error,2021-02-10T13:20:36.756Z,ns_1@10.112.210.101:service_status_keeper_worker<0.607.0>:rest_utils:get_json:59]Request to (indexer) getIndexStatus failed: {error,timeout}
|
[ns_server:error,2021-02-10T13:20:36.756Z,ns_1@10.112.210.101:service_status_keeper-index<0.608.0>:service_status_keeper:handle_cast:132]Service service_index returned incorrect status
|
[ns_server:error,2021-02-10T13:20:51.427Z,ns_1@10.112.210.101:<0.14861.0>:service_agent:wait_for_connection_loop:351]No connection with label "index-service_api" after 60000ms. Exiting.
|
[ns_server:error,2021-02-10T13:20:51.427Z,ns_1@10.112.210.101:service_agent-index<0.14860.0>:service_agent:handle_info:283]Linked process <0.14861.0> died with reason {no_connection,
|
"index-service_api"}. Terminating
|
[ns_server:error,2021-02-10T13:20:51.427Z,ns_1@10.112.210.101:service_agent-index<0.14860.0>:service_agent:terminate:312]Terminating abnormally
|
[ns_server:error,2021-02-10T13:20:51.758Z,ns_1@10.112.210.101:service_status_keeper_worker<0.607.0>:rest_utils:get_json:59]Request to (indexer) getIndexStatus failed: {error,timeout}
|
[ns_server:error,2021-02-10T13:20:51.758Z,ns_1@10.112.210.101:service_status_keeper-index<0.608.0>:service_status_keeper:handle_cast:132]Service service_index returned incorrect status
|
[ns_server:error,2021-02-10T13:21:06.761Z,ns_1@10.112.210.101:service_status_keeper_worker<0.607.0>:rest_utils:get_json:59]Request to (indexer) getIndexStatus failed: {error,timeout}
|
[ns_server:error,2021-02-10T13:21:06.761Z,ns_1@10.112.210.101:service_status_keeper-index<0.608.0>:service_status_keeper:handle_cast:132]Service service_index returned incorrect status
|
[ns_server:error,2021-02-10T13:21:21.763Z,ns_1@10.112.210.101:service_status_keeper_worker<0.607.0>:rest_utils:get_json:59]Request to (indexer) getIndexStatus failed: {error,timeout}
|
[ns_server:error,2021-02-10T13:21:21.763Z,ns_1@10.112.210.101:service_status_keeper-index<0.608.0>:service_status_keeper:handle_cast:132]Service service_index returned incorrect status
|
[ns_server:error,2021-02-10T13:21:27.584Z,ns_1@10.112.210.101:service_agent-backup<0.505.0>:service_agent:handle_info:291]Lost json rpc connection for service backup, reason shutdown. Terminating.
|
[ns_server:error,2021-02-10T13:21:27.584Z,ns_1@10.112.210.101:service_agent-backup<0.505.0>:service_agent:terminate:312]Terminating abnormally
|
[ns_server:error,2021-02-10T13:21:27.595Z,ns_1@10.112.210.101:service_agent-eventing<0.509.0>:service_agent:handle_info:291]Lost json rpc connection for service eventing, reason shutdown. Terminating.
|
[ns_server:error,2021-02-10T13:21:27.596Z,ns_1@10.112.210.101:service_agent-eventing<0.509.0>:service_agent:terminate:312]Terminating abnormally
|
[ns_server:error,2021-02-10T13:21:27.601Z,ns_1@10.112.210.101:service_agent-fts<0.513.0>:service_agent:handle_info:291]Lost json rpc connection for service fts, reason shutdown. Terminating.
|
[ns_server:error,2021-02-10T13:21:27.601Z,ns_1@10.112.210.101:service_agent-fts<0.513.0>:service_agent:terminate:312]Terminating abnormally
|
[ns_server:error,2021-02-10T13:21:27.622Z,ns_1@10.112.210.101:service_status_keeper_worker<0.607.0>:rest_utils:get_json:59]Request to (indexer) getIndexStatus failed: {error,
|
{econnrefused,
|
[{lhttpc_client,send_request,1,
|
[{file,
|
"/home/couchbase/jenkins/workspace/couchbase-server-unix/couchdb/src/lhttpc/lhttpc_client.erl"},
|
{line,220}]},
|
{lhttpc_client,execute,9,
|
[{file,
|
"/home/couchbase/jenkins/workspace/couchbase-server-unix/couchdb/src/lhttpc/lhttpc_client.erl"},
|
{line,169}]},
|
{lhttpc_client,request,9,
|
[{file,
|
"/home/couchbase/jenkins/workspace/couchbase-server-unix/couchdb/src/lhttpc/lhttpc_client.erl"},
|
{line,93}]}]}}
|
[ns_server:error,2021-02-10T13:21:27.622Z,ns_1@10.112.210.101:service_status_keeper-index<0.608.0>:service_status_keeper:handle_cast:132]Service service_index returned incorrect status
|
[ns_server:error,2021-02-10T13:21:27.622Z,ns_1@10.112.210.101:service_status_keeper_worker<0.607.0>:rest_utils:get_json:59]Request to (fts) api/nsstatus failed: {error,
|
{econnrefused,
|
[{lhttpc_client,send_request,1,
|
[{file,
|
"/home/couchbase/jenkins/workspace/couchbase-server-unix/couchdb/src/lhttpc/lhttpc_client.erl"},
|
{line,220}]},
|
{lhttpc_client,execute,9,
|
[{file,
|
"/home/couchbase/jenkins/workspace/couchbase-server-unix/couchdb/src/lhttpc/lhttpc_client.erl"},
|
{line,169}]},
|
{lhttpc_client,request,9,
|
[{file,
|
"/home/couchbase/jenkins/workspace/couchbase-server-unix/couchdb/src/lhttpc/lhttpc_client.erl"},
|
{line,93}]}]}}
|
[ns_server:error,2021-02-10T13:21:27.623Z,ns_1@10.112.210.101:service_status_keeper-fts<0.613.0>:service_status_keeper:handle_cast:132]Service service_fts returned incorrect status
|
[ns_server:error,2021-02-10T13:21:27.626Z,ns_1@10.112.210.101:service_status_keeper_worker<0.607.0>:rest_utils:get_json:59]Request to (eventing) api/v1/functions failed: {error,
|
{econnrefused,
|
[{lhttpc_client,
|
send_request,1,
|
[{file,
|
"/home/couchbase/jenkins/workspace/couchbase-server-unix/couchdb/src/lhttpc/lhttpc_client.erl"},
|
{line,220}]},
|
{lhttpc_client,execute,9,
|
[{file,
|
"/home/couchbase/jenkins/workspace/couchbase-server-unix/couchdb/src/lhttpc/lhttpc_client.erl"},
|
{line,169}]},
|
{lhttpc_client,request,9,
|
[{file,
|
"/home/couchbase/jenkins/workspace/couchbase-server-unix/couchdb/src/lhttpc/lhttpc_client.erl"},
|
{line,93}]}]}}
|
[ns_server:error,2021-02-10T13:21:27.626Z,ns_1@10.112.210.101:service_status_keeper-eventing<0.618.0>:service_status_keeper:handle_cast:132]Service service_eventing returned incorrect status
|
[ns_server:error,2021-02-10T13:21:28.697Z,ns_1@10.112.210.101:<0.354.0>:ns_port_server:handle_info:157]Got unexpected exit signal from port: {'EXIT',<0.360.0>,normal}. Exiting.
|
[ns_server:error,2021-02-10T13:21:28.697Z,ns_1@10.112.210.101:prometheus_cfg<0.349.0>:prometheus_cfg:handle_info:376]Received exit from Prometheus port server - <0.354.0>: normal. Restarting Prometheus...
|
[ns_server:error,2021-02-10T13:21:29.082Z,ns_1@10.112.210.101:<0.19440.0>:prometheus:post_async:194]Prometheus http request failed:
|
URL: http://127.0.0.1:9123/api/v1/query
|
Body: query=%7Bname%3D~%60kv_curr_items%7Ckv_curr_items_tot%7Ckv_vb_replica_curr_items%7Ckv_mem_used_bytes%7Ccouch_docs_actual_disk_size%7Ccouch_views_actual_disk_size%7Ckv_ep_db_data_size_bytes%7Ckv_vb_active_num_non_resident%7Ckv_ep_bg_fetched%60%7D+or+label_replace%28sum+by+%28bucket%2C+name%29+%28irate%28kv_ops%7Bop%3D%60get%60%7D%5B1m%5D%29%29%2C+%60name%60%2C%60cmd_get%60%2C+%60%60%2C+%60%60%29+or+label_replace%28irate%28kv_ops%7Bop%3D%60get%60%2Cresult%3D%60hit%60%7D%5B1m%5D%29%2C%60name%60%2C%60get_hits%60%2C%60%60%2C%60%60%29+or+label_replace%28sum+by+%28bucket%29+%28irate%28kv_cmd_lookup%5B1m%5D%29+or+irate%28kv_ops%7Bop%3D~%60set%7Cincr%7Cdecr%7Cdelete%7Cdel_meta%7Cget_meta%7Cset_meta%7Cset_ret_meta%7Cdel_ret_meta%60%7D%5B1m%5D%29%29%2C+%60name%60%2C+%60ops%60%2C+%60%60%2C+%60%60%29+or+sum+by+%28bucket%2C+name%29+%28%7Bname%3D~%60index_data_size%7Cindex_disk_size%7Ccouch_spatial_data_size%7Ccouch_spatial_disk_size%7Ccouch_views_data_size%60%7D%29&timeout=5s
|
Reason: {failed_connect,[{to_address,{"127.0.0.1",9123}},
|
{inet,[inet],econnrefused}]}
|
[ns_server:error,2021-02-10T13:21:29.131Z,ns_1@10.112.210.101:<0.19443.0>:prometheus:post_async:194]Prometheus http request failed:
|
URL: http://127.0.0.1:9123/api/v1/query
|
Body: query=%7Bcategory%3D%60system-processes%60%2Cname%3D~%60sysproc_mem_resident%7Csysproc_mem_size%7Csysproc_cpu_utilization%7Csysproc_major_faults_raw%60%7D&timeout=5s
|
Reason: {failed_connect,[{to_address,{"127.0.0.1",9123}},
|
{inet,[inet],econnrefused}]}
|
[ns_server:error,2021-02-10T13:21:29.169Z,ns_1@10.112.210.101:<0.19448.0>:prometheus:post_async:194]Prometheus http request failed:
|
URL: http://127.0.0.1:9123/api/v1/query
|
Body: query=%7Bcategory%3D%60system%60%2Cname%3D~%60sys_cpu_utilization_rate%7Csys_cpu_stolen_rate%7Csys_swap_total%7Csys_swap_used%7Csys_mem_total%7Csys_mem_free%7Csys_mem_limit%7Csys_cpu_cores_available%7Csys_allocstall%60%7D&timeout=5s
|
Reason: {failed_connect,[{to_address,{"127.0.0.1",9123}},
|
{inet,[inet],econnrefused}]}
|
[ns_server:error,2021-02-10T13:21:38.564Z,ns_1@10.112.210.101:prometheus_cfg<0.349.0>:prometheus_cfg:terminate:397]Terminate: shutdown
|
[ns_server:error,2021-02-10T13:22:42.736Z,ns_1@127.0.0.1:prometheus_cfg<0.358.0>:prometheus_cfg:terminate:397]Terminate: shutdown
|
[ns_server:error,2021-02-10T13:23:02.447Z,ns_1@127.0.0.1:prometheus_cfg<0.349.0>:prometheus_cfg:terminate:397]Terminate: shutdown
|
[ns_server:error,2021-02-10T13:23:22.128Z,ns_1@127.0.0.1:prometheus_cfg<0.349.0>:prometheus_cfg:terminate:397]Terminate: shutdown
|
[ns_server:error,2021-02-10T13:23:41.783Z,ns_1@127.0.0.1:prometheus_cfg<0.347.0>:prometheus_cfg:terminate:397]Terminate: shutdown
|
[ns_server:error,2021-02-10T13:24:01.892Z,ns_1@127.0.0.1:prometheus_cfg<0.352.0>:prometheus_cfg:terminate:397]Terminate: shutdown
|
Spoke at some length with Dave F late last week to reach a consensus. The way forward is to have the installers (for at least Windows MSI and Linux rpm and deb) explicitly remove the entire chronicle directory when doing a full uninstall, but not when doing an upgrade. This should be substantially easier and safer than attempting to rename/move the directory. I don't believe that any of the installer mechanisms we currently use have the concept of a "config directory" - certainly MSI doesn't - and so that concept would have to be manually implemented. Also, moving it aside like that would require changes in cbupgrade to move it back, making this a cross-team effort since cbupgrade is owned by Tools. Finally, as this bug proves, after uninstall those files are not only useless but potentially dangerous, and so the right logical action is to ensure they are deleted during uninstall.
I am working on this ticket this week, but unless someone (Dave Finlay?) tells me otherwise, I'm planning to wait until after the beta refresh build to actually commit the changes. Installer changes are notoriously fragile, and I don't think we can count on QE running an exhaustive set of upgrade and uninstall/re-install scenarios in the next few days before beta refresh code freeze, so the potential risk of getting the changes in for beta refresh seems to outweigh any potential benefit.