Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-60128

[Upgrade][Community to Enterprise] : Rebalance exited with reason {service_rebalance_failed,fts,{worker_died,{'EXIT',<0.3522.3>,{rebalance_failed,{service_error,

    XMLWordPrintable

Details

    Description

      1. Installed Couchbase Community Edition 7.2.2-6401 and created a 3 node couchbase cluster with the following services
        1. Node 1 - 172.23.121.192 - kv, index,n1ql
        2. Node 2 - 172.23.121.100 - kv, index, n1ql, fts
        3. Node 3 - 172.23.121.189 - kv, index, n1ql, fts
      2. Created a couchstore bucket named default
      3. Created a fulltext-index index_default on 172.23.122.189
      4. Ran many fts queries using the fts index
      5. A new node 172.23.121.180 was installed and initialized with Couchbase Enterprise Edition 7.2.4-7059
      6. Started a swap rebalance on the cluster removing 172.23.121.192
      7. Cleaned and installed Couchbase Enterprise Edition 7.2.4-7059 on 172.23.121.192
      8. Started a swap rebalance on the cluster removing 172.23.121.100

      Rebalance fails

      2023-12-12T00:49:02.342-08:00, ns_orchestrator:0:critical:message(ns_1@172.23.121.100) - Rebalance exited with reason {service_rebalance_failed,fts,                              {worker_died,                               {'EXIT',<0.3522.3>,                                {rebalance_failed,                                 {service_error,                                  <<"nodes: sample res.StatusCode not 200, res: &http.Response{Status:\"503 Service Unavailable\", StatusCode:503, Proto:\"HTTP/1.1\", ProtoMajor:1, ProtoMinor:1, Header:http.Header{\"Content-Length\":[]string{\"50\"}, \"Content-Type\":[]string{\"text/plain; charset=utf-8\"}, \"Date\":[]string{\"Tue, 12 Dec 2023 08:49:02 GMT\"}}, Body:(*http.bodyEOFSignal)(0xc00060fc80), ContentLength:50, TransferEncoding:[]string(nil), Close:false, Uncompressed:false, Trailer:http.Header(nil), Request:(*http.Request)(0xc006ab2c00), TLS:(*tls.ConnectionState)(0xc000138840)}, urlUUID: monitor.UrlUUID{Url:\"https://172.23.121.192:18094\", UUID:\"3e30e19303ea45e6204de316e5dba41a\"}, kind: /api/stats?partitions=true, err: <nil>">>}}}}}.Rebalance Operation Id = e862c38ea50b8c690e43318f8d49c2b1 

      Observing these CRASH REPORTS on 172.23.121.100

      [error_logger:error,2023-12-12T00:49:02.340-08:00,ns_1@172.23.121.100:service_rebalancer-fts-worker<0.3522.3>:ale_error_logger_handler:do_log:101]=========================CRASH REPORT=========================  crasher:    initial call: service_rebalancer:'-run_rebalance_worker/1-fun-0-'/0    pid: <0.3522.3>    registered_name: 'service_rebalancer-fts-worker'    exception exit: {rebalance_failed,                        {service_error,                            <<"nodes: sample res.StatusCode not 200, res: &http.Response{Status:\"503 Service Unavailable\", StatusCode:503, Proto:\"HTTP/1.1\", ProtoMajor:1, ProtoMinor:1, Header:http.Header{\"Conten"...>>}}      in function  service_rebalancer:wait_for_rebalance_completion_loop/2 (src/service_rebalancer.erl, line 178)    ancestors: ['service_rebalancer-fts',<0.3710.2>]    message_queue_len: 0    messages: []    links: [<0.3502.3>]    dictionary: []    trap_exit: false    status: running    heap_size: 2586    stack_size: 29    reductions: 14875  neighbours:
      [ns_server:error,2023-12-12T00:49:02.340-08:00,ns_1@172.23.121.100:service_rebalancer-fts<0.3502.3>:service_rebalancer:run_rebalance_worker:119]Worker terminated abnormally: {'EXIT',<0.3522.3>,                               {rebalance_failed,                                {service_error,                                 <<"nodes: sample res.StatusCode not 200, res: &http.Response{Status:\"503 Service Unavailable\", StatusCode:503, Proto:\"HTTP/1.1\", ProtoMajor:1, ProtoMinor:1, Header:http.Header{\"Content-Length\":[]string{\"50\"}, \"Content-Type\":[]string{\"text/plain; charset=utf-8\"}, \"Date\":[]string{\"Tue, 12 Dec 2023 08:49:02 GMT\"}}, Body:(*http.bodyEOFSignal)(0xc00060fc80), ContentLength:50, TransferEncoding:[]string(nil), Close:false, Uncompressed:false, Trailer:http.Header(nil), Request:(*http.Request)(0xc006ab2c00), TLS:(*tls.ConnectionState)(0xc000138840)}, urlUUID: monitor.UrlUUID{Url:\"https://172.23.121.192:18094\", UUID:\"3e30e19303ea45e6204de316e5dba41a\"}, kind: /api/stats?partitions=true, err: <nil>">>}}}[error_logger:error,2023-12-12T00:49:02.341-08:00,ns_1@172.23.121.100:service_rebalancer-fts<0.3502.3>:ale_error_logger_handler:do_log:101]=========================CRASH REPORT=========================  crasher:    initial call: misc:'-spawn_monitor/1-fun-0-'/0    pid: <0.3502.3>    registered_name: 'service_rebalancer-fts'    exception exit: {worker_died,                        {'EXIT',<0.3522.3>,                            {rebalance_failed,                                {service_error,                                    <<"nodes: sample res.StatusCode not 200, res: &http.Response{Status:\"503 Service Unavailable\", StatusCode:503, Proto:\"HTTP/1.1\", ProtoMajor:1, ProtoMinor:1, Header"...>>}}}}      in function  service_rebalancer:run_rebalance/1 (src/service_rebalancer.erl, line 73)    ancestors: [<0.3710.2>]    message_queue_len: 0    messages: []    links: []    dictionary: []    trap_exit: false    status: running    heap_size: 6772    stack_size: 29    reductions: 14749  neighbours:
      [ns_server:error,2023-12-12T00:49:02.341-08:00,ns_1@172.23.121.100:service_agent-fts<0.3452.0>:service_agent:handle_info:234]Rebalancer <0.3502.3> died unexpectedly: {worker_died,                                          {'EXIT',<0.3522.3>,                                           {rebalance_failed,                                            {service_error,                                             <<"nodes: sample res.StatusCode not 200, res: &http.Response{Status:\"503 Service Unavailable\", StatusCode:503, Proto:\"HTTP/1.1\", ProtoMajor:1, ProtoMinor:1, Header:http.Header{\"Content-Length\":[]string{\"50\"}, \"Content-Type\":[]string{\"text/plain; charset=utf-8\"}, \"Date\":[]string{\"Tue, 12 Dec 2023 08:49:02 GMT\"}}, Body:(*http.bodyEOFSignal)(0xc00060fc80), ContentLength:50, TransferEncoding:[]string(nil), Close:false, Uncompressed:false, Trailer:http.Header(nil), Request:(*http.Request)(0xc006ab2c00), TLS:(*tls.ConnectionState)(0xc000138840)}, urlUUID: monitor.UrlUUID{Url:\"https://172.23.121.192:18094\", UUID:\"3e30e19303ea45e6204de316e5dba41a\"}, kind: /api/stats?partitions=true, err: <nil>">>}}}}[ns_server:info,2023-12-12T00:49:02.341-08:00,ns_1@172.23.121.100:rebalance_agent<0.3084.0>:rebalance_agent:handle_down:290]Rebalancer process <0.3710.2> died (reason {service_rebalance_failed,fts,                                            {worker_died,                                             {'EXIT',<0.3522.3>,                                              {rebalance_failed,                                               {service_error,                                                <<"nodes: sample res.StatusCode not 200, res: &http.Response{Status:\"503 Service Unavailable\", StatusCode:503, Proto:\"HTTP/1.1\", ProtoMajor:1, ProtoMinor:1, Header:http.Header{\"Content-Length\":[]string{\"50\"}, \"Content-Type\":[]string{\"text/plain; charset=utf-8\"}, \"Date\":[]string{\"Tue, 12 Dec 2023 08:49:02 GMT\"}}, Body:(*http.bodyEOFSignal)(0xc00060fc80), ContentLength:50, TransferEncoding:[]string(nil), Close:false, Uncompressed:false, Trailer:http.Header(nil), Request:(*http.Request)(0xc006ab2c00), TLS:(*tls.ConnectionState)(0xc000138840)}, urlUUID: monitor.UrlUUID{Url:\"https://172.23.121.192:18094\", UUID:\"3e30e19303ea45e6204de316e5dba41a\"}, kind: /api/stats?partitions=true, err: <nil>">>}}}}}).[ns_server:debug,2023-12-12T00:49:02.341-08:00,ns_1@172.23.121.100:leader_activities<0.3056.0>:leader_activities:handle_activity_down:450]Activity terminated with reason {shutdown,                                 {async_died,                                  {raised,                                   {exit,                                    {service_rebalance_failed,fts,                                     {worker_died,                                      {'EXIT',<0.3522.3>,                                       {rebalance_failed,                                        {service_error,                                         <<"nodes: sample res.StatusCode not 200, res: &http.Response{Status:\"503 Service Unavailable\", StatusCode:503, Proto:\"HTTP/1.1\", ProtoMajor:1, ProtoMinor:1, Header:http.Header{\"Content-Length\":[]string{\"50\"}, \"Content-Type\":[]string{\"text/plain; charset=utf-8\"}, \"Date\":[]string{\"Tue, 12 Dec 2023 08:49:02 GMT\"}}, Body:(*http.bodyEOFSignal)(0xc00060fc80), ContentLength:50, TransferEncoding:[]string(nil), Close:false, Uncompressed:false, Trailer:http.Header(nil), Request:(*http.Request)(0xc006ab2c00), TLS:(*tls.ConnectionState)(0xc000138840)}, urlUUID: monitor.UrlUUID{Url:\"https://172.23.121.192:18094\", UUID:\"3e30e19303ea45e6204de316e5dba41a\"}, kind: /api/stats?partitions=true, err: <nil>">>}}}}},                                    [{ns_rebalancer,                                      '-rebalance_topology_aware_service/4-fun-1-',                                      5,                                      [{file,"src/ns_rebalancer.erl"},                                       {line,395}]},                                     {misc,with_trap_exit,1,                                      [{file,"src/misc.erl"},{line,2714}]},                                     {ns_rebalancer,                                      '-rebalance_topology_aware_services/4-fun-1-',                                      4,                                      [{file,"src/ns_rebalancer.erl"},                                       {line,360}]},                                     {lists,filtermap,2,                                      [{file,"lists.erl"},{line,1321}]},                                     {ns_rebalancer,rebalance_services,2,                                      [{file,"src/ns_rebalancer.erl"},                                       {line,279}]},                                     {ns_rebalancer,rebalance_body,5,                                      [{file,"src/ns_rebalancer.erl"},                                       {line,526}]},                                     {async,'-async_init/4-fun-1-',3,                                      [{file,"src/async.erl"},                                       {line,191}]}]}}}}. Activity:{activity,<0.3709.2>,#Ref<0.1630041773.2127298564.210993>,default,          <<"8b80d44e1e68411e2044cbdcb74c5f22">>,          [rebalance],          majority,[]}[error_logger:error,2023-12-12T00:49:02.342-08:00,ns_1@172.23.121.100:<0.3706.2>:ale_error_logger_handler:do_log:101]=========================CRASH REPORT=========================  crasher:    initial call: erlang:apply/2    pid: <0.3706.2>    registered_name: []    exception exit: {service_rebalance_failed,fts,                     {worker_died,                      {'EXIT',<0.3522.3>,                       {rebalance_failed,                        {service_error,                         <<"nodes: sample res.StatusCode not 200, res: &http.Response{Status:\"503 Service Unavailable\", StatusCode:503, Proto:\"HTTP/1.1\", ProtoMajor:1, ProtoMin"...>>}}}}}      in function  ns_rebalancer:'-rebalance_topology_aware_service/4-fun-1-'/5 (src/ns_rebalancer.erl, line 395)      in call from misc:with_trap_exit/1 (src/misc.erl, line 2714)      in call from ns_rebalancer:'-rebalance_topology_aware_services/4-fun-1-'/4 (src/ns_rebalancer.erl, line 360)      in call from lists:filtermap/2 (lists.erl, line 1321)      in call from ns_rebalancer:rebalance_services/2 (src/ns_rebalancer.erl, line 279)      in call from ns_rebalancer:rebalance_body/5 (src/ns_rebalancer.erl, line 526)      in call from async:'-async_init/4-fun-1-'/3 (src/async.erl, line 191)    ancestors: [<0.29804.1>,ns_orchestrator_child_sup,ns_orchestrator_sup,                  mb_master_sup,mb_master,leader_registry_sup,                  leader_services_sup,<0.3053.0>,ns_server_sup,                  ns_server_nodes_sup,<0.2744.0>,ns_server_cluster_sup,                  root_sup,<0.146.0>]    message_queue_len: 0    messages: []    links: [<0.29804.1>]    dictionary: []    trap_exit: false    status: running    heap_size: 1598    stack_size: 29    reductions: 1381  neighbours:
      [user:error,2023-12-12T00:49:02.342-08:00,ns_1@172.23.121.100:<0.29804.1>:ns_orchestrator:log_rebalance_completion:1433]Rebalance exited with reason {service_rebalance_failed,fts,                              {worker_died,                               {'EXIT',<0.3522.3>,                                {rebalance_failed,                                 {service_error,                                  <<"nodes: sample res.StatusCode not 200, res: &http.Response{Status:\"503 Service Unavailable\", StatusCode:503, Proto:\"HTTP/1.1\", ProtoMajor:1, ProtoMinor:1, Header:http.Header{\"Content-Length\":[]string{\"50\"}, \"Content-Type\":[]string{\"text/plain; charset=utf-8\"}, \"Date\":[]string{\"Tue, 12 Dec 2023 08:49:02 GMT\"}}, Body:(*http.bodyEOFSignal)(0xc00060fc80), ContentLength:50, TransferEncoding:[]string(nil), Close:false, Uncompressed:false, Trailer:http.Header(nil), Request:(*http.Request)(0xc006ab2c00), TLS:(*tls.ConnectionState)(0xc000138840)}, urlUUID: monitor.UrlUUID{Url:\"https://172.23.121.192:18094\", UUID:\"3e30e19303ea45e6204de316e5dba41a\"}, kind: /api/stats?partitions=true, err: <nil>">>}}}}}.Rebalance Operation Id = e862c38ea50b8c690e43318f8d49c2b1[ns_server:debug,2023-12-12T00:49:02.342-08:00,ns_1@172.23.121.100:<0.29804.1>:auto_rebalance:retry_rebalance:58]Retry rebalance is not enabled. Failed Rebalance with Id e862c38ea50b8c690e43318f8d49c2b1 will not be retried.[ns_server:debug,2023-12-12T00:49:02.343-08:00,ns_1@172.23.121.100:service_agent-fts<0.3452.0>:service_agent:cleanup_service:496]Cleaning up stale tasks:[[{<<"rev">>,<<"MTQ=">>},  {<<"id">>,<<"prepare:1e45761aa655cce953347f2d516ad8c5">>},  {<<"type">>,<<"task-prepared">>},  {<<"status">>,<<"task-running">>},  {<<"isCancelable">>,true},  {<<"progress">>,100},  {<<"description">>,<<"prepare topology change">>},  {<<"extra">>,   {[{<<"topologyChange">>,      {[{<<"id">>,<<"1e45761aa655cce953347f2d516ad8c5">>},        {<<"currentTopologyRev">>,null},        {<<"type">>,<<"topology-change-rebalance">>},        {<<"keepNodes">>,         [{[{<<"nodeInfo">>,             {[{<<"nodeId">>,<<"22d6c0d40bf022fd886ea82513d24795">>},               {<<"priority">>,0},               {<<"opaque">>,null}]}},            {<<"recoveryType">>,<<"recovery-full">>}]},          {[{<<"nodeInfo">>,             {[{<<"nodeId">>,<<"3e30e19303ea45e6204de316e5dba41a">>},               {<<"priority">>,0},               {<<"opaque">>,null}]}},            {<<"recoveryType">>,<<"recovery-full">>}]}]},        {<<"ejectNodes">>,         [{[{<<"nodeId">>,<<"722e13ec968b63a09feb6eae765bb6e8">>},            {<<"priority">>,0},            {<<"opaque">>,null}]}]}]}}]}}]][ns_server:debug,2023-12-12T00:49:02.366-08:00,ns_1@172.23.121.100:chronicle_kv_log<0.2812.0>:chronicle_kv_log:log:59]update (key: counters, rev: {<<"2a1e08cda0a3efb5ebfe0e5a718fbc43">>,525})[{rebalance_fail,{1702370942,1}}, {rebalance_start,{1702370732,4}}, {rebalance_success,{1702370578,3}}][ns_server:debug,2023-12-12T00:49:02.368-08:00,ns_1@172.23.121.100:<0.3001.0>:terse_cluster_info_uploader:handle_info:53]Refreshing terse cluster info with <<"{\"rev\":642,\"nodesExt\":[{\"services\":{\"capi\":8092,\"fts\":8094,\"ftsGRPC\":9130,\"indexAdmin\":9100,\"indexHttp\":9102,\"indexScan\":9101,\"indexStreamCatchup\":9104,\"indexStreamInit\":9103,\"indexStreamMaint\":9105,\"kv\":11210,\"mgmt\":8091,\"n1ql\":8093,\"projector\":9999},\"thisNode\":true,\"hostname\":\"172.23.121.100\"},{\"services\":{\"capi\":8092,\"capiSSL\":18092,\"indexAdmin\":9100,\"indexHttp\":9102,\"indexHttps\":19102,\"indexScan\":9101,\"indexStreamCatchup\":9104,\"indexStreamInit\":9103,\"indexStreamMaint\":9105,\"kv\":11210,\"kvSSL\":11207,\"mgmt\":8091,\"mgmtSSL\":18091,\"n1ql\":8093,\"n1qlSSL\":18093,\"projector\":9999},\"hostname\":\"172.23.121.180\"},{\"services\":{\"capi\":8092,\"fts\":8094,\"ftsGRPC\":9130,\"indexAdmin\":9100,\"indexHttp\":9102,\"indexScan\":9101,\"indexStreamCatchup\":9104,\"indexStreamInit\":9103,\"indexStreamMaint\":9105,\"kv\":11210,\"mgmt\":8091,\"n1ql\":8093,\"projector\":9999},\"hostname\":\"172.23.121.189\"},{\"services\":{\"capi\":8092,\"capiSSL\":18092,\"fts\":8094,\"ftsGRPC\":9130,\"ftsGRPCSSL\":19130,\"ftsSSL\":18094,\"kv\":11210,\"kvSSL\":11207,\"mgmt\":8091,\"mgmtSSL\":18091,\"projector\":9999},\"hostname\":\"172.23.121.192\"}],\"clusterCapabilitiesVer\":[1,0],\"clusterCapabilities\":{\"n1ql\":[\"enhancedPreparedStatements\"]},\"revEpoch\":1}">>[ns_server:debug,2023-12-12T00:49:02.480-08:00,ns_1@172.23.121.100:ns_config_log<0.266.0>:ns_config_log:log_common:275]config change:rebalance_reports ->[{'_vclock',[{<<"722e13ec968b63a09feb6eae765bb6e8">>,{1,63869589778}},             {<<"89c1bd5de68261e148ca1df22ef17432">>,{2,63869588669}},             {<<"af6906da7a32633087b639bb2cea7758">>,{1,63869590142}}]}, {<<"c42c72db63ae4573233135d5309d3719">>,  [{node,'ns_1@172.23.121.180'},   {filename,"rebalance_report_20231212T084902.json"}]}, {<<"3fd428aa8e79f487eada7e1c683dcd1c">>,  [{node,'ns_1@172.23.121.100'},   {filename,"rebalance_report_20231212T084258.json"}]}, {<<"af2369f00c39d6ce7d5a78929979bf4d">>,  [{node,'ns_1@172.23.121.192'},   {filename,"rebalance_report_20231212T082429.json"}]}, {<<"61e0dbc06ea364d258466ff8eb809554">>,  [{node,'ns_1@172.23.121.192'},   {filename,"rebalance_report_20231212T082338.json"}]}][ns_server:debug,2023-12-12T00:49:02.480-08:00,ns_1@172.23.121.100:ns_config_log<0.266.0>:ns_config_log:log_common:275]config change:{local_changes_count,<<"af6906da7a32633087b639bb2cea7758">>} ->[{'_vclock',[{<<"af6906da7a32633087b639bb2cea7758">>,{15,63869590142}}]}][ns_server:info,2023-12-12T00:49:02.499-08:00,ns_1@172.23.121.100:<0.10564.3>:diag_handler:log_all_dcp_stats:199]logging dcp stats[ns_server:debug,2023-12-12T00:49:02.499-08:00,ns_1@172.23.121.100:chronicle_kv_log<0.2812.0>:chronicle_kv_log:log:59]update (key: rebalance_status, rev: {<<"2a1e08cda0a3efb5ebfe0e5a718fbc43">>,                                     526}){none,<<"Rebalance failed. See logs for detailed reason. You can try again.">>} 

       

      Marking this as a regression since this issue was not seen in 7.2.4-7014

       


       

      Testrunner script to reproduce

      ./testrunner -i /data/workspace/debian-p0-upgrade-vset00-00-ce_ee-new-from-72x-2b/testexec.101666.ini -p skip_init_check_cbserver=true,released_upgrade_version=7.2.4-7059,get-cbcollect-info=false,data_analysis=True,initial_build_type=community,ce_to_ee=true,skip_log_scan=true,upgrade_test=true,get-cbcollect-info=True,bucket_storage=couchstore,sirius_url=http://172.23.120.103:4000 -t upgrade.upgrade_tests.UpgradeTests.test_upgrade,items=20000,initial_version=7.2.0-5325,nodes_init=3,initialize_events=create_fts_index_query_compare,initial-services-setting=kv,index,n1ql-kv,n1ql,index,fts-kv,index,n1ql,fts,upgrade_services_in=same,after_events=rebalance_in-run_fts_query_and_compare,after_upgrade_services_in=kv,index,n1ql,fts,disable_HTP=True,upgrade_test=True,init_nodes=False

      Job name : debian-upgrade-ce_ee-new-from-72x-2b

      Job ref : http://cb-logs-qe.s3-website-us-west-2.amazonaws.com/7.2.4-7059/jenkins_logs/test_suite_executor/653291/

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          People

            raghav.sk Raghav S K
            raghav.sk Raghav S K
            Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty