Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-47937

Add a retry around the /controller/addNode http request

    XMLWordPrintable

Details

    • Task
    • Resolution: Fixed
    • Major
    • 7.1.0
    • 7.0.1, 7.1.0
    • Jepsen
    • None
    • 1

    Description

      What's the problem?

      Looks like we're hitting MB-45289 during the cluster setup in which the ''/controller/addNode" end-point to returns a http status code of 500 on occasion with the error message "Unexpected server error, request logged".

      What's the fix?

      Given that the comment in the MB suggests that we retry on this error, I think it's best to simply add a retry-with-exp-backoff around the "controller/addNode" http request.

      Side notes

      Doesn't seem to affect CC tests.

      Appendix
      An extract from the jepsen.log which indicates the http request failed with a http status of 500.

      jepsen.log

      2021-08-13 15:34:19,463{GMT} INFO [jepsen node 172.28.128.183] couchbase.util: Adding node 172.28.128.184 to cluster2021-08-13 15:34:19,463{GMT} INFO [jepsen node 172.28.128.183] couchbase.util: Adding node 172.28.128.184 to cluster2021-08-13 15:34:19,561{GMT} WARN [jepsen node 172.28.128.183] couchbase.util: Rest call to http://172.28.128.183:8091/controller/addNode with params {:hostname http://172.28.128.184, :user Administrator, :password abc123, :services kv} threw exception.2021-08-13 15:34:19,566{GMT} INFO [jepsen node 172.28.128.183] couchbase.util: #error { :cause clj-http: status 500 {:cached nil, :request-time 86, :repeatable? false, :protocol-version {:name "HTTP", :major 1, :minor 1}, :streaming? true, :http-client #object[org.apache.http.impl.client.InternalHttpClient 0x400fe1ef "org.apache.http.impl.client.InternalHttpClient@400fe1ef"], :chunked? false, :reason-phrase "Internal Server Error", :headers {"X-Permitted-Cross-Domain-Policies" "none", "Server" "Couchbase Server", "Content-Type" "application/json", "X-Content-Type-Options" "nosniff", "Content-Length" "44", "X-Frame-Options" "DENY", "Connection" "close", "Pragma" "no-cache", "Expires" "Thu, 01 Jan 1970 00:00:00 GMT", "Date" "Fri, 13 Aug 2021 14:34:17 GMT", "X-XSS-Protection" "1; mode=block", "Cache-Control" "no-cache,no-store,must-revalidate"}, :orig-content-encoding nil, :status 500, :length 44, :body "[\"Unexpected server error, request logged.\"]", :trace-redirects []} :data {:cached nil, :request-time 86, :repeatable? false, :protocol-version {:name HTTP, :major 1, :minor 1}, :streaming? true, :http-client #object[org.apache.http.impl.client.InternalHttpClient 0x400fe1ef org.apache.http.impl.client.InternalHttpClient@400fe1ef], :chunked? false, :type :clj-http.client/unexceptional-status, :reason-phrase Internal Server Error, :headers {X-Permitted-Cross-Domain-Policies none, Server Couchbase Server, Content-Type application/json, X-Content-Type-Options nosniff, Content-Length 44, X-Frame-Options DENY, Connection close, Pragma no-cache, Expires Thu, 01 Jan 1970 00:00:00 GMT, Date Fri, 13 Aug 2021 14:34:17 GMT, X-XSS-Protection 1; mode=block, Cache-Control no-cache,no-store,must-revalidate}, :orig-content-encoding nil, :status 500, :length 44, :body ["Unexpected server error, request logged."], :trace-redirects []}

       
      A stack trace from the ns_server.error.log, although it's not identical but it seems to be related to node renaming.

      ns_server.error.log(node:172.28.128.183)

      [ns_server:error,2021-08-13T14:34:18.414Z,ns_1@172.28.128.183:<0.814.0>:menelaus_util:reply_server_error:205]Server error during processing: ["web request failed",
                                       {path,"/controller/addNode"},
                                       {method,'POST'},
                                       {type,exit},
                                       {what,
                                        {{{{{badmatch,
                                             {error,
                                              {conflict,
                                               {<<"6785d52df1cd6dd18a640f46c3233394">>,
                                                8}}}},
                                            [{chronicle_local,handle_rename,1,
                                              [{file,"src/chronicle_local.erl"},
                                               {line,152}]},
                                             {chronicle_local,handle_call,3,
                                              [{file,"src/chronicle_local.erl"},
                                               {line,96}]},
                                             {gen_server2,handle_call,3,
                                              [{file,"src/gen_server2.erl"},
                                               {line,214}]},
                                             {gen_server,try_handle_call,4,
                                              [{file,"gen_server.erl"},{line,661}]},
                                             {gen_server,handle_msg,6,
                                              [{file,"gen_server.erl"},{line,690}]},
                                             {proc_lib,init_p_do_apply,3,
                                              [{file,"proc_lib.erl"},{line,249}]}]},
                                           {gen_server,call,
                                            [chronicle_local,
                                             {rename,'ns_1@cb.local'}]}},
                                          {gen_server,call,
                                           [dist_manager,
                                            {adjust_my_address,"172.28.128.183",
                                             false,#Fun<ns_cluster.7.111409773>},
                                            infinity]}},
                                         {gen_server,call,
                                          [ns_cluster,
                                           {add_node_to_group,http,
                                            "172.28.128.184",8091,
                                            {"Administrator","abc123"},
                                            undefined,
                                            [kv]},
                                           240000]}}},
                                       {trace,
                                        [{gen_server,call,3,
                                          [{file,"gen_server.erl"},{line,223}]},
                                         {ns_cluster,add_node_to_group,6,
                                          [{file,"src/ns_cluster.erl"},{line,80}]},
                                         {menelaus_web_cluster,do_handle_add_node,
                                          2,
                                          [{file,"src/menelaus_web_cluster.erl"},
                                           {line,645}]},
                                         {request_throttler,do_request,3,
                                          [{file,"src/request_throttler.erl"},
                                           {line,58}]},
                                         {menelaus_util,handle_request,2,
                                          [{file,"src/menelaus_util.erl"},
                                           {line,216}]},
                                         {mochiweb_http,headers,6,
                                          [{file,
                                            "/home/couchbase/jenkins/workspace/couchbase-server-unix/couchdb/src/mochiweb/mochiweb_http.erl"},
                                           {line,150}]},
                                         {proc_lib,init_p_do_apply,3,
                                          [{file,"proc_lib.erl"},{line,249}]}]}]
      

      Attachments

        Issue Links

          For Gerrit Dashboard: MB-47937
          # Subject Branch Project Status CR V

          Activity

            People

              asad.zaidi Asad Zaidi (Inactive)
              asad.zaidi Asad Zaidi (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty