Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-46228

FTS is failing to shut down "stats" agents on nodes other than where an "index deletion" is received

    XMLWordPrintable

Details

    • Untriaged
    • 1
    • No
    • KV-Engine CC Final Sprint

    Description

      Script to Repro

      guides/gradlew --refresh-dependencies testrunner -P jython=/opt/jython/bin/jython -P 'args=-i /tmp/win10-bucket-ops.ini rerun=False,get-cbcollect-info=True,quota_percent=99,crash_warning=True,create_metakv_entries=True -t bucket_collections.collections_rebalance.CollectionsRebalance.test_data_load_collections_with_hard_failover_rebalance_out,nodes_init=5,services_init=kv-fts-kv-kv-kv,nodes_failover=2,bucket_spec=multi_bucket.buckets_for_rebalance_tests_more_collections,data_load_spec=volume_test_load_with_CRUD_on_collections,data_load_stage=before,scrape_interval=5,rebalance_moves_per_node=32,quota_percent=80,skip_validations=True,GROUP=failover_with_collection_crud'
      

      Steps to Repro
      1. Create a node cluster
      2021-05-10 20:21:19,243 | test | INFO | pool-5-thread-7 | [table_view:display:72] Rebalance Overview
      ----------------------------------------------------------------------

      Nodes Services Version CPU Status

      ----------------------------------------------------------------------

      172.23.98.196 kv 7.0.0-5133-enterprise 3.35251438579 Cluster node
      172.23.98.195 ['fts']     <--- IN —
      172.23.121.10 ['kv']     <--- IN —
      172.23.104.186 ['kv']     <--- IN —
      172.23.120.206 ['kv']     <--- IN —

      ----------------------------------------------------------------------
      2) Create buckets/scopes/collections/data
      2021-05-10 20:25:08,065 | test | INFO | MainThread | [table_view:display:72] Bucket statistics
      -------------------------------------------------------------------------

      Bucket Type Replicas Durability TTL Items RAM Quota RAM Used Disk Used

      -------------------------------------------------------------------------

      bucket1 couchbase 3 none 0 3000 838860800 205419072 300403800
      bucket2 ephemeral 3 none 0 3000 838860800 316814512 136
      default couchbase 3 none 0 500000 8388608000 755490896 590212284

      -------------------------------------------------------------------------
      3) Set the following settings

      2021-05-10 20:25:16,298 | test  | INFO    | MainThread | [collections_rebalance:setUp:58] Changing scrape interval to 5
      2021-05-10 20:25:18,355 | test  | INFO    | MainThread | [cluster_ready_functions:set_rebalance_moves_per_nodes:129] Changed Rebalance settings: {u'rebalanceMovesPerNode': 32}
      

      4) Create metakv entries by creating and dropping 200 fts indexes

      2021-05-10 20:25:18,355 | test  | INFO    | MainThread | [collections_rebalance:setUp:78] Creating metakv entries start
      2021-05-10 20:27:37,470 | test  | INFO    | MainThread | [collections_rebalance:setUp:80] Creating metakv entries end
      

      5) Start CRUD on collections

      2021-05-10 20:27:37,474 | test  | INFO    | MainThread | [bucket_ready_functions:perform_tasks_from_spec:4651] Performing scope/collection specific operations
      2021-05-10 20:27:44,384 | test  | INFO    | MainThread | [bucket_ready_functions:perform_tasks_from_spec:4741] Done Performing scope/collection specific operations
      

      5) Start hard failover of one of the node which fails as shown below.

      2021-05-10 20:27:44,589 | test  | INFO    | MainThread | [collections_rebalance:rebalance_operation:388] Starting rebalance operation of type : hard_failover_rebalance_out
      2021-05-10 20:27:44,591 | test  | INFO    | MainThread | [collections_rebalance:rebalance_operation:632] failing over nodes [ip:172.23.104.186 port:8091 ssh_username:root, ip:172.23.120.206 port:8091 ssh_username:root]
      2021-05-10 20:27:54,937 | test  | ERROR   | pool-5-thread-9 | [rest_client:_http_request:748] POST http://172.23.98.196:8091/controller/failOver body: otpNode=ns_1%40172.23.104.186&allowUnsafe=false headers: {'Accept': '*/*', 'Connection': 'close', 'Authorization': 'Basic QWRtaW5pc3RyYXRvcjpwYXNzd29yZA==\n', 'Content-Type': 'application/x-www-form-urlencoded'} error: 500 reason: unknown ["Unexpected server error, request logged."] auth: Administrator:password
      2021-05-10 20:27:54,940 | test  | ERROR   | pool-5-thread-9 | [rest_client:fail_over:1276] ns_1@172.23.104.186 - Failover error: ["Unexpected server error, request logged."]
      ERROR
      

      debug.log at the time of failure

      [ns_server:error,2021-05-10T20:27:54.907-07:00,ns_1@172.23.98.196:<0.25585.1>:menelaus_util:reply_server_error:206]Server error during processing: ["web request failed",
                                       {path,"/controller/failOver"},
                                       {method,'POST'},
                                       {type,exit},
                                       {what,
                                        {{function_clause,
                                          [{ns_orchestrator,rebalancing,
                                            [{request_janitor_run,
                                              {bucket,"bucket1"}},
                                             {rebalancing_state,<0.25739.1>,
                                              <0.25737.1>,[],[],[],[],undefined,
                                              ['ns_1@172.23.104.186'],
                                              undefined,failover,
                                              <<"5173d70bbd2afe55f32eb5e976d59df5">>,
                                              undefined,
                                              {<0.25585.1>,
                                               #Ref<0.2378229137.562823169.70811>}}],
                                            [{file,"src/ns_orchestrator.erl"},
                                             {line,887}]},
                                           {gen_statem,loop_state_callback,11,
                                            [{file,"gen_statem.erl"},{line,1120}]},
                                           {proc_lib,init_p_do_apply,3,
                                            [{file,"proc_lib.erl"},{line,249}]}]},
                                         {gen_statem,call,
                                          [{via,leader_registry,ns_orchestrator},
                                           {failover,['ns_1@172.23.104.186'],false},
                                           infinity]}}},
                                       {trace,
                                        [{gen,do_call,4,
                                          [{file,"gen.erl"},{line,177}]},
                                         {gen,do_for_proc,2,
                                          [{file,"gen.erl"},{line,238}]},
                                         {gen_statem,call_dirty,4,
                                          [{file,"gen_statem.erl"},{line,623}]},
                                         {menelaus_web_cluster,handle_failover,1,
                                          [{file,"src/menelaus_web_cluster.erl"},
                                           {line,782}]},
                                         {request_throttler,do_request,3,
                                          [{file,"src/request_throttler.erl"},
                                           {line,58}]},
                                         {menelaus_util,handle_request,2,
                                          [{file,"src/menelaus_util.erl"},
                                           {line,217}]},
                                         {mochiweb_http,headers,6,
                                          [{file,
                                            "/home/couchbase/jenkins/workspace/couchbase-server-unix/couchdb/src/mochiweb/mochiweb_http.erl"},
                                           {line,150}]},
                                         {proc_lib,init_p_do_apply,3,
                                          [{file,"proc_lib.erl"},{line,249}]}]}]
      

      cbcollect_info attached.

      Attachments

        1. kv_dcp_connection_count-stats.png
          57 kB
          Daniel Owen
        2. kv-collections.png
          118 kB
          Daniel Owen
        3. kv-cpu-utilization.png
          41 kB
          Daniel Owen
        4. kv-current-connections.png
          26 kB
          Daniel Owen
        5. kv-operations.png
          51 kB
          Daniel Owen
        6. MB-46228_test.log
          94 kB
          Balakumaran Gopal
        7. nutshell.txt
          41 kB
          Daniel Owen
        8. screenshot-1.png
          39 kB
          Dave Finlay
        9. screenshot-2.png
          47 kB
          Dave Finlay
        10. test.log
          36 kB
          Balakumaran Gopal

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            People

              Balakumaran.Gopal Balakumaran Gopal
              Balakumaran.Gopal Balakumaran Gopal
              Votes:
              0 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty