Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-30288

[Backport MB-30162] - rebalance in for index node stuck for 2 days - centos longevity

    XMLWordPrintable

Details

    • Untriaged
    • No

    Description

      centos longevity - 5.5.0-2907 - looks like rebalance has been stuck for more than 2 days - following trace seen in .103 diag.log:

      2018-06-16T22:29:14.189-07:00, ns_orchestrator:4:info:message(ns_1@172.23.108.103) - Starting rebalance, KeepNodes = ['ns_1@172.23.104.164','ns_1@172.23.104.61',
                                       'ns_1@172.23.106.188','ns_1@172.23.108.103',
                                       'ns_1@172.23.108.104','ns_1@172.23.96.145',
                                       'ns_1@172.23.96.148','ns_1@172.23.96.168',
                                       'ns_1@172.23.96.56','ns_1@172.23.97.238',
                                       'ns_1@172.23.97.239','ns_1@172.23.97.242',
                                       'ns_1@172.23.98.135','ns_1@172.23.99.11',
                                       'ns_1@172.23.99.20','ns_1@172.23.99.21',
                                       'ns_1@172.23.99.25'], EjectNodes = [], Failed over and being ejected nodes = []; no delta recovery nodes
       
      2018-06-16T22:29:16.227-07:00, ns_rebalancer:0:info:message(ns_1@172.23.108.103) - Started rebalancing bucket WAREHOUSE
      2018-06-16T22:29:16.706-07:00, ns_vbucket_mover:0:info:message(ns_1@172.23.108.103) - Bucket "WAREHOUSE" rebalance appears to be swap rebalance
      2018-06-16T22:29:17.345-07:00, ns_rebalancer:0:info:message(ns_1@172.23.108.103) - Started rebalancing bucket STOCK
      2018-06-16T22:29:18.357-07:00, ns_vbucket_mover:0:info:message(ns_1@172.23.108.103) - Bucket "STOCK" rebalance appears to be swap rebalance
      2018-06-16T22:29:18.827-07:00, ns_rebalancer:0:info:message(ns_1@172.23.108.103) - Started rebalancing bucket ORDER_LINE
      2018-06-16T22:29:19.803-07:00, ns_vbucket_mover:0:info:message(ns_1@172.23.108.103) - Bucket "ORDER_LINE" rebalance appears to be swap rebalance
      2018-06-16T22:29:20.208-07:00, ns_rebalancer:0:info:message(ns_1@172.23.108.103) - Started rebalancing bucket ORDERS
      2018-06-16T22:29:20.881-07:00, ns_vbucket_mover:0:info:message(ns_1@172.23.108.103) - Bucket "ORDERS" rebalance appears to be swap rebalance
      2018-06-16T22:29:21.510-07:00, ns_rebalancer:0:info:message(ns_1@172.23.108.103) - Started rebalancing bucket NEW_ORDER
      2018-06-16T22:29:22.426-07:00, ns_vbucket_mover:0:info:message(ns_1@172.23.108.103) - Bucket "NEW_ORDER" rebalance appears to be swap rebalance
      2018-06-16T22:29:22.598-07:00, ns_rebalancer:0:info:message(ns_1@172.23.108.103) - Started rebalancing bucket ITEM
      2018-06-16T22:29:23.222-07:00, ns_vbucket_mover:0:info:message(ns_1@172.23.108.103) - Bucket "ITEM" rebalance appears to be swap rebalance
      2018-06-16T22:29:23.698-07:00, ns_rebalancer:0:info:message(ns_1@172.23.108.103) - Started rebalancing bucket HISTORY
      2018-06-16T22:29:24.618-07:00, ns_vbucket_mover:0:info:message(ns_1@172.23.108.103) - Bucket "HISTORY" rebalance appears to be swap rebalance
      2018-06-16T22:29:24.952-07:00, ns_rebalancer:0:info:message(ns_1@172.23.108.103) - Started rebalancing bucket DISTRICT
      2018-06-16T22:29:25.688-07:00, ns_vbucket_mover:0:info:message(ns_1@172.23.108.103) - Bucket "DISTRICT" rebalance appears to be swap rebalance
      2018-06-16T22:29:25.951-07:00, ns_rebalancer:0:info:message(ns_1@172.23.108.103) - Started rebalancing bucket CUSTOMER
      2018-06-16T22:29:26.506-07:00, ns_vbucket_mover:0:info:message(ns_1@172.23.108.103) - Bucket "CUSTOMER" rebalance appears to be swap rebalance
      2018-06-16T22:29:26.726-07:00, ns_rebalancer:0:info:message(ns_1@172.23.108.103) - Started rebalancing bucket default
      2018-06-16T22:29:27.364-07:00, ns_vbucket_mover:0:info:message(ns_1@172.23.108.103) - Bucket "default" rebalance appears to be swap rebalance
      2018-06-16T22:38:28.407-07:00, auto_failover:3:info:message(ns_1@172.23.108.103) - Could not auto-failover node ('ns_1@172.23.104.61'). There was at least another node down.
      2018-06-16T22:38:28.411-07:00, auto_failover:3:info:message(ns_1@172.23.108.103) - Could not auto-failover node ('ns_1@172.23.108.103'). There was at least another node down.
      2018-06-16T22:38:28.465-07:00, auto_failover:3:info:message(ns_1@172.23.108.103) - Could not auto-failover node ('ns_1@172.23.108.104'). There was at least another node down.
      2018-06-16T22:38:28.466-07:00, auto_failover:3:info:message(ns_1@172.23.108.103) - Could not auto-failover node ('ns_1@172.23.96.145'). There was at least another node down.
      2018-06-16T22:38:28.467-07:00, auto_failover:3:info:message(ns_1@172.23.108.103) - Could not auto-failover node ('ns_1@172.23.96.168'). There was at least another node down.
      2018-06-16T22:38:28.468-07:00, auto_failover:3:info:message(ns_1@172.23.108.103) - Could not auto-failover node ('ns_1@172.23.97.238'). There was at least another node down.
      2018-06-16T22:38:28.525-07:00, auto_failover:3:info:message(ns_1@172.23.108.103) - Could not auto-failover node ('ns_1@172.23.97.239'). There was at least another node down.
      2018-06-16T22:38:28.564-07:00, auto_failover:3:info:message(ns_1@172.23.108.103) - Could not auto-failover node ('ns_1@172.23.99.20'). There was at least another node down.
      2018-06-16T22:38:28.616-07:00, auto_failover:3:info:message(ns_1@172.23.108.103) - Could not auto-failover node ('ns_1@172.23.99.21'). There was at least another node down.
      2018-06-16T22:38:28.620-07:00, auto_failover:3:info:message(ns_1@172.23.108.103) - Could not auto-failover node ('ns_1@172.23.99.25'). There was at least another node down.
      2018-06-17T23:12:27.315-07:00, menelaus_web:102:warning:client-side error report(ns_1@172.23.108.103) - Client-side error-report for user "Administrator" on node 'ns_1@172.23.108.103':
      User-Agent:Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_4) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/11.1 Safari/605.1.15
      Got unhandled javascript error:
      message: The transition errored;
       
       
      2018-06-17T23:13:02.974-07:00, menelaus_web:102:warning:client-side error report(ns_1@172.23.108.103) - Client-side error-report for user "Administrator" on node 'ns_1@172.23.108.103':
      User-Agent:Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_4) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/11.1 Safari/605.1.15
      Got unhandled javascript error:
      message: The transition errored;
       
       (repeated 1 times)
      2018-06-18T01:39:40.575-07:00, menelaus_web:102:warning:client-side error report(ns_1@172.23.108.103) - Client-side error-report for user "Administrator" on node 'ns_1@172.23.108.103':
      User-Agent:Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_4) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/11.1 Safari/605.1.15
      Got unhandled javascript error:
      message: The transition errored;
       
       
      -------------------------------
       
       
      per_node_processes('ns_1@172.23.108.103') =
           {<0.32163.965>,
            [{registered_name,[]},
             {status,waiting},
             {initial_call,{proc_lib,init_p,5}},
             {backtrace,[<<"Program counter: 0x00007fa09d8c9920 (gen_server:loop/6 + 264)">>,
                         <<"CP: 0x0000000000000000 (invalid)">>,<<"arity = 0">>,
                         <<>>,
                         <<"0x00007fa098141618 Return addr 0x00007fa09d7f0210 (proc_lib:init_p_do_apply/3 + 56)">>,
                         <<"y(0)     []">>,<<"y(1)     10000">>,
                         <<"y(2)     dcp_proxy">>,
                         <<"(3)     {state,#Port<0.736989>,{consumer,\"replication:ns_1@172.23.108.104->ns_1@172.23.108.103:CUSTOMER\",'ns_1@172.23.10">>,
                         <<"y(4)     <0.32163.965>">>,<<"y(5)     <0.5515.966>">>,
                         <<>>,
                         <<"0x00007fa098141650 Return addr 0x0000000000892548 (<terminate process normally>)">>,
                         <<"y(0)     Catch 0x00007fa09d7f0230 (proc_lib:init_p_do_apply/3 + 88)">>,
                         <<>>]},
             {error_handler,error_handler},
             {garbage_collection,[{min_bin_vheap_size,46422},
                                  {min_heap_size,233},
                                  {fullsweep_after,512},
                                  {minor_gcs,21}]},
             {heap_size,987},
             {total_heap_size,1974},
             {links,[<0.5515.966>,#Port<0.736989>]},
             {monitors,[]},
             {monitored_by,[<0.8120.0>]},
             {memory,16744},
             {messages,[]},
             {message_queue_len,0},
             {reductions,14978131},
             {trap_exit,false},
             {current_location,{gen_server,loop,6,
                                           [{file,"gen_server.erl"},{line,358}]}},
             {dictionary,[{'$ancestors',['dcp_replicator-CUSTOMER-ns_1@172.23.108.104',
                                         'dcp_sup-CUSTOMER',
                                         'single_bucket_kv_sup-CUSTOMER',
                                         ns_bucket_sup,ns_bucket_worker_sup,
                                         ns_server_sup,ns_server_nodes_sup,
                                         <0.167.0>,ns_server_cluster_sup,<0.89.0>]},
                          {'$initial_call',{dcp_proxy,init,1}}]}]}
           {<0.31048.483>,
            [{registered_name,[]},
             {status,waiting},
             {initial_call,{erlang,apply,2}},
             {backtrace,
                 [<<"Program counter: 0x00007fa047b8b810 (leader_lease_acquire_worker:loop/1 + 40)">>,
                  <<"CP: 0x0000000000000000 (invalid)">>,<<"arity = 0">>,<<>>,
                  <<"0x00007fa062af19b0 Return addr 0x00007fa063eec930 (async:'-async_init/4-fun-2-'/3 + 272)">>,
                  <<"(0)     {state,<0.9200.0>,'ns_1@172.23.96.145',<<32 bytes>>,true,1529369273548,1529369283548,{backoff,500,15000,2,500},{">>,
                  <<>>,
                  <<"0x00007fa062af19c0 Return addr 0x0000000000892548 (<terminate process normally>)">>,
                  <<"y(0)     []">>,<<"y(1)     []">>,
                  <<"y(2)     Catch 0x00007fa063eec988 (async:'-async_init/4-fun-2-'/3 + 360)">>,
                  <<"y(3)     {<0.28987.483>,#Ref<0.0.316.55163>}">>,<<>>]},
             {error_handler,error_handler},
             {garbage_collection,
                 [{min_bin_vheap_size,46422},
                  {min_heap_size,233},
                  {fullsweep_after,512},
                  {minor_gcs,38}]},
             {heap_size,1598},
             {total_heap_size,1974},
             {links,[<0.28987.483>]},
             {monitors,[]},
             {monitored_by,[]},
             {memory,16632},
             {messages,[]},
             {message_queue_len,0},
             {reductions,22489112},
             {trap_exit,false},
             {current_location,
                 {leader_lease_acquire_worker,loop,1,
                     [{file,"src/leader_lease_acquire_worker.erl"},{line,61}]}},
             {dictionary,
                 [{'$async_role',executor},{'$async_controller',<0.28987.483>}]}]}
           {<0.30924.483>,
            [{registered_name,[]},
             {status,waiting},
             {initial_call,{inet_tcp_dist,do_accept,6}},
             {backtrace,[<<"Program counter: 0x00007fa063d6b6c0 (dist_util:con_loop/9 + 112)">>,
                         <<"CP: 0x0000000000000000 (invalid)">>,<<"arity = 0">>,
                         <<>>,
                         <<"0x00007fa061807ce0 Return addr 0x0000000000892548 (<terminate process normally>)">>,
                         <<"y(0)     []">>,
                         <<"y(1)     #Fun<inet_tcp_dist.getstat.1>">>,
                         <<"y(2)     #Fun<inet_tcp_dist.tick.1>">>,
                         <<"y(3)     {tick,1268894,2159172,2,2}">>,
                         <<"y(4)     normal">>,<<"y(5)     'ns_1@172.23.108.103'">>,
                         <<"y(6)     {net_address,{{172,23,96,145},59255},\"172.23.96.145\",tcp,inet}">>,
                         <<"y(7)     #Port<0.359168>">>,
                         <<"y(8)     'ns_1@172.23.96.145'">>,
                         <<"y(9)     <0.9129.0>">>,<<>>]},
             {error_handler,error_handler},
             {garbage_collection,[{min_bin_vheap_size,46422},
                                  {min_heap_size,233},
                                  {fullsweep_after,512},
                                  {minor_gcs,188}]},
             {heap_size,987},
             {total_heap_size,1363},
             {links,[<0.9129.0>,#Port<0.359168>]},
             {monitors,[]},
             {monitored_by,[]},
             {memory,11680},
             {messages,[]},
             {message_queue_len,0},
             {reductions,605395},
             {trap_exit,false},
             {current_location,{dist_util,con_loop,9,
                                          [{file,"dist_util.erl"},{line,454}]}},
             {dictionary,[]}]}
      

      Supportal: https://supportal.couchbase.com/snapshot/d728fa2425d291c708831355596cf470::0

      cluster live at 172.23.108.103:8091 for debugging

      Attachments

        Issue Links

          For Gerrit Dashboard: MB-30288
          # Subject Branch Project Status CR V

          Activity

            People

              deepkaran.salooja Deepkaran Salooja
              deepkaran.salooja Deepkaran Salooja
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty