Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-45281

CE rebalance failing due to indexing service being unavailable/unresponsive

    XMLWordPrintable

Details

    • Untriaged
    • Windows 64-bit
    • 1
    • Unknown

    Description

      What's the issue?
      This may be multiple separate issues, however, I think its sensible to start with the initial rebalance failure, and investigate from there. Please feel free to separate the issues if required.

      A user on the forums has a two node cluster which has failed the initial rebalance due to the indexing service, they're now unable to interact with the cluster in a useful/expected manner. The issue is manifesting with the following symptoms:
      1) Loading a sample bucket has failed (due to a timeout waiting for the bucket to report as healthy)
      2) The user is unable to insert documents into the created bucket
      3) ns_server appears to be timing out when communicating with the indexing service
      4) Successive rebalances are now failing
      5) Indexing appears to be failing to communicate with the projector

      From the logs we see a few interesting things worth noting:

      intial rebalance failure

      2021-03-25T12:41:40.158+02:00, ns_orchestrator:0:critical:message(ns_1@192.168.0.122) - Rebalance exited with reason {service_rebalance_failed,index,
                                    {agent_died,<22661.10863.0>,
                                     {linked_process_died,<22661.10865.0>,
                                      {no_connection,"index-service_api"}}}}.
      

      ns_server request timeout

      [ns_server:error,2021-03-25T12:43:36.765+02:00,ns_1@192.168.0.122:service_status_keeper_worker<0.430.0>:rest_utils:get_json:62]Request to (indexer) getIndexStatus failed: {error,timeout}
      

      index service failing to connect/communicate with the projector

      2021-03-25T12:41:01.702+02:00 [Error] KVSender::closeMutationStream MAINT_STREAM  Error Received Post http://192.168.0.122:9999/adminport/shutdownTopicRequest: dial tcp 192.168.0.122:9999: connectex: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond. from 192.168.0.122:9999
      

      memcached slow runtime during warmup

      2021-03-25T12:43:28.127751+02:00 WARNING (beer-sample) Slow runtime for 'Warmup - populate VB Map: shard 7' on thread reader_worker_0: 1079 us
      2021-03-25T12:43:39.673165+02:00 WARNING (beer-sample) Slow runtime for 'Running the ALL_DOCS api on vb:908' on thread reader_worker_2: 234 ms
      2021-03-25T12:43:39.950189+02:00 WARNING (beer-sample) Slow runtime for 'Running the ALL_DOCS api on vb:916' on thread reader_worker_2: 168 ms
      2021-03-25T12:43:40.602306+02:00 WARNING (beer-sample) Slow runtime for 'Running the ALL_DOCS api on vb:963' on thread reader_worker_0: 243 ms
      

      projector errors

      2021-03-25T12:40:40.699+02:00 [Info] pram[:9999] Request "/adminport/shutdownTopicRequest"
      2021-03-25T12:40:40.699+02:00 [Info] PROJ[:9999] ##1 doShutdownTopic() "MAINT_STREAM_TOPIC_0aff61dffe090601aa26977b9c56c153"
      2021-03-25T12:40:40.699+02:00 [Error] PROJ[:9999] ##1 acquireFeed(): projector.topicMissing
      2021-03-25T12:40:40.699+02:00 [Info] PROJ[:9999] ##1 doShutdownTopic() returns ...
      

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            ns_server.indexer.log on node 111 indicates it is trying and failing to contact nodeUUID 14f1c3ec9d692532da75848c53947345, but this is the local node:

            2021-03-25T12:37:16.423+02:00 [Info] Indexer started with command line: [d:/Couchbase/bin/indexer.exe -adminPort=9100 -scanPort=9101 -httpPort=9102 -streamInitPort=9103 -streamCatchupPort=9104 -streamMaintPort=9105 -vbuckets=1024 -cluster=127.0.0.1:8091 -storageDir=d:/Couchbase/var/lib/couchbase/data/@2i -diagDir=d:/Couchbase/var/lib/couchbase/crash -nodeUUID=14f1c3ec9d692532da75848c53947345 -ipv6=false -isEnterprise=false]
            ...
            2021-03-25T12:37:17.259+02:00 [Info] ClustMgr:handleSetLocalValue Key IndexerId Value 14f1c3ec9d692532da75848c53947345
            ...
            2021-03-25T12:37:19.421+02:00 [Error] DDLServiceMgr: notifyNewTopologyChange(): Failed to initialize metadata provider.  Error=DDLServiceMgr: Failed to initialize metadata provider.  Unknown 
            host=map[14f1c3ec9d692532da75848c53947345:true].
            

            James Lee Even though this is the local host, some accesses to it can be done via REST. Possibly a firewall is blocking this loopback call.

            kevin.cherkauer Kevin Cherkauer (Inactive) added a comment - - edited ns_server.indexer.log on node 111 indicates it is trying and failing to contact nodeUUID 14f1c3ec9d692532da75848c53947345, but this is the local node: 2021-03-25T12:37:16.423+02:00 [Info] Indexer started with command line: [d:/Couchbase/bin/indexer.exe -adminPort=9100 -scanPort=9101 -httpPort=9102 -streamInitPort=9103 -streamCatchupPort=9104 -streamMaintPort=9105 -vbuckets=1024 -cluster=127.0.0.1:8091 -storageDir=d:/Couchbase/var/lib/couchbase/data/@2i -diagDir=d:/Couchbase/var/lib/couchbase/crash -nodeUUID=14f1c3ec9d692532da75848c53947345 -ipv6=false -isEnterprise=false] ... 2021-03-25T12:37:17.259+02:00 [Info] ClustMgr:handleSetLocalValue Key IndexerId Value 14f1c3ec9d692532da75848c53947345 ... 2021-03-25T12:37:19.421+02:00 [Error] DDLServiceMgr: notifyNewTopologyChange(): Failed to initialize metadata provider. Error=DDLServiceMgr: Failed to initialize metadata provider. Unknown host=map[14f1c3ec9d692532da75848c53947345:true]. James Lee Even though this is the local host, some accesses to it can be done via REST. Possibly a firewall is blocking this loopback call.

            James Lee GSI team discussed this in scrum yesterday and recommend the user check that all of the Couchbase communication ports are open and not being blocked by a firewall or antimalware tool. The ports are documented here:

            https://docs.couchbase.com/server/current/install/install-ports.html

            kevin.cherkauer Kevin Cherkauer (Inactive) added a comment - James Lee GSI team discussed this in scrum yesterday and recommend the user check that all of the Couchbase communication ports are open and not being blocked by a firewall or antimalware tool. The ports are documented here: https://docs.couchbase.com/server/current/install/install-ports.html
            jeelan.poola Jeelan Poola added a comment -

            James Lee Any update on this one?

            jeelan.poola Jeelan Poola added a comment - James Lee Any update on this one?
            james.lee James Lee added a comment -

            Jeelan Poola we've just had a four day weekend in the UK so the forum issue hasn't been updated, I'll drop an updated comment to the user today. Thank you for having a look at the logs.

            james.lee James Lee added a comment - Jeelan Poola we've just had a four day weekend in the UK so the forum issue hasn't been updated, I'll drop an updated comment to the user today. Thank you for having a look at the logs.

            2021-04-06 on the forum:

             
            jamesl33Couchbase
            3d
             
            Hi @Rus_Adrian,

            This issue has now been looked at by the indexing team, please see MB-45281 2 for the full details.

            To summarize, we see that node 111 is trying and failing to contact the node with UUID 14f1c3ec9d692532da75848c53947345, however, this is the local node. So even though this is the local node, some accesses via the REST API appear to be failing. The likely cause for this will be due to a firewall rule blocking the loopback call.

            Please could you ensure all the required ports (documented here) are available and try again.

            Thanks in advance,
            James
             
             
            Rus_Adrian
            3d
             
            I will inform the network administrator about the firewall issue and see if this resolves it.
            Thank you very much @jamesl33 !

            kevin.cherkauer Kevin Cherkauer (Inactive) added a comment - 2021-04-06 on the forum:   jamesl33 Couchbase 3d   Hi @Rus_Adrian , This issue has now been looked at by the indexing team, please see MB-45281 2 for the full details. To summarize, we see that node 111 is trying and failing to contact the node with UUID 14f1c3ec9d692532da75848c53947345, however, this is the local node. So even though this is the local node, some accesses via the REST API appear to be failing. The likely cause for this will be due to a firewall rule blocking the loopback call. Please could you ensure all the required ports (documented here ) are available and try again. Thanks in advance, James     Rus_Adrian 3d   I will inform the network administrator about the firewall issue and see if this resolves it. Thank you very much @jamesl33 !

            Closing all non-fixed issues. Pls reopen if necessary

            mihir.kamdar Mihir Kamdar (Inactive) added a comment - Closing all non-fixed issues. Pls reopen if necessary

            People

              james.lee James Lee
              james.lee James Lee
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty