Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-45281

CE rebalance failing due to indexing service being unavailable/unresponsive

    XMLWordPrintable

Details

    • Untriaged
    • Windows 64-bit
    • 1
    • Unknown

    Description

      What's the issue?
      This may be multiple separate issues, however, I think its sensible to start with the initial rebalance failure, and investigate from there. Please feel free to separate the issues if required.

      A user on the forums has a two node cluster which has failed the initial rebalance due to the indexing service, they're now unable to interact with the cluster in a useful/expected manner. The issue is manifesting with the following symptoms:
      1) Loading a sample bucket has failed (due to a timeout waiting for the bucket to report as healthy)
      2) The user is unable to insert documents into the created bucket
      3) ns_server appears to be timing out when communicating with the indexing service
      4) Successive rebalances are now failing
      5) Indexing appears to be failing to communicate with the projector

      From the logs we see a few interesting things worth noting:

      intial rebalance failure

      2021-03-25T12:41:40.158+02:00, ns_orchestrator:0:critical:message(ns_1@192.168.0.122) - Rebalance exited with reason {service_rebalance_failed,index,
                                    {agent_died,<22661.10863.0>,
                                     {linked_process_died,<22661.10865.0>,
                                      {no_connection,"index-service_api"}}}}.
      

      ns_server request timeout

      [ns_server:error,2021-03-25T12:43:36.765+02:00,ns_1@192.168.0.122:service_status_keeper_worker<0.430.0>:rest_utils:get_json:62]Request to (indexer) getIndexStatus failed: {error,timeout}
      

      index service failing to connect/communicate with the projector

      2021-03-25T12:41:01.702+02:00 [Error] KVSender::closeMutationStream MAINT_STREAM  Error Received Post http://192.168.0.122:9999/adminport/shutdownTopicRequest: dial tcp 192.168.0.122:9999: connectex: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond. from 192.168.0.122:9999
      

      memcached slow runtime during warmup

      2021-03-25T12:43:28.127751+02:00 WARNING (beer-sample) Slow runtime for 'Warmup - populate VB Map: shard 7' on thread reader_worker_0: 1079 us
      2021-03-25T12:43:39.673165+02:00 WARNING (beer-sample) Slow runtime for 'Running the ALL_DOCS api on vb:908' on thread reader_worker_2: 234 ms
      2021-03-25T12:43:39.950189+02:00 WARNING (beer-sample) Slow runtime for 'Running the ALL_DOCS api on vb:916' on thread reader_worker_2: 168 ms
      2021-03-25T12:43:40.602306+02:00 WARNING (beer-sample) Slow runtime for 'Running the ALL_DOCS api on vb:963' on thread reader_worker_0: 243 ms
      

      projector errors

      2021-03-25T12:40:40.699+02:00 [Info] pram[:9999] Request "/adminport/shutdownTopicRequest"
      2021-03-25T12:40:40.699+02:00 [Info] PROJ[:9999] ##1 doShutdownTopic() "MAINT_STREAM_TOPIC_0aff61dffe090601aa26977b9c56c153"
      2021-03-25T12:40:40.699+02:00 [Error] PROJ[:9999] ##1 acquireFeed(): projector.topicMissing
      2021-03-25T12:40:40.699+02:00 [Info] PROJ[:9999] ##1 doShutdownTopic() returns ...
      

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            People

              james.lee James Lee
              james.lee James Lee
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty