Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-39165

[Collections] - Compactor daemon exits and doesn't recover while running collection tests

    XMLWordPrintable

    Details

      Description

      Script to Repro

      ./testrunner -i /tmp/win10-bucket-ops.ini sdk_client_pool=True -t bucket_collections.collections_rebalance.CollectionsRebalance.test_data_load_collections_with_graceful_failover_rebalance_out,nodes_init=5,nodes_failover=2,replicas_for_failover=3,override_spec_params=durability;replicas,durability=MAJORITY,replicas=Bucket.ReplicaNum.TWO,bucket_spec=multi_bucket.buckets_for_rebalance_tests,data_load_stage=during
      

      Steps to Repro
      1. Created a 5 node cluster (172.23.104.186, 172.23.120.201, 172.23.121.10, 172.23.98.195, 172.23.98.196)
      2. Create 3 buckets default, bucket1 and bucket2(ephemeral)
      3. Create 100 collections on default , 6 on bucket1 and bucket 2

      After completing collections being created successfully. I see the following exits on .196 continuously and the buckets are in warmup state forever.

      Compactor for database `default` (pid [{type,database},
      {important,true},
      {name,<<"default">>},
      {fa,
      {#Fun<compaction_daemon.4.9063353>,
      [<<"default">>,
      {config,
      {30,undefined},
      {30,undefined},
      undefined,false,false,
      {daemon_config,30,131072,20971520}},
      false,
      {[{type,bucket}]}]}}]) terminated unexpectedly: {timeout,
      {gen_server,
      call,
      [{'ns_memcached-default',
      'ns_1@172.23.98.196'},
      {raw_stats,
      <<"diskinfo">>,
      #Fun<compaction_daemon.18.9063353>,
      {<<"0">>,
      <<"0">>}},
      180000]}} 
      

      cbcollect_info attached.

        Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

          Hide
          james.harrison James Harrison added a comment - - edited

          Unsurprisingly given the condition of the node, stats.log is near useless

          172.23.98.196

          ==============================================================================
          memcached stats all
          cbstats -a 127.0.0.1:11209 all -u @ns_server
          ==============================================================================
          Traceback (most recent call last):
            File "/opt/couchbase/lib/python/cbstats", line 962, in <module>
              main()
            File "/opt/couchbase/lib/python/cbstats", line 959, in main
              c.execute()
            File "/opt/couchbase/lib/python/clitool.py", line 71, in execute
              f[0](mc, *args[2:], **opts.__dict__)
            File "/opt/couchbase/lib/python/cbstats", line 38, in g
              f(*args, **kwargs)
            File "/opt/couchbase/lib/python/cli_auth_utils.py", line 67, in g
              mc.sasl_auth_plain(username, password)
            File "/opt/couchbase/lib/python/mc_bin_client.py", line 483, in sasl_auth_plain
              return self.sasl_auth_start('PLAIN', '\0'.join([foruser, user, password]))
            File "/opt/couchbase/lib/python/mc_bin_client.py", line 479, in sasl_auth_start
              return self._doCmd(memcacheConstants.CMD_SASL_AUTH, mech, data)
            File "/opt/couchbase/lib/python/mc_bin_client.py", line 298, in _doCmd
              return self._handleSingleResponse(opaque)
            File "/opt/couchbase/lib/python/mc_bin_client.py", line 291, in _handleSingleResponse
              cmd, opaque, cas, keylen, extralen, data = self._handleKeyedResponse(myopaque)
            File "/opt/couchbase/lib/python/mc_bin_client.py", line 276, in _handleKeyedResponse
              cmd, errcode, opaque, cas, keylen, extralen, rv = self._recvMsg()
            File "/opt/couchbase/lib/python/mc_bin_client.py", line 245, in _recvMsg
              data = self._socketRecv(MIN_RECV_PACKET - len(response))
            File "/opt/couchbase/lib/python/mc_bin_client.py", line 240, in _socketRecv
              raise TimeoutError(30)
          mc_bin_client.TimeoutError: Error: Operation timed out (30 seconds)
          Please check list of arguments (e.g., IP address, port number) passed or the connectivity to a server to be connected
          

          Show
          james.harrison James Harrison added a comment - - edited Unsurprisingly given the condition of the node, stats.log is near useless 172.23.98.196 ============================================================================== memcached stats all cbstats -a 127.0.0.1:11209 all -u @ns_server ============================================================================== Traceback (most recent call last): File "/opt/couchbase/lib/python/cbstats", line 962, in <module> main() File "/opt/couchbase/lib/python/cbstats", line 959, in main c.execute() File "/opt/couchbase/lib/python/clitool.py", line 71, in execute f[0](mc, *args[2:], **opts.__dict__) File "/opt/couchbase/lib/python/cbstats", line 38, in g f(*args, **kwargs) File "/opt/couchbase/lib/python/cli_auth_utils.py", line 67, in g mc.sasl_auth_plain(username, password) File "/opt/couchbase/lib/python/mc_bin_client.py", line 483, in sasl_auth_plain return self.sasl_auth_start('PLAIN', '\0'.join([foruser, user, password])) File "/opt/couchbase/lib/python/mc_bin_client.py", line 479, in sasl_auth_start return self._doCmd(memcacheConstants.CMD_SASL_AUTH, mech, data) File "/opt/couchbase/lib/python/mc_bin_client.py", line 298, in _doCmd return self._handleSingleResponse(opaque) File "/opt/couchbase/lib/python/mc_bin_client.py", line 291, in _handleSingleResponse cmd, opaque, cas, keylen, extralen, data = self._handleKeyedResponse(myopaque) File "/opt/couchbase/lib/python/mc_bin_client.py", line 276, in _handleKeyedResponse cmd, errcode, opaque, cas, keylen, extralen, rv = self._recvMsg() File "/opt/couchbase/lib/python/mc_bin_client.py", line 245, in _recvMsg data = self._socketRecv(MIN_RECV_PACKET - len(response)) File "/opt/couchbase/lib/python/mc_bin_client.py", line 240, in _socketRecv raise TimeoutError(30) mc_bin_client.TimeoutError: Error: Operation timed out (30 seconds) Please check list of arguments (e.g., IP address, port number) passed or the connectivity to a server to be connected
          Hide
          james.harrison James Harrison added a comment -

          The excessive connections and the slow operations are present from the start of the logs. Presumably, the sheer volume of HELO and Slow operation logging quickly causes the logs to wrap. If the rate of connections is due to some retry logic I'd strongly suggest adding backoff.

          Show
          james.harrison James Harrison added a comment - The excessive connections and the slow operations are present from the start of the logs . Presumably, the sheer volume of HELO and Slow operation logging quickly causes the logs to wrap. If the rate of connections is due to some retry logic I'd strongly suggest adding backoff.
          Hide
          james.harrison James Harrison added a comment - - edited

          Filtering out lines referencing:

          • HELO
          • Slow operation
          • SASL_AUTH
          • User authenticated as
          • connection reset by peer
          • The connected bucket is being deleted.. closing connection
          • User [<ud>Administrator</ud>] not found

          leaves a trivial amount of information in the logs.

          I'd recommend investingating why so many client connection are being made - as it stands, the volume of connections itself has not been ruled out as a potential cause (or at least a contributor).

          Once the number of connections is more reasonable, there will likely be more information to work from. If the issue persists, it will be much easier to investigate.

          Show
          james.harrison James Harrison added a comment - - edited Filtering out lines referencing: HELO Slow operation SASL_AUTH User authenticated as connection reset by peer The connected bucket is being deleted.. closing connection User [<ud>Administrator</ud>] not found leaves a trivial amount of information in the logs. I'd recommend investingating why so many client connection are being made - as it stands, the volume of connections itself has not been ruled out as a potential cause (or at least a contributor). Once the number of connections is more reasonable, there will likely be more information to work from. If the issue persists, it will be much easier to investigate.
          Hide
          Balakumaran.Gopal Balakumaran Gopal added a comment -

          Closing this out as we are not seeing this in the latest builds.

          Show
          Balakumaran.Gopal Balakumaran Gopal added a comment - Closing this out as we are not seeing this in the latest builds.
          Hide
          drigby Dave Rigby added a comment -

          Reopening to set the correct resolution - given all the reported issues seemed to be environmental and no code was changed, "Fixed" is not a suitable resolution - "Cannot Reproduce" is the correct resolution if the issue went away "by itself".

          Show
          drigby Dave Rigby added a comment - Reopening to set the correct resolution - given all the reported issues seemed to be environmental and no code was changed, "Fixed" is not a suitable resolution - "Cannot Reproduce" is the correct resolution if the issue went away "by itself".

            People

            Assignee:
            Balakumaran.Gopal Balakumaran Gopal
            Reporter:
            Balakumaran.Gopal Balakumaran Gopal
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

              Dates

              Created:
              Updated:
              Resolved:

                Gerrit Reviews

                There are no open Gerrit changes

                  PagerDuty