Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-13963

net_tick_timeout on clean cluster - no buckets created etc. (on cluster of 130 nodes)

    XMLWordPrintable

Details

    • Untriaged
    • Unknown

    Description

      One node (52-17-12-151) repeatedly suffers net_tick_timeout with multiple different nodes.

      However on concluding a node has gone down (due to net_tick_timeout) it then almost immediately sees it again - claiming that it "came up". For example:

      [user:warn,2015-03-16T16:34:34.880,ns_1@ec2-52-17-12-151.eu-west-1.compute.amazonaws.com:ns_node_disco<0.4999.0>:ns_node_disco:handle_info:175]Node 'ns_1@ec2-52-17-12-151.eu-west-1.compute.amazonaws.com' saw that node 'ns_1@ec2-52-17-15-202.eu-west-1.compute.amazonaws.com' went down. Details: [

      {nodedown_reason, net_tick_timeout}

      ]

      [user:info,2015-03-16T16:34:34.887,ns_1@ec2-52-17-12-151.eu-west-1.compute.amazonaws.com:ns_node_disco<0.4999.0>:ns_node_disco:handle_info:169]Node 'ns_1@ec2-52-17-12-151.eu-west-1.compute.amazonaws.com' saw that node 'ns_1@ec2-52-17-15-202.eu-west-1.compute.amazonaws.com' came up. Tags: []

      This repeatedly occurs only on this node. The other nodes (e.g. 52-17-15-202) is up and running, and reports as follows:

      [user:warn,2015-03-16T16:34:34.896,ns_1@ec2-52-17-15-202.eu-west-1.compute.amazonaws.com:ns_node_disco<0.5208.0>:ns_node_disco:handle_info:175]Node 'ns_1@ec2-52-17-15-202.eu-west-1.compute.amazonaws.com' saw that node 'ns_1@ec2-52-17-12-151.eu-west-1.compute.amazonaws.com' went down. Details: [

      {nodedown_reason, connection_closed}]

      After the node is ejected from the cluster - no more net_tick_timeouts are observed.


      UPDATE
      =======
      Loaded the game-sim sample and created default bucket (then deleted game-sim sample). Leaving the system quiet (i.e no ops) and now starting to get net_tick_timeouts. This time from 52-17-15-193.

      [user:warn,2015-03-16T17:32:55.877,ns_1@ec2-52-17-15-193.eu-west-1.compute.amazonaws.com:ns_node_disco<0.5644.0>:ns_node_disco:handle_info:175]Node 'ns_1@ec2-52-17-15-193.eu-west-1.compute.amazonaws.com' saw that node 'ns_1@10.0.0.43' went down. Details: [{nodedown_reason,net_tick_timeout}]

      The other node (orchestrator) is up and running and has the corresponding message.
      [user:warn,2015-03-16T17:32:55.816,ns_1@10.0.0.43:ns_node_disco<0.4153.0>:ns_node_disco:handle_info:175]Node 'ns_1@10.0.0.43' saw that node 'ns_1@ec2-52-17-15-193.eu-west-1.compute.amazonaws.com' went down. Details: [{nodedown_reason, connection_closed}

      ]

      Issues always appears to be with module core ns_node_disco005, and ns_node_disco004.

      Uploaded the logs, see https://s3.amazonaws.com/cb-customers/owend-couchbase/collectinfo-2015-03-16T180110-ns_1%4010.0.0.43.zip

      During the collection process additional net_tick_timeouts were seen from 52-17-15-193.

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          People

            owend Daniel Owen
            owend Daniel Owen
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty