Loading...

XML

Word

Printable

Details

Type: Bug
Resolution: Cannot Reproduce
Priority: Major
Fix Version/s: feature-backlog
Affects Version/s: 3.0.2
Component/s: ns_server
Security Level: Public
Labels:
- large-cluster-support
- ns_server

Triage:
Untriaged
Is this a Regression?:
Unknown

Description

One node (52-17-12-151) repeatedly suffers net_tick_timeout with multiple different nodes.

However on concluding a node has gone down (due to net_tick_timeout) it then almost immediately sees it again - claiming that it "came up". For example:

[user:warn,2015-03-16T16:34:34.880,ns_1@ec2-52-17-12-151.eu-west-1.compute.amazonaws.com:ns_node_disco<0.4999.0>:ns_node_disco:handle_info:175]Node 'ns_1@ec2-52-17-12-151.eu-west-1.compute.amazonaws.com' saw that node 'ns_1@ec2-52-17-15-202.eu-west-1.compute.amazonaws.com' went down. Details: [

{nodedown_reason, net_tick_timeout}

]

[user:info,2015-03-16T16:34:34.887,ns_1@ec2-52-17-12-151.eu-west-1.compute.amazonaws.com:ns_node_disco<0.4999.0>:ns_node_disco:handle_info:169]Node 'ns_1@ec2-52-17-12-151.eu-west-1.compute.amazonaws.com' saw that node 'ns_1@ec2-52-17-15-202.eu-west-1.compute.amazonaws.com' came up. Tags: []

This repeatedly occurs only on this node. The other nodes (e.g. 52-17-15-202) is up and running, and reports as follows:

[user:warn,2015-03-16T16:34:34.896,ns_1@ec2-52-17-15-202.eu-west-1.compute.amazonaws.com:ns_node_disco<0.5208.0>:ns_node_disco:handle_info:175]Node 'ns_1@ec2-52-17-15-202.eu-west-1.compute.amazonaws.com' saw that node 'ns_1@ec2-52-17-12-151.eu-west-1.compute.amazonaws.com' went down. Details: [

{nodedown_reason, connection_closed}]

After the node is ejected from the cluster - no more net_tick_timeouts are observed.

UPDATE
=======
Loaded the game-sim sample and created default bucket (then deleted game-sim sample). Leaving the system quiet (i.e no ops) and now starting to get net_tick_timeouts. This time from 52-17-15-193.

[user:warn,2015-03-16T17:32:55.877,ns_1@ec2-52-17-15-193.eu-west-1.compute.amazonaws.com:ns_node_disco<0.5644.0>:ns_node_disco:handle_info:175]Node 'ns_1@ec2-52-17-15-193.eu-west-1.compute.amazonaws.com' saw that node 'ns_1@10.0.0.43' went down. Details: [{nodedown_reason,net_tick_timeout}]

The other node (orchestrator) is up and running and has the corresponding message.
[user:warn,2015-03-16T17:32:55.816,ns_1@10.0.0.43:ns_node_disco<0.4153.0>:ns_node_disco:handle_info:175]Node 'ns_1@10.0.0.43' saw that node 'ns_1@ec2-52-17-15-193.eu-west-1.compute.amazonaws.com' went down. Details: [{nodedown_reason, connection_closed}

]

Issues always appears to be with module core ns_node_disco005, and ns_node_disco004.

Uploaded the logs, see https://s3.amazonaws.com/cb-customers/owend-couchbase/collectinfo-2015-03-16T180110-ns_1%4010.0.0.43.zip

During the collection process additional net_tick_timeouts were seen from 52-17-15-193.

Attachments

Gerrit Reviews

- Issue Only
- Show All Reviews
- Show Open Reviews
- Show All Issues
- Show Open Issues

No reviews matched the request. Check your Options in the drop-down menu of this sections header.

Activity

People

Assignee:: Daniel Owen

Reporter:: Daniel Owen

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 16/Mar/15 10:32 AM

Updated:: 04/Dec/15 3:14 AM

Resolved:: 04/Dec/15 3:14 AM

Gerrit Reviews

There are no open Gerrit changes

net_tick_timeout on clean cluster - no buckets created etc. (on cluster of 130 nodes)

Details

Description

Attachments

Gerrit Reviews

Activity

People

Dates

Gerrit Reviews

PagerDuty