Loading...

XML

Word

Printable

Details

Type: Bug
Resolution: Fixed
Priority: Critical
Fix Version/s: 5.5.0
Affects Version/s: 5.5.0
Component/s: query
Labels:
- system-test

Triage:
Untriaged
Is this a Regression?:
Unknown

Description

As reported in this comment: https://issues.couchbase.com/browse/MB-28710?focusedCommentId=261403&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-261403, running of system test leaves thousands of sockets in the TIME_WAIT state. From nutshell on node .99.25:

                                 Established   TIME_WAIT       Total

  TCP Connections in State       Conns Hosts Conns Hosts Conns Hosts SendQ (Bytes)

  --------------------------------------------------------------------------------

  Port 8091 (cluster mgmt)          23     1     5     1    29     2             0

  Port 8093 (N1QL)                 473     2     0     0   505     3             0

  Port 18091 (cluster mgmt SSL)      0     0     0     0     1     1             0

  Port 18093 (N1QL SSL)              0     0     0     0     1     1             0

  ==Total==                       1021    17  7764     7  8839    19           354

Looking in couchbase.log we see:

Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name

tcp        0      0 172.23.99.25:41411      172.23.97.238:11210     TIME_WAIT   -

tcp        0      0 172.23.99.25:56182      172.23.99.21:11210      TIME_WAIT   -

tcp        0      0 172.23.99.25:33245      172.23.97.239:11210     TIME_WAIT   -

tcp        0      0 172.23.99.25:58658      172.23.97.239:11210     TIME_WAIT   -

tcp        0      0 172.23.99.25:60702      172.23.99.21:11210      TIME_WAIT   -

tcp        0      0 172.23.99.25:49630      172.23.108.104:11210    TIME_WAIT   -

tcp        0      0 172.23.99.25:52370      172.23.99.22:11210      TIME_WAIT   -

tcp        0      0 172.23.99.25:52947      172.23.99.22:11210      TIME_WAIT   -

tcp        0      0 172.23.99.25:42461      172.23.97.238:11210     TIME_WAIT   -

tcp        0      0 172.23.99.25:48095      172.23.99.22:11210      TIME_WAIT   -

tcp        0      0 172.23.99.25:58853      172.23.97.239:11210     TIME_WAIT   -

tcp        0      0 172.23.99.25:48984      172.23.108.104:11210    TIME_WAIT   -

tcp        0      0 172.23.99.25:42261      172.23.97.238:11210     TIME_WAIT   -

...

Since these are outbound connections from .25 to the memcached port on other nodes it seems that these connections have to be query connections as query is the only service running on this node and ns_server always connections to 11209. It seems wrong that we have so many connections getting closed so quickly that we end up with an enormous number of connections in this state.

My recommendation is that in addition to looking at what's happening between cbq-engine and memcached, we should also look at the queries that are running (we should connect with the QE guys on this) and figure out if there's something pathological about the workload.

Full logs for test, including for .25.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

afterRebalance.log
472 kB
03/Apr/18 3:04 PM
duringRebalance.log
518 kB
03/Apr/18 3:04 PM
Screen Shot 2018-04-03 at 1.57.32 PM.png
325 kB
03/Apr/18 3:04 PM
Screen Shot 2018-04-03 at 3.03.53 PM.png
105 kB
03/Apr/18 3:04 PM

Issue Links

relates to

MB-28710 Rebalance exited with reason {badmatch, {leader_activities_error, {default,rebalance}, quorum_lost}}

Closed

MB-28854 Eventing is not listening to dcp even when agg_queue_size is 0

Closed

MB-28972 Memcached log spammed from N1QL related messages

Closed

Gerrit Reviews

- Issue Only
- Show All Reviews
- Show Open Reviews
- Show All Issues
- Show Open Issues

No reviews matched the request. Check your Options in the drop-down menu of this sections header.

Activity

People

Assignee:: Arunkumar Senthilnathan (Inactive)

Reporter:: Dave Finlay

Votes:: 0 Vote for this issue

Watchers:: 10 Start watching this issue

Dates

Created:: 21/Mar/18 8:24 PM

Updated:: 15/May/18 12:51 AM

Resolved:: 05/Apr/18 5:46 AM

Gerrit Reviews

There are no open Gerrit changes

Show There are 2 closed Gerrit changes

Hide There are 2 closed Gerrit changes

MB-28842 avoid closing connections on NMVB: Gerrit Review:

MB-28842 discard connections on NMVB if node on map: Gerrit Review:

Thousands of sockets in TIME_WAIT state during system test

Details

Description

Attachments

Attachments

Issue Links

Gerrit Reviews

Activity

People

Dates

Gerrit Reviews

PagerDuty