Loading...

XML

Word

Printable

Details

Type: Bug
Resolution: Fixed
Priority: Critical
Fix Version/s: 5.5.0
Affects Version/s: 5.5.0
Component/s: fts
Labels:
- fts
- performance
Environment:
Cluster: atlas_setupA
OS: CentOS 7
CPU: E5-2680 v3 (48 vCPU)
Memory: 256 GB
Disk: Samsung PM863

Triage:
Untriaged
Operating System:
Centos 64-bit
Link to Log File, atop/blg, CBCollectInfo, Core dump:

Hide
https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-atlas-sdk-922/172.23.99.211.zip
https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-atlas-sdk-922/172.23.99.39.zip
https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-atlas-sdk-922/172.23.99.40.zip

Show
https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-atlas-sdk-922/172.23.99.211.zip https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-atlas-sdk-922/172.23.99.39.zip https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-atlas-sdk-922/172.23.99.40.zip
Is this a Regression?:
Unknown
Sprint:
FTS Sprint Mar-30-2018

Description

Tested on 5.5.0-2126 with http2 enabled.

Test specs: low frequency terms, 600 client threads, 3 node FTS cluster, 6 pindexes (2 per node), Java SDK

Behavior:

Up to 400 client threads system behaves as expected: cbft isn't CPU bound, the amount of TCP connections between nodes is consistent, the cluster produces about 35K q/sec. So it seems like there is a room for additional load. Max throughput expected for 3 node setup is 120K q/sec assuming linear scalability (I can get 40K q/sec on single node)

But when increasing amount of client threads to 600 following happens:

1) randomly one of the nodes (node A) gets 100% CPU constantly (FlameGraph attached)

2) node A starts accumulating connections to node B and node C in "TIME_WAIT" state

3) node B and node C start accumulating connections to node A in "FIN_WAIT2" state

So it seems like node A stops sending FIN request to close connection to other nodes (I'm looking at https://kb.iu.edu/d/ajmi)

As for this moment I don't know if 100% CPU prevents closing that connections or attempt to close that connections is resulting 100% CPU.

Looking on CPU profile I assume its #2.

Finally, once node goes to 100% CPU overall FTS throughput drops dramatically.

The behavior is similar to what we had before http2 implementation. It that case any client call beyond the 100 connections pool was causing opening new connection which than stayed in TIME_WAIT until timeout. If load is high enough, amount of such connections starts growing until all taken.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

busy.svg
14/Mar/18 6:47 PM
743 kB
Alex Gyryk

Issue Links

relates to

MB-29218 [FTS] low frequency term query throughput doesn't scale with more nodes

Reopened

Gerrit Reviews

- Issue Only
- Show All Reviews
- Show Open Reviews
- Show All Issues
- Show Open Issues

No reviews matched the request. Check your Options in the drop-down menu of this sections header.

Activity

People

Assignee:: Abhi Dangeti

Reporter:: Alex Gyryk (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 14/Mar/18 6:53 PM

Updated:: 27/Jun/18 5:05 PM

Resolved:: 10/Apr/18 10:13 PM

Gerrit Reviews

There are no open Gerrit changes

Show There are 2 closed Gerrit changes

Hide There are 2 closed Gerrit changes

MB-28700: Use HTTP/2's default transport settings for http2 clients: Gerrit Review:

MB-28700: Configure HTTPS listeners for HTTP/2 if specified: Gerrit Review:

[FTS] cbft is leaking TCP connections on multi-node environment under load (http2)

Details

Description

Attachments

Attachments

Issue Links

Gerrit Reviews

Activity

People

Dates

Gerrit Reviews

PagerDuty