Details
-
Bug
-
Resolution: Fixed
-
Critical
-
5.5.0
-
Cluster: atlas_setupA
OS: CentOS 7
CPU: E5-2680 v3 (48 vCPU)
Memory: 256 GB
Disk: Samsung PM863
-
Untriaged
-
Centos 64-bit
-
-
Unknown
-
FTS Sprint Mar-30-2018
Description
Tested on 5.5.0-2126 with http2 enabled.
Test specs: low frequency terms, 600 client threads, 3 node FTS cluster, 6 pindexes (2 per node), Java SDK
Behavior:
Up to 400 client threads system behaves as expected: cbft isn't CPU bound, the amount of TCP connections between nodes is consistent, the cluster produces about 35K q/sec. So it seems like there is a room for additional load. Max throughput expected for 3 node setup is 120K q/sec assuming linear scalability (I can get 40K q/sec on single node)
But when increasing amount of client threads to 600 following happens:
1) randomly one of the nodes (node A) gets 100% CPU constantly (FlameGraph attached)
2) node A starts accumulating connections to node B and node C in "TIME_WAIT" state
3) node B and node C start accumulating connections to node A in "FIN_WAIT2" state
So it seems like node A stops sending FIN request to close connection to other nodes (I'm looking at https://kb.iu.edu/d/ajmi)
As for this moment I don't know if 100% CPU prevents closing that connections or attempt to close that connections is resulting 100% CPU.
Looking on CPU profile I assume its #2.
Finally, once node goes to 100% CPU overall FTS throughput drops dramatically.
The behavior is similar to what we had before http2 implementation. It that case any client call beyond the 100 connections pool was causing opening new connection which than stayed in TIME_WAIT until timeout. If load is high enough, amount of such connections starts growing until all taken.
Attachments
Issue Links
- relates to
-
MB-29218 [FTS] low frequency term query throughput doesn't scale with more nodes
- Reopened