Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-28700

[FTS] cbft is leaking TCP connections on multi-node environment under load (http2)

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • 5.5.0
    • 5.5.0
    • fts
    • Cluster: atlas_setupA
      OS: CentOS 7
      CPU: E5-2680 v3 (48 vCPU)
      Memory: 256 GB
      Disk: Samsung PM863

    Description

      Tested on 5.5.0-2126 with http2 enabled.

      Test specs: low frequency terms, 600 client threads, 3 node FTS cluster, 6 pindexes (2 per node), Java SDK

      Behavior:

      Up to 400 client threads system behaves as expected: cbft isn't CPU bound, the amount of TCP connections between nodes is consistent, the cluster produces about 35K q/sec.  So it seems like there is a room for additional load. Max throughput expected for 3 node setup is 120K q/sec assuming linear scalability (I can get 40K q/sec on single node)

      But when increasing amount of client threads to 600 following happens:

      1) randomly one of the nodes (node A)  gets 100% CPU constantly (FlameGraph attached)

      2) node A starts accumulating connections to node B and node C in "TIME_WAIT" state

      3) node B and node C start accumulating connections to node A in "FIN_WAIT2" state

      So it seems like node A stops sending FIN request to close  connection to other nodes (I'm looking at https://kb.iu.edu/d/ajmi)

      As for this moment I don't know if 100% CPU prevents closing that connections or attempt to close that connections is resulting 100% CPU. 

      Looking on CPU profile I assume its #2.

       

      Finally, once node goes to 100% CPU overall FTS throughput drops dramatically.

       

      The behavior is similar to what we had before http2 implementation. It that case any client call beyond the 100 connections pool was causing opening new connection which than stayed in TIME_WAIT until timeout. If load is high enough, amount of such connections starts growing until all taken.

       

        

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            People

              abhinav Abhi Dangeti
              oleksandr.gyryk Alex Gyryk (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty