Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-44502

Disproportionate resource utilization (cpu) in a multi-search-node cluster with non-default(>6) index partitions and index replicas

    XMLWordPrintable

Details

    • Untriaged
    • 1
    • Unknown

    Description

      Observing disproportionate cpu utilization in a multi-search-node cluster with non-default(>6) index partitions and index replicas 

      Test/Cluster setup  

      • Nodes : 10
      • KV/Search : 5:5
      • Total docs : 20 M
      • num of search Index : 1
      • num of fields : 6
      • Index Partition : 16
      • index replica : 1
      • workers : 20
      • queryType : 1_conjuncts_2_disjuncts
      • number_cores; 16

       

       

       

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          Sharath Sulochana I'll need some logs and CPU profiles to be able to go further on this.

          abhinav Abhinav Dangeti added a comment - Sharath Sulochana  I'll need some logs and CPU profiles to be able to go further on this.

          I just tried logging in to the cluster to see if it's still up ..

          Seems like the CPU usage is back down now. Sharath Sulochana  can you confirm any of these for me ..

          • The cluster is exactly in the same state since your test.
          • The high CPU usage you observed was during or immediately after an index build
          • If the above two are confirmed, it's quite possible that merging (compaction) tasks were using some CPU upon index build, and the CPU usage settling back down at zero is indicative of the compaction tasks having completed.
          abhinav Abhinav Dangeti added a comment - I just tried logging in to the cluster to see if it's still up .. Seems like the CPU usage is back down now. Sharath Sulochana   can you confirm any of these for me .. The cluster is exactly in the same state since your test. The high CPU usage you observed was during or immediately after an index build If the above two are confirmed, it's quite possible that merging (compaction) tasks were using some CPU upon index build, and the CPU usage settling back down at zero is indicative of the compaction tasks having completed.

          Abhinav Dangeti 

          Those numbers are right after CPU usage settles down to zero (or negligible) . I don't think there was any impact of compaction process consuming the CPU considering there were no mutations during the test (these are read_only search queries) .

          Attached (cpu_profiles.zip) are cpu profiles during the run for 105 & 107 machines . I looked into cpu profiles - there is nothing significant I see from the profiles. But please take another look into it.

          Also, I tried capturing some goroutine profiles to see if there are any blocking threads in either nodes . It looks like profiles are corrupted . Is this is a known issue for goroutine profiles ? Here is the api used for it . 

          curl http://172.23.96.107:8094/debug/pprof/goroutine?debug=2 -u Administrator:password > goroutine_profile_107.pprof

           

          go tool pprof --pdf goroutine_profile_105.pprof > goroutine_profile_105.pdf
          goroutine_profile_105.pprof: parsing profile: unrecognized profile format
          

          *lets sync up to go through some more details .

           

          sharath.sulochana Sharath Sulochana (Inactive) added a comment - Abhinav Dangeti   Those numbers are right after CPU usage settles down to zero (or negligible) . I don't think there was any impact of compaction process consuming the CPU considering there were no mutations during the test (these are read_only search queries ) . Attached ( cpu_profiles.zip ) are cpu profiles during the run for 105 & 107 machines . I looked into cpu profiles - there is nothing significant I see from the profiles. But please take another look into it. Also, I tried capturing some goroutine profiles to see if there are any blocking threads in either nodes . It looks like profiles are corrupted . Is this is a known issue for goroutine profiles ? Here is the api used for it .  curl http://172.23.96.107:8094/debug/pprof/goroutine?debug=2 -u Administrator:password > goroutine_profile_107.pprof   go tool pprof --pdf goroutine_profile_105.pprof > goroutine_profile_105.pdf goroutine_profile_105.pprof: parsing profile: unrecognized profile format *lets sync up to go through some more details .  

          Update-

          Had a detailed call with Abhinav Dangeti  and ran some experiments . As suspected in the beginning , reason for disproportionate resource utilization is due to type of traffic  run in customer environment  ( text present in docs, cardinality of search_terms, search_query_terms ). This is internally directs to the way data is partitioned in search nodes  .

           

          Here is a snapshot of index data distribution across search nodes in the cluster . 

          Search node 172-23-96-105 have single partition of ~104GB  and each partition have varied size . As client's search term's are part of this partitions . This node is expected to have more work and hence higher cpu utilization .

          This underlaying issue is potentially be causing the nodes to disproportionate resource utilization based on kind of query traffic .

          We are looking at couple of more possible design considerations here -

          1. Distribution of data partitions 
          2. Load balancing across multiple search nodes 

           

          [root@172-23-96-107 @fts]# du -sh *[root@172-23-96-107 @fts]# du -sh *
          4.0K cbft.uuid
          19G perf_fts_index_47fb8b485a4580a3_103fc5fb.pindex
          33G perf_fts_index_47fb8b485a4580a3_1076212c.pindex
          30G perf_fts_index_47fb8b485a4580a3_56a5d570.pindex
          54G perf_fts_index_47fb8b485a4580a3_5b084477.pindex
          57G perf_fts_index_47fb8b485a4580a3_7332d0c7.pindex
          61G perf_fts_index_47fb8b485a4580a3_80d40edf.pindex
           
          [root@172-23-96-105 @fts]# du -sh *
          4.0K cbft.uuid
          26G perf_fts_index_47fb8b485a4580a3_103fc5fb.pindex
          92G perf_fts_index_47fb8b485a4580a3_1d526e8f.pindex
          64G perf_fts_index_47fb8b485a4580a3_244a4eff.pindex
          49G perf_fts_index_47fb8b485a4580a3_627da171.pindex
          35G perf_fts_index_47fb8b485a4580a3_7dc1913f.pindex
          104G perf_fts_index_47fb8b485a4580a3_84003db0.pindex
          27G perf_fts_index_47fb8b485a4580a3_f47365c5.pindex
          

          Another thing noticed while running these exercise -  post index creation compaction have a significant impact on overall throughput & performance . 

          I will have separate tickets for each of these findings and we will have to enhance performance test coverage based on these design considerations .  

           

          sharath.sulochana Sharath Sulochana (Inactive) added a comment - - edited Update- Had a detailed call with Abhinav Dangeti   and ran some experiments . As suspected in the beginning , reason for disproportionate resource utilization is due to type of traffic  run in customer environment  ( text present in docs, cardinality of search_terms, search_query_terms ). This is internally directs to the way data is partitioned in search nodes  .   Here is a snapshot of index data distribution across search nodes in the cluster .  Search node  172-23-96-105 have single partition of ~104GB   and each partition have varied size . As client's search term's are part of this partitions . This node is expected to have more work and hence higher cpu utilization . This underlaying issue is potentially be causing the nodes to disproportionate resource utilization based on kind of query traffic . We are looking at couple of more possible design considerations here - Distribution of data partitions  Load balancing across multiple search nodes    [root@172-23-96-107 @fts]# du -sh *[root@172-23-96-107 @fts]# du -sh * 4.0K cbft.uuid 19G perf_fts_index_47fb8b485a4580a3_103fc5fb.pindex 33G perf_fts_index_47fb8b485a4580a3_1076212c.pindex 30G perf_fts_index_47fb8b485a4580a3_56a5d570.pindex 54G perf_fts_index_47fb8b485a4580a3_5b084477.pindex 57G perf_fts_index_47fb8b485a4580a3_7332d0c7.pindex 61G perf_fts_index_47fb8b485a4580a3_80d40edf.pindex   [root@172-23-96-105 @fts]# du -sh * 4.0K cbft.uuid 26G perf_fts_index_47fb8b485a4580a3_103fc5fb.pindex 92G perf_fts_index_47fb8b485a4580a3_1d526e8f.pindex 64G perf_fts_index_47fb8b485a4580a3_244a4eff.pindex 49G perf_fts_index_47fb8b485a4580a3_627da171.pindex 35G perf_fts_index_47fb8b485a4580a3_7dc1913f.pindex 104G perf_fts_index_47fb8b485a4580a3_84003db0.pindex 27G perf_fts_index_47fb8b485a4580a3_f47365c5.pindex Another thing noticed while running these exercise -  post index creation compaction have a significant impact on overall throughput & performance .  I will have separate tickets for each of these findings and we will have to enhance performance test coverage based on these design considerations .    

          See the last comment.

          keshav Keshav Murthy added a comment - See the last comment.

          People

            abhinav Abhinav Dangeti
            sharath.sulochana Sharath Sulochana (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            9 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty