Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-19714

GSI indexes: high CPU utilization after load-cycle

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • 4.1.2
    • 4.1.0
    • secondary-index
    • None
    • Untriaged
    • Ubuntu 64-bit
    • Unknown

    Description

      There are GSI's combinations that cause following situation: after lots of documents upsertions, high CPU utilization remains for nodes with GSI indexes for indefinite time (observed and confirmed: ~1h 30 min) even if there are no more upsertions.

      Related forum topic: https://forums.couchbase.com/t/issue-reproducible-gsi-problem-high-after-load-cpu-utilization/8374/2

      How to reproduce:
      1.4.1.0-EE GA (5005), 3 nodes (4CPU / 4G RAM) [4.1.2-6033 also affected]
      2.All nodes have all services enabled (data,index,query)
      3.Establish a cluster with "default" bucket (256Mb, full eviction, no password, 1 replica, view index replicas, flush=enable)
      4. Now, establish ssh connection to all nodes, run htop, and just watch for a while, to ensure, that total observed CPU utilization load is ~ [0..5]% for each node
      5. Run the following code (this will take a while ~ 12 min) or write your own:

      package highcpuafterload;
       
      import com.couchbase.client.java.Bucket;
      import com.couchbase.client.java.Cluster;
      import com.couchbase.client.java.CouchbaseCluster;
      import com.couchbase.client.java.document.JsonDocument;
      import com.couchbase.client.java.document.json.JsonObject;
      import com.couchbase.client.java.env.CouchbaseEnvironment;
      import com.couchbase.client.java.env.DefaultCouchbaseEnvironment;
      import com.couchbase.client.java.query.N1qlQuery;
      import java.util.LinkedList;
      import java.util.concurrent.Phaser;
       
      public class BombardaMaxima extends Thread {
       
          private final int tid;
          
          // configure here
          private static final int threads = 20;
          private static final int docsPerThread = 50000;
          private static final int docTTLms = 86400 * 1000;
          private static final int dumpToConsoleEachNDocs = 1000;
          
          private static final Phaser phaser = new Phaser(threads + 1);
          private static final CouchbaseEnvironment ce;
          private static final Cluster cluster;
          private static final String bucket = "default";
          
          static {
              ce = DefaultCouchbaseEnvironment.create();
              final LinkedList<String> nodes = new LinkedList();
              nodes.add("A.node");
              nodes.add("B.node");
              nodes.add("C.node");
              cluster = CouchbaseCluster.create(ce, nodes);
              final Bucket b = cluster.openBucket(bucket);
       
              
              // three indexes for query
              final String iQA = "CREATE INDEX iQA ON `default`(a, b) WHERE a is valued USING GSI";
              final String iQB = "CREATE INDEX iQB ON `default`(a, b) WHERE a is valued USING GSI";
              final String iQC = "CREATE INDEX iQC ON `default`(a, b) WHERE a is valued USING GSI";
       
              final String iQX = "CREATE INDEX iQX ON `default`(a, c) WHERE a is valued USING GSI";
              final String iQY = "CREATE INDEX iQY ON `default`(a, c) WHERE a is valued USING GSI";
              final String iQZ = "CREATE INDEX iQZ ON `default`(a, c) WHERE a is valued USING GSI";
              
              b.query(N1qlQuery.simple(iQA));        
              b.query(N1qlQuery.simple(iQB));        
              b.query(N1qlQuery.simple(iQC));        
              b.query(N1qlQuery.simple(iQX));        
              b.query(N1qlQuery.simple(iQY));        
              b.query(N1qlQuery.simple(iQZ));        
              
          }
          public BombardaMaxima(final int tid) {
              this.tid = tid;
          }
          public final void run() {
              try {
                  Bucket b = null;
                  synchronized(cluster) { b = cluster.openBucket(bucket); }
                  final long stm = System.currentTimeMillis();
                  final JsonObject jo = JsonObject
                          .empty()
                          .put("a", stm)
                          .put("b", stm)
                          .put("c", stm);
                  for(int i = 0; i< docsPerThread; i++) { 
                      b.upsert(JsonDocument.create(
                              tid + ":" + System.currentTimeMillis(), 
                              (int)((System.currentTimeMillis() + docTTLms) / 1000),
                              jo)
                      );
                      if (i % dumpToConsoleEachNDocs == 0) System.out.println("T[" + tid + "] = " + i);
                  }
              } catch(final Exception e) {
                  e.printStackTrace();
              } finally {
                  phaser.arriveAndAwaitAdvance();
              }   
          }
          public static void main(String[] args) {
              for(int i = 0; i< threads; i++) new BombardaMaxima(i).start();
              phaser.arriveAndAwaitAdvance();
              System.out.println("DONE");
          }
          
      }
      

      6. Wait code run to complete (you can also wait 10, 15, 20 minutes more after compete, if you wish)
      7. Now watch htop again: you'll see, that CPU utilization is ~ [15...40]% for all nodes.
      8. Now restart one of nodes, wait for a while for "node initialization to pass", and, watch for htop once again (this node load is [0...5]% now)
      9. restart second node and watch for htop once again (this node load is [0...5]% now)
      10. restart last node and watch for htop once again (this node load is [0...5]% now)

      Additional info (between runs below bucket is deleted and recreated + all servers are restarted):
      1. No GSI = No problem. With removed (from code) GSI creation, there is no problem at all. So, the problem is definitely with GSI's.
      2. There is no (at least for 1 try) problem with only one index (iQA, for example)
      3. There is a strange effect with 3 indexes (iQA, iQB, iQC):
      1) first run: only one node is affected with "post-load-high-cpu-utilization".
      2) second run: all three nodes are affected
      3) third run: no nodes affected at all
      4. 6 confirmed all-nodes-affected runs with all 6 GSI (iQA, iQB, iQC, iQX, iQY, iQZ)

      Also, atop reports about ~80% CPU for "indexer" process on each affected node.
      Seems like "something stuck within indexer", but indexer.log has no explicit errors (log level is too low ?)

      P.S. And please, think about possible port of this fix (if any) for 4.1.X branch

      Attachments

        1. 140.pprof.svg
          95 kB
        2. 73.pprof.svg
          91 kB
        3. load-graph.png
          load-graph.png
          29 kB

        Issue Links

          For Gerrit Dashboard: MB-19714
          # Subject Branch Project Status CR V

          Activity

            People

              prataprc Pratap Chakravarthy (Inactive)
              egrep Georgy Repin
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty