Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-60816

[1536 dim]: 4million FTS docs with 120G total memory quota & 6 fts nodes. FTS crashed due to merger error.

    XMLWordPrintable

Details

    Description

      Test Config:

      6 FTS nodes
      4 FTS indexes
      36 partitions per index - 4 partition per index per node
      0 Replicas

      Dataset

      • 2 Million KV data. 1M each in 2 collections
      • 4 Million docs in FTS
      • 1536 Vector Dimension
      • 1 query thread shooting FTS vector queries on fts indexes randomly
      • 2000 ops - pure upserts

      Steps:

      1. Start 1 query thread shooting FTS vector queries on fts indexes randomly
      2. Start KV load of 2000 ops pure upserts
      3. Add 1 fts node and rebalance - passed
      4. Remove 1 fts node and rebalance - passed
      5. Remove 1 and add 2 fts node and rebalance - passed
      6. Swap rebalance 1 fts node and rebalance - passed
      7. Failover 1 node and rebalance out - passed
      8. Failover 1 node and do full recovery & rebalance - passed
      9. Stop the kv workload
      10. Start KV load of 6000 ops
      11. After a while there are a couple of oom kills - ignoring as it is tracked via different bug
      12. FTS exited with status 1 on 172.23.107.220
      13. Stopped the test & left the cluster as is for hours. But the memory usage on 172.23.107.240 is constantly high and throwing warnings. Capella is going to get a lot of alerts if same has been tried on Cloud.

      Node 220

      Service 'fts' exited with status 1. Restarting. Messages:
      2024-02-15T20:58:32.721-08:00 [WARN] slow-query: index: default0VolumeCollection0_fts_idx_0, username: <ud>Administrator</ud>, query: <ud>{"explain": false, "query": {"match_none": {}}, "knn": [{"k": 41, "field": "embedding", "vector": [0.9270686507225037, 0.31093931198120117, 0.8670985698699951, 0.3526496887207031, 0.7519625425338745, ...</ud>, resultset bytes: 25011, duration: 8.470553869s, err: <nil> -- rest.(*QueryHandler).ServeHTTP() at rest_index.go:393
      2024-02-15T20:58:46.332-08:00 [WARN] slow-query: index: default0VolumeCollection0_fts_idx_0, username: <ud>Administrator</ud>, query: <ud>{"explain": false, "query": {"match_none": {}}, "knn": [{"k": 28, "field": "embedding", "vector": [0.5896949172019958, 0.0023876428604125977, 0.4888734221458435, 0.9877305626869202, 0.9136510491371155...</ud>, resultset bytes: 22608, duration: 11.101730373s, err: <nil> -- rest.(*QueryHandler).ServeHTTP() at rest_index.go:393
      2024-02-15T20:58:46.460-08:00 [FATA] scorch AsyncError, path: /data/@fts/default0VolumeCollection1_fts_idx_0_22a546b778cf2baa_909bb146.pindex/store, treating this as fatal, err: merging err: merging failed: , stack dump: /data/@fts/dumps/1708059526.fts.stack.dump.txt -- main.initBleveOptions.func2() at init_bleve.go:113
      

      scorch AsyncError, path: /data/@fts/default0VolumeCollection1_fts_idx_0_22a546b778cf2baa_909bb146.pindex/store, err: merging err: merging failed:
       
      goroutine 3079648 [running]:
      runtime/pprof.writeGoroutineStacks({0x1dd45a0, 0xc00049bfe0})
      	/home/couchbase/.cbdepscache/exploded/x86_64/go-1.21.6/go/src/runtime/pprof/pprof.go:703 +0x6a
      runtime/pprof.writeGoroutine({0x1dd45a0?, 0xc00049bfe0?}, 0xc0316bbb80?)
      	/home/couchbase/.cbdepscache/exploded/x86_64/go-1.21.6/go/src/runtime/pprof/pprof.go:692 +0x25
      runtime/pprof.(*Profile).WriteTo(0x15d2d01?, {0x1dd45a0?, 0xc00049bfe0?}, 0x94?)
      	/home/couchbase/.cbdepscache/exploded/x86_64/go-1.21.6/go/src/runtime/pprof/pprof.go:329 +0x146
      main.dumpStack({0x7ffeb2d6741c?, 0x24?}, {0xc00c78e460, 0x92})
      	cbft/cmd/cbft/stack_dump.go:59 +0x47e
      main.initBleveOptions.func2({0x1dd4aa0?, 0xc0592e15d0}, {0xc009f428a0, 0x55})
      	cbft/cmd/cbft/init_bleve.go:110 +0xd1
      github.com/blevesearch/bleve/v2/index/scorch.(*Scorch).fireAsyncError(...)
      	/home/couchbase/.cbdepscache/gomodcache/pkg/mod/github.com/blevesearch/bleve/v2@v2.3.11-0.20240215163640-2d3c3ebf7a5b/index/scorch/scorch.go:188
      github.com/blevesearch/bleve/v2/index/scorch.(*Scorch).mergerLoop(0xc00e834480)
      	/home/couchbase/.cbdepscache/gomodcache/pkg/mod/github.com/blevesearch/bleve/v2@v2.3.11-0.20240215163640-2d3c3ebf7a5b/index/scorch/merge.go:97 +0x35b
      created by github.com/blevesearch/bleve/v2/index/scorch.(*Scorch).Open in goroutine 3079598
      	/home/couchbase/.cbdepscache/gomodcache/pkg/mod/github.com/blevesearch/bleve/v2@v2.3.11-0.20240215163640-2d3c3ebf7a5b/index/scorch/scorch.go:206 +0x145
      

      Test

      guides/gradlew --refresh-dependencies testrunner -P jython=/opt/jython/bin/jython -P args=-i /tmp/magma_temp_job.ini -p bucket_storage=magma,bucket_eviction_policy=fullEviction,rerun=False -t aGoodDoctor.vectorSearch.VectorVolume.Murphy.ClusterOpsVolume,nodes_init=1,graceful=True,skip_cleanup=True,num_items=1000000,num_buckets=1,bucket_names=GleamBook,doc_size=1024,bucket_type=membase,eviction_policy=fullEviction,iterations=2,batch_size=1000,sdk_timeout=60,log_level=debug,infra_log_level=debug,rerun=False,skip_cleanup=True,key_size=18,randomize_doc_size=False,randomize_value=True,assert_crashes_on_load=True,collections=2,maxttl=10,num_indexes=1,pc=20,index_nodes=0,xdcr_collections=10,xdcr_remote_nodes=0,cbas_nodes=0,fts_nodes=6,ops_rate=10000,doc_ops=update,rebl_ops_rate=2000,key_type=RandomKey,mutation_perc=30,replicas=1,clients_per_db=10,skip_cluster_reset=false,skip_setup_cleanup=false,use_https=False,track_failures=False,model=sentence-transformers/paraphrase-MiniLM-L3-v2,fts_index_partition=36,fts_replicas=0,mockVector=true,dim=1536
      

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            People

              ritesh.agarwal Ritesh Agarwal
              ritesh.agarwal Ritesh Agarwal
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There is 1 open Gerrit change

                  PagerDuty