Details
-
Bug
-
Resolution: Unresolved
-
Critical
-
Cypher
-
Enterprise Edition 7.7.0 build 1079
-
Untriaged
-
0
-
Unknown
Description
Dataset:
t1M = {"vector": None, "size": [5, 6, 7, 8, 9, 10], "color": "green", "brand": "Nike", "country": "USA", "category": "Shoes", "type": "Apparel", "avg_review": 1}
|
t2M = {"vector": None, "size": [6, 7, 8, 9, 10], "color": "green", "brand": "Nike", "country": "USA", "category": "Shoes", "type": "Apparel", "avg_review": 1.5}
|
t5M = {"vector": None, "size": [7, 8, 9, 10], "color": "red", "brand": "Nike", "country": "USA", "category": "Shoes", "type": "Apparel", "avg_review": 2}
|
t10M = {"vector": None, "size": [8, 9, 10], "color": "red", "brand": "Adidas", "country": "USA", "category": "Shoes", "type": "Apparel", "avg_review": 2.5}
|
t20M = {"vector": None, "size": [9, 10], "color": "red", "brand": "Adidas", "country": "Canada", "category": "Shoes", "type": "Apparel", "avg_review": 3}
|
t50M = {"vector": None, "size": [10], "color": "red", "brand": "Adidas", "country": "Canada", "category": "Jeans", "type": "Apparel", "avg_review": 3.5}
|
t100M = {"vector": None, "color": "red", "brand": "Adidas", "country": "Canada", "category": "Jeans", "type": "Denim", "avg_review": 4}
|
t200M = {"vector": None, "color": "red", "brand": "Adidas", "country": "Canada", "category": "Jeans", "type": "Denim", "avg_review": 4.5}
|
t500M = {"vector": None, "color": "red", "brand": "Adidas", "country": "Canada", "category": "Jeans", "type": "Denim", "avg_review": 5}
|
t1000M = {"vector": None, "color": "red", "brand": "Adidas", "country": "Canada", "category": "Jeans", "type": "Denim", "avg_review": 10}
|
Steps:
- Based on the above template started loading the data into the bucket.
- Where there are around 100M items create the below index:
CREATE INDEX `bigann2M` ON `_default`(`color`,`embedding` VECTOR) PARTITION BY HASH((META().`id`)) where color="green" WITH { "defer_build":TRUE, "num_partition":8, "dimension":128, "similarity":"L2_SQUARED", "description":"IVF,PQ32x8", "scan_nprobes":3};
- Index created properly
- When the load is around 250M created the below index:
CREATE INDEX `bigann2MSQ8` ON `_default`(`color`,`embedding` VECTOR) PARTITION BY HASH((META().`id`)) where color="green" WITH { "defer_build":TRUE, "num_partition":8, "dimension":128, "similarity":"L2_SQUARED", "description":"IVF,SQ8", "scan_nprobes":3};
- Index creation failed:
2024-08-16T16:10:39.856-07:00 [Info] Indexer::initateTraining Starting training for vector index with instId: 10405409891669675141, partnId: 42024-08-16T16:10:39.861-07:00 [Info] NewCodebookIVFSQ: Initialized codebook with dimension: 128, range: SQ8, nlist: 15263, metric: L2, useCosine: false
2024-08-16T16:10:39.862-07:00 [Error] Indexer::initiateTraining error observed during training phase of codebook for instId: 10405409891669675141, partnId: 4, err: Error in void faiss::Clustering::train_encoded(faiss::idx_t, const uint8_t*, const faiss::Index*, faiss::Index&, const float*) at /home/couchbase/jenkins/workspace/cbdeps-platform-build/faiss/faiss/Clustering.cpp:276: Error: 'nx >= k' failed: Number of training points (6627) should be at least as large as number of clusters (15263)