Loading...

XML

Word

Printable

Details

Type: Bug
Resolution: Fixed
Priority: Critical
Fix Version/s: Goldfish GA
Affects Version/s: Goldfish GA
Component/s: analytics
Labels:
- triaged
- volume-test

Triage:
Untriaged
Link to Log File, atop/blg, CBCollectInfo, Core dump:
http://supportal.couchbase.com/snapshot/36f2372bb9fc53adc232d24041d7459a::0
Story Points:
0
Is this a Regression?:
Unknown
Sprint:
Analytics Sprint 42, Analytics Sprint 43

Description

Create a 3 node provisioned cluster, a magma bucket, 10 collections and load 100M items in each with 1B items in total.
Create columnar instance with 2 nodes.
Create 10 remote collection each ingesting 100M items.
Ingestion completed.
Start an upsert kv workload on provisioned cluster.
Start query workload.
Scale Columnar from 2 -> 4 -> 8 nodes. All successful.
Scale down from 8 to 4 nodes
EjectNodes = ['ns_1@svc-da-node-004.oglyir3dvb8ndilj.pl.nonprod-project-avengers.com']. Done in less than 60s
EjectNodes = ['ns_1@svc-da-node-001.oglyir3dvb8ndilj.pl.nonprod-project-avengers.com']. Done in less than 60s
EjectNodes = ['ns_1@svc-da-node-007.oglyir3dvb8ndilj.pl.nonprod-project-avengers.com']. Done in less than 2m but cluster balance state is FALSE
Starting rebalance, KeepNodes = ['ns_1@svc-da-node-002.oglyir3dvb8ndilj.pl.nonprod-project-avengers.com',
'ns_1@svc-da-node-003.oglyir3dvb8ndilj.pl.nonprod-project-avengers.com',
'ns_1@svc-da-node-005.oglyir3dvb8ndilj.pl.nonprod-project-avengers.com',
'ns_1@svc-da-node-006.oglyir3dvb8ndilj.pl.nonprod-project-avengers.com',
'ns_1@svc-da-node-008.oglyir3dvb8ndilj.pl.nonprod-project-avengers.com'], EjectNodes = [], Failed over and being ejected nodes = []; no delta recovery nodes; Operation Id = 2e0b51811d1a8f23e27034aa1d45ed69 === Completed in <60s but state is still FALSE
Starting rebalance, KeepNodes = ['ns_1@svc-da-node-002.oglyir3dvb8ndilj.pl.nonprod-project-avengers.com',
'ns_1@svc-da-node-003.oglyir3dvb8ndilj.pl.nonprod-project-avengers.com',
'ns_1@svc-da-node-005.oglyir3dvb8ndilj.pl.nonprod-project-avengers.com',
'ns_1@svc-da-node-006.oglyir3dvb8ndilj.pl.nonprod-project-avengers.com',
'ns_1@svc-da-node-008.oglyir3dvb8ndilj.pl.nonprod-project-avengers.com'], EjectNodes = [], Failed over and being ejected nodes = []; no delta recovery nodes; Operation Id = b03342d6121755eaeedd6c19d3fa1bcf === Done in 20s but state is still FALSE
Starting rebalance, KeepNodes = ['ns_1@svc-da-node-002.oglyir3dvb8ndilj.pl.nonprod-project-avengers.com',
'ns_1@svc-da-node-003.oglyir3dvb8ndilj.pl.nonprod-project-avengers.com',
'ns_1@svc-da-node-005.oglyir3dvb8ndilj.pl.nonprod-project-avengers.com',
'ns_1@svc-da-node-006.oglyir3dvb8ndilj.pl.nonprod-project-avengers.com',
'ns_1@svc-da-node-008.oglyir3dvb8ndilj.pl.nonprod-project-avengers.com'], EjectNodes = [], Failed over and being ejected nodes = []; no delta recovery nodes; Operation Id = 66529572d8002e2f136f2afe3edcf689 === Completed in 1 hour BUT state is still FALSE
Starting rebalance, KeepNodes = ['ns_1@svc-da-node-002.oglyir3dvb8ndilj.pl.nonprod-project-avengers.com',
'ns_1@svc-da-node-003.oglyir3dvb8ndilj.pl.nonprod-project-avengers.com',
'ns_1@svc-da-node-005.oglyir3dvb8ndilj.pl.nonprod-project-avengers.com',
'ns_1@svc-da-node-006.oglyir3dvb8ndilj.pl.nonprod-project-avengers.com',
'ns_1@svc-da-node-008.oglyir3dvb8ndilj.pl.nonprod-project-avengers.com'], EjectNodes = [], Failed over and being ejected nodes = []; no delta recovery nodes; Operation Id = 486691285a4cc8773ecf74ab6f7328aa === Rebalance is hung since 2 hours...

Cluster state:

"balanced": false,

"servicesNeedRebalance": [

"code": "service_not_balanced",

"description": "Service needs rebalance.",

"services": [

"cbas"

QE Test

sudo guides/gradlew --refresh-dependencies testrunner -P jython=/opt/jython/bin/jython -P 'args=-i /tmp/couchbase_capella_volume_3_new_dummy.ini -p bucket_storage=magma,bucket_eviction_policy=fullEviction,rerun=False -t aGoodDoctor.goldfish.GoldfishVolume.Columnar.test_rebalance,num_items=100000000,num_buckets=1,bucket_names=GleamBook,bucket_type=membase,iterations=2,batch_size=1000,sdk_timeout=60,log_level=debug,infra_log_level=debug,rerun=False,skip_cleanup=True,key_size=18,randomize_doc_size=False,randomize_value=True,maxttl=10,pc=20,gsi_nodes=3,cbas_nodes=3,fts_nodes=3,kv_nodes=3,n1ql_nodes=2,kv_disk=1000,n1ql_disk=50,gsi_disk=500,fts_disk=1000,cbas_disk=1000,kv_compute=c5.4xlarge,gsi_compute=c5.4xlarge,n1ql_compute=c5.4xlarge,fts_compute=c5.4xlarge,cbas_compute=c5.4xlarge,mutation_perc=20,key_type=CircularKey,capella_run=true,services=data,rebl_services=,max_rebl_nodes=27,provider=AWS,region=us-east-1,type=GP3,size=1000,ops_rate=100000,skip_teardown_cleanup=false,wait_timeout=14400,index_timeout=28800,runtype=columnar1,skip_init=false,rebl_ops_rate=10000,collections=10,valType=Hotel,expiry=true,v_scaling=true,h_scaling=true,horizontal_scale=1,clients_per_db=10,track_failures=False,onPremMongo=False,num_clusters=1 -m rest'

Want to summarize the issues noticed:

Why is the cluster balance state is False after the 6th node is properly removed from the cluster during 8->4 scaling.
While CP trigger a noop with 5 nodes in cluster why is rebalance taking an hour while previous rebalance out was done in few seconds? What is analytics doing during a noop rebalance? noop means no change cluster topology.
Why is the last 5 nodes noop rebalance operation hung?

Attachments

Issue Links

relates to

MB-61999 On rebalance completion, ns_server clears rebalance indicator before refreshing isBalanced checks from services

Open

Gerrit Reviews

- Issue Only
- Show All Reviews
- Show Open Reviews
- Show All Issues
- Show Open Issues

No reviews matched the request. Check your Options in the drop-down menu of this sections header.

Activity

People

Assignee:: Ritesh Agarwal

Reporter:: Ritesh Agarwal

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Dates

Created:: 16/May/24 12:32 PM

Updated:: 22/May/24 10:02 AM

Resolved:: 22/May/24 6:08 AM

Gerrit Reviews

There are no open Gerrit changes

Show There are 2 closed Gerrit changes

Hide There are 2 closed Gerrit changes

MB-61932: force numReplicas to 0 when columnar: Gerrit Review:

MB-61932: on same keepnodes, don't force balanced state to unknown: Gerrit Review:

Cluster state as "balanced": false post removal of 6th cbas node successfully while scaling down from 8 to 4 nodes while leads to CP triggering a noop rebalances with 5 nodes until balanced is true. But, Rebalance hangs.

Details

Description

Attachments

Issue Links

Gerrit Reviews

Activity

People

Dates

Gerrit Reviews

PagerDuty