Loading...

XML

Word

Printable

Details

Type: Bug
Resolution: Not a Bug
Priority: Major
Fix Version/s: 7.6.0
Affects Version/s: 7.6.0
Component/s: couchbase-bucket, storage-engine
Labels:
- volume-test
Environment:
7.6.0-1606

Triage:
Untriaged
Operating System:
Centos 64-bit
Link to Log File, atop/blg, CBCollectInfo, Core dump:

Hide
https://cb-engineering.s3.amazonaws.com/stop_start_and_migration/collectinfo-2023-10-10T020152-ns_1%40172.23.107.220.zip
https://cb-engineering.s3.amazonaws.com/stop_start_and_migration/collectinfo-2023-10-10T020152-ns_1%40172.23.107.232.zip
https://cb-engineering.s3.amazonaws.com/stop_start_and_migration/collectinfo-2023-10-10T020152-ns_1%40172.23.107.239.zip
https://cb-engineering.s3.amazonaws.com/stop_start_and_migration/collectinfo-2023-10-10T020152-ns_1%40172.23.107.97.zip

Show
https://cb-engineering.s3.amazonaws.com/stop_start_and_migration/collectinfo-2023-10-10T020152-ns_1%40172.23.107.220.zip https://cb-engineering.s3.amazonaws.com/stop_start_and_migration/collectinfo-2023-10-10T020152-ns_1%40172.23.107.232.zip https://cb-engineering.s3.amazonaws.com/stop_start_and_migration/collectinfo-2023-10-10T020152-ns_1%40172.23.107.239.zip https://cb-engineering.s3.amazonaws.com/stop_start_and_migration/collectinfo-2023-10-10T020152-ns_1%40172.23.107.97.zip
Story Points:
0
Is this a Regression?:
No

Description

This MB is a clone from ~~MB-59037~~ as it was noted that a bucket took long to delete (for reconfiguration), and the same problem occurred on all but 172.23.107.232.

From the memcached logs for example on 172.23.107.97 the following grep in memcached.log hightlights the slow flusher "stop", 194s.

>  grep -e "stop flusher" -e Flusher::wait  memcached.log

2023-10-09T18:59:53.131262-07:00 INFO (GleamBookUsers0) Attempting to stop flusher:0

2023-10-09T19:03:07.145242-07:00 INFO (GleamBookUsers0) Flusher::wait: had to wait 194 s for shutdown

2023-10-09T19:03:07.145254-07:00 INFO (GleamBookUsers0) Attempting to stop flusher:1

2023-10-09T19:03:07.146782-07:00 INFO (GleamBookUsers0) Attempting to stop flusher:2

2023-10-09T19:03:07.147955-07:00 INFO (GleamBookUsers0) Attempting to stop flusher:3

2023-10-09T19:03:07.149108-07:00 INFO (GleamBookUsers0) Attempting to stop flusher:4

2023-10-09T19:03:07.150247-07:00 INFO (GleamBookUsers0) Attempting to stop flusher:5

2023-10-09T19:03:09.408951-07:00 INFO (GleamBookUsers0) Flusher::wait: had to wait 2259 ms for shutdown

2023-10-09T19:03:09.408970-07:00 INFO (GleamBookUsers0) Attempting to stop flusher:6

2023-10-09T19:03:09.862978-07:00 INFO (GleamBookUsers0) Flusher::wait: had to wait 454 ms for shutdown

2023-10-09T19:03:09.862993-07:00 INFO (GleamBookUsers0) Attempting to stop flusher:7

The flusher in that instance reported it was running for 189s

2023-10-09T19:03:07.141661-07:00 WARNING (GleamBookUsers0) Slow runtime for 'Running a flusher loop: flusher 0' on thread WriterPool746: 189 s

It's unclear where the time was spent, but suspect it could of been inside magma?

Original MB description below, unclear if these steps will always reproduce the slow shutdown.

Steps To Recreate:

Create a 4 node cluster.
Create a couchstore bucket(replicas=1, ram_quota =10 GiB per node, bucket_eviction_policy=fullEviction, bucket-name=GleamBookUsers0)
Create 50 non default collections
Load 50000000 docs of size 512 bytes in each of the newly created non default collections
Change storage mode from couchstore to magma
Start doc:ops(update:read)
Trigger a swap rebalance(one node coming in one going out)
Swap rebalance was successfull
Trigger graceful failover fullrecovery rebalance (failed over node 172.23.107.220) while data loading is going on
Graceful failover+fullrecovery was successfull
Trigger hard failover + fullrecovery+ rebalance (failed over node 172.23.107.97) while data loading is going on
Trigger hard failover + fullrecovery+ rebalance (failed over node 172.23.107.232) while data loading is going on
Update maxTTL value of VolumeCollection0 and VolumeCollection1
Delete one collection(VolumeCollection 10)
Create a collection(with name VolumeCollection10) with history=true
Stop rebalance and create a new bucket( with historyRetentionCollectionDefault=true -d historyRetentionBytes=8446744073709551615 -d historyRetentionSeconds=3600)
Update num replicas = 2 , durability=majority and bucket priority to high for bucket GleamBookUsers0
Start rebalance
Rebalance exited with reason {pre_rebalance_janitor_run_failed, "GleamBookUsers0", {error,wait_for_memcached_failed,

Restarted rebalance multiple times but it was failing always

Rebalance Failute:

Rebalance exited with reason {pre_rebalance_janitor_run_failed,

"GleamBookUsers0",

{error,wait_for_memcached_failed,

['ns_1@172.23.107.220',

'ns_1@172.23.107.239',

'ns_1@172.23.107.97']}}.

Rebalance Operation Id = 0d32f3032d8813abee3b8aca0cc87262

QE-TEST:

guides/gradlew --refresh-dependencies testrunner -P jython=/opt/jython/bin/jython -P 'args=-i /tmp/temp_vol.ini bucket_storage=couchstore,bucket_eviction_policy=fullEviction,rerun=False -t aGoodDoctor.Hospital.Murphy.StorageMigrationTestHappyPath,nodes_init=4,graceful=True,skip_cleanup=True,num_items=50000000,num_buckets=1,bucket_names=GleamBook,doc_size=512,bucket_type=membase,bucket_eviction_policy=valueOnly,iterations=5,batch_size=1000,sdk_timeout=60,log_level=debug,infra_log_level=debug,rerun=False,skip_cleanup=True,key_size=22,assert_crashes_on_load=True,num_collections=50,maxttl=10,num_indexes=0,pc=10,indexer_mem_quota=0,index_nodes=0,cbas_nodes=0,fts_nodes=0,ops_rate=100000,ramQuota=10240,doc_ops=create:update:delete:read,mutation_perc=100,rebl_ops_rate=50000,key_type=RandomKey,revert_migration=True'

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

Screenshot 2023-10-16 at 12.08.56.png
314 kB
16/Oct/23 4:09 AM
Screenshot 2023-10-16 at 14.38.20.png
77 kB
16/Oct/23 6:40 AM

Issue Links

Clones

MB-59037 [CouchstoreToMagmaMigrationTest] Rebalance exited with reason {pre_rebalance_janitor_run_failed, "GleamBookUsers0", {error,wait_for_memcached_failed,

Closed

Gerrit Reviews

- Issue Only
- Show All Reviews
- Show Open Reviews
- Show All Issues
- Show Open Issues

No reviews matched the request. Check your Options in the drop-down menu of this sections header.

Activity

People

Assignee:: Jim Walker

Reporter:: Jim Walker

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 11/Oct/23 2:03 PM

Updated:: 19/Oct/23 10:54 PM

Resolved:: 18/Oct/23 12:52 AM

Gerrit Reviews

There are no open Gerrit changes

Show There are 3 closed Gerrit changes

Hide There are 3 closed Gerrit changes

MB-59079: Add flusherId to Flusher messages: Gerrit Review:

MB-59079: Don't reschedule compaction when shutting down: Gerrit Review:

MB-59079: Fix underflow in ep_pending_compactions: Gerrit Review:

Delete bucket took over 2 minutes as flusher took a long time to stop.

Details

Description

Attachments

Attachments

Issue Links

Gerrit Reviews

Activity

People

Dates

Gerrit Reviews

PagerDuty