Loading...

XML

Word

Printable

Details

Type: Bug
Resolution: Fixed
Priority: Major
Fix Version/s: 6.5.1
Affects Version/s: 6.5.0
Component/s: couchbase-bucket
Labels:

Triage:
Untriaged
Is this a Regression?:
Yes
Sprint:
KV-Engine Mad-Hatter GA, KV Sprint 2020-January

Description

I noticed this in the most recent 9->9 swap rebalance test against 6.5.0-4788: http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=titan_650-4788_rebalance_d2af#19046bfb651c6e0796fef70b7b4a833b.

I downloaded some of the logs and looked at mortimer and indeed across the cluster the ops go to zero from around 5:38:52 to 5:41:52:

(Note that the mortimer timestamps are in GMT - 8 hours ahead of the timestamps in the logs.)

We don't see this in the previous build that was tested 6.5.0-4744: http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=titan_650-4788_rebalance_d2af&label=%226.5.0-4788%209-%3E9%20swap%20rebalance%22&snapshot=titan_650-4744_rebalance_7aac&label=%226.5.0-4744%209-%3E9%20swap%20rebalance%22#19046bfb651c6e0796fef70b7b4a833b.

This occurs during the early part of the rebalance when we're moving a lot of replicas. With the sharding change we're able to move even more replicas.

At first I thought it might be CPU, but CPU is only high on node .109 that's getting rebalanced in (where the CPU hits about 50%.) Then I thought the drop might possibly be related to compaction, but I didn't see evidence of compaction during this time.

Then I noticed these messages in the logs approximately during this time:

2019-11-14T05:36:09.885569-08:00 INFO (bucket-1) DCP (Producer) eq_dcpq:replication:ns_1@172.23.96.100->ns_1@172.23.96.109:bucket-1 - Notifying paused connection now that DcpProducer::BufferLog is no longer full; ackedBytes:52429522, bytesSent:0, maxBytes:52428800

2019-11-14T05:36:12.816088-08:00 INFO (bucket-1) DCP (Producer) eq_dcpq:replication:ns_1@172.23.96.100->ns_1@172.23.96.109:bucket-1 - Notifying paused connection now that DcpProducer::BufferLog is no longer full; ackedBytes:104858323, bytesSent:0, maxBytes:52428800

2019-11-14T05:36:15.658150-08:00 INFO (bucket-1) DCP (Producer) eq_dcpq:replication:ns_1@172.23.96.100->ns_1@172.23.96.109:bucket-1 - Notifying paused connection now that DcpProducer::BufferLog is no longer full; ackedBytes:157287648, bytesSent:0, maxBytes:52428800

...

Basically every couple of seconds the replication queue between 100 and 109 gets full, then gets unblocked and can continue. These logs occur a minute or two before the ops drop to zero and then cease right around the times the ops start going again.

$ fgrep "Notifying paused connectio" memcached.log  | grep -o "2019-11-14T..:..:"  | uniq -c

  54 2019-11-14T05:36:

  26 2019-11-14T05:37:

  23 2019-11-14T05:38:

  79 2019-11-14T05:39:

  12 2019-11-14T05:40:

  27 2019-11-14T05:41:

  11 2019-11-14T05:42:

Similar logs at the same time occur on all the nodes. E.g. on node .101:

$ fgrep "Notifying paused connectio" memcached.log  | grep -o "2019-11-14T..:..:"  |uniq -c

  56 2019-11-14T05:36:

  31 2019-11-14T05:37:

  25 2019-11-14T05:38:

  79 2019-11-14T05:39:

  17 2019-11-14T05:40:

  14 2019-11-14T05:41:

However, it turns out that we also see these traces on 4744. It's just that there's a slightly smaller number of them (222 vs 210) and they occur over 8 minutes not 7.

$ fgrep "Notifying paused connectio" memcached.log  | grep -o "2019-11-..T..:..:"  | uniq -c

  56 2019-11-04T15:33:

  21 2019-11-04T15:34:

  36 2019-11-04T15:35:

  23 2019-11-04T15:36:

  18 2019-11-04T15:37:

  14 2019-11-04T15:38:

  28 2019-11-04T15:39:

  14 2019-11-04T15:40:

At any rate, the ops shouldn't go to zero.

6.5.0-4744 job: http://perf.jenkins.couchbase.com/job/titan-reb/927/
6.5.0-4788 job: http://perf.jenkins.couchbase.com/job/titan-reb/947/

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

checkpoint_mem.png
451 kB
15/Nov/19 7:33 AM
memory.png
190 kB
15/Nov/19 2:23 AM
replication_high_wat.png
202 kB
15/Nov/19 8:07 AM
screenshot-1.png
235 kB
14/Nov/19 9:12 PM
Screen Shot 2019-11-15 at 21.00.43.png
346 kB
15/Nov/19 1:01 PM
Screen Shot 2019-11-15 at 21.01.01.png
301 kB
15/Nov/19 1:01 PM
Screen Shot 2019-11-15 at 21.03.23.png
214 kB
15/Nov/19 1:04 PM
Screenshot 2019-11-20 at 08.48.42.png
73 kB
20/Nov/19 12:49 AM
Screenshot 2019-11-20 at 08.48.55.png
62 kB
20/Nov/19 12:49 AM
Screenshot 2019-11-20 at 09.26.28.png
50 kB
20/Nov/19 1:26 AM
temp_oom.png
204 kB
15/Nov/19 2:23 AM
test0.svg
99 kB
20/Nov/19 1:06 AM
test1.svg
115 kB
20/Nov/19 1:07 AM

Gerrit Reviews

- Issue Only
- Show All Reviews
- Show Open Reviews
- Show All Issues
- Show Open Issues

No reviews matched the request. Check your Options in the drop-down menu of this sections header.

Activity

People

Assignee:: Wayne Siu

Reporter:: Dave Finlay

Votes:: 0 Vote for this issue

Watchers:: 12 Start watching this issue

Dates

Created:: 14/Nov/19 8:45 PM

Updated:: 28/Apr/20 11:25 PM

Resolved:: 10/Jan/20 3:59 AM

Gerrit Reviews

There are no open Gerrit changes

Show There are 30 closed Gerrit changes

Hide There are 30 closed Gerrit changes

MB-36926: Remove Vbid from IORequest: Gerrit Review:

MB-36926: Remove ids vector from CouchKVStore: Gerrit Review:

MB-36926: Don't oversize actions buffer in update_indexes: Gerrit Review:

MB-36926: Reuse idacts for seqacts: Gerrit Review:

MB-36926: Drop queued_items vector before commit when flushing: Gerrit Review:

MB-36926: Do not always attempt to run RocksDB flush benchmark: Gerrit Review:

MB-36926: Remove PersistenceCallback from IORequest: Gerrit Review:

MB-36926: Reduce indentation of EPBucket::flushVBucket: Gerrit Review:

MB-36926: Add FETCH_INSERT modify action: Gerrit Review:

MB-36926: Remove C linkage from couch_btree.h: Gerrit Review:

MB-36926: Use a PackedPtr for the type of a couchfile_modify_action: Gerrit Review:

MB-36926: WIP: Swap pendingReqsQ to map and remove keystats: Gerrit Review:

MB-36926: Swap kvstats_ctx map to unordered_set: Gerrit Review:

MB-36926: Add flusher replace benchmarks: Gerrit Review:

Merge branch 'mad-hatter': Gerrit Review:

Merge branch 'mad-hatter': Gerrit Review:

Merge remote-tracking branch 'couchbase/mad-hatter': Gerrit Review:

MB-36926: Remove Vbid from IORequest: Gerrit Review:

Merge remote-tracking branch 'couchbase/mad-hatter': Gerrit Review:

MB-36926: Don't oversize actions buffer in update_indexes: Gerrit Review:

MB-36926: Add FETCH_INSERT modify action: Gerrit Review:

MB-36926: Reuse idacts for seqacts: Gerrit Review:

MB-36926: Remove C linkage from couch_btree.h: Gerrit Review:

MB-36926: Use a PackedPtr for the type of a couchfile_modify_action: Gerrit Review:

Merge branch 'couchbase/mad-hatter': Gerrit Review:

Merge branch 'mad-hatter': Gerrit Review:

Merge branch 'mad-hatter': Gerrit Review:

Merge branch 'mad-hatter': Gerrit Review:

Merge branch 'mad-hatter': Gerrit Review:

Merge branch 'mad-hatter': Gerrit Review:

Ops drop to zero across cluster during for 3 minutes during 9 -> 9 swap rebalance

Details

Description

Attachments

Attachments

Gerrit Reviews

Activity

People

Dates

Gerrit Reviews

PagerDuty