I noticed this in the most recent 9->9 swap rebalance test against 6.5.0-4788: http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=titan_650-4788_rebalance_d2af#19046bfb651c6e0796fef70b7b4a833b.
I downloaded some of the logs and looked at mortimer and indeed across the cluster the ops go to zero from around 5:38:52 to 5:41:52:
(Note that the mortimer timestamps are in GMT - 8 hours ahead of the timestamps in the logs.)
We don't see this in the previous build that was tested 6.5.0-4744: http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=titan_650-4788_rebalance_d2af&label=%226.5.0-4788%209-%3E9%20swap%20rebalance%22&snapshot=titan_650-4744_rebalance_7aac&label=%226.5.0-4744%209-%3E9%20swap%20rebalance%22#19046bfb651c6e0796fef70b7b4a833b.
This occurs during the early part of the rebalance when we're moving a lot of replicas. With the sharding change we're able to move even more replicas.
At first I thought it might be CPU, but CPU is only high on node .109 that's getting rebalanced in (where the CPU hits about 50%.) Then I thought the drop might possibly be related to compaction, but I didn't see evidence of compaction during this time.
Then I noticed these messages in the logs approximately during this time:
Basically every couple of seconds the replication queue between 100 and 109 gets full, then gets unblocked and can continue. These logs occur a minute or two before the ops drop to zero and then cease right around the times the ops start going again.
Similar logs at the same time occur on all the nodes. E.g. on node .101:
However, it turns out that we also see these traces on 4744. It's just that there's a slightly smaller number of them (222 vs 210) and they occur over 8 minutes not 7.
At any rate, the ops shouldn't go to zero.
6.5.0-4744 job: http://perf.jenkins.couchbase.com/job/titan-reb/927/
6.5.0-4788 job: http://perf.jenkins.couchbase.com/job/titan-reb/947/