Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-36926

Ops drop to zero across cluster during for 3 minutes during 9 -> 9 swap rebalance



    • Untriaged
    • Yes
    • KV-Engine Mad-Hatter GA, KV Sprint 2020-January


      I noticed this in the most recent 9->9 swap rebalance test against 6.5.0-4788: http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=titan_650-4788_rebalance_d2af#19046bfb651c6e0796fef70b7b4a833b.

      I downloaded some of the logs and looked at mortimer and indeed across the cluster the ops go to zero from around 5:38:52 to 5:41:52:

      (Note that the mortimer timestamps are in GMT - 8 hours ahead of the timestamps in the logs.)

      We don't see this in the previous build that was tested 6.5.0-4744: http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=titan_650-4788_rebalance_d2af&label=%226.5.0-4788%209-%3E9%20swap%20rebalance%22&snapshot=titan_650-4744_rebalance_7aac&label=%226.5.0-4744%209-%3E9%20swap%20rebalance%22#19046bfb651c6e0796fef70b7b4a833b.

      This occurs during the early part of the rebalance when we're moving a lot of replicas. With the sharding change we're able to move even more replicas.

      At first I thought it might be CPU, but CPU is only high on node .109 that's getting rebalanced in (where the CPU hits about 50%.) Then I thought the drop might possibly be related to compaction, but I didn't see evidence of compaction during this time.

      Then I noticed these messages in the logs approximately during this time:

      2019-11-14T05:36:09.885569-08:00 INFO (bucket-1) DCP (Producer) eq_dcpq:replication:ns_1@>ns_1@ - Notifying paused connection now that DcpProducer::BufferLog is no longer full; ackedBytes:52429522, bytesSent:0, maxBytes:52428800
      2019-11-14T05:36:12.816088-08:00 INFO (bucket-1) DCP (Producer) eq_dcpq:replication:ns_1@>ns_1@ - Notifying paused connection now that DcpProducer::BufferLog is no longer full; ackedBytes:104858323, bytesSent:0, maxBytes:52428800
      2019-11-14T05:36:15.658150-08:00 INFO (bucket-1) DCP (Producer) eq_dcpq:replication:ns_1@>ns_1@ - Notifying paused connection now that DcpProducer::BufferLog is no longer full; ackedBytes:157287648, bytesSent:0, maxBytes:52428800

      Basically every couple of seconds the replication queue between 100 and 109 gets full, then gets unblocked and can continue. These logs occur a minute or two before the ops drop to zero and then cease right around the times the ops start going again.

      $ fgrep "Notifying paused connectio" memcached.log  | grep -o "2019-11-14T..:..:"  | uniq -c
        54 2019-11-14T05:36:
        26 2019-11-14T05:37:
        23 2019-11-14T05:38:
        79 2019-11-14T05:39:
        12 2019-11-14T05:40:
        27 2019-11-14T05:41:
        11 2019-11-14T05:42:

      Similar logs at the same time occur on all the nodes. E.g. on node .101:

      $ fgrep "Notifying paused connectio" memcached.log  | grep -o "2019-11-14T..:..:"  |uniq -c
        56 2019-11-14T05:36:
        31 2019-11-14T05:37:
        25 2019-11-14T05:38:
        79 2019-11-14T05:39:
        17 2019-11-14T05:40:
        14 2019-11-14T05:41:

      However, it turns out that we also see these traces on 4744. It's just that there's a slightly smaller number of them (222 vs 210) and they occur over 8 minutes not 7.

      $ fgrep "Notifying paused connectio" memcached.log  | grep -o "2019-11-..T..:..:"  | uniq -c
        56 2019-11-04T15:33:
        21 2019-11-04T15:34:
        36 2019-11-04T15:35:
        23 2019-11-04T15:36:
        18 2019-11-04T15:37:
        14 2019-11-04T15:38:
        28 2019-11-04T15:39:
        14 2019-11-04T15:40:

      At any rate, the ops shouldn't go to zero.

      6.5.0-4744 job: http://perf.jenkins.couchbase.com/job/titan-reb/927/
      6.5.0-4788 job: http://perf.jenkins.couchbase.com/job/titan-reb/947/


        1. checkpoint_mem.png
          451 kB
          Dave Rigby
        2. memory.png
          190 kB
          Dave Rigby
        3. replication_high_wat.png
          202 kB
          Dave Rigby
        4. screenshot-1.png
          235 kB
          Dave Finlay
        5. Screen Shot 2019-11-15 at 21.00.43.png
          346 kB
          Dave Rigby
        6. Screen Shot 2019-11-15 at 21.01.01.png
          301 kB
          Dave Rigby
        7. Screen Shot 2019-11-15 at 21.03.23.png
          214 kB
          Dave Rigby
        8. Screenshot 2019-11-20 at 08.48.42.png
          73 kB
          Ben Huddleston
        9. Screenshot 2019-11-20 at 08.48.55.png
          62 kB
          Ben Huddleston
        10. Screenshot 2019-11-20 at 09.26.28.png
          50 kB
          Ben Huddleston
        11. temp_oom.png
          204 kB
          Dave Rigby
        12. test0.svg
          99 kB
          Ben Huddleston
        13. test1.svg
          115 kB
          Ben Huddleston
        No reviews matched the request. Check your Options in the drop-down menu of this sections header.



            wayne Wayne Siu
            dfinlay Dave Finlay
            0 Vote for this issue
            12 Start watching this issue



              Gerrit Reviews

                There are no open Gerrit changes