Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-36926

Ops drop to zero across cluster during for 3 minutes during 9 -> 9 swap rebalance

    XMLWordPrintable

Details

    • Untriaged
    • Yes
    • KV-Engine Mad-Hatter GA, KV Sprint 2020-January

    Description

      I noticed this in the most recent 9->9 swap rebalance test against 6.5.0-4788: http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=titan_650-4788_rebalance_d2af#19046bfb651c6e0796fef70b7b4a833b.

      I downloaded some of the logs and looked at mortimer and indeed across the cluster the ops go to zero from around 5:38:52 to 5:41:52:

      (Note that the mortimer timestamps are in GMT - 8 hours ahead of the timestamps in the logs.)

      We don't see this in the previous build that was tested 6.5.0-4744: http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=titan_650-4788_rebalance_d2af&label=%226.5.0-4788%209-%3E9%20swap%20rebalance%22&snapshot=titan_650-4744_rebalance_7aac&label=%226.5.0-4744%209-%3E9%20swap%20rebalance%22#19046bfb651c6e0796fef70b7b4a833b.

      This occurs during the early part of the rebalance when we're moving a lot of replicas. With the sharding change we're able to move even more replicas.

      At first I thought it might be CPU, but CPU is only high on node .109 that's getting rebalanced in (where the CPU hits about 50%.) Then I thought the drop might possibly be related to compaction, but I didn't see evidence of compaction during this time.

      Then I noticed these messages in the logs approximately during this time:

      2019-11-14T05:36:09.885569-08:00 INFO (bucket-1) DCP (Producer) eq_dcpq:replication:ns_1@172.23.96.100->ns_1@172.23.96.109:bucket-1 - Notifying paused connection now that DcpProducer::BufferLog is no longer full; ackedBytes:52429522, bytesSent:0, maxBytes:52428800
      2019-11-14T05:36:12.816088-08:00 INFO (bucket-1) DCP (Producer) eq_dcpq:replication:ns_1@172.23.96.100->ns_1@172.23.96.109:bucket-1 - Notifying paused connection now that DcpProducer::BufferLog is no longer full; ackedBytes:104858323, bytesSent:0, maxBytes:52428800
      2019-11-14T05:36:15.658150-08:00 INFO (bucket-1) DCP (Producer) eq_dcpq:replication:ns_1@172.23.96.100->ns_1@172.23.96.109:bucket-1 - Notifying paused connection now that DcpProducer::BufferLog is no longer full; ackedBytes:157287648, bytesSent:0, maxBytes:52428800
      ...
      

      Basically every couple of seconds the replication queue between 100 and 109 gets full, then gets unblocked and can continue. These logs occur a minute or two before the ops drop to zero and then cease right around the times the ops start going again.

      $ fgrep "Notifying paused connectio" memcached.log  | grep -o "2019-11-14T..:..:"  | uniq -c
        54 2019-11-14T05:36:
        26 2019-11-14T05:37:
        23 2019-11-14T05:38:
        79 2019-11-14T05:39:
        12 2019-11-14T05:40:
        27 2019-11-14T05:41:
        11 2019-11-14T05:42:
      

      Similar logs at the same time occur on all the nodes. E.g. on node .101:

      $ fgrep "Notifying paused connectio" memcached.log  | grep -o "2019-11-14T..:..:"  |uniq -c
        56 2019-11-14T05:36:
        31 2019-11-14T05:37:
        25 2019-11-14T05:38:
        79 2019-11-14T05:39:
        17 2019-11-14T05:40:
        14 2019-11-14T05:41:
      

      However, it turns out that we also see these traces on 4744. It's just that there's a slightly smaller number of them (222 vs 210) and they occur over 8 minutes not 7.

      $ fgrep "Notifying paused connectio" memcached.log  | grep -o "2019-11-..T..:..:"  | uniq -c
        56 2019-11-04T15:33:
        21 2019-11-04T15:34:
        36 2019-11-04T15:35:
        23 2019-11-04T15:36:
        18 2019-11-04T15:37:
        14 2019-11-04T15:38:
        28 2019-11-04T15:39:
        14 2019-11-04T15:40:
      

      At any rate, the ops shouldn't go to zero.

      6.5.0-4744 job: http://perf.jenkins.couchbase.com/job/titan-reb/927/
      6.5.0-4788 job: http://perf.jenkins.couchbase.com/job/titan-reb/947/

      Attachments

        1. checkpoint_mem.png
          checkpoint_mem.png
          451 kB
        2. memory.png
          memory.png
          190 kB
        3. replication_high_wat.png
          replication_high_wat.png
          202 kB
        4. screenshot-1.png
          screenshot-1.png
          235 kB
        5. Screen Shot 2019-11-15 at 21.00.43.png
          Screen Shot 2019-11-15 at 21.00.43.png
          346 kB
        6. Screen Shot 2019-11-15 at 21.01.01.png
          Screen Shot 2019-11-15 at 21.01.01.png
          301 kB
        7. Screen Shot 2019-11-15 at 21.03.23.png
          Screen Shot 2019-11-15 at 21.03.23.png
          214 kB
        8. Screenshot 2019-11-20 at 08.48.42.png
          Screenshot 2019-11-20 at 08.48.42.png
          73 kB
        9. Screenshot 2019-11-20 at 08.48.55.png
          Screenshot 2019-11-20 at 08.48.55.png
          62 kB
        10. Screenshot 2019-11-20 at 09.26.28.png
          Screenshot 2019-11-20 at 09.26.28.png
          50 kB
        11. temp_oom.png
          temp_oom.png
          204 kB
        12. test0.svg
          99 kB
        13. test1.svg
          115 kB

        Activity

          People

            wayne Wayne Siu
            dfinlay Dave Finlay
            Votes:
            0 Vote for this issue
            Watchers:
            12 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              PagerDuty