Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-42606

[Cbbackupmgr] Activity monitor not logging inactivity

    XMLWordPrintable

Details

    • Bug
    • Status: Closed
    • Minor
    • Resolution: Not a Bug
    • 6.6.1
    • 7.0.0
    • tools
    • 6.6.1 build 9147
    • Untriaged
    • 1
    • Unknown

    Description

      Description:

      The activity monitor does not appear to be logging inactivity.

      Steps to reproduce:

      1. Set CB_DCP_ACTIVITY_MONITOR_FREQUENCY equal to 5 seconds
      2. Start a backup asynchronously.
      3. When ' DCP ' is spotted in the logs repeat the following 5 times:
        1. pause the memcached process for a total of 7 seconds by sending a SIGSTOP.
        2. Unpause memcached by sending a SIGCONT.
      4. Scan the logs for the inactivity message.

      What happens:

      In 6.6.1-9174, the inactivity message is not present in the backup logs. 

      However, in 7.0.0-3583, the inactivity message is present:

      2020-11-09T15:42:35.583+00:00 WARN: (DCP) (default) (vb 153) Stream has been inactive for 5s, last seqno 112 -- couchbase.(*DCPAsyncWorker).monitorActivity.func1() at dcp_async_worker.go:247
      2020-11-09T15:42:35.583+00:00 WARN: (DCP) (default) (vb 167) Stream has been inactive for 5s, last seqno 113 -- couchbase.(*DCPAsyncWorker).monitorActivity.func1() at dcp_async_worker.go:247
      2020-11-09T15:42:35.583+00:00 WARN: (DCP) (default) (vb 209) Stream has been inactive for 5s, last seqno 112 -- couchbase.(*DCPAsyncWorker).monitorActivity.func1() at dcp_async_worker.go:247
      2020-11-09T15:42:35.583+00:00 WARN: (DCP) (default) (vb 223) Stream has been inactive for 5s, last seqno 112 -- couchbase.(*DCPAsyncWorker).monitorActivity.func1() at dcp_async_worker.go:247
      

      What I expected to happen:

      I expected the equivalent message to be present in the 6.6.1-9174. backup logs.

       

      Logs for 7.0.0: 

      cbbackupmgr-collectinfo-entbackup_10.112.210.101-2020-11-09T154345.zip

      [^collectinfo-2020-11-09T154308-ns_1@cb.local.zip]

      Logs for 6.6.1: 

      cbbackupmgr-collectinfo-entbackup_10.112.202.101-2020-11-09T160246.zip

      [^collectinfo-2020-11-09T160228-ns_1@cb.local.zip]

       

       

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          james.lee James Lee added a comment - - edited

          Asad Zaidi and I spoke about this before the MB was raised, I'd like to highlight a few things from that conversation:
          1) This was broken in 6.6.0 and should have been fixed in 6.6.1
          2) Asad is using the same test case between MB and CC
          3) There hasn't been any (significant) code change around the DCP client from MH to CC (most of, if not all of the changes are around line wrapping and function signatures)
          4) The activity monitor is slightly more limited that it used to be due to synchronous backfill i.e. a stream is only marked active upon receiving the first mutation. If a stream never receives a mutation (but does receive a snapshot marker) the activity monitor will not be triggered.
          5) We do have unit testing around this (however it may be wrong)

          With that out of the way, the activity monitor seems to working as expected (note that I'm building with the 'mad-hatter' branch, all 6.6.1 builds should be equivalent); I'm using the following steps to test it:
          1) Create a test cluster
          2) Load the travel-sample bucket
          3) Add a 'time.Sleep(5 * time.Second)' just before begin opening DCP streams (obviously this can't be done in QE)
          4) Run 'cbbackupmgr' with 'CB_DCP_ACTIVITY_MONITOR_FREQUENCY=1s'
          5) Whilst 'cbbackupmgr' is sleeping, use the 'tc' command to introduce an artificial delay to the network of 5s
          6) Let the backup run for a little bit
          7) Remove the artificial delay
          8) Inspect the logs

          Looking at the logs from the above steps, I see:

          2020-11-09T17:28:24.278+00:00 TEMP: enable tc filter -- couchbase.(*Source).Data() at source.go:312
          2020-11-09T17:28:29.279+00:00 (DCP) (travel-sample) (vb 0) Creating DCP stream {"uuid":54313299993859,"start_seqno":0,"end_seqno":38,"snap_start":0,"snap_end":0,"retries":0}
          2020-11-09T17:28:38.273+00:00 WARN: (DCP) (travel-sample) (vb 0) Stream has been inactive for 1s, last seqno 12 -- couchbase.(*DCPAsyncWorker).monitorActivity.func1() at dcp_async_worker.go:247
          2020-11-09T17:28:39.273+00:00 WARN: (DCP) (travel-sample) (vb 0) Stream is no longer inactive, last seqno 14 -- couchbase.(*DCPAsyncWorker).monitorActivity.func1() at dcp_async_worker.go:236
          2020-11-09T17:28:40.273+00:00 WARN: (DCP) (travel-sample) (vb 0) Stream has been inactive for 1s, last seqno 14 -- couchbase.(*DCPAsyncWorker).monitorActivity.func1() at dcp_async_worker.go:247
          2020-11-09T17:28:41.658+00:00 (DCP) (travel-sample) (vb 0) Stream closed because all items were streamed | {"uuid":54313299993859,"snap_start":38,"snap_end":38,"snap_complete":true,"last_seqno":38,"retries":0}
          

          We can see the activity monitor pick up that things are going slowly i.e it detects the inactivity, and then the new activity. This may just be a case of this being difficult to test without modifying the code, although, it is odd that Asad Zaidi has been unable to trigger the activity monitor using testrunner.

          james.lee James Lee added a comment - - edited Asad Zaidi and I spoke about this before the MB was raised, I'd like to highlight a few things from that conversation: 1) This was broken in 6.6.0 and should have been fixed in 6.6.1 2) Asad is using the same test case between MB and CC 3) There hasn't been any (significant) code change around the DCP client from MH to CC (most of, if not all of the changes are around line wrapping and function signatures) 4) The activity monitor is slightly more limited that it used to be due to synchronous backfill i.e. a stream is only marked active upon receiving the first mutation. If a stream never receives a mutation (but does receive a snapshot marker) the activity monitor will not be triggered. 5) We do have unit testing around this (however it may be wrong) With that out of the way, the activity monitor seems to working as expected (note that I'm building with the ' mad-hatter ' branch, all 6.6.1 builds should be equivalent); I'm using the following steps to test it: 1) Create a test cluster 2) Load the travel-sample bucket 3) Add a ' time.Sleep(5 * time.Second) ' just before begin opening DCP streams (obviously this can't be done in QE) 4) Run ' cbbackupmgr ' with ' CB_DCP_ACTIVITY_MONITOR_FREQUENCY=1s ' 5) Whilst ' cbbackupmgr ' is sleeping, use the ' tc ' command to introduce an artificial delay to the network of 5s 6) Let the backup run for a little bit 7) Remove the artificial delay 8) Inspect the logs Looking at the logs from the above steps, I see: 2020-11-09T17:28:24.278+00:00 TEMP: enable tc filter -- couchbase.(*Source).Data() at source.go:312 2020-11-09T17:28:29.279+00:00 (DCP) (travel-sample) (vb 0) Creating DCP stream {"uuid":54313299993859,"start_seqno":0,"end_seqno":38,"snap_start":0,"snap_end":0,"retries":0} 2020-11-09T17:28:38.273+00:00 WARN: (DCP) (travel-sample) (vb 0) Stream has been inactive for 1s, last seqno 12 -- couchbase.(*DCPAsyncWorker).monitorActivity.func1() at dcp_async_worker.go:247 2020-11-09T17:28:39.273+00:00 WARN: (DCP) (travel-sample) (vb 0) Stream is no longer inactive, last seqno 14 -- couchbase.(*DCPAsyncWorker).monitorActivity.func1() at dcp_async_worker.go:236 2020-11-09T17:28:40.273+00:00 WARN: (DCP) (travel-sample) (vb 0) Stream has been inactive for 1s, last seqno 14 -- couchbase.(*DCPAsyncWorker).monitorActivity.func1() at dcp_async_worker.go:247 2020-11-09T17:28:41.658+00:00 (DCP) (travel-sample) (vb 0) Stream closed because all items were streamed | {"uuid":54313299993859,"snap_start":38,"snap_end":38,"snap_complete":true,"last_seqno":38,"retries":0} We can see the activity monitor pick up that things are going slowly i.e it detects the inactivity, and then the new activity. This may just be a case of this being difficult to test without modifying the code, although, it is odd that Asad Zaidi has been unable to trigger the activity monitor using testrunner.

          Thank you James Lee for looking into this,

          Although it's odd that it cannot be reproduced by the same testrunner code in 6.6.1, but can be reproduced in 7.0.0 I think it's fair to say that the nature of the test itself is quite finicky and difficult to test from a qe perspective in a consistent manner. This is because the 'pause' must happen when at least 1 mutation has been sent by an active stream. In addition, when this is coupled with pausing memcached using a SIGSTOP, it may not result in the most reliable of techniques to trigger inactivity.

          On the contrary, using the tc tool to introduce artificial delay is a little dangerous as it may render the testing virtual machines unusable or inaccessible and it requires a sleep to be present before DCP streams are opened in cbbackupmgr which means that this cannot be easily reproduced by qe.

          Which means that perhaps we can removing the test case, I will file a CBQE for this and close this is issue as it's not a bug.

          asad.zaidi Asad Zaidi (Inactive) added a comment - Thank you James Lee for looking into this, Although it's odd that it cannot be reproduced by the same testrunner code in 6.6.1, but can be reproduced in 7.0.0 I think it's fair to say that the nature of the test itself is quite finicky and difficult to test from a qe perspective in a consistent manner. This is because the 'pause' must happen when at least 1 mutation has been sent by an active stream. In addition, when this is coupled with pausing memcached using a SIGSTOP, it may not result in the most reliable of techniques to trigger inactivity. On the contrary, using the tc tool to introduce artificial delay is a little dangerous as it may render the testing virtual machines unusable or inaccessible and it requires a sleep to be present before DCP streams are opened in cbbackupmgr which means that this cannot be easily reproduced by qe. Which means that perhaps we can removing the test case, I will file a CBQE for this and close this is issue as it's not a bug.

          People

            asad.zaidi Asad Zaidi (Inactive)
            asad.zaidi Asad Zaidi (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty