Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-47169

[high-bucket] - 30 multi bucket test rebalance failed with buckets_cleanup_failed error

    XMLWordPrintable

Details

    Description

      Environment: 7.0.0-5295
      Test : 30 bucket test with all the components 
      Failed at : Rebalance step 
      Error message: 

      completionMessage":"Rebalance exited with reason
      Unknown macro:
      Unknown macro: {buckets_cleanup_failed,['ns_1@172.23.96.20']}
      ."}

      Link to the job : http://perf.jenkins.couchbase.com/view/Eventing/job/themis_multibucket/102/ 

      Steps of the test :

      1. Load the buckets with documents 
      2. Create n1ql indexes 
      3. Initialise XDCR (init_only_xdcr() )
      4. Creating the eventing functions
      5. Creating FTS indexes 
      6. Creating Analytics dataset
      7. Running rebalance for each phase as follows :
        1. KV rebalance 
          1. Rebalance in with mutations
          2. Rebalance swap 
          3. Rebalance out 
        2. Index rebalance
          1. Rebalance in 
          2. Rebalance swap
          3. Rebalance Out 
        3. Eventing rebalance
          1. Rebalance in 
          2. Rebalance swap
          3. Rebalance Out 
        4. CBAS rebalance 
          1. Rebalance in 
          2. Rebalance swap
          3. Rebalance Out 
      8. Backup
      9. FTS swap rebalance

      The test failed when Eventing Swap rebalance was being executed. (Marked in red) 
      Cluster setup and the cluster details are mentioned in the screenshot attached below.

      Attachments

        1. image-2021-08-19-18-50-23-794.png
          image-2021-08-19-18-50-23-794.png
          295 kB
        2. image-2021-08-19-19-36-41-983.png
          image-2021-08-19-19-36-41-983.png
          682 kB
        3. loadavg.png
          loadavg.png
          148 kB
        4. screenshot-1.png
          screenshot-1.png
          42 kB
        5. screenshot-10.png
          screenshot-10.png
          32 kB
        6. screenshot-11.png
          screenshot-11.png
          47 kB
        7. screenshot-12.png
          screenshot-12.png
          44 kB
        8. screenshot-13.png
          screenshot-13.png
          38 kB
        9. screenshot-14.png
          screenshot-14.png
          38 kB
        10. screenshot-15.png
          screenshot-15.png
          45 kB
        11. screenshot-16.png
          screenshot-16.png
          37 kB
        12. screenshot-17.png
          screenshot-17.png
          37 kB
        13. screenshot-18.png
          screenshot-18.png
          38 kB
        14. screenshot-19.png
          screenshot-19.png
          33 kB
        15. screenshot-2.png
          screenshot-2.png
          35 kB
        16. screenshot-20.png
          screenshot-20.png
          39 kB
        17. Screenshot 2021-07-01 at 5.30.16 PM.png
          Screenshot 2021-07-01 at 5.30.16 PM.png
          75 kB
        18. Screenshot 2021-07-01 at 5.31.24 PM.png
          Screenshot 2021-07-01 at 5.31.24 PM.png
          33 kB
        19. Screenshot 2021-12-01 at 21-22-09 Chronicle Node - Grafana.png
          Screenshot 2021-12-01 at 21-22-09 Chronicle Node - Grafana.png
          92 kB
        20. screenshot-3.png
          screenshot-3.png
          49 kB
        21. screenshot-4.png
          screenshot-4.png
          77 kB
        22. screenshot-5.png
          screenshot-5.png
          39 kB
        23. screenshot-6.png
          screenshot-6.png
          88 kB
        24. screenshot-7.png
          screenshot-7.png
          67 kB
        25. screenshot-8.png
          screenshot-8.png
          31 kB
        26. screenshot-9.png
          screenshot-9.png
          29 kB
        27. slow_fsyncs_1s_rate.png
          slow_fsyncs_1s_rate.png
          150 kB
        28. slow_fsyncs_1s.png
          slow_fsyncs_1s.png
          124 kB
        29. slow_fsyncs_5s.png
          slow_fsyncs_5s.png
          114 kB

        Issue Links

          For Gerrit Dashboard: MB-47169
          # Subject Branch Project Status CR V

          Activity

            I'll update the ticket with more details on what I think the source of the problem is later. Meanwhile though, would it be possible to rerun the test with some modifications to the IO scheduler parameters?

            From what I could tell the cluster is set up to use 'deadline' IO scheduler:

            /sys/block/sda/queue/scheduler:noop [deadline] cfq 
            

            I would like to try adjusting write_expire parameter down from 5000 milliseconds to 1000 milliseconds:

            /sys/block/sda/queue/iosched/write_expire:5000
            

            Aliaksey Artamonau Aliaksey Artamonau (Inactive) added a comment - I'll update the ticket with more details on what I think the source of the problem is later. Meanwhile though, would it be possible to rerun the test with some modifications to the IO scheduler parameters? From what I could tell the cluster is set up to use 'deadline' IO scheduler: /sys/block/sda/queue/scheduler:noop [deadline] cfq I would like to try adjusting write_expire parameter down from 5000 milliseconds to 1000 milliseconds: /sys/block/sda/queue/iosched/write_expire:5000

            In fact, I would also like to request a run with the scheduler set to 'cfq' (with default settings).

            Aliaksey Artamonau Aliaksey Artamonau (Inactive) added a comment - In fact, I would also like to request a run with the scheduler set to 'cfq' (with default settings).

            Having thought about it a bit more, I don't anticipate that tuning the deadline scheduler will be very fruitful. So I'd like to see results with cfq first.

            Aliaksey Artamonau Aliaksey Artamonau (Inactive) added a comment - Having thought about it a bit more, I don't anticipate that tuning the deadline scheduler will be very fruitful. So I'd like to see results with cfq first.
            jyotsna.nayak Jyotsna Nayak added a comment - - edited

            HI , I have scheduled the test (it is in the queue) and have set the cfq parameter as discussed. 
            Link to the Job: http://perf.jenkins.couchbase.com/view/Eventing/job/themis_multibucket/104/

            Link to the Gerrit review regarding the Multibucket test changes:  http://review.couchbase.org/c/perfrunner/+/156490

             

            jyotsna.nayak Jyotsna Nayak added a comment - - edited HI , I have scheduled the test (it is in the queue) and have set the cfq parameter as discussed.  Link to the Job: http://perf.jenkins.couchbase.com/view/Eventing/job/themis_multibucket/104/ Link to the Gerrit review regarding the Multibucket test changes:   http://review.couchbase.org/c/perfrunner/+/156490  
            jyotsna.nayak Jyotsna Nayak added a comment - - edited

            Aliaksey Artamonau , the test failed after creating Analytics datasets.
            The parameter is set on all the nodes and the following message is printed in the log : 

            21:45:33 2021-07-08T09:15:34 [INFO] Running: echo cfq > /sys/block/sda/queue/scheduler
            21:45:33 2021-07-08T09:15:34 [INFO] Running: echo cfq > /sys/block/sda/queue/scheduler
            21:45:33 2021-07-08T09:15:34 [INFO] Running: echo cfq > /sys/block/sda/queue/scheduler
            21:45:33 2021-07-08T09:15:34 [INFO] Running: echo cfq > /sys/block/sda/queue/scheduler
            21:45:33 2021-07-08T09:15:34 [INFO] Running: echo cfq > /sys/block/sda/queue/scheduler
            21:45:33 2021-07-08T09:15:34 [INFO] Running: echo cfq > /sys/block/sda/queue/scheduler
            21:45:33 2021-07-08T09:15:34 [INFO] Running: echo cfq > /sys/block/sda/queue/scheduler
            21:45:33 2021-07-08T09:15:34 [INFO] Running: echo cfq > /sys/block/sda/queue/scheduler
            21:45:33 2021-07-08T09:15:34 [INFO] Running: echo cfq > /sys/block/sda/queue/scheduler
            21:45:33 2021-07-08T09:15:34 [INFO] Running: echo cfq > /sys/block/sda/queue/scheduler
            21:45:33 2021-07-08T09:15:34 [INFO] Running: echo cfq > /sys/block/sda/queue/scheduler
            21:45:33 2021-07-08T09:15:34 [INFO] Running: echo cfq > /sys/block/sda/queue/scheduler
            21:45:34 2021-07-08T09:15:35 [INFO] The output of the command is noop deadline [cfq]
            21:45:34 2021-07-08T09:15:35 [INFO] The output of the command is noop deadline [cfq]
            21:45:34 2021-07-08T09:15:35 [INFO] The output of the command is noop deadline [cfq]
            21:45:34 2021-07-08T09:15:35 [INFO] The output of the command is noop deadline [cfq]
            21:45:34 2021-07-08T09:15:35 [INFO] The output of the command is noop deadline [cfq]
            21:45:34 2021-07-08T09:15:35 [INFO] The output of the command is noop deadline [cfq]
            21:45:34 2021-07-08T09:15:35 [INFO] The output of the command is noop deadline [cfq]
            21:45:34 2021-07-08T09:15:35 [INFO] The output of the command is noop deadline [cfq]
            21:45:34 2021-07-08T09:15:35 [INFO] The output of the command is noop deadline [cfq]
            21:45:34 2021-07-08T09:15:35 [INFO] The output of the command is noop deadline [cfq]
            21:45:34 2021-07-08T09:15:35 [INFO] The output of the command is noop deadline [cfq]
            21:45:34 2021-07-08T09:15:35 [INFO] The output of the command is noop deadline [cfq]

            link to the job: Link to the Job: http://perf.jenkins.couchbase.com/view/Eventing/job/themis_multibucket/104/ 
            Can you please let  me know what this parameter gives more information on ? And if the test (link above) has the required information required for further analyses of the test ?

            jyotsna.nayak Jyotsna Nayak added a comment - - edited Aliaksey Artamonau  , the test failed after creating Analytics datasets. The parameter is set on all the nodes and the following message is printed in the log :  21:45:33 2021-07-08T09:15:34 [INFO] Running: echo cfq > /sys/block/sda/queue/scheduler 21:45:33 2021-07-08T09:15:34 [INFO] Running: echo cfq > /sys/block/sda/queue/scheduler 21:45:33 2021-07-08T09:15:34 [INFO] Running: echo cfq > /sys/block/sda/queue/scheduler 21:45:33 2021-07-08T09:15:34 [INFO] Running: echo cfq > /sys/block/sda/queue/scheduler 21:45:33 2021-07-08T09:15:34 [INFO] Running: echo cfq > /sys/block/sda/queue/scheduler 21:45:33 2021-07-08T09:15:34 [INFO] Running: echo cfq > /sys/block/sda/queue/scheduler 21:45:33 2021-07-08T09:15:34 [INFO] Running: echo cfq > /sys/block/sda/queue/scheduler 21:45:33 2021-07-08T09:15:34 [INFO] Running: echo cfq > /sys/block/sda/queue/scheduler 21:45:33 2021-07-08T09:15:34 [INFO] Running: echo cfq > /sys/block/sda/queue/scheduler 21:45:33 2021-07-08T09:15:34 [INFO] Running: echo cfq > /sys/block/sda/queue/scheduler 21:45:33 2021-07-08T09:15:34 [INFO] Running: echo cfq > /sys/block/sda/queue/scheduler 21:45:33 2021-07-08T09:15:34 [INFO] Running: echo cfq > /sys/block/sda/queue/scheduler 21:45:34 2021-07-08T09:15:35 [INFO] The output of the command is noop deadline [cfq] 21:45:34 2021-07-08T09:15:35 [INFO] The output of the command is noop deadline [cfq] 21:45:34 2021-07-08T09:15:35 [INFO] The output of the command is noop deadline [cfq] 21:45:34 2021-07-08T09:15:35 [INFO] The output of the command is noop deadline [cfq] 21:45:34 2021-07-08T09:15:35 [INFO] The output of the command is noop deadline [cfq] 21:45:34 2021-07-08T09:15:35 [INFO] The output of the command is noop deadline [cfq] 21:45:34 2021-07-08T09:15:35 [INFO] The output of the command is noop deadline [cfq] 21:45:34 2021-07-08T09:15:35 [INFO] The output of the command is noop deadline [cfq] 21:45:34 2021-07-08T09:15:35 [INFO] The output of the command is noop deadline [cfq] 21:45:34 2021-07-08T09:15:35 [INFO] The output of the command is noop deadline [cfq] 21:45:34 2021-07-08T09:15:35 [INFO] The output of the command is noop deadline [cfq] 21:45:34 2021-07-08T09:15:35 [INFO] The output of the command is noop deadline [cfq] link to the job: Link to the Job: http://perf.jenkins.couchbase.com/view/Eventing/job/themis_multibucket/104/   Can you please let  me know what this parameter gives more information on ? And if the test (link above) has the required information required for further analyses of the test ?

            The logs from the run with cfq io scheduler show some mild improvements, but it's likely that the issues would still reoccur had the test run till completion.

            The information that I gathered from the original logs. There are two nodes that are affected 172.23.96.20 (fts) and 172.23.96.23 (analytics). The fts node is affected to a much greater degree. In addition to the rebalance failures, that node also times out during tombstone purging. In both situations timeouts occur on operations with chronicle.

            Chronicle is very sensitive to disk latency. Or to be more specific, to fsync latency. On the two affected nodes there are hundreds of fsync-s that take much longer than what we would have wanted them to take.

            Number of fsyncs that take longer than 1 second over time:

            On the FTS node, it's even worse. There's a significant number of fsync-s that are over 5 seconds:

            Looking at the FTS and Analytics logs, the periods in which fsync latency spikes appear to coincide with indexing and compaction.

            There's a definite correlation between the system loadavg (mostly attributable to IO) and the rate of slow fsyncs.

            Load average:

            The rate of slow fsyncs:

            I've made some changes to chronicle that will hopefully make it a bit more resilient to spikes in fsync latency. Jyotsna Nayak, could you please rerun the test with latest changes in chronicle (with the noop IO scheduler, as in the original run)?

            At the same time, fsync-s taking over 5 seconds will still be a huge problem. So we need the FTS and Analytics team to look into how we can share access to disk more fairly. Note that kv-engine team had run into similar problems before. Some details on how they were dealt with are available in MB-25509. Although, their approach may not be easily applicable.

            Aliaksey Artamonau Aliaksey Artamonau (Inactive) added a comment - - edited The logs from the run with cfq io scheduler show some mild improvements, but it's likely that the issues would still reoccur had the test run till completion. The information that I gathered from the original logs. There are two nodes that are affected 172.23.96.20 (fts) and 172.23.96.23 (analytics). The fts node is affected to a much greater degree. In addition to the rebalance failures, that node also times out during tombstone purging. In both situations timeouts occur on operations with chronicle. Chronicle is very sensitive to disk latency. Or to be more specific, to fsync latency. On the two affected nodes there are hundreds of fsync-s that take much longer than what we would have wanted them to take. Number of fsyncs that take longer than 1 second over time: On the FTS node, it's even worse. There's a significant number of fsync-s that are over 5 seconds: Looking at the FTS and Analytics logs, the periods in which fsync latency spikes appear to coincide with indexing and compaction. There's a definite correlation between the system loadavg (mostly attributable to IO) and the rate of slow fsyncs. Load average: The rate of slow fsyncs: I've made some changes to chronicle that will hopefully make it a bit more resilient to spikes in fsync latency. Jyotsna Nayak , could you please rerun the test with latest changes in chronicle (with the noop IO scheduler, as in the original run)? At the same time, fsync-s taking over 5 seconds will still be a huge problem. So we need the FTS and Analytics team to look into how we can share access to disk more fairly. Note that kv-engine team had run into similar problems before. Some details on how they were dealt with are available in MB-25509 . Although, their approach may not be easily applicable.

            Aliaksey Artamonau, can you please confirm if the changes are present in the build - 7.0.1-5878 ?

            jyotsna.nayak Jyotsna Nayak added a comment - Aliaksey Artamonau , can you please confirm if the changes are present in the build - 7.0.1-5878 ?

            Jyotsna Nayak, no, the changes I'm interested in are not part of that build.

            Aliaksey Artamonau Aliaksey Artamonau (Inactive) added a comment - Jyotsna Nayak , no, the changes I'm interested in are not part of that build.

            Aliaksey Artamonau , can you please suggest a build that contains the changes you are interested in ?

            jyotsna.nayak Jyotsna Nayak added a comment - Aliaksey Artamonau  , can you please suggest a build that contains the changes you are interested in ?

            7.0.1-5942 is the earliest build that has the changes.

            Aliaksey Artamonau Aliaksey Artamonau (Inactive) added a comment - 7.0.1-5942 is the earliest build that has the changes.

            Link to the job scheduled on build 7.0.1-5942 : http://perf.jenkins.couchbase.com/view/Eventing/job/themis_multibucket/105/console

            Using patch 9 from this Gerrit change : http://review.couchbase.org/c/perfrunner/+/156490/9

             

            jyotsna.nayak Jyotsna Nayak added a comment - Link to the job scheduled on build 7.0.1-5942 : http://perf.jenkins.couchbase.com/view/Eventing/job/themis_multibucket/105/console Using patch 9 from this Gerrit change : http://review.couchbase.org/c/perfrunner/+/156490/9  

            The test failed at step 7. 2. 1 . ie when it was trying to rebalance in an index node.
            The error message I am seeing is as follows ;

            "completionMessage":"Rebalance exited with reason {service_rebalance_failed,cbas,\n badmatch,\n {error,\n {bad_nodes,cbas,get_agent,\n [\{'ns_1@172.23.96.23',{error,timeout}}]},\n [

            Unknown macro: {service_rebalancer,wait_for_agents,1,n [{file,"src/service_rebalancer.erl"}

            ,\n {line,80}]},\n {service_rebalancer,run_rebalance,1,\n [

            Unknown macro: {file,"src/service_rebalancer.erl"}

            ,\n {line,59}]},\n

            Unknown macro: {proc_lib,init_p,3,n [{file,"proc_lib.erl"}

            ,{line,234}]}]}}."}

            Links to the logs : 
            https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-themis_multibucket-105/172.23.96.15.zip

            https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-themis_multibucket-105/172.23.96.19.zip

            https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-themis_multibucket-105/172.23.96.23.zip (analytics node) 

            https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-themis_multibucket-105/172.23.97.177.zip

            https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-themis_multibucket-105/172.23.99.157.zip

            https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-themis_multibucket-105/172.23.99.158.zip

            https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-themis_multibucket-105/172.23.99.159.zip

            https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-themis_multibucket-105/172.23.99.160.zip

            https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-themis_multibucket-105/172.23.99.161.zip

            jyotsna.nayak Jyotsna Nayak added a comment - The test failed at step 7. 2. 1 . ie when it was trying to rebalance in an index node. The error message I am seeing is as follows ; "completionMessage":"Rebalance exited with reason {service_rebalance_failed,cbas,\n badmatch,\n {error,\n {bad_nodes,cbas,get_agent,\n [\{'ns_1@172.23.96.23',{error,timeout}}] },\n [ Unknown macro: {service_rebalancer,wait_for_agents,1,n [{file,"src/service_rebalancer.erl"} ,\n {line,80}]},\n {service_rebalancer,run_rebalance,1,\n [ Unknown macro: {file,"src/service_rebalancer.erl"} ,\n {line,59}]},\n Unknown macro: {proc_lib,init_p,3,n [{file,"proc_lib.erl"} ,{line,234}]}]}}."} Links to the logs :  https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-themis_multibucket-105/172.23.96.15.zip https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-themis_multibucket-105/172.23.96.19.zip https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-themis_multibucket-105/172.23.96.23.zip  (analytics node)  https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-themis_multibucket-105/172.23.97.177.zip https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-themis_multibucket-105/172.23.99.157.zip https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-themis_multibucket-105/172.23.99.158.zip https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-themis_multibucket-105/172.23.99.159.zip https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-themis_multibucket-105/172.23.99.160.zip https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-themis_multibucket-105/172.23.99.161.zip
            dfinlay Dave Finlay added a comment -

            Aliaksei is out. I will pick up this ticket.

            dfinlay Dave Finlay added a comment - Aliaksei is out. I will pick up this ticket.

            Dave Finlay , please let me know if any more tests need to be run for further debugging. Thank you .

            jyotsna.nayak Jyotsna Nayak added a comment - Dave Finlay  , please let me know if any more tests need to be run for further debugging. Thank you .
            dfinlay Dave Finlay added a comment -

            Thanks Jyotsna.

            It's the same problem as before. Chronicle experiences a big spike in disk latencies, which cause cascading timeouts in ns_server on node .23. This impacts the CBAS service agent which gets killed as a result of the general chaos inside ns_server, which results in the rebalance exiting as it can't proceed with the Analytics rebalance.

            A quick review of the timeline is as follows. Rebalance starts:

            2021-07-26T14:55:03.980-07:00, ns_orchestrator:0:info:message(ns_1@172.23.99.157) - Starting rebalance, KeepNodes = ['ns_1@172.23.99.157','ns_1@172.23.99.158',
                                             'ns_1@172.23.99.159','ns_1@172.23.96.19',
                                             'ns_1@172.23.96.15','ns_1@172.23.97.177',
                                             'ns_1@172.23.96.23','ns_1@172.23.96.20',
                                             'ns_1@172.23.99.161'], EjectNodes = ['ns_1@172.23.99.160'], Failed over and being ejected nodes = []; no delta recovery nodes; Operation Id = 0f154250e44b5d23778e153c3c55263c
            

            The orchestrator reaches the point of beginning the Analytics rebalance and waits for a minute for the CBAS agent to be available. When there's no sign of it, the rebalance terminates:

            [ns_server:debug,2021-07-26T15:06:55.935-07:00,ns_1@172.23.99.157:service_rebalancer-cbas<0.1103.228>:service_agent:wait_for_agents:68]Waiting for the service agents for service cbas to come up on nodes: (1 minute timeout)
            ['ns_1@172.23.96.23']
            ...
            [ns_server:error,2021-07-26T15:07:56.239-07:00,ns_1@172.23.99.157:service_rebalancer-cbas<0.1103.228>:service_agent:process_bad_results:854]Service call get_agent (service cbas) failed on some nodes:
            [{'ns_1@172.23.96.23',{error,timeout}}]
            2021-07-26T15:07:56.241-07:00, ns_orchestrator:0:critical:message(ns_1@172.23.99.157) - Rebalance exited with reason {service_rebalance_failed,cbas,
                                          {{badmatch,
                                            {error,
                                             {bad_nodes,cbas,get_agent,
                                              [{'ns_1@172.23.96.23',{error,timeout}}]}}},
            

            The reason that the CBAS agent isn't available is that it got shutdown when a whole collection of process inside ns_server died as a result of a cascade of timeouts happening due to chronicle getting stuck trying to write a snapshot to disk. Here's where the CBAS agent dies:

            [error_logger:info,2021-07-26T11:52:20.181-07:00,ns_1@172.23.96.23:service_agent_children_sup<0.6331.0>:ale_error_logger_handler:do_log:101]
            =========================PROGRESS REPORT=========================
                supervisor: {local,service_agent_children_sup}
                started: [{pid,<0.6970.0>},
                          {id,{service_agent,cbas}},
                          {mfargs,{service_agent,start_link,[cbas]}},
                          {restart_type,permanent},
                          {shutdown,1000},
                          {child_type,worker}]
            ...
            [ns_server:debug,2021-07-26T15:06:16.194-07:00,ns_1@172.23.96.23:<0.6972.0>:ns_pubsub:do_subscribe_link_continue:152]Parent process of subscription {ns_config_events,<0.6970.0>} exited with reason shutdown
            

            And here's where Chronicle dies trying to take a snapshot:

            [chronicle:info,2021-07-26T15:06:00.952-07:00,ns_1@172.23.96.23:chronicle_agent<0.224.0>:chronicle_agent:initiate_snapshot:2655]Taking snapshot at seqno 15425.
            Config:
            {log_entry,<<"78107037ce794323d06e2e7fd3cd9570">>,
                       {5,'ns_1@172.23.99.157'},
                       8150,
                       {config,{<<"a4e97106a698a8398cddd8dd240fa6c4">>,1,23},
                               0,<<"c225a353020bc03f6a1349d868a9aefe">>,
                               #{'ns_1@172.23.96.15' =>
            ...
            [error_logger:error,2021-07-26T15:06:15.953-07:00,ns_1@172.23.96.23:chronicle_config_rsm<0.6006.0>:ale_error_logger_handler:do_log:101]
            =========================ERROR REPORT=========================
            ** State machine chronicle_config_rsm terminating
            ** Last event = {cast,{event,{take_snapshot,15425}}}
            ** When server state  = {{follower,'ns_1@172.23.99.157',
                                         <<"78107037ce794323d06e2e7fd3cd9570">>,
                                         {5,'ns_1@172.23.99.157'}},
            ...
            ** Reason for termination = exit:{timeout,
                                              {gen_server,call,
                                               [chronicle_snapshot_mgr,
                                                {get_snapshot_saver,chronicle_config_rsm,
                                                 <0.6006.0>,15425},
                                                15000]}}
            ** Callback modules = [chronicle_rsm]
            ** Callback mode = handle_event_function
            ** Stacktrace =
            **  [{gen_server,call,3,[{file,"gen_server.erl"},{line,223}]},
                 {chronicle_snapshot_mgr,save_snapshot,3,
                                         [{file,"src/chronicle_snapshot_mgr.erl"},
                                          {line,69}]},
                 {chronicle_rsm,handle_take_snapshot,3,
                                [{file,"src/chronicle_rsm.erl"},{line,1428}]},
                 {gen_statem,loop_state_callback,11,[{file,"gen_statem.erl"},{line,1120}]},
                 {proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,249}]}]
            ** Time-outs: {1,[{state_timeout,maybe_send_noop}]}
            

            The problem looks to be that it takes Chronicle about 20 s to create (and fsync) a directory.

            Chronicle fsyncs spike around this time. Before the problems start there are 0 fsyncs in excess of 1 s, in the minute beginning 15:05:06 there are 11 fsyncs in excess of 1 s, and in the period after that it jumps to 23 fsyncs in excess of 1 s.

            The average fsync time rises to above 2.5s:

            And the number of fysncs per second performed by Chronicle drops to essentially zero - it's making no progress.

            Strongly correlated with the spike in Chronicle fsync time we see an increase in the number of bytes written by Analytics:

            It's hard to tell precisely when these bytes were written, but in the worst case it could have been the case that something around 4.6 GB was written to disk in a short interval - maybe 10-20 s. (Taking the gap before and after.)

            Note that in addition we see something of the order of 150 allocation stalls (direct reclaim) occur at about this time:

            This indicates there's a lot of pressure on the virtual memory system - processes are forced to claim their memory by writing dirty pages to disk. In the past we've seen timeouts in ns_server correlated with a spike in this metric.

            Till Westmann: can I ask 2 things:

            1. Could you take a look at the Analytics side of this issue - e.g. what's the reason behind the I/O, can you get a sense of the max rate at which you are writing bytes to disk and the sizes of the chunks that are getting written?
            2. Is there a tunable in Analytics that we should set that would limit the max rate at which Analytics writes to disk such that we can run the test again?
            dfinlay Dave Finlay added a comment - Thanks Jyotsna. It's the same problem as before. Chronicle experiences a big spike in disk latencies, which cause cascading timeouts in ns_server on node .23. This impacts the CBAS service agent which gets killed as a result of the general chaos inside ns_server, which results in the rebalance exiting as it can't proceed with the Analytics rebalance. A quick review of the timeline is as follows. Rebalance starts: 2021-07-26T14:55:03.980-07:00, ns_orchestrator:0:info:message(ns_1@172.23.99.157) - Starting rebalance, KeepNodes = ['ns_1@172.23.99.157','ns_1@172.23.99.158', 'ns_1@172.23.99.159','ns_1@172.23.96.19', 'ns_1@172.23.96.15','ns_1@172.23.97.177', 'ns_1@172.23.96.23','ns_1@172.23.96.20', 'ns_1@172.23.99.161'], EjectNodes = ['ns_1@172.23.99.160'], Failed over and being ejected nodes = []; no delta recovery nodes; Operation Id = 0f154250e44b5d23778e153c3c55263c The orchestrator reaches the point of beginning the Analytics rebalance and waits for a minute for the CBAS agent to be available. When there's no sign of it, the rebalance terminates: [ns_server:debug,2021-07-26T15:06:55.935-07:00,ns_1@172.23.99.157:service_rebalancer-cbas<0.1103.228>:service_agent:wait_for_agents:68]Waiting for the service agents for service cbas to come up on nodes: (1 minute timeout) ['ns_1@172.23.96.23'] ... [ns_server:error,2021-07-26T15:07:56.239-07:00,ns_1@172.23.99.157:service_rebalancer-cbas<0.1103.228>:service_agent:process_bad_results:854]Service call get_agent (service cbas) failed on some nodes: [{'ns_1@172.23.96.23',{error,timeout}}] 2021-07-26T15:07:56.241-07:00, ns_orchestrator:0:critical:message(ns_1@172.23.99.157) - Rebalance exited with reason {service_rebalance_failed,cbas, {{badmatch, {error, {bad_nodes,cbas,get_agent, [{'ns_1@172.23.96.23',{error,timeout}}]}}}, The reason that the CBAS agent isn't available is that it got shutdown when a whole collection of process inside ns_server died as a result of a cascade of timeouts happening due to chronicle getting stuck trying to write a snapshot to disk. Here's where the CBAS agent dies: [error_logger:info,2021-07-26T11:52:20.181-07:00,ns_1@172.23.96.23:service_agent_children_sup<0.6331.0>:ale_error_logger_handler:do_log:101] =========================PROGRESS REPORT========================= supervisor: {local,service_agent_children_sup} started: [{pid,<0.6970.0>}, {id,{service_agent,cbas}}, {mfargs,{service_agent,start_link,[cbas]}}, {restart_type,permanent}, {shutdown,1000}, {child_type,worker}] ... [ns_server:debug,2021-07-26T15:06:16.194-07:00,ns_1@172.23.96.23:<0.6972.0>:ns_pubsub:do_subscribe_link_continue:152]Parent process of subscription {ns_config_events,<0.6970.0>} exited with reason shutdown And here's where Chronicle dies trying to take a snapshot: [chronicle:info,2021-07-26T15:06:00.952-07:00,ns_1@172.23.96.23:chronicle_agent<0.224.0>:chronicle_agent:initiate_snapshot:2655]Taking snapshot at seqno 15425. Config: {log_entry,<<"78107037ce794323d06e2e7fd3cd9570">>, {5,'ns_1@172.23.99.157'}, 8150, {config,{<<"a4e97106a698a8398cddd8dd240fa6c4">>,1,23}, 0,<<"c225a353020bc03f6a1349d868a9aefe">>, #{'ns_1@172.23.96.15' => ... [error_logger:error,2021-07-26T15:06:15.953-07:00,ns_1@172.23.96.23:chronicle_config_rsm<0.6006.0>:ale_error_logger_handler:do_log:101] =========================ERROR REPORT========================= ** State machine chronicle_config_rsm terminating ** Last event = {cast,{event,{take_snapshot,15425}}} ** When server state = {{follower,'ns_1@172.23.99.157', <<"78107037ce794323d06e2e7fd3cd9570">>, {5,'ns_1@172.23.99.157'}}, ... ** Reason for termination = exit:{timeout, {gen_server,call, [chronicle_snapshot_mgr, {get_snapshot_saver,chronicle_config_rsm, <0.6006.0>,15425}, 15000]}} ** Callback modules = [chronicle_rsm] ** Callback mode = handle_event_function ** Stacktrace = ** [{gen_server,call,3,[{file,"gen_server.erl"},{line,223}]}, {chronicle_snapshot_mgr,save_snapshot,3, [{file,"src/chronicle_snapshot_mgr.erl"}, {line,69}]}, {chronicle_rsm,handle_take_snapshot,3, [{file,"src/chronicle_rsm.erl"},{line,1428}]}, {gen_statem,loop_state_callback,11,[{file,"gen_statem.erl"},{line,1120}]}, {proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,249}]}] ** Time-outs: {1,[{state_timeout,maybe_send_noop}]} The problem looks to be that it takes Chronicle about 20 s to create (and fsync) a directory. Chronicle fsyncs spike around this time. Before the problems start there are 0 fsyncs in excess of 1 s, in the minute beginning 15:05:06 there are 11 fsyncs in excess of 1 s, and in the period after that it jumps to 23 fsyncs in excess of 1 s. The average fsync time rises to above 2.5s: And the number of fysncs per second performed by Chronicle drops to essentially zero - it's making no progress. Strongly correlated with the spike in Chronicle fsync time we see an increase in the number of bytes written by Analytics: It's hard to tell precisely when these bytes were written, but in the worst case it could have been the case that something around 4.6 GB was written to disk in a short interval - maybe 10-20 s. (Taking the gap before and after.) Note that in addition we see something of the order of 150 allocation stalls (direct reclaim) occur at about this time: This indicates there's a lot of pressure on the virtual memory system - processes are forced to claim their memory by writing dirty pages to disk. In the past we've seen timeouts in ns_server correlated with a spike in this metric. Till Westmann : can I ask 2 things: Could you take a look at the Analytics side of this issue - e.g. what's the reason behind the I/O, can you get a sense of the max rate at which you are writing bytes to disk and the sizes of the chunks that are getting written? Is there a tunable in Analytics that we should set that would limit the max rate at which Analytics writes to disk such that we can run the test again?

            Dave Finlay,

            The Analytics operations that result in writing a lot of data are:

            1. Flush Operations: persisting an in-memory btree data to disk. There is a configurable parameter (storageMaxConcurrentFlushesPerPartition=2 by default) that limits the number of concurrent flushes per partition in the node.
            2. Merge Operations: merging (compacting) multiple btrees into a single btree. There is a configurable parameter (storageMaxConcurrentMergesPerPartition=2 by default) that limits the number of concurrent merges per partition in the node.

            In both flush and merge operations, Analytics writes consist of pages of size 128KB. Each operation forces to disk every time it writes 16MB worth of pages (128 pages).
            Doing the math in this case for node .23 which has 16 partitions configured:
            Max concurrent flushes = 2 * 16 = 32
            Max concurrent merges = 2 * 16 = 32

            Therefore, we can end up with:
            (32+32) * 16MB worth of data pending force to disk = 1GB
            There are other sources of writes but these should be the major ones.

            If you would like to reduce the number of concurrent writes produced by Analytics, you can do the following and run the test again:

            1. Reduce the number of partitions configured in the node.
            2. Configure storageMaxConcurrentFlushesPerPartition to 1.
            3. Configure storageMaxConcurrentMergesPerPartition to 1.

            The force after writing 16MB is currently not exposed as configurable parameter in our service config API, but I can submit a patch to make it so if needed. We also have a configurable parameter (storageWriteRateLimit=0 by default) which makes any write operation wait after writing a certain number of bytes/second. You can configure it with a value > 0 bytes/s, but I recommend leaving this as a last resort.

            Please let me know if more information is needed from Analytics side.

            murtadha.hubail Murtadha Hubail added a comment - Dave Finlay , The Analytics operations that result in writing a lot of data are: Flush Operations: persisting an in-memory btree data to disk. There is a configurable parameter (storageMaxConcurrentFlushesPerPartition=2 by default) that limits the number of concurrent flushes per partition in the node. Merge Operations: merging (compacting) multiple btrees into a single btree. There is a configurable parameter (storageMaxConcurrentMergesPerPartition=2 by default) that limits the number of concurrent merges per partition in the node. In both flush and merge operations, Analytics writes consist of pages of size 128KB . Each operation forces to disk every time it writes 16MB worth of pages (128 pages). Doing the math in this case for node .23 which has 16 partitions configured: Max concurrent flushes = 2 * 16 = 32 Max concurrent merges = 2 * 16 = 32 Therefore, we can end up with: (32+32) * 16MB worth of data pending force to disk = 1GB There are other sources of writes but these should be the major ones. If you would like to reduce the number of concurrent writes produced by Analytics, you can do the following and run the test again: Reduce the number of partitions configured in the node. Configure storageMaxConcurrentFlushesPerPartition to 1. Configure storageMaxConcurrentMergesPerPartition to 1. The force after writing 16MB is currently not exposed as configurable parameter in our service config API, but I can submit a patch to make it so if needed. We also have a configurable parameter (storageWriteRateLimit=0 by default) which makes any write operation wait after writing a certain number of bytes/second. You can configure it with a value > 0 bytes/s, but I recommend leaving this as a last resort. Please let me know if more information is needed from Analytics side.
            dfinlay Dave Finlay added a comment - - edited

            Thanks Murtadha Hubail, this is awesome detail.

            Indeed some follow up questions:

            • Do you fsync after each operation causes a write to disk? Or do you accumulate writes and at the end of some period only then fsync?
            • I take from your analysis as to what's happening on .23 that at any point in time there could be 64 concurrent operations writing a total of 1 GB to disk. I presume that once these operations finish, new operations can commence and essentially it is possible that there are 64 operations continually writing to disk. Is that true?
            • Is there any logging in Analytics from T15:04 to T15:10 that would tell us how many operations ran and when they started and finished?

            I would like to start with the configurables that you mention. Could you provide instructions as to how to configure:

            • Set the number of partitions to 8.
            • Configure storageMaxConcurrentFlushesPerPartition to 1.
            • Configure storageMaxConcurrentMergesPerPartition to 1.

            Once we get this information, I'll ask Jyostna to run again. Thanks!

            dfinlay Dave Finlay added a comment - - edited Thanks Murtadha Hubail , this is awesome detail. Indeed some follow up questions: Do you fsync after each operation causes a write to disk? Or do you accumulate writes and at the end of some period only then fsync? I take from your analysis as to what's happening on .23 that at any point in time there could be 64 concurrent operations writing a total of 1 GB to disk. I presume that once these operations finish, new operations can commence and essentially it is possible that there are 64 operations continually writing to disk. Is that true? Is there any logging in Analytics from T15:04 to T15:10 that would tell us how many operations ran and when they started and finished? I would like to start with the configurables that you mention. Could you provide instructions as to how to configure: Set the number of partitions to 8. Configure storageMaxConcurrentFlushesPerPartition to 1. Configure storageMaxConcurrentMergesPerPartition to 1. Once we get this information, I'll ask Jyostna to run again. Thanks!

            Dave Finlay,

            • Do you fsync after each operation causes a write to disk? Or do you accumulate writes and at the end of some period only then fsync?
              We don't fsync after each page write (128KB). Each operation performs an fsync after writing 16MB worth of pages. Once that fsync completes, the operation continues writing the next 16MB followed by another fsync and so on.
            • I take from your analysis as to what's happening on .23 that at any point in time there could be 64 concurrent operations writing a total of 1 GB to disk. I presume that once these operations finish, new operations can commence and essentially it is possible that there are 64 operations continually writing to disk. Is that true?
              True.
            • Is there any logging in Analytics from T15:04 to T15:10 that would tell us how many operations ran and when they started and finished?
              During that time window, our periodic checkpoint, which runs once every 1 hour by default, was triggered. It results in scheduling async flush operations on all partitions of each dataset in the node. Since this test has 60 datasets, it is expected to have a large volume of data to write during that time window.

              2021-07-26T15:04:29.203-07:00 INFO CBAS.context.DatasetLifecycleManager [Checkpoint Thread (121)] Async flushing Dataset (139), Partition (0)
              2021-07-26T15:04:29.203-07:00 INFO CBAS.context.DatasetLifecycleManager [Checkpoint Thread (121)] Async flushing Dataset (139), Partition (2)
              .....
              2021-07-26T15:10:29.235-07:00 INFO CBAS.context.DatasetLifecycleManager [Checkpoint Thread (121)] Async flushing Dataset (127), Partition (15)
              ....
              2021-07-26T15:14:29.243-07:00 INFO CBAS.context.DatasetLifecycleManager [Checkpoint Thread (121)] Async flushing Dataset (104), Partition (15)
              

              In analytics_debug.log, we log when each flush/merge operation is scheduled, started, and finished. The logs have the following pattern:

              2021-07-26T15:04:29.296-07:00 DEBU CBAS.impls.LSMHarness [Executor-2:6f869fc5a5d9e893196043df02a7a0f6] Flush is being scheduled on {index name}
              ...
              2021-07-26T15:04:31.498-07:00 DEBU CBAS.impls.LSMHarness [Executor-1539:6f869fc5a5d9e893196043df02a7a0f6] Started a flush operation for index: {index name}
              ...
              2021-07-26T15:04:35.400-07:00 DEBU CBAS.impls.LSMHarness [Executor-1539:6f869fc5a5d9e893196043df02a7a0f6] Finished the flush operation for index: {index name}
              

            • Set the number of partitions to 8.
              Setting the number of Analytics partitions is done during node init by specifying as many --node-init-analytics-path as partitions needed and providing a unique disk path to each. QE might already have the test framework to do that. If only a single path is specified, we automatically congiure the partitions to be the same number as vCores in the node, which is what happened in node .93.
            • Configure storageMaxConcurrentFlushesPerPartition to 1.
            • Configure storageMaxConcurrentMergesPerPartition to 1.
              For storageMaxConcurrentFlushesPerPartition and storageMaxConcurrentMergesPerPartition, the Analytics Service Configuration API can be used to change these values once the service is up. For example:

              curl -v -u Administrator:password -X PUT -d storageMaxConcurrentFlushesPerPartition=1 -d storageMaxConcurrentMergesPerPartition=1 http://localhost:8095/analytics/config/service
              

              After that, the Analytics service restart API needs to be called for the changes to take effect:

              curl -v -u Administrator:password -X POST http://localhost:8095/analytics/cluster/restart
              

            murtadha.hubail Murtadha Hubail added a comment - Dave Finlay , Do you fsync after each operation causes a write to disk? Or do you accumulate writes and at the end of some period only then fsync? We don't fsync after each page write (128KB). Each operation performs an fsync after writing 16MB worth of pages. Once that fsync completes, the operation continues writing the next 16MB followed by another fsync and so on. I take from your analysis as to what's happening on .23 that at any point in time there could be 64 concurrent operations writing a total of 1 GB to disk. I presume that once these operations finish, new operations can commence and essentially it is possible that there are 64 operations continually writing to disk. Is that true? True. Is there any logging in Analytics from T15:04 to T15:10 that would tell us how many operations ran and when they started and finished? During that time window, our periodic checkpoint, which runs once every 1 hour by default, was triggered. It results in scheduling async flush operations on all partitions of each dataset in the node. Since this test has 60 datasets, it is expected to have a large volume of data to write during that time window. 2021-07-26T15:04:29.203-07:00 INFO CBAS.context.DatasetLifecycleManager [Checkpoint Thread (121)] Async flushing Dataset (139), Partition (0) 2021-07-26T15:04:29.203-07:00 INFO CBAS.context.DatasetLifecycleManager [Checkpoint Thread (121)] Async flushing Dataset (139), Partition (2) ..... 2021-07-26T15:10:29.235-07:00 INFO CBAS.context.DatasetLifecycleManager [Checkpoint Thread (121)] Async flushing Dataset (127), Partition (15) .... 2021-07-26T15:14:29.243-07:00 INFO CBAS.context.DatasetLifecycleManager [Checkpoint Thread (121)] Async flushing Dataset (104), Partition (15) In analytics_debug.log, we log when each flush/merge operation is scheduled, started, and finished. The logs have the following pattern: 2021-07-26T15:04:29.296-07:00 DEBU CBAS.impls.LSMHarness [Executor-2:6f869fc5a5d9e893196043df02a7a0f6] Flush is being scheduled on {index name} ... 2021-07-26T15:04:31.498-07:00 DEBU CBAS.impls.LSMHarness [Executor-1539:6f869fc5a5d9e893196043df02a7a0f6] Started a flush operation for index: {index name} ... 2021-07-26T15:04:35.400-07:00 DEBU CBAS.impls.LSMHarness [Executor-1539:6f869fc5a5d9e893196043df02a7a0f6] Finished the flush operation for index: {index name} Set the number of partitions to 8. Setting the number of Analytics partitions is done during node init by specifying as many --node-init-analytics-path as partitions needed and providing a unique disk path to each. QE might already have the test framework to do that. If only a single path is specified, we automatically congiure the partitions to be the same number as vCores in the node, which is what happened in node .93. Configure storageMaxConcurrentFlushesPerPartition to 1. Configure storageMaxConcurrentMergesPerPartition to 1. For storageMaxConcurrentFlushesPerPartition and storageMaxConcurrentMergesPerPartition, the Analytics Service Configuration API can be used to change these values once the service is up. For example: curl -v -u Administrator:password -X PUT -d storageMaxConcurrentFlushesPerPartition=1 -d storageMaxConcurrentMergesPerPartition=1 http://localhost:8095/analytics/config/service After that, the Analytics service restart API needs to be called for the changes to take effect: curl -v -u Administrator:password -X POST http://localhost:8095/analytics/cluster/restart
            dfinlay Dave Finlay added a comment - - edited

            Thanks Murtadha.

            It sounds like the behavior varies quite significantly if the user chooses to only provide one disk path to Analytics. Question: You mention that if only one path is provided to Analytics, you scale with the number of cores, however, the # of cores on this machine is 48 and the number of partitions of each dataset is 16. Do you cap at 16?

            Would it be the case that if you scheduled a periodic flush more frequently, you would write less each time? (Intuitively it seems like it should do, but I should ask.) If so, would you guys be open to flushing more frequently so that it's not such a huge amount of data when it does happen?

            Jyotsna Nayak:

            Murtadha says:

            Setting the number of Analytics partitions is done during node init by specifying as many --node-init-analytics-path as partitions needed and providing a unique disk path to each. QE might already have the test framework to do that. I

            Do you have the capability to start Analytics with 8 unique disk paths so we can test with that number of partitions?

            Could you also configure the test to specify the max concurrent flushes and merges to 1 as Murtadha indicates?

            If so, could you run a test with these settings. Also: please retain the current cfq scheduler setting.

            dfinlay Dave Finlay added a comment - - edited Thanks Murtadha. It sounds like the behavior varies quite significantly if the user chooses to only provide one disk path to Analytics. Question: You mention that if only one path is provided to Analytics, you scale with the number of cores, however, the # of cores on this machine is 48 and the number of partitions of each dataset is 16. Do you cap at 16? Would it be the case that if you scheduled a periodic flush more frequently, you would write less each time? (Intuitively it seems like it should do, but I should ask.) If so, would you guys be open to flushing more frequently so that it's not such a huge amount of data when it does happen? Jyotsna Nayak : Murtadha says: Setting the number of Analytics partitions is done during node init by specifying as many --node-init-analytics-path as partitions needed and providing a unique disk path to each. QE might already have the test framework to do that. I Do you have the capability to start Analytics with 8 unique disk paths so we can test with that number of partitions? Could you also configure the test to specify the max concurrent flushes and merges to 1 as Murtadha indicates? If so, could you run a test with these settings. Also: please retain the current cfq scheduler setting.

            Yes, we cap the auto-config to 16 partitions. We also consider the Analytics ram and require 1GB for each auto-configured partition. More information about the auto-config can be found here.

            To clarify a bit on the periodic checkpoint, the idea behind it is to persist datasets which haven't been flushed for a while. The checkpoint itself runs every 2 minutes and schedules a flush only on datasets which have not been flushed in the last 60 minutes or more. In a production system, incoming data will cause a flush on one or more datasets once a certain amount of memory is consumed, but It looks like in this test either 1) many datasets stopped receiving any data for a while or 2) we never consumed enough memory to trigger a flush.

            The 60 minutes configuration can be changed using the configuration parameter (txnDatasetCheckpointInterval=3600 by default). However, if we have many datasets in this test that stop receiving data around the same time, which seems to be the case since the datasets appear to be copies of each other, then changing this parameter to a lower value won't change the amount of data to be written since we still have the same amount of non-flushed data in memory.

            murtadha.hubail Murtadha Hubail added a comment - Yes, we cap the auto-config to 16 partitions. We also consider the Analytics ram and require 1GB for each auto-configured partition. More information about the auto-config can be found here . To clarify a bit on the periodic checkpoint, the idea behind it is to persist datasets which haven't been flushed for a while. The checkpoint itself runs every 2 minutes and schedules a flush only on datasets which have not been flushed in the last 60 minutes or more. In a production system, incoming data will cause a flush on one or more datasets once a certain amount of memory is consumed, but It looks like in this test either 1) many datasets stopped receiving any data for a while or 2) we never consumed enough memory to trigger a flush. The 60 minutes configuration can be changed using the configuration parameter (txnDatasetCheckpointInterval=3600 by default). However, if we have many datasets in this test that stop receiving data around the same time, which seems to be the case since the datasets appear to be copies of each other, then changing this parameter to a lower value won't change the amount of data to be written since we still have the same amount of non-flushed data in memory.
            dfinlay Dave Finlay added a comment -

            Thanks Murtadha; this is helpful. It's possible that this test is set up in a way that we get unusual behavior. I.e. analytics on the same partition as ns_server configuration data; single disk path provided and lots of cores resulting in 16 partitions; bursty data ingestion that's below the memory threshold and causes large flushes to happen all at once. Of course, it's also possible that this is the kind of thing we would see in production scenarios too.

            OK, for now, let's see what Jyotsna says in terms of getting the test to be run with the adjusted settings and then go from there. Thanks!

            dfinlay Dave Finlay added a comment - Thanks Murtadha; this is helpful. It's possible that this test is set up in a way that we get unusual behavior. I.e. analytics on the same partition as ns_server configuration data; single disk path provided and lots of cores resulting in 16 partitions; bursty data ingestion that's below the memory threshold and causes large flushes to happen all at once. Of course, it's also possible that this is the kind of thing we would see in production scenarios too. OK, for now, let's see what Jyotsna says in terms of getting the test to be run with the adjusted settings and then go from there. Thanks!
            dfinlay Dave Finlay added a comment -

            Hi Jyotsna Nayak:

            Any updates? Can I get this test run again as follows:

            1. Do you have the capability to start Analytics with 8 unique disk paths so we can test with that number of partitions?
            2. Could you also configure the test to specify the max concurrent flushes and merges to 1 as Murtadha indicates in this comment.
            3. Please retain the current cfq scheduler setting.

            Thanks!

            dfinlay Dave Finlay added a comment - Hi Jyotsna Nayak : Any updates? Can I get this test run again as follows: Do you have the capability to start Analytics with 8 unique disk paths so we can test with that number of partitions? Could you also configure the test to specify the max concurrent flushes and merges to 1 as Murtadha indicates in this comment . Please retain the current cfq scheduler setting. Thanks!

            Dave Finlay , Hi. I was checking the analytics node to check if there are enough partitions.
            This is the output of the command "df -h " on .23 node:

            I will check all the nodes to see which one has more than one partition that can be used for the test. 

            I have also added the curl commands to the test.
            Murtadha Hubail can you please confirm if the port is 8095 ? I am getting the following error message : 

            Can you also please tell me how to confirm that the parameter is set once I run the above 2 curl commands ?

             

            jyotsna.nayak Jyotsna Nayak added a comment - Dave Finlay  , Hi. I was checking the analytics node to check if there are enough partitions. This is the output of the command "df -h " on .23 node: I will check all the nodes to see which one has more than one partition that can be used for the test.  I have also added the curl commands to the test. Murtadha Hubail  can you please confirm if the port is 8095 ? I am getting the following error message :  Can you also please tell me how to confirm that the parameter is set once I run the above 2 curl commands ?  

            Jyotsna Nayak,

            You don't have to have unique storage volumes to specify unique paths. All the paths can use the same physical device, but you need to specify a unique directory path for each partition. For example, you can specify the following to create 8 partitions:

            /dev/sda2/analytics/data1
            /dev/sda2/analytics/data2
            /dev/sda2/analytics/data3
            /dev/sda2/analytics/data4
            /dev/sda2/analytics/data5
            /dev/sda2/analytics/data6
            /dev/sda2/analytics/data7
            /dev/sda2/analytics/data8

            If you are using the perfrunner, I believe this should be as simple as specifying the num_io_devices for Analytics in the test config. See this for example.

            As for the port, 8095 is the default Analytics service port.  The URL in your screenshot is wrong though. You need to replace analytic1, by analytics. If you are connecting to a node that has the Analytics service running, then that should work.

            To confirm that changes were applied, after restarting the Analytics service, call the same API using a GET:

            curl -v -u Administrator:password http://localhost:8095/analytics/config/service
            

            That will return to you the currently set value.
            Again, if you are using perfrunner, then the infra to change Analytics config is already there. For example, see this.

            Please let me know if more information is needed. 

            murtadha.hubail Murtadha Hubail added a comment - Jyotsna Nayak , You don't have to have unique storage volumes to specify unique paths. All the paths can use the same physical device, but you need to specify a unique directory path for each partition. For example, you can specify the following to create 8 partitions: /dev/sda2/analytics/data1 /dev/sda2/analytics/data2 /dev/sda2/analytics/data3 /dev/sda2/analytics/data4 /dev/sda2/analytics/data5 /dev/sda2/analytics/data6 /dev/sda2/analytics/data7 /dev/sda2/analytics/data8 If you are using the perfrunner, I believe this should be as simple as specifying the num_io_devices for Analytics in the test config. See this for example. As for the port, 8095 is the default Analytics service port.  The URL in your screenshot is wrong though. You need to replace analytic1, by analytics. If you are connecting to a node that has the Analytics service running, then that should work. To confirm that changes were applied, after restarting the Analytics service, call the same API using a GET: curl -v -u Administrator:password http://localhost:8095/analytics/config/service That will return to you the currently set value. Again, if you are using perfrunner, then the infra to change Analytics config is already there. For example, see this . Please let me know if more information is needed. 
            dfinlay Dave Finlay added a comment -

            Hi Jyotsna Nayak: do you have an estimate of when you might get a chance to re-run this test? Thanks.

            dfinlay Dave Finlay added a comment - Hi Jyotsna Nayak : do you have an estimate of when you might get a chance to re-run this test? Thanks.

            Dave Finlay , 

            Sorry about the delay.I have scheduled the test as described above , it is currently running.

            Link to the job: http://perf.jenkins.couchbase.com/job/themis_multibucket_jobs/8/consoleFull
             Will post my observations after the test completes.

            jyotsna.nayak Jyotsna Nayak added a comment - Dave Finlay  ,  Sorry about the delay.I have scheduled the test as described above , it is currently running. Link to the job: http://perf.jenkins.couchbase.com/job/themis_multibucket_jobs/8/consoleFull  Will post my observations after the test completes.
            dfinlay Dave Finlay added a comment -

            Thanks Joytsna. I see the test was aborted - was there a problem?

            dfinlay Dave Finlay added a comment - Thanks Joytsna. I see the test was aborted - was there a problem?

            Dave, the test didn't progress after rebalance in completed for KV . I am triaging this test on a smaller scale. Will have more updates on this soon.

            jyotsna.nayak Jyotsna Nayak added a comment - Dave, the test didn't progress after rebalance in completed for KV . I am triaging this test on a smaller scale. Will have more updates on this soon.

            Hi , I have changed a part of code which should not cause the test to pause for a really long time after rebalance in of kv service;
            Patch : http://review.couchbase.org/c/perfrunner/+/156490/23..27
            The job is scheduled ; will post more updates soon.
            Link  to  job: http://perf.jenkins.couchbase.com/job/themis_multibucket_jobs/9/

             

            jyotsna.nayak Jyotsna Nayak added a comment - Hi , I have changed a part of code which should not cause the test to pause for a really long time after rebalance in of kv service; Patch : http://review.couchbase.org/c/perfrunner/+/156490/23..27 The job is scheduled ; will post more updates soon. Link  to  job: http://perf.jenkins.couchbase.com/job/themis_multibucket_jobs/9/  
            jyotsna.nayak Jyotsna Nayak added a comment - - edited

            The test has completed the run and has failed at step 8 ; backup ; due to the absence of parameters for this in the original test file (I am looking into this)
            The rebalance values are as follows : 

            Component Type of rebalance rebalance time (mins) 
            ​KV ​In  (with mutations) ​6.908
            KV Swap  21.86 
            KV  Out  10.3 
            Index In 7.15 
            Index  Swap 19.73
            Index Out 7.51
            Eventing  In 0.91
            Eventing  Swap 10.91
            Eventing Out 0.65 
            CBAS In 7.27
            CBAS Swap 17.73
            CBAS  Out 17.28

            Link  to  job: http://perf.jenkins.couchbase.com/job/themis_multibucket_jobs/9/
            Following parameters were set :
            The scheduler: 

            Running: echo cfq > /sys/block/sda/queue/scheduler
            The output of the command is noop deadline [cfq] 

            The analytics path: 

            Configuring analytics path on 172.23.96.23: ['/analytics/dev0', '/analytics/dev1', '/analytics/dev2', '/analytics/dev3', '/analytics/dev4', '/analytics/dev5', '/analytics/dev6', '/analytics/dev7']

            The extra parameters set for cabs :

            Setting the parameter for storageMaxConcurrentFlushesPerPartition = 1
            Setting the parameter for storageMaxConcurrentMergesPerPartition = 1

            Message printed when values are set:

            About to connect() to localhost port 8095 (#0)
            Trying ::1...
            Connected to localhost (::1) port 8095 (#0)
            Server auth using Basic with user 'Administrator' > POST /analytics/cluster/restart HTTP/1.1 > Authorization: Basic QWRtaW5pc3RyYXRvcjpwYXNzd29yZA== > User-Agent: curl/7.29.0 > Host: localhost:8095 > Accept: / > < HTTP/1.1 202 Accepted < connection: keep-alive < content-type: application/json; charset=UTF-8 < content-length: 602
            < Connection #0 to host localhost left intact

            Message printed when restart command is run: 

            {"cluster":{"metadata_node":"dc469a84839fefa847fa99d56b079f1a","ncs":[{"node_id":"dc469a84839fefa847fa99d56b079f1a","partitions":[

            Unknown macro: {"active"}

            ,{"active":true,"partition_id":"partition_1"},{"active":true,"partition_id":"partition_2"},{"active":true,"partition_id":"partition_3"},{"active":true,"partition_id":"partition_4"},{"active":true,"partition_id":"partition_5"},{"active":true,"partition_id":"partition_6"},{"active":true,"partition_id":"partition_7"}],"pid":5509,"state":"ACTIVE"}],"state":"ACTIVE"},"date":"Thu Sep 02 00:26:27 PDT 2021","status":"SHUTTING_DOWN"}
             

            jyotsna.nayak Jyotsna Nayak added a comment - - edited The test has completed the run and has failed at step 8 ; backup ; due to the absence of parameters for this in the original test file (I am looking into this) The rebalance values are as follows :  Component Type of rebalance rebalance time (mins)  ​KV ​In  (with mutations) ​6.908 KV Swap  21.86  KV  Out  10.3  Index In 7.15  Index  Swap 19.73 Index Out 7.51 Eventing  In 0.91 Eventing  Swap 10.91 Eventing Out 0.65  CBAS In 7.27 CBAS Swap 17.73 CBAS  Out 17.28 Link  to  job: http://perf.jenkins.couchbase.com/job/themis_multibucket_jobs/9/ Following parameters were set : The scheduler:  Running: echo cfq > /sys/block/sda/queue/scheduler The output of the command is noop deadline [cfq]   The analytics path:  Configuring analytics path on 172.23.96.23: ['/analytics/dev0', '/analytics/dev1', '/analytics/dev2', '/analytics/dev3', '/analytics/dev4', '/analytics/dev5', '/analytics/dev6', '/analytics/dev7'] The extra parameters set for cabs : Setting the parameter for storageMaxConcurrentFlushesPerPartition = 1 Setting the parameter for storageMaxConcurrentMergesPerPartition = 1 Message printed when values are set: About to connect() to localhost port 8095 (#0) Trying ::1... Connected to localhost (::1) port 8095 (#0) Server auth using Basic with user 'Administrator' > POST /analytics/cluster/restart HTTP/1.1 > Authorization: Basic QWRtaW5pc3RyYXRvcjpwYXNzd29yZA== > User-Agent: curl/7.29.0 > Host: localhost:8095 > Accept: / > < HTTP/1.1 202 Accepted < connection: keep-alive < content-type: application/json; charset=UTF-8 < content-length: 602 < Connection #0 to host localhost left intact Message printed when restart command is run:  {"cluster":{"metadata_node":"dc469a84839fefa847fa99d56b079f1a","ncs":[{"node_id":"dc469a84839fefa847fa99d56b079f1a","partitions":[ Unknown macro: {"active"} ,{"active":true,"partition_id":"partition_1"},{"active":true,"partition_id":"partition_2"},{"active":true,"partition_id":"partition_3"},{"active":true,"partition_id":"partition_4"},{"active":true,"partition_id":"partition_5"},{"active":true,"partition_id":"partition_6"},{"active":true,"partition_id":"partition_7"}],"pid":5509,"state":"ACTIVE"}],"state":"ACTIVE"},"date":"Thu Sep 02 00:26:27 PDT 2021","status":"SHUTTING_DOWN"}  
            dfinlay Dave Finlay added a comment -

            Thanks Jyotsna Nayak! Will take a look at the logs to see how things went in general and especially on the Analytics and FTS nodes.

            dfinlay Dave Finlay added a comment - Thanks Jyotsna Nayak ! Will take a look at the logs to see how things went in general and especially on the Analytics and FTS nodes.

            Build couchbase-server-7.1.0-1237 contains chronicle commit 69a503e with commit message:
            MB-47169 Prepare snapshot directory in chronicle_agent.

            build-team Couchbase Build Team added a comment - Build couchbase-server-7.1.0-1237 contains chronicle commit 69a503e with commit message: MB-47169 Prepare snapshot directory in chronicle_agent.

            Build couchbase-server-7.0.2-6642 contains chronicle commit 69a503e with commit message:
            MB-47169 Prepare snapshot directory in chronicle_agent.

            build-team Couchbase Build Team added a comment - Build couchbase-server-7.0.2-6642 contains chronicle commit 69a503e with commit message: MB-47169 Prepare snapshot directory in chronicle_agent.
            dfinlay Dave Finlay added a comment -

            So, the summary from looking at the logs, is essentially:

            • As far as I can tell, the changes to the Analytics configuration in terms of number of disk paths (and thus partitions) and the concurrency look to have been configured correctly, though I'm not expert enough to confirm this.
            • At the macro level there's no observable bad behavior (except for the backup failure, which I didn't look at and don't believe is related to the issues Chronicle issues). By this I mean that rebalance didn't fail.
            • The situation on the Analytics node with respect to Chronicle is still not good. Mostly we got lucky and KV rebalance didn't overlap with heavy Analytics writes however on the one occasion that there was a short period of overlap, fsync latency spiked, chronicle latencies jumped and the backlog of replicated log entries to be appended backed up to 300. It wasn't enough to cause processes to start timing out inside ns_server and negatively affect the rebalance, but it probably wasn't that far off. If the period of overlap was greater, I think we would have seen timeouts inside ns-server.

            Let's first look at the overlap of KV rebalance and writes by Analytics to disk on .23.

            Note that I say "KV rebalance" as movement of vbuckets causes a higher number of writes to get pushed through Chronicle as vbucket maps get updated throughout the rebalance. Rebalances that don't involve the data service currently cause many fewer changes in Chronicle managed metadata (though this will change in future releases as more metadata moves to Chronicle).

            In any case, the only occasion in which there's some overlap between KV rebalance and Analytics writing to disk is this rebalance:

            2021-09-02T03:32:58.103-07:00, ns_orchestrator:0:info:message(ns_1@172.23.99.157) - Starting rebalance, KeepNodes = ['ns_1@172.23.99.157','ns_1@172.23.99.158',
                                             'ns_1@172.23.99.159','ns_1@172.23.96.19',
                                             'ns_1@172.23.96.15','ns_1@172.23.97.177',
                                             'ns_1@172.23.96.23','ns_1@172.23.96.20',
                                             'ns_1@172.23.99.161'], EjectNodes = ['ns_1@172.23.99.160'], Failed over and being ejected nodes = []; no delta recovery nodes; Operation Id = db29f91162572456a611d20009d879a2
            ...
            2021-09-02T03:44:36.540-07:00, ns_orchestrator:0:info:message(ns_1@172.23.99.157) - Rebalance completed successfully.
            

            Zooming in, we see the overlap runs for about 4 minutes from 3:40:40 to 3:44:36 approximately.

            Chronicle disk operations taking more than 1 second jump:

            Average fsync latency jumps to over 800 ms (recall in the earlier test it was 2.5s, so it's an improvement but still not great):

            And Chronicle gets backlogged writing entries to disk, with the backlog rising to 300 right around the time the rebalance comes to an end:

            It seems probable that if the Analytics writes happened early in the the rebalance that it would have been interrupted.

            I took a look at things from the Analytics side also. Looking at the graphs in this test Analytics appears to be pushing about 19 MB / s down to the device when Chronicle started to show some increased fsync latencies. In the earlier test, at the peak the rate is a good bit higher probably close to 50 MB / s. As there are reads happening too, this is certainly going to be keeping the device busy, but I was wondering if perhaps the device is not set up optimally for heavy read / write workloads.

            Filesystem
            df -ha
            ==============================================================================
            Filesystem               Size  Used Avail Use% Mounted on
            ...
            /dev/mapper/centos-root  440G   23G  418G   6% /
            ...
            /dev/sdb1                477G   33M  477G   1% /data
            

            Using lsblk, we can see that /dev/centos/root is set up as logical volume. By contrast the data directory is mounted on a partition built straight on top of the physical device.

            # lsblk 
            NAME            MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
            sda               8:0    0   477G  0 disk 
            ├─sda1            8:1    0     1G  0 part /boot
            └─sda2            8:2    0 471.5G  0 part 
              ├─centos-root 253:0    0   440G  0 lvm  /
              └─centos-swap 253:1    0  31.4G  0 lvm  [SWAP]
            sdb               8:16   0   477G  0 disk 
            └─sdb1            8:17   0   477G  0 part /data
            sr0              11:0    1  1024M  0 rom  
            

            I'd be interested in comments from folks that might know more about good ways to mount filesystems on devices. I'm still concerned about the Chronicle timeouts, but I first want to eliminate the possibility that we're testing on a setup that's not really designed for this Analytics workload.
            CC: Dave Rigby, Bo-Chun Wang, Korrigan Clark, Patrick Varley

            Murtadha Hubail: Related, what recommendations do the Analytics team have for customers in terms of how they should set up their disks and filesystems for good performance?

            dfinlay Dave Finlay added a comment - So, the summary from looking at the logs, is essentially: As far as I can tell, the changes to the Analytics configuration in terms of number of disk paths (and thus partitions) and the concurrency look to have been configured correctly, though I'm not expert enough to confirm this. At the macro level there's no observable bad behavior (except for the backup failure, which I didn't look at and don't believe is related to the issues Chronicle issues). By this I mean that rebalance didn't fail. The situation on the Analytics node with respect to Chronicle is still not good. Mostly we got lucky and KV rebalance didn't overlap with heavy Analytics writes however on the one occasion that there was a short period of overlap, fsync latency spiked, chronicle latencies jumped and the backlog of replicated log entries to be appended backed up to 300. It wasn't enough to cause processes to start timing out inside ns_server and negatively affect the rebalance, but it probably wasn't that far off. If the period of overlap was greater, I think we would have seen timeouts inside ns-server. Let's first look at the overlap of KV rebalance and writes by Analytics to disk on .23. Note that I say "KV rebalance" as movement of vbuckets causes a higher number of writes to get pushed through Chronicle as vbucket maps get updated throughout the rebalance. Rebalances that don't involve the data service currently cause many fewer changes in Chronicle managed metadata (though this will change in future releases as more metadata moves to Chronicle). In any case, the only occasion in which there's some overlap between KV rebalance and Analytics writing to disk is this rebalance: 2021-09-02T03:32:58.103-07:00, ns_orchestrator:0:info:message(ns_1@172.23.99.157) - Starting rebalance, KeepNodes = ['ns_1@172.23.99.157','ns_1@172.23.99.158', 'ns_1@172.23.99.159','ns_1@172.23.96.19', 'ns_1@172.23.96.15','ns_1@172.23.97.177', 'ns_1@172.23.96.23','ns_1@172.23.96.20', 'ns_1@172.23.99.161'], EjectNodes = ['ns_1@172.23.99.160'], Failed over and being ejected nodes = []; no delta recovery nodes; Operation Id = db29f91162572456a611d20009d879a2 ... 2021-09-02T03:44:36.540-07:00, ns_orchestrator:0:info:message(ns_1@172.23.99.157) - Rebalance completed successfully. Zooming in, we see the overlap runs for about 4 minutes from 3:40:40 to 3:44:36 approximately. Chronicle disk operations taking more than 1 second jump: Average fsync latency jumps to over 800 ms (recall in the earlier test it was 2.5s, so it's an improvement but still not great): And Chronicle gets backlogged writing entries to disk, with the backlog rising to 300 right around the time the rebalance comes to an end: It seems probable that if the Analytics writes happened early in the the rebalance that it would have been interrupted. I took a look at things from the Analytics side also. Looking at the graphs in this test Analytics appears to be pushing about 19 MB / s down to the device when Chronicle started to show some increased fsync latencies. In the earlier test, at the peak the rate is a good bit higher probably close to 50 MB / s. As there are reads happening too, this is certainly going to be keeping the device busy, but I was wondering if perhaps the device is not set up optimally for heavy read / write workloads. Filesystem df -ha ============================================================================== Filesystem Size Used Avail Use% Mounted on ... /dev/mapper/centos-root 440G 23G 418G 6% / ... /dev/sdb1 477G 33M 477G 1% /data Using lsblk, we can see that /dev/centos/root is set up as logical volume. By contrast the data directory is mounted on a partition built straight on top of the physical device. # lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT sda 8:0 0 477G 0 disk ├─sda1 8:1 0 1G 0 part /boot └─sda2 8:2 0 471.5G 0 part ├─centos-root 253:0 0 440G 0 lvm / └─centos-swap 253:1 0 31.4G 0 lvm [SWAP] sdb 8:16 0 477G 0 disk └─sdb1 8:17 0 477G 0 part /data sr0 11:0 1 1024M 0 rom I'd be interested in comments from folks that might know more about good ways to mount filesystems on devices. I'm still concerned about the Chronicle timeouts, but I first want to eliminate the possibility that we're testing on a setup that's not really designed for this Analytics workload. CC: Dave Rigby , Bo-Chun Wang , Korrigan Clark , Patrick Varley Murtadha Hubail : Related, what recommendations do the Analytics team have for customers in terms of how they should set up their disks and filesystems for good performance?
            drigby Dave Rigby added a comment -

            I'd be interested in comments from folks that might know more about good ways to mount filesystems on devices. I'm still concerned about the Chronicle timeouts, but I first want to eliminate the possibility that we're testing on a setup that's not really designed for this Analytics workload.

            These days Linux LVM vs direct disk partition has virtually zero overhead; certainly relative to a disk which can only push 50MB/s.

            I didn't see mention of the filesystem, but from previous experience I've generally seen xfs give better write latencies when there's contention on the disk - i.e. it has better QoS / sharing compared to ext4.

            Note that a disk which can only do 50MB/s is pretty anaemic in 2021 - what's the underlying storage medium here?

            drigby Dave Rigby added a comment - I'd be interested in comments from folks that might know more about good ways to mount filesystems on devices. I'm still concerned about the Chronicle timeouts, but I first want to eliminate the possibility that we're testing on a setup that's not really designed for this Analytics workload. These days Linux LVM vs direct disk partition has virtually zero overhead; certainly relative to a disk which can only push 50MB/s. I didn't see mention of the filesystem, but from previous experience I've generally seen xfs give better write latencies when there's contention on the disk - i.e. it has better QoS / sharing compared to ext4. Note that a disk which can only do 50MB/s is pretty anaemic in 2021 - what's the underlying storage medium here?
            dfinlay Dave Finlay added a comment -

            Thanks Dave - filesystem is XFS. Perf team - can you give some details on the sda2 device?

            dfinlay Dave Finlay added a comment - Thanks Dave - filesystem is XFS. Perf team - can you give some details on the sda2 device?

            Dave Finlay,
            The general recommendation we give is all about the number of partitions per node:
            1. Limit your total number of partitions to the node's cores.
            2. If you have multiple storage volumes, distribute your partitions among them evenly.
            3. Disk speed makes a bigger difference for Analytics performance, so always prefer SSDs/NVMes.
            4. If you are using slow storage (e.g. HDD), limit the number of partitions to two per storage volume.

            Number of nodes recommendations:
            1. Add more Analytics nodes if the Analytics ingestion speed cannot catch up with the mutations in the Data Service.
            2. Add more Analytics nodes if you are trying to reduce Analytics query execution time.

            We have never had to ask a customer to change certain configuration other than the number of partitions and the disk type to use. Not even the number of concurrent flushes/merges.

            I can confirm that the Analytics configuration changes have been applied on the last test run. However, I echo Dave Rigby's concern about the disk speed. We wouldn't recommend using a single Analytics node with 8 partitions at this scale (60 Analytics collections) with such disk speed.

            murtadha.hubail Murtadha Hubail added a comment - Dave Finlay , The general recommendation we give is all about the number of partitions per node: 1. Limit your total number of partitions to the node's cores. 2. If you have multiple storage volumes, distribute your partitions among them evenly. 3. Disk speed makes a bigger difference for Analytics performance, so always prefer SSDs/NVMes. 4. If you are using slow storage (e.g. HDD), limit the number of partitions to two per storage volume. Number of nodes recommendations: 1. Add more Analytics nodes if the Analytics ingestion speed cannot catch up with the mutations in the Data Service. 2. Add more Analytics nodes if you are trying to reduce Analytics query execution time. We have never had to ask a customer to change certain configuration other than the number of partitions and the disk type to use. Not even the number of concurrent flushes/merges. I can confirm that the Analytics configuration changes have been applied on the last test run. However, I echo Dave Rigby 's concern about the disk speed. We wouldn't recommend using a single Analytics node with 8 partitions at this scale (60 Analytics collections) with such disk speed.
            dfinlay Dave Finlay added a comment -

            Jyotsna Nayak, Bo-Chun Wang - can I get some details on the sd2 physical device?

            I imagine you probably have tests that you run on disks to check for byte and iops throughput (possibly with associated latencies) on disks on machines in the perf lab. Could you run such a test on this device? Ideally the test would be something like:

            • sequential writes (66%) and reads (33%)
            • with periodic fsyncs on the writes ideally every 16 MB

            I'm not very familiar with fio but it seems like it could be used to create such a test. It would be good if we could get some insight into at what point we start to see write saturation.

            dfinlay Dave Finlay added a comment - Jyotsna Nayak , Bo-Chun Wang - can I get some details on the sd2 physical device? I imagine you probably have tests that you run on disks to check for byte and iops throughput (possibly with associated latencies) on disks on machines in the perf lab. Could you run such a test on this device? Ideally the test would be something like: sequential writes (66%) and reads (33%) with periodic fsyncs on the writes ideally every 16 MB I'm not very familiar with fio but it seems like it could be used to create such a test. It would be good if we could get some insight into at what point we start to see write saturation.

            Disk SDA is at SCSI ID 0:0:0:0, which is Samsung SSD 850.

            scsi 0:0:0:0: Direct-Access     ATA      Samsung SSD 850  2B6Q PQ: 0 ANSI: 6
            

            SDB is the same model of disk.

            The Samsung SSD 850 was released in 2014, unfortunately I cannot tell when SDA was manufactured. Recently the Samsung SSDs in Leto performance clusters has to be replace as the performance was dropping over time.
            It would not surprise me if that is the case here and we come to the end of life of these disks.

            Running fio test the performance of the disk is a good idea.

            Doing a quick smartctl test might also give us a indicator into the health of the disk:

            # Execute a quick test
            sudo smartctl -t short -a /dev/sdb
            # Test takes 2mins to run to see the results:
            sudo smartctl -a /dev/sdb
            


            It would be much easier to get hardware information if lshw was install. Jyotsna Nayak is this something you can look into?

            lshw -json || lshw
            ==============================================================================
            /bin/sh: lshw: command not found
            /bin/sh: lshw: command not found
            ==============================================================================
            

            pvarley Patrick Varley added a comment - Disk SDA is at SCSI ID 0:0:0:0, which is Samsung SSD 850. scsi 0:0:0:0: Direct-Access ATA Samsung SSD 850 2B6Q PQ: 0 ANSI: 6 SDB is the same model of disk. The Samsung SSD 850 was released in 2014, unfortunately I cannot tell when SDA was manufactured. Recently the Samsung SSDs in Leto performance clusters has to be replace as the performance was dropping over time. It would not surprise me if that is the case here and we come to the end of life of these disks. Running fio test the performance of the disk is a good idea. Doing a quick smartctl test might also give us a indicator into the health of the disk: # Execute a quick test sudo smartctl -t short -a /dev/sdb # Test takes 2mins to run to see the results: sudo smartctl -a /dev/sdb — It would be much easier to get hardware information if lshw was install. Jyotsna Nayak is this something you can look into? lshw -json || lshw ============================================================================== /bin/sh: lshw: command not found /bin/sh: lshw: command not found ==============================================================================

            Patrick Varley , Thank you for helping me out with this ; I have installed lshw on node : 172.23.96.23 (analytics node )
            I did run the smartctl command on it :
            Output :

            [root@cnt-s122 ~]# sudo smartctl -t short -a /dev/sdb

            smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1062.el7.x86_64] (local build)

            Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

            === START OF INFORMATION SECTION ===

            Model Family:     Samsung based SSDs

            Device Model:     Samsung SSD 850 PRO 512GB

            Serial Number:    S250NX0H610728W

            LU WWN Device Id: 5 002538 84021cd8d

            Firmware Version: EXM02B6Q

            User Capacity:    512,110,190,592 bytes [512 GB]

            Sector Size:      512 bytes logical/physical

            Rotation Rate:    Solid State Device

            Device is:        In smartctl database [for details use: -P show]

            ATA Version is:   ACS-2, ATA8-ACS T13/1699-D revision 4c

            SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)

            Local Time is:    Tue Sep 14 03:36:39 2021 PDT

            SMART support is: Available - device has SMART capability.

            SMART support is: Enabled

            === START OF READ SMART DATA SECTION ===

            SMART overall-health self-assessment test result: PASSED

            General SMART Values:

            Offline data collection status:  (0x00) Offline data collection activity

            was never started.

            Auto Offline Data Collection: Disabled.

            Self-test execution status:      (   0) The previous self-test routine completed

            without error or no self-test has ever 

            been run.

            Total time to complete Offline 

            data collection: (    0) seconds.

            Offline data collection

            capabilities:  (0x53) SMART execute Offline immediate.

            Auto Offline data collection on/off support.

            Suspend Offline collection upon new

            command.

            No Offline surface scan supported.

            Self-test supported.

            No Conveyance Self-test supported.

            Selective Self-test supported.

            SMART capabilities:            (0x0003) Saves SMART data before entering

            power-saving mode.

            Supports SMART auto save timer.

            Error logging capability:        (0x01) Error logging supported.

            General Purpose Logging supported.

            Short self-test routine 

            recommended polling time:  (   2) minutes.

            Extended self-test routine

            recommended polling time:  ( 272) minutes.

            SCT capabilities:        (0x003d) SCT Status supported.

            SCT Error Recovery Control supported.

            SCT Feature Control supported.

            SCT Data Table supported.

            SMART Attributes Data Structure revision number: 1

            Vendor Specific SMART Attributes with Thresholds:

            ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE

              5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0

              9 Power_On_Hours          0x0032   084   084   000    Old_age   Always       -       80802

             12 Power_Cycle_Count       0x0032   099   099   000    Old_age   Always       -       20

            177 Wear_Leveling_Count     0x0013   087   087   000    Pre-fail  Always       -       760

            179 Used_Rsvd_Blk_Cnt_Tot   0x0013   100   100   010    Pre-fail  Always       -       0

            181 Program_Fail_Cnt_Total  0x0032   100   100   010    Old_age   Always       -       0

            182 Erase_Fail_Count_Total  0x0032   100   100   010    Old_age   Always       -       0

            183 Runtime_Bad_Block       0x0013   100   100   010    Pre-fail  Always       -       0

            187 Uncorrectable_Error_Cnt 0x0032   100   100   000    Old_age   Always       -       0

            190 Airflow_Temperature_Cel 0x0032   073   059   000    Old_age   Always       -       27

            195 ECC_Error_Rate          0x001a   200   200   000    Old_age   Always       -       0

            199 CRC_Error_Count         0x003e   100   100   000    Old_age   Always       -       0

            235 POR_Recovery_Count      0x0012   099   099   000    Old_age   Always       -       11

            241 Total_LBAs_Written      0x0032   099   099   000    Old_age   Always       -       64095824300

            SMART Error Log Version: 1

            No Errors Logged

            SMART Self-test log structure revision number 1

            Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error

            1. 1  Short offline       Completed without error       00%     15266         -

            SMART Selective self-test log data structure revision number 1

             SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS

                1        0        0  Not_testing

                2        0        0  Not_testing

                3        0        0  Not_testing

                4        0        0  Not_testing

                5        0        0  Not_testing

            Selective self-test flags (0x0):

              After scanning selected spans, do NOT read-scan remainder of disk.

            If Selective self-test is pending on power-up, resume after 0 minute delay.

            === START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===

            Sending command: "Execute SMART Short self-test routine immediately in off-line mode".

            Drive command "Execute SMART Short self-test routine immediately in off-line mode" successful.

            Testing has begun.

            Please wait 2 minutes for test to complete.

            Test will complete after Tue Sep 14 03:38:39 2021

            Use smartctl -X to abort test.

             

            And the output of lshw command : here

            jyotsna.nayak Jyotsna Nayak added a comment - Patrick Varley , Thank you for helping me out with this ; I have installed lshw on node : 172.23.96.23 (analytics node ) I did run the smartctl command on it : Output : [root@cnt-s122 ~] # sudo smartctl -t short -a /dev/sdb smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1062.el7.x86_64] (local build) Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org === START OF INFORMATION SECTION === Model Family:     Samsung based SSDs Device Model:     Samsung SSD 850 PRO 512GB Serial Number:    S250NX0H610728W LU WWN Device Id: 5 002538 84021cd8d Firmware Version: EXM02B6Q User Capacity:    512,110,190,592 bytes [512 GB] Sector Size:      512 bytes logical/physical Rotation Rate:    Solid State Device Device is:        In smartctl database [for details use: -P show] ATA Version is:   ACS-2, ATA8-ACS T13/1699-D revision 4c SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s) Local Time is:    Tue Sep 14 03:36:39 2021 PDT SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status:  (0x00) Offline data collection activity was never started. Auto Offline Data Collection: Disabled. Self-test execution status:      (   0) The previous self-test routine completed without error or no self-test has ever  been run. Total time to complete Offline  data collection: (    0) seconds. Offline data collection capabilities:  (0x53) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. No Offline surface scan supported. Self-test supported. No Conveyance Self-test supported. Selective Self-test supported. SMART capabilities:            (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability:        (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine  recommended polling time:  (   2) minutes. Extended self-test routine recommended polling time:  ( 272) minutes. SCT capabilities:        (0x003d) SCT Status supported. SCT Error Recovery Control supported. SCT Feature Control supported. SCT Data Table supported. SMART Attributes Data Structure revision number: 1 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE   5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0   9 Power_On_Hours          0x0032   084   084   000    Old_age   Always       -       80802  12 Power_Cycle_Count       0x0032   099   099   000    Old_age   Always       -       20 177 Wear_Leveling_Count     0x0013   087   087   000    Pre-fail  Always       -       760 179 Used_Rsvd_Blk_Cnt_Tot   0x0013   100   100   010    Pre-fail  Always       -       0 181 Program_Fail_Cnt_Total  0x0032   100   100   010    Old_age   Always       -       0 182 Erase_Fail_Count_Total  0x0032   100   100   010    Old_age   Always       -       0 183 Runtime_Bad_Block       0x0013   100   100   010    Pre-fail  Always       -       0 187 Uncorrectable_Error_Cnt 0x0032   100   100   000    Old_age   Always       -       0 190 Airflow_Temperature_Cel 0x0032   073   059   000    Old_age   Always       -       27 195 ECC_Error_Rate          0x001a   200   200   000    Old_age   Always       -       0 199 CRC_Error_Count         0x003e   100   100   000    Old_age   Always       -       0 235 POR_Recovery_Count      0x0012   099   099   000    Old_age   Always       -       11 241 Total_LBAs_Written      0x0032   099   099   000    Old_age   Always       -       64095824300 SMART Error Log Version: 1 No Errors Logged SMART Self-test log structure revision number 1 Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error 1  Short offline       Completed without error       00%     15266         - SMART Selective self-test log data structure revision number 1  SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS     1        0        0  Not_testing     2        0        0  Not_testing     3        0        0  Not_testing     4        0        0  Not_testing     5        0        0  Not_testing Selective self-test flags (0x0):   After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay. === START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION === Sending command: "Execute SMART Short self-test routine immediately in off-line mode". Drive command "Execute SMART Short self-test routine immediately in off-line mode" successful. Testing has begun. Please wait 2 minutes for test to complete. Test will complete after Tue Sep 14 03:38:39 2021 Use smartctl -X to abort test.   And the output of lshw command : here
            pvarley Patrick Varley added a comment - - edited

            I logged in and ran the S.M.A.R.T test. Here are more details on the drive:

            === START OF INFORMATION SECTION ===           
            Model Family:     Samsung based SSDs                                                                                                        
            Device Model:     Samsung SSD 850 PRO 512GB
            Serial Number:    S250NX0H610728W                             
            LU WWN Device Id: 5 002538 84021cd8d        
            Firmware Version: EXM02B6Q          
            User Capacity:    512,110,190,592 bytes [512 GB]
            Sector Size:      512 bytes logical/physical
            Rotation Rate:    Solid State Device
            Device is:        In smartctl database [for details use: -P show]
            ATA Version is:   ACS-2, ATA8-ACS T13/1699-D revision 4c
            SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)            
            Local Time is:    Tue Sep 14 03:20:09 2021 PDT                                                                                              
            SMART support is: Available - device has SMART capability.
            SMART support is: Enabled                               
            

            The vendor stats gives us an idea of the current state of the drive

            SMART Attributes Data Structure revision number: 1
            Vendor Specific SMART Attributes with Thresholds:
            ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
              5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
            # 9 years  the drive has been powered on
              9 Power_On_Hours          0x0032   084   084   000    Old_age   Always       -       80802
             12 Power_Cycle_Count       0x0032   099   099   000    Old_age   Always       -       20
            # Wear leveling looks ok. Zero is bad. 
            177 Wear_Leveling_Count     0x0013   087   087   000    Pre-fail  Always       -       760
            179 Used_Rsvd_Blk_Cnt_Tot   0x0013   100   100   010    Pre-fail  Always       -       0
            181 Program_Fail_Cnt_Total  0x0032   100   100   010    Old_age   Always       -       0
            182 Erase_Fail_Count_Total  0x0032   100   100   010    Old_age   Always       -       0
            183 Runtime_Bad_Block       0x0013   100   100   010    Pre-fail  Always       -       0
            187 Uncorrectable_Error_Cnt 0x0032   100   100   000    Old_age   Always       -       0
            190 Airflow_Temperature_Cel 0x0032   073   059   000    Old_age   Always       -       27
            195 ECC_Error_Rate          0x001a   200   200   000    Old_age   Always       -       0
            199 CRC_Error_Count         0x003e   100   100   000    Old_age   Always       -       0
            235 POR_Recovery_Count      0x0012   099   099   000    Old_age   Always       -       11
            # Around 32TB of data has been written.
            241 Total_LBAs_Written      0x0032   099   099   000    Old_age   Always       -       64095824300
            

            The S.M.A.R.T test passed. Looking at the stats the disk looks to be in good health. Samsung gives a 10 year warranty and the total bytes written before failure is meant to be 150TB.

            For more information on the SSD storagereview has a good write up.

            I cannot see any evidence of trim being ran weekly or the filesystems being setup to support trim (aka discard option).

            It might be worth doing the FIO test, then running fstrim then running FIO test again. To see if it's a performance issue and if it's related to trim.

            pvarley Patrick Varley added a comment - - edited I logged in and ran the S.M.A.R.T test. Here are more details on the drive: === START OF INFORMATION SECTION === Model Family: Samsung based SSDs Device Model: Samsung SSD 850 PRO 512GB Serial Number: S250NX0H610728W LU WWN Device Id: 5 002538 84021cd8d Firmware Version: EXM02B6Q User Capacity: 512,110,190,592 bytes [512 GB] Sector Size: 512 bytes logical/physical Rotation Rate: Solid State Device Device is: In smartctl database [for details use: -P show] ATA Version is: ACS-2, ATA8-ACS T13/1699-D revision 4c SATA Version is: SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s) Local Time is: Tue Sep 14 03:20:09 2021 PDT SMART support is: Available - device has SMART capability. SMART support is: Enabled The vendor stats gives us an idea of the current state of the drive SMART Attributes Data Structure revision number: 1 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0 # 9 years the drive has been powered on 9 Power_On_Hours 0x0032 084 084 000 Old_age Always - 80802 12 Power_Cycle_Count 0x0032 099 099 000 Old_age Always - 20 # Wear leveling looks ok. Zero is bad. 177 Wear_Leveling_Count 0x0013 087 087 000 Pre-fail Always - 760 179 Used_Rsvd_Blk_Cnt_Tot 0x0013 100 100 010 Pre-fail Always - 0 181 Program_Fail_Cnt_Total 0x0032 100 100 010 Old_age Always - 0 182 Erase_Fail_Count_Total 0x0032 100 100 010 Old_age Always - 0 183 Runtime_Bad_Block 0x0013 100 100 010 Pre-fail Always - 0 187 Uncorrectable_Error_Cnt 0x0032 100 100 000 Old_age Always - 0 190 Airflow_Temperature_Cel 0x0032 073 059 000 Old_age Always - 27 195 ECC_Error_Rate 0x001a 200 200 000 Old_age Always - 0 199 CRC_Error_Count 0x003e 100 100 000 Old_age Always - 0 235 POR_Recovery_Count 0x0012 099 099 000 Old_age Always - 11 # Around 32TB of data has been written. 241 Total_LBAs_Written 0x0032 099 099 000 Old_age Always - 64095824300 The S.M.A.R.T test passed. Looking at the stats the disk looks to be in good health. Samsung gives a 10 year warranty and the total bytes written before failure is meant to be 150TB. For more information on the SSD storagereview has a good write up. — I cannot see any evidence of trim being ran weekly or the filesystems being setup to support trim (aka discard option). It might be worth doing the FIO test, then running fstrim then running FIO test again. To see if it's a performance issue and if it's related to trim.

            I have installed lshw on node : 172.23.96.23 (analytics node )

            That is great. What about all of the rest of the test machines? Can testrunner be change to install this by default?

            pvarley Patrick Varley added a comment - I have installed lshw on node : 172.23.96.23 (analytics node ) That is great. What about all of the rest of the test machines? Can testrunner be change to install this by default?
            jyotsna.nayak Jyotsna Nayak added a comment - - edited

            Right now , I have installed it manually on all the nodes ; will add the package to the main code of perf runner. 

             

            jyotsna.nayak Jyotsna Nayak added a comment - - edited Right now , I have installed it manually on all the nodes ; will add the package to the main code of perf runner.   
            dfinlay Dave Finlay added a comment -

            Patrick Varley: thanks a lot for the information on the disk!

            Jyotsna Nayak, Bo-Chun Wang: can we get an fio test run as requested?

            dfinlay Dave Finlay added a comment - Patrick Varley : thanks a lot for the information on the disk! Jyotsna Nayak , Bo-Chun Wang : can we get an fio test run as requested?

            Can someone please confirm if this is the correct fio test :

             sudo fio --filename=/dev/sdb1  --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test --filename=random_write.fio  --bs=4k --iodepth=64 --size=500MB --readwrite=rw --rwmixread=33 --fsync=4096

            I got this error message when I ran the test : 

            fio: /dev/sdb1 appears mounted, and 'allow_mounted_write' isn't set. Aborting. 

            jyotsna.nayak Jyotsna Nayak added a comment - Can someone please confirm if this is the correct fio test :  sudo fio --filename=/dev/sdb1  --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test --filename=random_write.fio  --bs=4k --iodepth=64 --size=500MB --readwrite=rw --rwmixread=33 --fsync=4096 I got this error message when I ran the test :  fio: /dev/sdb1 appears mounted, and 'allow_mounted_write' isn't set. Aborting. 
            drigby Dave Rigby added a comment -

            fio: /dev/sdb1 appears mounted, and 'allow_mounted_write' isn't set. Aborting.

            Did you follow the directions of that error message (either unmount it before running or specify the allow_mounted_write option?)

            drigby Dave Rigby added a comment - fio: /dev/sdb1 appears mounted, and 'allow_mounted_write' isn't set. Aborting. Did you follow the directions of that error message (either unmount it before running or specify the allow_mounted_write option?)

            I have tried with the option allow_mounted_write on my personal laptop , and I had to reset the laptop ;
            Bo has suggested to try with a file rather than a drive , I am looking into this method. 

             

            jyotsna.nayak Jyotsna Nayak added a comment - I have tried with the option allow_mounted_write on my personal laptop , and I had to reset the laptop ; Bo has suggested to try with a file rather than a drive , I am looking into this method.   

            Bo is correct it should be a path on the filesystem. That command above will write directly to the block device, which is why the laptop had to be reset.

            pvarley Patrick Varley added a comment - Bo is correct it should be a path on the filesystem. That command above will write directly to the block device, which is why the laptop had to be reset.
            jyotsna.nayak Jyotsna Nayak added a comment - - edited

            Hi , 
            This is the test I ran on 172.23.96.20 (FTS node) 
            command run: 

            fio --filename=random_write.fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test --bs=4k --iodepth=64 --size=500MB --readwrite=rw --rwmixread=33 --fsync=4096
            

            Output : 

            test: (g=0): rw=rw, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=64
            fio-3.7
            Starting 1 process
            test: Laying out IO file (1 file / 500MiB)
            Jobs: 1 (f=1): [M(1)][100.0%][r=17.8MiB/s,w=36.5MiB/s][r=4568,w=9351 IOPS][eta 00m:00s]
            test: (groupid=0, jobs=1): err= 0: pid=29666: Tue Sep 21 11:06:03 2021
               read: IOPS=4102, BW=16.0MiB/s (16.8MB/s)(166MiB/10336msec)
               bw (  KiB/s): min=10776, max=21632, per=100.00%, avg=16451.20, stdev=2823.43, samples=20
               iops        : min= 2694, max= 5408, avg=4112.70, stdev=705.89, samples=20
              write: IOPS=8281, BW=32.3MiB/s (33.9MB/s)(334MiB/10336msec)
               bw (  KiB/s): min=21256, max=42704, per=100.00%, avg=33236.35, stdev=5691.48, samples=20
               iops        : min= 5314, max=10676, avg=8309.05, stdev=1422.86, samples=20
              cpu          : usr=3.44%, sys=13.57%, ctx=15218, majf=0, minf=27
              IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
                 submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
                 complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
                 issued rwts: total=42404,85596,0,39 short=0,0,0,0 dropped=0,0,0,0
                 latency   : target=0, window=0, percentile=100.00%, depth=64
             
            Run status group 0 (all jobs):
               READ: bw=16.0MiB/s (16.8MB/s), 16.0MiB/s-16.0MiB/s (16.8MB/s-16.8MB/s), io=166MiB (174MB), run=10336-10336msec
              WRITE: bw=32.3MiB/s (33.9MB/s), 32.3MiB/s-32.3MiB/s (33.9MB/s-33.9MB/s), io=334MiB (351MB), run=10336-10336msec
             
            Disk stats (read/write):
                dm-0: ios=42176/85222, merge=0/0, ticks=173724/341791, in_queue=516010, util=99.03%, aggrios=27600/51806, aggrmerge=14804/33868, aggrticks=102964/184858, aggrin_queue=287796, aggrutil=98.87%
              sda: ios=27600/51806, merge=14804/33868, ticks=102964/184858, in_queue=287796, util=98.87%
            

            jyotsna.nayak Jyotsna Nayak added a comment - - edited Hi ,  This is the test I ran on 172.23.96.20 (FTS node)  command run:  fio --filename=random_write.fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test --bs=4k --iodepth=64 --size=500MB --readwrite=rw --rwmixread=33 --fsync=4096 Output :  test: (g=0): rw=rw, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=64 fio-3.7 Starting 1 process test: Laying out IO file (1 file / 500MiB) Jobs: 1 (f=1): [M(1)][100.0%][r=17.8MiB/s,w=36.5MiB/s][r=4568,w=9351 IOPS][eta 00m:00s] test: (groupid=0, jobs=1): err= 0: pid=29666: Tue Sep 21 11:06:03 2021    read: IOPS=4102, BW=16.0MiB/s (16.8MB/s)(166MiB/10336msec)    bw (  KiB/s): min=10776, max=21632, per=100.00%, avg=16451.20, stdev=2823.43, samples=20    iops        : min= 2694, max= 5408, avg=4112.70, stdev=705.89, samples=20   write: IOPS=8281, BW=32.3MiB/s (33.9MB/s)(334MiB/10336msec)    bw (  KiB/s): min=21256, max=42704, per=100.00%, avg=33236.35, stdev=5691.48, samples=20    iops        : min= 5314, max=10676, avg=8309.05, stdev=1422.86, samples=20   cpu          : usr=3.44%, sys=13.57%, ctx=15218, majf=0, minf=27   IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%      issued rwts: total=42404,85596,0,39 short=0,0,0,0 dropped=0,0,0,0      latency   : target=0, window=0, percentile=100.00%, depth=64   Run status group 0 (all jobs):    READ: bw=16.0MiB/s (16.8MB/s), 16.0MiB/s-16.0MiB/s (16.8MB/s-16.8MB/s), io=166MiB (174MB), run=10336-10336msec   WRITE: bw=32.3MiB/s (33.9MB/s), 32.3MiB/s-32.3MiB/s (33.9MB/s-33.9MB/s), io=334MiB (351MB), run=10336-10336msec   Disk stats (read/write):     dm-0: ios=42176/85222, merge=0/0, ticks=173724/341791, in_queue=516010, util=99.03%, aggrios=27600/51806, aggrmerge=14804/33868, aggrticks=102964/184858, aggrin_queue=287796, aggrutil=98.87%   sda: ios=27600/51806, merge=14804/33868, ticks=102964/184858, in_queue=287796, util=98.87%
            jyotsna.nayak Jyotsna Nayak added a comment - - edited

            This is the test run on 172.23.96.23 (CBAS node )
            Command run :

            fio --filename=random_write.fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test --bs=4k --iodepth=64 --size=500MB --readwrite=rw --rwmixread=33 --fsync=4096
            

            Output :

            test: (g=0): rw=rw, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=64
            fio-3.7
            Starting 1 process
            test: Laying out IO file (1 file / 500MiB)
            Jobs: 1 (f=1): [M(1)][100.0%][r=17.6MiB/s,w=35.9MiB/s][r=4506,w=9201 IOPS][eta 00m:00s]
            test: (groupid=0, jobs=1): err= 0: pid=23484: Tue Sep 21 11:15:11 2021
               read: IOPS=4113, BW=16.1MiB/s (16.8MB/s)(166MiB/10308msec)
               bw (  KiB/s): min= 6682, max=23528, per=99.36%, avg=16349.30, stdev=4045.38, samples=20
               iops        : min= 1670, max= 5882, avg=4087.25, stdev=1011.36, samples=20
              write: IOPS=8303, BW=32.4MiB/s (34.0MB/s)(334MiB/10308msec)
               bw (  KiB/s): min=13612, max=47816, per=99.43%, avg=33025.85, stdev=8289.42, samples=20
               iops        : min= 3403, max=11954, avg=8256.45, stdev=2072.26, samples=20
              cpu          : usr=3.38%, sys=13.96%, ctx=14568, majf=0, minf=27
              IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
                 submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
                 complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
                 issued rwts: total=42404,85596,0,39 short=0,0,0,0 dropped=0,0,0,0
                 latency   : target=0, window=0, percentile=100.00%, depth=64
             
            Run status group 0 (all jobs):
               READ: bw=16.1MiB/s (16.8MB/s), 16.1MiB/s-16.1MiB/s (16.8MB/s-16.8MB/s), io=166MiB (174MB), run=10308-10308msec
              WRITE: bw=32.4MiB/s (34.0MB/s), 32.4MiB/s-32.4MiB/s (34.0MB/s-34.0MB/s), io=334MiB (351MB), run=10308-10308msec
             
            Disk stats (read/write):
                dm-0: ios=41412/83702, merge=0/0, ticks=172325/335448, in_queue=508177, util=98.99%, aggrios=27657/52120, aggrmerge=14748/33558, aggrticks=100531/187111, aggrin_queue=287606, aggrutil=98.78%
              sda: ios=27657/52120, merge=14748/33558, ticks=100531/187111, in_queue=287606, util=98.78%
            

            jyotsna.nayak Jyotsna Nayak added a comment - - edited This is the test run on 172.23.96.23 (CBAS node ) Command run : fio --filename=random_write.fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test --bs=4k --iodepth=64 --size=500MB --readwrite=rw --rwmixread=33 --fsync=4096 Output : test: (g=0): rw=rw, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=64 fio-3.7 Starting 1 process test: Laying out IO file (1 file / 500MiB) Jobs: 1 (f=1): [M(1)][100.0%][r=17.6MiB/s,w=35.9MiB/s][r=4506,w=9201 IOPS][eta 00m:00s] test: (groupid=0, jobs=1): err= 0: pid=23484: Tue Sep 21 11:15:11 2021    read: IOPS=4113, BW=16.1MiB/s (16.8MB/s)(166MiB/10308msec)    bw (  KiB/s): min= 6682, max=23528, per=99.36%, avg=16349.30, stdev=4045.38, samples=20    iops        : min= 1670, max= 5882, avg=4087.25, stdev=1011.36, samples=20   write: IOPS=8303, BW=32.4MiB/s (34.0MB/s)(334MiB/10308msec)    bw (  KiB/s): min=13612, max=47816, per=99.43%, avg=33025.85, stdev=8289.42, samples=20    iops        : min= 3403, max=11954, avg=8256.45, stdev=2072.26, samples=20   cpu          : usr=3.38%, sys=13.96%, ctx=14568, majf=0, minf=27   IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%      issued rwts: total=42404,85596,0,39 short=0,0,0,0 dropped=0,0,0,0      latency   : target=0, window=0, percentile=100.00%, depth=64   Run status group 0 (all jobs):    READ: bw=16.1MiB/s (16.8MB/s), 16.1MiB/s-16.1MiB/s (16.8MB/s-16.8MB/s), io=166MiB (174MB), run=10308-10308msec   WRITE: bw=32.4MiB/s (34.0MB/s), 32.4MiB/s-32.4MiB/s (34.0MB/s-34.0MB/s), io=334MiB (351MB), run=10308-10308msec   Disk stats (read/write):     dm-0: ios=41412/83702, merge=0/0, ticks=172325/335448, in_queue=508177, util=98.99%, aggrios=27657/52120, aggrmerge=14748/33558, aggrticks=100531/187111, aggrin_queue=287606, aggrutil=98.78%   sda: ios=27657/52120, merge=14748/33558, ticks=100531/187111, in_queue=287606, util=98.78%

            Dave Finlay asked me to run this on the local (manchester) linux dev server.

            ben@mancouch:~$ fio --filename=/ssd-ext4/random_write.fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test --bs=4k --iodepth=64 --size=500MB --readwrite=rw --rwmixread=33 --fsync=4096
            test: (g=0): rw=rw, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=64
            fio-3.1
            Starting 1 process
            Jobs: 1 (f=0)
            test: (groupid=0, jobs=1): err= 0: pid=13297: Wed Sep 22 16:44:30 2021
               read: IOPS=23.1k, BW=90.3MiB/s (94.7MB/s)(166MiB/1835msec)
               bw (  KiB/s): min=86192, max=98616, per=98.26%, avg=90826.67, stdev=6786.23, samples=3
               iops        : min=21548, max=24654, avg=22706.67, stdev=1696.56, samples=3
              write: IOPS=46.6k, BW=182MiB/s (191MB/s)(334MiB/1835msec)
               bw (  KiB/s): min=175008, max=197192, per=98.39%, avg=183586.67, stdev=11915.66, samples=3
               iops        : min=43752, max=49298, avg=45896.67, stdev=2978.92, samples=3
              cpu          : usr=14.94%, sys=54.25%, ctx=9911, majf=0, minf=11
              IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
                 submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
                 complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
                 issued rwt: total=42404,85596,0, short=0,0,0, dropped=0,0,0
                 latency   : target=0, window=0, percentile=100.00%, depth=64
             
            Run status group 0 (all jobs):
               READ: bw=90.3MiB/s (94.7MB/s), 90.3MiB/s-90.3MiB/s (94.7MB/s-94.7MB/s), io=166MiB (174MB), run=1835-1835msec
              WRITE: bw=182MiB/s (191MB/s), 182MiB/s-182MiB/s (191MB/s-191MB/s), io=334MiB (351MB), run=1835-1835msec
             
            Disk stats (read/write):
              sda: ios=20875/30969, merge=16320/44204, ticks=16296/23380, in_queue=39664, util=94.34%
            

            Looks to be a Samsung PM863

            ben@mancouch:~$ sudo hdparm -I /dev/sda
             
            /dev/sda:
             
            ATA device, with non-removable media
            	Model Number:       SAMSUNG MZ7LM960HCHP-0E003
            

            ben.huddleston Ben Huddleston added a comment - Dave Finlay asked me to run this on the local (manchester) linux dev server. ben@mancouch:~$ fio --filename=/ssd-ext4/random_write.fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test --bs=4k --iodepth=64 --size=500MB --readwrite=rw --rwmixread=33 --fsync=4096 test: (g=0): rw=rw, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=64 fio-3.1 Starting 1 process Jobs: 1 (f=0) test: (groupid=0, jobs=1): err= 0: pid=13297: Wed Sep 22 16:44:30 2021 read: IOPS=23.1k, BW=90.3MiB/s (94.7MB/s)(166MiB/1835msec) bw ( KiB/s): min=86192, max=98616, per=98.26%, avg=90826.67, stdev=6786.23, samples=3 iops : min=21548, max=24654, avg=22706.67, stdev=1696.56, samples=3 write: IOPS=46.6k, BW=182MiB/s (191MB/s)(334MiB/1835msec) bw ( KiB/s): min=175008, max=197192, per=98.39%, avg=183586.67, stdev=11915.66, samples=3 iops : min=43752, max=49298, avg=45896.67, stdev=2978.92, samples=3 cpu : usr=14.94%, sys=54.25%, ctx=9911, majf=0, minf=11 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0% issued rwt: total=42404,85596,0, short=0,0,0, dropped=0,0,0 latency : target=0, window=0, percentile=100.00%, depth=64   Run status group 0 (all jobs): READ: bw=90.3MiB/s (94.7MB/s), 90.3MiB/s-90.3MiB/s (94.7MB/s-94.7MB/s), io=166MiB (174MB), run=1835-1835msec WRITE: bw=182MiB/s (191MB/s), 182MiB/s-182MiB/s (191MB/s-191MB/s), io=334MiB (351MB), run=1835-1835msec   Disk stats (read/write): sda: ios=20875/30969, merge=16320/44204, ticks=16296/23380, in_queue=39664, util=94.34% Looks to be a Samsung PM863 ben@mancouch:~$ sudo hdparm -I /dev/sda   /dev/sda:   ATA device, with non-removable media Model Number: SAMSUNG MZ7LM960HCHP-0E003
            dfinlay Dave Finlay added a comment - - edited

            Thanks Jyotsna Nayak, Ben Huddleston, Patrick Varley.

            Thoughts so far

            This test:

            fio --filename=random_write.fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test --bs=4k --iodepth=64 --size=500MB --readwrite=rw --rwmixread=33 --fsync=4096
            

            seems like a reasonable test to try and push the disk in terms of max throughput in a fashion similar to how Analytics does it. Based on my understanding of fio the test set up is as follows:

            • mixed of sequential writes (2/3) and reads (1/3)
            • 4k blocks are written for a total of 500 MB of IO
            • max of 64 parallel I/O writers
            • direct I/O (note that buffered I/O doesn't support async writers)
            • fsync every 4096 blocks (i.e. 16 MB)

            Observations are that:

            • the CBAS and FTS drives perform similarly. Headlines are:
              • about 16 MiB/s of read bandwidth, 32 MiB/s of write bandwidth
              • about 4k read iops, 8k write iops
              • total runtime of about 10 s

            This test close to maxes out the devices (utilization at 98.7%).

            By contrast, the server in Manchester has much more impressive numbers:

            • 90 MiB/s read b/w, 180 MiB/s write b/w
            • 22k read iops, 45k write iops
            • runtime of 1.8 s; disk at 94% utilization

            Latencies

            I found this article on testing disks for use with etcd, which is quite interesting: https://www.ibm.com/cloud/blog/using-fio-to-tell-whether-your-storage-is-fast-enough-for-etcd. It references the etcd recommendation that the p99 fsync time should be < 10ms: https://etcd.io/docs/v3.4/faq/#what-does-the-etcd-warning-failed-to-send-out-heartbeat-on-time-mean and constructs an fio test to check this:

            fio --rw=write --ioengine=sync --fdatasync=1 --directory=test-data --size=22m --bs=2300 --name=mytest
            

            (Note you first need to create the test-data directory.) This test does 10k sequential writes to the disk and measures the latency percentiles.

            I took the liberty of running this test a couple of times on 172.23.96.23 when the system appeared to be quiescent. The best result I got was a p99 of 13 ms:

            [root@cnt-s122 ~]# fio --rw=write --ioengine=sync --fdatasync=1 --directory=test-data --size=22m --bs=2300 --name=mytest
            mytest: (g=0): rw=write, bs=(R) 2300B-2300B, (W) 2300B-2300B, (T) 2300B-2300B, ioengine=sync, iodepth=1
            fio-3.7
            Starting 1 process
            Jobs: 1 (f=1): [W(1)][100.0%][r=0KiB/s,w=494KiB/s][r=0,w=220 IOPS][eta 00m:00s]
            mytest: (groupid=0, jobs=1): err= 0: pid=21967: Thu Sep 23 16:41:34 2021
              write: IOPS=230, BW=517KiB/s (530kB/s)(21.0MiB/43531msec)
                clat (usec): min=3, max=1979, avg=92.04, stdev=92.37
                 lat (usec): min=3, max=1980, avg=92.53, stdev=92.38
                clat percentiles (usec):
                 |  1.00th=[    8],  5.00th=[   10], 10.00th=[   11], 20.00th=[   12],
                 | 30.00th=[   14], 40.00th=[   18], 50.00th=[  131], 60.00th=[  137],
                 | 70.00th=[  145], 80.00th=[  151], 90.00th=[  161], 95.00th=[  169],
                 | 99.00th=[  229], 99.50th=[  355], 99.90th=[ 1237], 99.95th=[ 1352],
                 | 99.99th=[ 1614]
               bw (  KiB/s): min=  431, max=  637, per=100.00%, avg=517.07, stdev=40.88, samples=87
               iops        : min=  192, max=  284, avg=230.36, stdev=18.25, samples=87
              lat (usec)   : 4=0.03%, 10=5.70%, 20=36.97%, 50=1.14%, 250=55.34%
              lat (usec)   : 500=0.44%, 750=0.12%, 1000=0.07%
              lat (msec)   : 2=0.19%
              fsync/fdatasync/sync_file_range:
                sync (usec): min=2816, max=38202, avg=4242.99, stdev=1660.62
                sync percentiles (usec):
                 |  1.00th=[ 3064],  5.00th=[ 3228], 10.00th=[ 3326], 20.00th=[ 3490],
                 | 30.00th=[ 3621], 40.00th=[ 3752], 50.00th=[ 3884], 60.00th=[ 4015],
                 | 70.00th=[ 4178], 80.00th=[ 4424], 90.00th=[ 4883], 95.00th=[ 6128],
                 | 99.00th=[13042], 99.50th=[13698], 99.90th=[14615], 99.95th=[16909],
                 | 99.99th=[34866]
              cpu          : usr=0.27%, sys=1.50%, ctx=25702, majf=0, minf=34
              IO depths    : 1=200.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
                 submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
                 complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
                 issued rwts: total=0,10029,0,0 short=10029,0,0,0 dropped=0,0,0,0
                 latency   : target=0, window=0, percentile=100.00%, depth=1
             
            Run status group 0 (all jobs):
              WRITE: bw=517KiB/s (530kB/s), 517KiB/s-517KiB/s (530kB/s-530kB/s), io=21.0MiB (23.1MB), run=43531-43531msec
             
            Disk stats (read/write):
                dm-0: ios=5613/20051, merge=0/0, ticks=722/42099, in_queue=42823, util=98.15%, aggrios=5632/20116, aggrmerge=0/7, aggrticks=707/42224, aggrin_queue=42917, aggrutil=98.02%
              sda: ios=5632/20116, merge=0/7, ticks=707/42224, in_queue=42917, util=98.02%
            

            So, at least by this measurement this disk isn't fast enough for etcd.

            This got me to thinking about latencies so I tried ioping.

            # ioping -c 10 -s 1K -S 1K -D .
            1 KiB <<< . (xfs /dev/dm-0): request=1 time=156.6 us (warmup)
            1 KiB <<< . (xfs /dev/dm-0): request=2 time=1.97 ms
            1 KiB <<< . (xfs /dev/dm-0): request=3 time=1.95 ms
            1 KiB <<< . (xfs /dev/dm-0): request=4 time=1.94 ms
            1 KiB <<< . (xfs /dev/dm-0): request=5 time=1.97 ms
            1 KiB <<< . (xfs /dev/dm-0): request=6 time=1.94 ms
            1 KiB <<< . (xfs /dev/dm-0): request=7 time=1.96 ms
            1 KiB <<< . (xfs /dev/dm-0): request=8 time=1.94 ms
            1 KiB <<< . (xfs /dev/dm-0): request=9 time=1.93 ms (fast)
            1 KiB <<< . (xfs /dev/dm-0): request=10 time=1.94 ms
            

            This indicates it takes about 2 ms to fsync one 4k block - which seems pretty slow. Certainly my laptop is a lot faster - I get something of the order of 300 us.

            Further tests

            This got me to thinking about running some further tests with fio. Here's what I'd like to do.

            First run a test that simulates Chronicle writes. To do so, create a fio test file chronicle.ini as follows:

            [chronicle]
            rw=write
            ioengine=sync
            iodepth=1
            direct=0
            fsync=1
            directory=chron-fio
            size=22m
            bs=2300
            gtod_reduce=0
            

            This configuration is similar to the etcd test, except it fsyncs rather than fdatasyncs.

            And run:

            mkdir chron-fio
            fio chronicle.ini
            

            Second run a test that simulates Chronicle writes with some amount of background analytics writes. Create a test file, mixed.ini, as follows:

            [global]
            rw=rw
            ioengine=libaio
            iodepth=6
            direct=1
            rwmixread=33
            fsync=4096
            randrepeat=1
            bs=4k
            size=200MB
            gtod_reduce=1
             
            [analytics]
            numjobs=${NUMJOBS}
             
            [chronicle]
            rw=write
            ioengine=sync
            iodepth=1
            direct=0
            fsync=1
            directory=chron-fio
            size=22m
            bs=2300
            gtod_reduce=0
            

            Then run as follows:

            NUMJOBS=1 fio mixed.ini
            NUMJOBS=5 fio mixed.ini
            NUMJOBS=10 fio mixed.ini
            

            This will allow us to see the p99 fsync latency under increasing amounts of competing I/O traffic.

            Trim

            Lastly, I believe that TRIM is enabled on the disks.

            # hdparm  -I /dev/sda2 | grep -i trim
            	   *	Data Set Management TRIM supported (limit 8 blocks)
            

            But I'm not sure if the disk has a trim job running frequently to keep things in good shape.

            Jyotsna Nayak:

            • Can you run the tests that I've described in the "Further tests" section?
            • Can you trim the disks and then rerun?

            Thanks!

            dfinlay Dave Finlay added a comment - - edited Thanks Jyotsna Nayak , Ben Huddleston , Patrick Varley . Thoughts so far This test: fio --filename=random_write.fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test --bs=4k --iodepth=64 --size=500MB --readwrite=rw --rwmixread=33 --fsync=4096 seems like a reasonable test to try and push the disk in terms of max throughput in a fashion similar to how Analytics does it. Based on my understanding of fio the test set up is as follows: mixed of sequential writes (2/3) and reads (1/3) 4k blocks are written for a total of 500 MB of IO max of 64 parallel I/O writers direct I/O (note that buffered I/O doesn't support async writers) fsync every 4096 blocks (i.e. 16 MB) Observations are that: the CBAS and FTS drives perform similarly. Headlines are: about 16 MiB/s of read bandwidth, 32 MiB/s of write bandwidth about 4k read iops, 8k write iops total runtime of about 10 s This test close to maxes out the devices (utilization at 98.7%). By contrast, the server in Manchester has much more impressive numbers: 90 MiB/s read b/w, 180 MiB/s write b/w 22k read iops, 45k write iops runtime of 1.8 s; disk at 94% utilization Latencies I found this article on testing disks for use with etcd, which is quite interesting: https://www.ibm.com/cloud/blog/using-fio-to-tell-whether-your-storage-is-fast-enough-for-etcd . It references the etcd recommendation that the p99 fsync time should be < 10ms: https://etcd.io/docs/v3.4/faq/#what-does-the-etcd-warning-failed-to-send-out-heartbeat-on-time-mean and constructs an fio test to check this: fio --rw=write --ioengine=sync --fdatasync=1 --directory=test-data --size=22m --bs=2300 --name=mytest (Note you first need to create the test-data directory.) This test does 10k sequential writes to the disk and measures the latency percentiles. I took the liberty of running this test a couple of times on 172.23.96.23 when the system appeared to be quiescent. The best result I got was a p99 of 13 ms: [root@cnt-s122 ~]# fio --rw=write --ioengine=sync --fdatasync=1 --directory=test-data --size=22m --bs=2300 --name=mytest mytest: (g=0): rw=write, bs=(R) 2300B-2300B, (W) 2300B-2300B, (T) 2300B-2300B, ioengine=sync, iodepth=1 fio-3.7 Starting 1 process Jobs: 1 (f=1): [W(1)][100.0%][r=0KiB/s,w=494KiB/s][r=0,w=220 IOPS][eta 00m:00s] mytest: (groupid=0, jobs=1): err= 0: pid=21967: Thu Sep 23 16:41:34 2021 write: IOPS=230, BW=517KiB/s (530kB/s)(21.0MiB/43531msec) clat (usec): min=3, max=1979, avg=92.04, stdev=92.37 lat (usec): min=3, max=1980, avg=92.53, stdev=92.38 clat percentiles (usec): | 1.00th=[ 8], 5.00th=[ 10], 10.00th=[ 11], 20.00th=[ 12], | 30.00th=[ 14], 40.00th=[ 18], 50.00th=[ 131], 60.00th=[ 137], | 70.00th=[ 145], 80.00th=[ 151], 90.00th=[ 161], 95.00th=[ 169], | 99.00th=[ 229], 99.50th=[ 355], 99.90th=[ 1237], 99.95th=[ 1352], | 99.99th=[ 1614] bw ( KiB/s): min= 431, max= 637, per=100.00%, avg=517.07, stdev=40.88, samples=87 iops : min= 192, max= 284, avg=230.36, stdev=18.25, samples=87 lat (usec) : 4=0.03%, 10=5.70%, 20=36.97%, 50=1.14%, 250=55.34% lat (usec) : 500=0.44%, 750=0.12%, 1000=0.07% lat (msec) : 2=0.19% fsync/fdatasync/sync_file_range: sync (usec): min=2816, max=38202, avg=4242.99, stdev=1660.62 sync percentiles (usec): | 1.00th=[ 3064], 5.00th=[ 3228], 10.00th=[ 3326], 20.00th=[ 3490], | 30.00th=[ 3621], 40.00th=[ 3752], 50.00th=[ 3884], 60.00th=[ 4015], | 70.00th=[ 4178], 80.00th=[ 4424], 90.00th=[ 4883], 95.00th=[ 6128], | 99.00th=[13042], 99.50th=[13698], 99.90th=[14615], 99.95th=[16909], | 99.99th=[34866] cpu : usr=0.27%, sys=1.50%, ctx=25702, majf=0, minf=34 IO depths : 1=200.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued rwts: total=0,10029,0,0 short=10029,0,0,0 dropped=0,0,0,0 latency : target=0, window=0, percentile=100.00%, depth=1   Run status group 0 (all jobs): WRITE: bw=517KiB/s (530kB/s), 517KiB/s-517KiB/s (530kB/s-530kB/s), io=21.0MiB (23.1MB), run=43531-43531msec   Disk stats (read/write): dm-0: ios=5613/20051, merge=0/0, ticks=722/42099, in_queue=42823, util=98.15%, aggrios=5632/20116, aggrmerge=0/7, aggrticks=707/42224, aggrin_queue=42917, aggrutil=98.02% sda: ios=5632/20116, merge=0/7, ticks=707/42224, in_queue=42917, util=98.02% So, at least by this measurement this disk isn't fast enough for etcd. This got me to thinking about latencies so I tried ioping. # ioping -c 10 -s 1K -S 1K -D . 1 KiB <<< . (xfs /dev/dm-0): request=1 time=156.6 us (warmup) 1 KiB <<< . (xfs /dev/dm-0): request=2 time=1.97 ms 1 KiB <<< . (xfs /dev/dm-0): request=3 time=1.95 ms 1 KiB <<< . (xfs /dev/dm-0): request=4 time=1.94 ms 1 KiB <<< . (xfs /dev/dm-0): request=5 time=1.97 ms 1 KiB <<< . (xfs /dev/dm-0): request=6 time=1.94 ms 1 KiB <<< . (xfs /dev/dm-0): request=7 time=1.96 ms 1 KiB <<< . (xfs /dev/dm-0): request=8 time=1.94 ms 1 KiB <<< . (xfs /dev/dm-0): request=9 time=1.93 ms (fast) 1 KiB <<< . (xfs /dev/dm-0): request=10 time=1.94 ms This indicates it takes about 2 ms to fsync one 4k block - which seems pretty slow. Certainly my laptop is a lot faster - I get something of the order of 300 us. Further tests This got me to thinking about running some further tests with fio . Here's what I'd like to do. First run a test that simulates Chronicle writes. To do so, create a fio test file chronicle.ini as follows: [chronicle] rw=write ioengine=sync iodepth=1 direct=0 fsync=1 directory=chron-fio size=22m bs=2300 gtod_reduce=0 This configuration is similar to the etcd test, except it fsyncs rather than fdatasyncs. And run: mkdir chron-fio fio chronicle.ini Second run a test that simulates Chronicle writes with some amount of background analytics writes. Create a test file, mixed.ini , as follows: [global] rw=rw ioengine=libaio iodepth=6 direct=1 rwmixread=33 fsync=4096 randrepeat=1 bs=4k size=200MB gtod_reduce=1   [analytics] numjobs=${NUMJOBS}   [chronicle] rw=write ioengine=sync iodepth=1 direct=0 fsync=1 directory=chron-fio size=22m bs=2300 gtod_reduce=0 Then run as follows: NUMJOBS=1 fio mixed.ini NUMJOBS=5 fio mixed.ini NUMJOBS=10 fio mixed.ini This will allow us to see the p99 fsync latency under increasing amounts of competing I/O traffic. Trim Lastly, I believe that TRIM is enabled on the disks. # hdparm -I /dev/sda2 | grep -i trim * Data Set Management TRIM supported (limit 8 blocks) But I'm not sure if the disk has a trim job running frequently to keep things in good shape. Jyotsna Nayak : Can you run the tests that I've described in the "Further tests" section? Can you trim the disks and then rerun? Thanks!

            Dave Finlay , these are the outputs on the first part of the tests on nodes 172.23.96.20 and 172.23.96.23 : here 
            Will run. the second set( rerun of the tests after a trim) after the weekly is completed for Eventing. 

            jyotsna.nayak Jyotsna Nayak added a comment - Dave Finlay , these are the outputs on the first part of the tests on nodes 172.23.96.20 and 172.23.96.23 : here   Will run. the second set( rerun of the tests after a trim) after the weekly is completed for Eventing. 
            dfinlay Dave Finlay added a comment -

            Jyotsna Nayak: what directory did you run these in? Also: can you attach a verbatim copy of the fio test configuration files to the Google doc?

            dfinlay Dave Finlay added a comment - Jyotsna Nayak : what directory did you run these in? Also: can you attach a verbatim copy of the fio test configuration files to the Google doc?

            Dave Finlay , I ran the tests from the /data directory 

            Regarding the verbatim copy of the fio test configuration files ;  I will update it shortly.

             

            jyotsna.nayak Jyotsna Nayak added a comment - Dave Finlay  , I ran the tests from the /data directory  Regarding the verbatim copy of the fio test configuration files ;  I will update it shortly.  
            dfinlay Dave Finlay added a comment -

            Thanks Jyotsna Nayak. I thought that might be the case. Can you run the tests on some directory under the root partition so that the targeted disk is sda2? To be honest, it's not likely to make a lot of difference, but this is the partition that Analytics and Chronicle are contended on?

            # lsblk 
            NAME            MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
            sda               8:0    0   477G  0 disk 
            ├─sda1            8:1    0     1G  0 part /boot
            └─sda2            8:2    0 471.5G  0 part 
              ├─centos-root 253:0    0   440G  0 lvm  /
              └─centos-swap 253:1    0  31.4G  0 lvm  [SWAP]
            sdb               8:16   0   477G  0 disk 
            └─sdb1            8:17   0   477G  0 part /data
            sr0              11:0    1  1024M  0 rom  
            

            dfinlay Dave Finlay added a comment - Thanks Jyotsna Nayak . I thought that might be the case. Can you run the tests on some directory under the root partition so that the targeted disk is sda2? To be honest, it's not likely to make a lot of difference, but this is the partition that Analytics and Chronicle are contended on? # lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT sda 8:0 0 477G 0 disk ├─sda1 8:1 0 1G 0 part /boot └─sda2 8:2 0 471.5G 0 part ├─centos-root 253:0 0 440G 0 lvm / └─centos-swap 253:1 0 31.4G 0 lvm [SWAP] sdb 8:16 0 477G 0 disk └─sdb1 8:17 0 477G 0 part /data sr0 11:0 1 1024M 0 rom

            Sure , Dave Finlay. I will run the tests on under the root directory as well and update the google doc with the outputs under the heading "Tests under Root Directory" . 

            jyotsna.nayak Jyotsna Nayak added a comment - Sure , Dave Finlay . I will run the tests on under the root directory as well and update the google doc with the outputs under the heading "Tests under Root Directory" . 

            Dave Finlay, the document has been updated : here.  
            Will run the tests with the trim when the cluster is free. 

             

            jyotsna.nayak Jyotsna Nayak added a comment - Dave Finlay , the document has been updated :  here .   Will run the tests with the trim when the cluster is free.   

            Hi , 
            I was unable to run trim on the nodes used for this test ; as weekly is running (at the time I have posted this comment) , I have filed a CBIT ( CBIT-28467 ) to check if the nodes support trim or not.

            jyotsna.nayak Jyotsna Nayak added a comment - Hi ,  I was unable to run trim on the nodes used for this test ; as weekly is running (at the time I have posted this comment) , I have filed a CBIT ( CBIT-28467 ) to check if the nodes support trim or not.

            Hi Dave Finlay.
            The disks of the FTS and analytics node needed to be changed as they diid not support trim.
            I reran the test : http://perf.jenkins.couchbase.com/job/themis_multibucket_jobs/11/consoleFull
            I have also run the FIO tests before and after running trim : here

            I have also rerun the test after the trim was run : http://perf.jenkins.couchbase.com/job/themis_multibucket_jobs/12/

             

            jyotsna.nayak Jyotsna Nayak added a comment - Hi Dave Finlay . The disks of the FTS and analytics node needed to be changed as they diid not support trim. I reran the test : http://perf.jenkins.couchbase.com/job/themis_multibucket_jobs/11/consoleFull I have also run the FIO tests before and after running trim :  here I have also rerun the test after the trim was run : http://perf.jenkins.couchbase.com/job/themis_multibucket_jobs/12/  

            Dave Finlay , 

            The disks of the two nodes(FTS and Analytics) had to be changed as it didn’t support trim earlier.
            Now the FTS and analytics node have disks that support trim. 

            I have rerun the job after a weekly was run : http://perf.jenkins.couchbase.com/job/themis_multibucket_jobs/11/consoleFull 
            And the test after the trim command was run:  http://perf.jenkins.couchbase.com/job/themis_multibucket_jobs/12/

            I have also run the FIO tests before and after running trim : here

            Can you please confirm if you are seeing the issue with the disk that was observed before during KV rebalance ?
            Please let me know if there are any additional tests that need to be run. 

            jyotsna.nayak Jyotsna Nayak added a comment - Dave Finlay  ,  The disks of the two nodes(FTS and Analytics) had to be changed as it didn’t support trim earlier. Now the FTS and analytics node have disks that support trim.  I have rerun the job after a weekly was run : http://perf.jenkins.couchbase.com/job/themis_multibucket_jobs/11/consoleFull   And the test after the trim command was run:  http://perf.jenkins.couchbase.com/job/themis_multibucket_jobs/12/ I have also run the FIO tests before and after running trim :  here Can you please confirm if you are seeing the issue with the disk that was observed before during KV rebalance ? Please let me know if there are any additional tests that need to be run. 

            I glanced at the logs from the post-trim run. The analytics node still experiences spikes in disk latency:


             

            Aliaksey Artamonau Aliaksey Artamonau (Inactive) added a comment - I glanced at the logs from the post-trim run. The analytics node still experiences spikes in disk latency:  

            Aliaksey Artamonau  , are there any additional tests that I should run to help gather more information about this issue ?

            jyotsna.nayak Jyotsna Nayak added a comment - Aliaksey Artamonau   , are there any additional tests that I should run to help gather more information about this issue ?
            jyotsna.nayak Jyotsna Nayak added a comment - - edited

            The test was scheduled with the following settings :
            1. storageMaxConcurrentFlushesPerPartition = 1
            2. storageMaxConcurrentMergesPerPartition = 1
            3. echo cfq > /sys/block/sda/queue/scheduler => output:  noop deadline [cfq] 
            4. Analytics paths :  ['/analytics/dev0', '/analytics/dev1', '/analytics/dev2', '/analytics/dev3', '/analytics/dev4', '/analytics/dev5', '/analytics/dev6', '/analytics/dev7']

            The following steps were completed.( Time taken by each part mentioned in parentheses where ever it is recorded in test)

            1. Load the buckets with documents 
            2. Create n1ql indexes (index build time : 8.11 mins) 
            3. Initialise XDCR (init_only_xdcr() - cbmonitor_for_source , cbmonitor_for_destination)
            4. Creating the eventing functions (only one deployed : 20.1s)
            5. Creating FTS indexes 
            6. Creating Analytics dataset
            7. Running rebalance for each phase as follows :
              1. KV rebalance - cbmonitor_src , cbmonitor_dest
                1. Rebalance in with mutations (6.9 mins)
                2. Rebalance swap (22.23 mins)
                3. Rebalance out (9.13 min) 
              2. Index rebalance cbmonitor_src , cbmonitor_dest
                1. Rebalance in (5.49 min)
                2. Rebalance swap (17.77 min)
                3. Rebalance Out (8.52 min)
              3. Eventing rebalance cbmonitor_src , cbmonitor_dest
                1. Rebalance in (0.83 min)
                2. Rebalance swap (10.83 min)
                3. Rebalance Out (0.55 min)
              4. CBAS rebalance cbmonitor_src, cbmonitor_dest
                1. Rebalance in (8.03 min)
                2. Rebalance swap (17.83 min)
                3. Rebalance Out (16.5 min)
              5. FTS rebalance cbmonitor_src,[cbmonitor_dest|http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=destination_bucket_710-1831_run_fts_rebalance_4fc3]
                1. Rebalance in (11.47 min)
                2. Rebalance swap (22.24 min)
                3. Rebalance Out (9.91 min)
            8. Backup (23.39 min , cbmonitor_src, cbmonitor_dest)

            The whole test ran and completed with no error messages displayed on the test logs. 
            Link to the job : http://perf.jenkins.couchbase.com/job/themis_multibucket/113/

            Logs :
            172.23.96.15_indexAndn1ql
            172.23.96.19_indexAndn1ql
            172.23.96.20_fts
            172.23.96.23_cbas
            172.23.97.177_eventing
            172.23.99.158_kv
            172.23.99.159_kv
            172.23.99.160_used_in_rebal
            172.23.99.161_used_in_rebal
            backup

            jyotsna.nayak Jyotsna Nayak added a comment - - edited The test was scheduled with the following settings : 1. storageMaxConcurrentFlushesPerPartition = 1 2. storageMaxConcurrentMergesPerPartition = 1 3. echo cfq > /sys/block/sda/queue/scheduler => output:  noop deadline [cfq]   4. Analytics paths :   ['/analytics/dev0', '/analytics/dev1', '/analytics/dev2', '/analytics/dev3', '/analytics/dev4', '/analytics/dev5', '/analytics/dev6', '/analytics/dev7'] The following steps were completed.( Time taken by each part mentioned in parentheses where ever it is recorded in test) Load the buckets with documents  Create n1ql indexes (index build time : 8.11 mins)  Initialise XDCR (init_only_xdcr() - cbmonitor_for_source , cbmonitor_for_destination ) Creating the eventing functions (only one deployed : 20.1s) Creating FTS indexes  Creating Analytics dataset Running rebalance for each phase as follows : KV rebalance - cbmonitor_src , cbmonitor_dest Rebalance in with mutations (6.9 mins) Rebalance swap (22.23 mins) Rebalance out (9.13 min)  Index rebalance cbmonitor_src , cbmonitor_dest Rebalance in (5.49 min) Rebalance swap (17.77 min) Rebalance Out (8.52 min) Eventing rebalance cbmonitor_src , cbmonitor_dest Rebalance in (0.83 min) Rebalance swap (10.83 min) Rebalance Out (0.55 min) CBAS rebalance  cbmonitor_src , cbmonitor_dest Rebalance in (8.03 min) Rebalance swap (17.83 min) Rebalance Out (16.5 min) FTS rebalance  cbmonitor_src ,[cbmonitor_dest|http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=destination_bucket_710-1831_run_fts_rebalance_4fc3] Rebalance in (11.47 min) Rebalance swap (22.24 min) Rebalance Out (9.91 min) Backup (23.39 min , cbmonitor_src , cbmonitor_dest ) The whole test ran and completed with no error messages displayed on the test logs.  Link to the job : http://perf.jenkins.couchbase.com/job/themis_multibucket/113/ Logs : 172.23.96.15_indexAndn1ql 172.23.96.19_indexAndn1ql 172.23.96.20_fts 172.23.96.23_cbas 172.23.97.177_eventing 172.23.99.158_kv 172.23.99.159_kv 172.23.99.160_used_in_rebal 172.23.99.161_used_in_rebal backup

            Aliaksey Artamonau , Dave Finlay , is there any other tests I should run to help analyse the issue seen on the analytics node with the spikes in disk latency?

            jyotsna.nayak Jyotsna Nayak added a comment - Aliaksey Artamonau  , Dave Finlay  , is there any other tests I should run to help analyse the issue seen on the analytics node with the spikes in disk latency?
            dfinlay Dave Finlay added a comment -

            Jyotsna Nayak: thanks for running these tests. I haven't had a chance to look at the results in recent weeks. Will try to do so soon.

            dfinlay Dave Finlay added a comment - Jyotsna Nayak : thanks for running these tests. I haven't had a chance to look at the results in recent weeks. Will try to do so soon.
            dfinlay Dave Finlay added a comment -

            The main observation is that though in this case all the rebalances succeeded, little has seemed to change.

            The disk performance seems the same as it always was.

            There's a spike in chronicle fsyncs in excess of 1 second around the time of heavy writes by Analytics:

            Average chronicle fsync latency jumps to over 1.5 s:

            And the number of appends batched by chronicle when writing to disk jumps to over 500:

            We're lucky that storageMaxConcurrentFlushesPerPartition and storageMaxConcurrentMergesPerPartition have been set to lower values and that we are running with 8 partitions instead of the default of 16. I still think we have an underlying issue here. I'll take up the discussion with Till Westmann.

            dfinlay Dave Finlay added a comment - The main observation is that though in this case all the rebalances succeeded, little has seemed to change. The disk performance seems the same as it always was. There's a spike in chronicle fsyncs in excess of 1 second around the time of heavy writes by Analytics: Average chronicle fsync latency jumps to over 1.5 s: And the number of appends batched by chronicle when writing to disk jumps to over 500: We're lucky that storageMaxConcurrentFlushesPerPartition and storageMaxConcurrentMergesPerPartition have been set to lower values and that we are running with 8 partitions instead of the default of 16. I still think we have an underlying issue here. I'll take up the discussion with Till Westmann .

            Dave Finlay , I understand that there is an underlying issue. I am planning on scaling this test to have 50 buckets. Can you please tell me if it would help to have runs of this test on a lower number of buckets (eg 15 buckets) ; to see what is the least number of buckets that is making this issue known .

            jyotsna.nayak Jyotsna Nayak added a comment - Dave Finlay  , I understand that there is an underlying issue. I am planning on scaling this test to have 50 buckets. Can you please tell me if it would help to have runs of this test on a lower number of buckets (eg 15 buckets) ; to see what is the least number of buckets that is making this issue known .

            Build couchbase-server-7.1.0-2226 contains cbas-core commit c44a2fd with commit message:
            MB-47169: Make disk force bytes configurable

            build-team Couchbase Build Team added a comment - Build couchbase-server-7.1.0-2226 contains cbas-core commit c44a2fd with commit message: MB-47169 : Make disk force bytes configurable
            dfinlay Dave Finlay added a comment -

            Hi Jyotsna Nayak:

            I spoke with Murtadha & Till from the Analytics team and they've changed the default chunk size of bytes that are written to disk when Analytics is flushing and merging from 16 MB to 1 MB. We believe this is likely to be a more neighborly amount of data to write to disk at one time and we expect that we'll see less impact on Chronicle on the Analytics nodes as a result. (Separately the Analytics team will check to see the degree to which there's a negative effect on ingestion rate; there is likely to be some impact but we're not sure if it'll be minor or if it'll be problematic.)

            Could you rerun this test in it's original configuration? I.e.

            1. Just specify one disk path rather than 8 - this results in 16 partitions.
            2. Don't change the OOTB values of max concurrent flushes and merges (they are 2 by default)
            3. Use the default noop I/O scheduler

            Let me know if you have questions. Thanks!

            dfinlay Dave Finlay added a comment - Hi Jyotsna Nayak : I spoke with Murtadha & Till from the Analytics team and they've changed the default chunk size of bytes that are written to disk when Analytics is flushing and merging from 16 MB to 1 MB. We believe this is likely to be a more neighborly amount of data to write to disk at one time and we expect that we'll see less impact on Chronicle on the Analytics nodes as a result. (Separately the Analytics team will check to see the degree to which there's a negative effect on ingestion rate; there is likely to be some impact but we're not sure if it'll be minor or if it'll be problematic.) Could you rerun this test in it's original configuration? I.e. Just specify one disk path rather than 8 - this results in 16 partitions. Don't change the OOTB values of max concurrent flushes and merges (they are 2 by default) Use the default noop I/O scheduler Let me know if you have questions. Thanks!

            Just to give an update on the impact of changing the chunk size to 1MB on Analytics ingestion. Bo-Chun Wang ran our ingestion regression perf jobs. While the change had no impact on the ingestion rate when using SSDs, there was a 40% regression on HDDs.
            We will wait and see the impact of the change on this test before we decide on the next step.

            Another thing to consider here is that if the issue is caused by the bandwidth of the disks in which this high-bucket test is being run on and not likely to be a production environment issue, we should consider setting up the Data path on a different disk than the Analytics disks if possible. This way Analytics writes won't interfere with chronicle time sensitive disk operations.

            murtadha.hubail Murtadha Hubail added a comment - Just to give an update on the impact of changing the chunk size to 1MB on Analytics ingestion. Bo-Chun Wang ran our ingestion regression perf jobs. While the change had no impact on the ingestion rate when using SSDs, there was a 40% regression on HDDs. We will wait and see the impact of the change on this test before we decide on the next step. Another thing to consider here is that if the issue is caused by the bandwidth of the disks in which this high-bucket test is being run on and not likely to be a production environment issue, we should consider setting up the Data path on a different disk than the Analytics disks if possible. This way Analytics writes won't interfere with chronicle time sensitive disk operations.

            Dave Finlay , 
            I want to confirm the parameters for the rerun; 
            1. Partitions for the analytics  = 1 
            2. no over writing the values for max concurrent flushes and merges 
            3. we ran the following command for the scheduler:  
            echo cfq > /sys/block/sda/queue/scheduler
            the result of which is : 
            The output of the command is noop deadline [cfq]
            for this rerun ; We do not over write the scheduler. 
            If yes ; is there a command to confirm that it is the default scheduler used ?

            jyotsna.nayak Jyotsna Nayak added a comment - Dave Finlay  ,  I want to confirm the parameters for the rerun;  1. Partitions for the analytics  = 1  2. no over writing the values for max concurrent flushes and merges  3. we ran the following command for the scheduler:   echo cfq > /sys/block/sda/queue/scheduler the result of which is :  The output of the command is noop deadline [cfq] for this rerun ; We do not over write the scheduler.  If yes ; is there a command to confirm that it is the default scheduler used ?
            dfinlay Dave Finlay added a comment -

            This looks good. Thanks, Jyotsna.

            dfinlay Dave Finlay added a comment - This looks good. Thanks, Jyotsna.
            jyotsna.nayak Jyotsna Nayak added a comment - - edited

            Test scheduled on build 7.1.0-2226
            link to the job: http://perf.jenkins.couchbase.com/job/themis_multibucket/118/

             

            jyotsna.nayak Jyotsna Nayak added a comment - - edited Test scheduled on build 7.1.0-2226 link to the job: http://perf.jenkins.couchbase.com/job/themis_multibucket/118/ .   
            jyotsna.nayak Jyotsna Nayak added a comment - The test completed the run, here are the logs for the test run with the following settings :      1. Partitions for the analytics  = 1      2. no over writing the values for max concurrent flushes and merges      3. running the test with the default scheduler  Build used : 7.1.0-2226 link to the job : http://perf.jenkins.couchbase.com/job/themis_multibucket/118/ .  Logs : https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-themis_multibucket-118/172.23.96.19.zip https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-themis_multibucket-118/172.23.96.20.zip https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-themis_multibucket-118/172.23.96.23.zip https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-themis_multibucket-118/172.23.97.177.zip https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-themis_multibucket-118/172.23.99.157.zip https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-themis_multibucket-118/172.23.99.158.zip https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-themis_multibucket-118/172.23.99.159.zip https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-themis_multibucket-118/172.23.99.160.zip https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-themis_multibucket-118/172.23.99.161.zip https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-themis_multibucket-118/tools.zip
            dfinlay Dave Finlay added a comment -

            Thanks Jyotsna.

            Summary

            In this latest test with a 1 MB Analytics chunk size, Chronicle latencies still spike but the situation is better - and the rebalance doesn't stop as a result of chronicle timing out as it did in this test which I analyzed in this comment (which is a very similar test).

            In addition, at least in this test (on SSDs) we don't see any noticeable change in the Analytics write workload. A very similar amount of data is written to the disk in a very similar amount of time. I wouldn't say that this "confirms" what you saw in the Analytics test, but it appears to be in line with what I sure were the more carefully measured Analytics tests.

            Till Westmann, Murtadha Hubail: if you guys are OK with the change to 1 MB overall, then I think that's great. If you're not OK, then perhaps we should talk further and maybe we can change the OOTB default # of partitions?

            Details

            In this case we see that there were 75 chronicle fsyncs in excess of 1 s (with a handful of deletes, delete_snapshots and prepare_snapshots taking more than 1 s) during the rebalance that started at 13:35. (Note the overlay shows the time rebalance was running.)

            The time when chronicle disk ops were slow again corresponds to the approximately 12 minute period when Analytics wrote approx 11.5 GB.

            And the Chronicle average latency was in the region of 800 ms at this time.

            In the previous case, Chronicle only hit 25 disk ops in excess of 1 s before it crashed:

            And the average fsync latency jumped to 2.5 s.

            The Analytics write were about the same - about 11.5 GB in approx 11 minutes:

            dfinlay Dave Finlay added a comment - Thanks Jyotsna. Summary In this latest test with a 1 MB Analytics chunk size, Chronicle latencies still spike but the situation is better - and the rebalance doesn't stop as a result of chronicle timing out as it did in this test which I analyzed in this comment (which is a very similar test). In addition, at least in this test (on SSDs) we don't see any noticeable change in the Analytics write workload. A very similar amount of data is written to the disk in a very similar amount of time. I wouldn't say that this "confirms" what you saw in the Analytics test, but it appears to be in line with what I sure were the more carefully measured Analytics tests. Till Westmann , Murtadha Hubail : if you guys are OK with the change to 1 MB overall, then I think that's great. If you're not OK, then perhaps we should talk further and maybe we can change the OOTB default # of partitions? Details In this case we see that there were 75 chronicle fsyncs in excess of 1 s (with a handful of deletes, delete_snapshots and prepare_snapshots taking more than 1 s) during the rebalance that started at 13:35. (Note the overlay shows the time rebalance was running.) The time when chronicle disk ops were slow again corresponds to the approximately 12 minute period when Analytics wrote approx 11.5 GB. And the Chronicle average latency was in the region of 800 ms at this time. In the previous case, Chronicle only hit 25 disk ops in excess of 1 s before it crashed: And the average fsync latency jumped to 2.5 s. The Analytics write were about the same - about 11.5 GB in approx 11 minutes:

            Dave Finlay,

            Please check the numbers in the comments in MB-50856Bo-Chun Wang ran with different values and with 1MB the ingestion rate regression is high even with some kind of SSDs.

            murtadha.hubail Murtadha Hubail added a comment - Dave Finlay , Please check the numbers in the comments in MB-50856 .  Bo-Chun Wang  ran with different values and with 1MB the ingestion rate regression is high even with some kind of SSDs.
            dfinlay Dave Finlay added a comment -

            Murtadha Hubail, Till Westmann:

            Can I get your thoughts on what we should do here? I think the summary is something like:

            With OOTB settings and users configuring Analytics in the simplest possible way (e.g. using a single path on the root partition), Analytics can saturate the disk with negative effects on other actors in the system, such as Chronicle. This can result in things like rebalance failures.

            If this does happen in production, I would consider this to be an Analytics issue. You could ask the customer to configure Analytics to run on a separate disk, or tune down the amount of write concurrency or tune down the chunk size.

            Or, we could change the default settings so that in the event the user configures one disk path, you configure a number of partitions smaller than 16, so that we're less likely to run into the issue.

            I'm interested in which way you decide to go here.

            If you do decide to revert the change to write in 1 MB chunks, then I have an additional ask for you guys to let the Perf team know how they should configure Analytics in this particular test. Obviously we don't want to be investigating rebalance failures due to this issue constantly – and the OOTB settings don't work for this test.

            dfinlay Dave Finlay added a comment - Murtadha Hubail , Till Westmann : Can I get your thoughts on what we should do here? I think the summary is something like: With OOTB settings and users configuring Analytics in the simplest possible way (e.g. using a single path on the root partition), Analytics can saturate the disk with negative effects on other actors in the system, such as Chronicle. This can result in things like rebalance failures. If this does happen in production, I would consider this to be an Analytics issue. You could ask the customer to configure Analytics to run on a separate disk, or tune down the amount of write concurrency or tune down the chunk size. Or, we could change the default settings so that in the event the user configures one disk path, you configure a number of partitions smaller than 16, so that we're less likely to run into the issue. I'm interested in which way you decide to go here. If you do decide to revert the change to write in 1 MB chunks, then I have an additional ask for you guys to let the Perf team know how they should configure Analytics in this particular test. Obviously we don't want to be investigating rebalance failures due to this issue constantly – and the OOTB settings don't work for this test.
            drigby Dave Rigby added a comment -

            Analytics folks - out of interest what syscall do you use for syncing every block size?

            Couchstore currently uses fsync, but that's actually unnecessarily expensive (apart from the final sync) as we don't actually require a full disk barrier to avoid impacting other threads; simply pushing the blocks into the device queue is sufficient - and to that end sync_file_range could be used instead - see MB-50468.

            drigby Dave Rigby added a comment - Analytics folks - out of interest what syscall do you use for syncing every block size? Couchstore currently uses fsync , but that's actually unnecessarily expensive (apart from the final sync) as we don't actually require a full disk barrier to avoid impacting other threads; simply pushing the blocks into the device queue is sufficient - and to that end sync_file_range could be used instead - see MB-50468 .

            Currently we're using fsync through the JDK. AFAIK the JDK does no expose sync_file_range at this time. There would be a way to use other syscalls using JNDI, but we're not looking at this option at this time in the release.

            till Till Westmann added a comment - Currently we're using fsync through the JDK. AFAIK the JDK does no expose sync_file_range at this time. There would be a way to use other syscalls using JNDI, but we're not looking at this option at this time in the release.
            drigby Dave Rigby added a comment -

            Thanks Till Westmann.

            drigby Dave Rigby added a comment - Thanks Till Westmann .

            Here is a summary of the performance runs comparing fsync after 1MB and 16MB:

              16MB 1MB Change %
            Rebalance-in time (min) HDD on oceanus cluster 9.4 31 229.7% increase
            Rebalance-out time (min) HDD on oceanus cluster 12.4 44.9 262% increase
            Ingestion rate (items/s) HDD on oceanus cluster 411,228 229,232 44.27% decrease
            Ingestion rate (items/s) SSD on some cluster 287,168 126,550 55.9317% decrease

            I'm in favor of reverting back to 16MB or a close value for the following reasons:

            • It isn't a good idea to lose over 40% of the system throughput under normal operations just because there is a chance that a rebalance might fail due to a possible write latency by chronicle.
            • Most of Analytics large, long running write operations (merges/compactions) aren't scheduled all the time, and so if a rebalance attempt fails while one of those operations are running, a re-try on such rebalance will most likely be successful.
            • I think we can all agree that the disk used in the Analytics node in this test has pretty low bandwidth by any standard (34.0MB/s writes), and so I don't think we should be tuning our system based on such environment.
            • Based on Dave Finlay's analysis, even with the 1MB change, chronicle's fsyncs still had spikes. So, losing over 40% of the system throughput doesn't even eliminate the problem completely.
            • Most of Analytics file operations are append-only and the files not never modified once fully written. It might not be a good idea to issue an expensive fsync, as Dave Rigby pointed out, each 1MB on such files.
            • There is a reason we document the recommendation that Analytics shouldn't be deployed with any other service. Analytics actually gets criticized when it isn't pushing the available resources. The reason for the current simple OOTB configuration is that we received many comments from customers and prospects that go like: "You claim to be a parallel database engine and you are not even utilizing the machine's cores when running a query". Similarly, customers expects full scan queries execution time to be close to the disk read bandwidth.

            I understand that Analytics has to play well with the cluster manager (ns_server/chronicle), but I don't think we should judge that based on the environment of this test. However, if this issue is ever encountered in a production environment (e.g. on Capella), it's good to know that we have the runtime configurable parameters to workaround it and only then we should consider changing our defaults.

            Dave Finlay,
            I'm not sure what kind of rebalance failure message/logs we are getting right now when a rebalance fails due to chronicle write latency, but it would be great if we can make it as clear as possible that the rebalance failed for that reason. If the issue happens in a production environment, it will save us all some debugging time.

            Jyotsna Nayak,
            I recommend you do one of the following for this test:

            1. Starting from build 7.1.0-2316, we have a configurable parameter called 'storageWriteRateLimit' which allows specifying the maximum bytes/second written by each Analytics partition (iodevice) in each node. You can set that parameter value in bytes based on the disk bandwidth and the number of partitions configured in Analytics nodes. For example, if you have 16 Analytics partitions on each node and you want to make sure Analytics will never write more than 25MB/s, then set its value to 25MB/16 = 1572864 bytes. Bo-Chun Wang ran an experiment using this parameter in MB-50856 and it showed that it's effective in throttling Analytics writes. Note that this will slow down all Analytics write operations including ingestion, rebalance, and merges.
            2. As mentioned before, separate Data Path and Analytics Path disks.
            3. Configure Analytics partitions to a very low number.
            4. Maybe this isn't the right environment for a 30 multi-bucket test and a more capable cluster should be used.

            I believe this is how we should proceed and if you agree, I will go ahead and revert the default Analytics fsync chunk size to 16MB. Let me know what you think.

            murtadha.hubail Murtadha Hubail added a comment - Here is a summary of the performance runs comparing fsync after 1MB and 16MB:   16MB 1MB Change % Rebalance-in time (min) HDD on oceanus cluster 9.4 31 229.7% increase Rebalance-out time (min) HDD on oceanus cluster 12.4 44.9 262% increase Ingestion rate (items/s) HDD on oceanus cluster 411,228 229,232 44.27% decrease Ingestion rate (items/s) SSD on some cluster 287,168 126,550 55.9317% decrease I'm in favor of reverting back to 16MB or a close value for the following reasons: It isn't a good idea to lose over 40% of the system throughput under normal operations just because there is a chance that a rebalance might fail due to a possible write latency by chronicle. Most of Analytics large, long running write operations (merges/compactions) aren't scheduled all the time, and so if a rebalance attempt fails while one of those operations are running, a re-try on such rebalance will most likely be successful. I think we can all agree that the disk used in the Analytics node in this test has pretty low bandwidth by any standard (34.0MB/s writes), and so I don't think we should be tuning our system based on such environment. Based on Dave Finlay 's analysis, even with the 1MB change, chronicle's fsyncs still had spikes. So, losing over 40% of the system throughput doesn't even eliminate the problem completely. Most of Analytics file operations are append-only and the files not never modified once fully written. It might not be a good idea to issue an expensive fsync, as Dave Rigby pointed out, each 1MB on such files. There is a reason we document the recommendation that Analytics shouldn't be deployed with any other service. Analytics actually gets criticized when it isn't pushing the available resources. The reason for the current simple OOTB configuration is that we received many comments from customers and prospects that go like: "You claim to be a parallel database engine and you are not even utilizing the machine's cores when running a query". Similarly, customers expects full scan queries execution time to be close to the disk read bandwidth. I understand that Analytics has to play well with the cluster manager (ns_server/chronicle), but I don't think we should judge that based on the environment of this test. However, if this issue is ever encountered in a production environment (e.g. on Capella), it's good to know that we have the runtime configurable parameters to workaround it and only then we should consider changing our defaults. Dave Finlay , I'm not sure what kind of rebalance failure message/logs we are getting right now when a rebalance fails due to chronicle write latency, but it would be great if we can make it as clear as possible that the rebalance failed for that reason. If the issue happens in a production environment, it will save us all some debugging time. Jyotsna Nayak , I recommend you do one of the following for this test: Starting from build 7.1.0-2316, we have a configurable parameter called 'storageWriteRateLimit' which allows specifying the maximum bytes/second written by each Analytics partition (iodevice) in each node. You can set that parameter value in bytes based on the disk bandwidth and the number of partitions configured in Analytics nodes. For example, if you have 16 Analytics partitions on each node and you want to make sure Analytics will never write more than 25MB/s, then set its value to 25MB/16 = 1572864 bytes. Bo-Chun Wang ran an experiment using this parameter in MB-50856 and it showed that it's effective in throttling Analytics writes. Note that this will slow down all Analytics write operations including ingestion, rebalance, and merges. As mentioned before, separate Data Path and Analytics Path disks. Configure Analytics partitions to a very low number. Maybe this isn't the right environment for a 30 multi-bucket test and a more capable cluster should be used. I believe this is how we should proceed and if you agree, I will go ahead and revert the default Analytics fsync chunk size to 16MB. Let me know what you think.
            dfinlay Dave Finlay added a comment - - edited

            Thanks Murtadha. I'm fine with your approach.

            On this:

            I'm not sure what kind of rebalance failure message/logs we are getting right now when a rebalance fails due to chronicle write latency, but it would be great if we can make it as clear as possible that the rebalance failed for that reason. If the issue happens in a production environment, it will save us all some debugging time.

            This is trickier than it sounds. The timeouts occur at a low level inside Chronicle and end up causing the process that writes Chronicle logs to disks to crash (i.e. stop and restart). This causes higher level Chronicle processes (such as the per-node agent) to crash, which then causes all the processes listening to chronicle events (which is a lot of processes) to crash and restart. Because all of these processes are down and not responsive, rebalance fails. Connecting the dots is in theory possible but not at all easy. In the event this happens in production at least currently, ns_server will likely have to diagnose for now and if this is what it is, this is what it is.

            dfinlay Dave Finlay added a comment - - edited Thanks Murtadha. I'm fine with your approach. On this: I'm not sure what kind of rebalance failure message/logs we are getting right now when a rebalance fails due to chronicle write latency, but it would be great if we can make it as clear as possible that the rebalance failed for that reason. If the issue happens in a production environment, it will save us all some debugging time. This is trickier than it sounds. The timeouts occur at a low level inside Chronicle and end up causing the process that writes Chronicle logs to disks to crash (i.e. stop and restart). This causes higher level Chronicle processes (such as the per-node agent) to crash, which then causes all the processes listening to chronicle events (which is a lot of processes) to crash and restart. Because all of these processes are down and not responsive, rebalance fails. Connecting the dots is in theory possible but not at all easy. In the event this happens in production at least currently, ns_server will likely have to diagnose for now and if this is what it is, this is what it is.

            Jyotsna Nayak,
            I reverted the default Analytics fsync chunk size back to 16MB in the build 7.1.0-2333.
            Please verify that the issue doesn't happen when limiting the Analytics writes using storageWriteRateLimit. Please let me know if you need any help applying any of the recommended workarounds on the Analytics side.
            Since we have had a couple of passing runs of this test when reducing the Analytics writes, I'm marking this issue as resolved for now and feel free to reopen it if needed.

            murtadha.hubail Murtadha Hubail added a comment - Jyotsna Nayak , I reverted the default Analytics fsync chunk size back to 16MB in the build 7.1.0-2333. Please verify that the issue doesn't happen when limiting the Analytics writes using storageWriteRateLimit. Please let me know if you need any help applying any of the recommended workarounds on the Analytics side. Since we have had a couple of passing runs of this test when reducing the Analytics writes, I'm marking this issue as resolved for now and feel free to reopen it if needed.

            Murtadha Hubail , 
            from this ticket ; https://issues.couchbase.com/browse/MB-50856
            I understand that the calculation for storageWriteRateLimit is as follows : 

            Set storageWriteRateLimit in bytes to a value equals:
             (80% of disks write bandwidth on each node / number of partitions on each node)

            I found that the analytics node that we are using in this test : .23 is using SSD : Samsung PM863 500GB and has a bandwidth of approx 500MB/s ;
            (0.8*500)/8 = 50MB 
            is this calculation correct ?
            And shall I keep the number of partitions for the analytics = 8 ? 

             

            jyotsna.nayak Jyotsna Nayak added a comment - Murtadha Hubail  ,  from this ticket ; https://issues.couchbase.com/browse/MB-50856 I understand that the calculation for storageWriteRateLimit is as follows :  Set storageWriteRateLimit in bytes to a value equals:  (80% of disks write bandwidth on each node / number of partitions on each node) I found that the analytics node that we are using in this test : .23 is using SSD : Samsung PM863 500GB and has a bandwidth of approx 500MB/s ; (0.8*500)/8 = 50MB  is this calculation correct ? And shall I keep the number of partitions for the analytics = 8 ?   

            Jyotsna Nayak,

            The calculation in general looks fine but I believe the disk write speed tests performed earlier showed that the disk write bandwidth is much less than 500MB/s. So, you should consider that.

            It is up to you if you would like to keep the partitions to 8 or keep it at the default of 16. Just remember to reflect that in your calculation along with the actual disk write bandwidth. You can always lower the percentage utilized by Analytics too.

            murtadha.hubail Murtadha Hubail added a comment - Jyotsna Nayak , The calculation in general looks fine but I believe the disk write speed tests performed earlier showed that the disk write bandwidth is much less than 500MB/s. So, you should consider that. It is up to you if you would like to keep the partitions to 8 or keep it at the default of 16. Just remember to reflect that in your calculation along with the actual disk write bandwidth. You can always lower the percentage utilized by Analytics too.
            jyotsna.nayak Jyotsna Nayak added a comment - - edited

            Murtadha Hubail ,

            From the FIO tests I have run , the write bandwidth for the analytics node is : 
            link to the document : here 
            value used : 6933kB/s (A comment is left on the document to point the value used for calculation)

            Bandwidth :  6933kB/s  =. 6.933 MB/s 
            keeping the partitions = 16 ; 
            the parameter would then have the value : ( 0.8* 6.933)/16 = 0.34665 MB  = 34665 Bytes 

            Can I use this value ? 
            Also how to lower the percentage utilised by Analytics ?

             

            jyotsna.nayak Jyotsna Nayak added a comment - - edited Murtadha Hubail  , From the FIO tests I have run , the write bandwidth for the analytics node is :  link to the document : here   value used : 6933kB/s (A comment is left on the document to point the value used for calculation) Bandwidth :  6933kB/s  =. 6.933 MB/s  keeping the partitions = 16 ;  the parameter would then have the value : ( 0.8* 6.933)/16 = 0.34665 MB  = 34665 Bytes  Can I use this value ?  Also how to lower the percentage utilised by Analytics ?  

            Jyotsna Nayak,

            The bandwidth of this SSD seems extremely low to me. You can use this value but you should expect the Analytics operations in the test (ingestion/rebalances) to be extremely slow.

            You can always lower the percentage by lowering the 0.8 factor in the calculation.

            murtadha.hubail Murtadha Hubail added a comment - Jyotsna Nayak , The bandwidth of this SSD seems extremely low to me. You can use this value but you should expect the Analytics operations in the test (ingestion/rebalances) to be extremely slow. You can always lower the percentage by lowering the 0.8 factor in the calculation.

            I have scheduled a test with 0.8 factor ; here
            for a factor of 0.5 and 16 partitions, I observed that the parameter would have : 21666 Bytes.
            I have reduced the number of partitions from 16 to 4 ; Calculation is as follows :
             ( 0.8* 6.933)/4 = 1.3866 MB  = 138660 Bytes 

            I have reduced the number of partitions from 16 to 4  and the test is in queue: here
            Will post my findings after the test completes their runs.

            jyotsna.nayak Jyotsna Nayak added a comment - I have scheduled a test with 0.8 factor ;  here for a factor of 0.5 and 16 partitions, I observed that the parameter would have : 21666 Bytes. I have reduced the number of partitions from 16 to 4 ; Calculation is as follows :  ( 0.8* 6.933)/4 = 1.3866 MB  = 138660 Bytes  I have reduced the number of partitions from 16 to 4  and the test is in queue:  here Will post my findings after the test completes their runs.
            jyotsna.nayak Jyotsna Nayak added a comment - Murtadha Hubail  ,  Both the tests scheduled in the above comment did not complete. it gets stuck when it is rebalancing in the cbas node at 44 % for a long time.  I was able to collect logs for the case where I reduced the partitions. here they are : https://cb-jira.s3.us-east-2.amazonaws.com/logs/multibucket_issue/collectinfo-2022-03-11T162107-ns_1%40172.23.96.15.zip https://cb-jira.s3.us-east-2.amazonaws.com/logs/multibucket_issue/collectinfo-2022-03-11T162107-ns_1%40172.23.96.19.zip https://cb-jira.s3.us-east-2.amazonaws.com/logs/multibucket_issue/collectinfo-2022-03-11T162107-ns_1%40172.23.96.20.zip https://cb-jira.s3.us-east-2.amazonaws.com/logs/multibucket_issue/collectinfo-2022-03-11T162107-ns_1%40172.23.96.23.zip https://cb-jira.s3.us-east-2.amazonaws.com/logs/multibucket_issue/collectinfo-2022-03-11T162107-ns_1%40172.23.97.177.zip https://cb-jira.s3.us-east-2.amazonaws.com/logs/multibucket_issue/collectinfo-2022-03-11T162107-ns_1%40172.23.99.157.zip https://cb-jira.s3.us-east-2.amazonaws.com/logs/multibucket_issue/collectinfo-2022-03-11T162107-ns_1%40172.23.99.158.zip https://cb-jira.s3.us-east-2.amazonaws.com/logs/multibucket_issue/collectinfo-2022-03-11T162107-ns_1%40172.23.99.159.zip https://cb-jira.s3.us-east-2.amazonaws.com/logs/multibucket_issue/collectinfo-2022-03-11T162107-ns_1%40172.23.99.160.zip

            Seeing the following in the ns_ser_info.log 
            [ns_server:error,2022-03-11T08:23:45.493-08:00,ns_1@172.23.96.23:ns_heart_slow_status_updater<0.7870.0>:ns_heart:grab_one_service_status:409]Failed to grab service cbas status: {exit,
                                                 {timeout,
                                                  {gen_server,call,
                                                   ['service_agent-cbas',get_status,
                                                    2000]}},
                                                 [

            {gen_server,call,3,                                        [\{file,"gen_server.erl"}

            ,{line,247}]},
                                                 

            {ns_heart,grab_one_service_status,1,                                        [\{file,"src/ns_heart.erl"}

            ,{line,406}]},
                                                 

            {ns_heart,                                        '-grab_service_statuses/0-lc$^1/1-1-',                                        1,                                        [\{file,"src/ns_heart.erl"}

            ,{line,402}]},
                                                 

            {ns_heart,current_status_slow_inner,0,                                        [\{file,"src/ns_heart.erl"}

            ,{line,276}]},
                                                 

            {ns_heart,current_status_slow,1,                                        [\{file,"src/ns_heart.erl"}

            ,{line,235}]},
                                                 

            {ns_heart,slow_updater_loop,0,                                        [\{file,"src/ns_heart.erl"}

            ,
                                                    {line,229}]}]}

            Murtadha Hubail , can you please point me to the log file I should look at ? I am seeing some errors in 
            1. ns_server_analytics_info.log
            2. ns_server_analytics_error.log
            3. ns_server_info.log (as shown above) 

            jyotsna.nayak Jyotsna Nayak added a comment - Seeing the following in the ns_ser_info.log  [ns_server:error,2022-03-11T08:23:45.493-08:00,ns_1@172.23.96.23:ns_heart_slow_status_updater<0.7870.0>:ns_heart:grab_one_service_status:409] Failed to grab service cbas status: {exit,                                      {timeout,                                       {gen_server,call,                                        ['service_agent-cbas',get_status,                                         2000]}},                                      [ {gen_server,call,3,                                        [\{file,"gen_server.erl"} ,{line,247}]},                                       {ns_heart,grab_one_service_status,1,                                        [\{file,"src/ns_heart.erl"} ,{line,406}]},                                       {ns_heart,                                        '-grab_service_statuses/0-lc$^1/1-1-',                                        1,                                        [\{file,"src/ns_heart.erl"} ,{line,402}]},                                       {ns_heart,current_status_slow_inner,0,                                        [\{file,"src/ns_heart.erl"} ,{line,276}]},                                       {ns_heart,current_status_slow,1,                                        [\{file,"src/ns_heart.erl"} ,{line,235}]},                                       {ns_heart,slow_updater_loop,0,                                        [\{file,"src/ns_heart.erl"} ,                                         {line,229}]}]} Murtadha Hubail  , can you please point me to the log file I should look at ? I am seeing some errors in  1. ns_server_analytics_info.log 2. ns_server_analytics_error.log 3. ns_server_info.log (as shown above) 

            Jyotsna Nayak,
            The analytics_info.log in the Analytics master node (.23) has the information.
            The rebalance started here:

            2022-03-10T16:45:49.398-08:00 WARN CBAS.rebalance.Rebalance [Executor-107:ClusterController] enter Rebalance f6895a60bffad14085b63dea6af729ec: {"id":"f6895a60bffad14085b63dea6af729ec","currentTopologyRev":null,"type":"topology-change-rebalance","keepNodes":[{"nodeInfo":{"nodeId":"1bb12910832b03d7760d874b89acbd35","priority":1970329131942144,"opaque":{"cbas-version":"7.1.0-2333","cc-http-port":"9111","controller-id":"0","host":"172.23.96.23","ns-server-port":"8091","num-iodevices":"4","starting-partition-id":"0","svc-http-port":"8095"}},"recoveryType":"recovery-full"},{"nodeInfo":{"nodeId":"c80577861232f3d3d8edfddde5789c7d","priority":1970329131941888,"opaque":{"cbas-version":"7.1.0-2333","cc-http-port":"9111","controller-id":"1","host":"172.23.99.160","ns-server-port":"8091","num-iodevices":"16","starting-partition-id":"4","svc-http-port":"8095"}},"recoveryType":"recovery-full"}],"ejectNodes":[]}
            

            And we were asked to cancel the rebalance here:

            2022-03-11T08:19:06.948-08:00 WARN CBAS.rebalance.Rebalance [HttpExecutor(port:9111)-15] Rebalance (f6895a60bffad14085b63dea6af729ec) cancellation requested
            

            Between the rebalance start and the cancellation, Analytics was trying to disconnect the DCP streams and that involves persisting any in-memory data to disk. All of that time we were trying to persist the data to disk.
            As I mentioned earlier, if you limit Analytics writes to 1.3MB and you have in this test 60 Analytics collections, things are expected to take this long. I suggest you increase the write rate limit and possibly reduce the number of Analytics collections. I still don't know what type of SSD is capable of only 6.933 MB/s. It seemed to me earlier tests suggested that the disk on node (.23) is capable of ~36MB/s.

            murtadha.hubail Murtadha Hubail added a comment - Jyotsna Nayak , The analytics_info.log in the Analytics master node (.23) has the information. The rebalance started here: 2022-03-10T16:45:49.398-08:00 WARN CBAS.rebalance.Rebalance [Executor-107:ClusterController] enter Rebalance f6895a60bffad14085b63dea6af729ec: {"id":"f6895a60bffad14085b63dea6af729ec","currentTopologyRev":null,"type":"topology-change-rebalance","keepNodes":[{"nodeInfo":{"nodeId":"1bb12910832b03d7760d874b89acbd35","priority":1970329131942144,"opaque":{"cbas-version":"7.1.0-2333","cc-http-port":"9111","controller-id":"0","host":"172.23.96.23","ns-server-port":"8091","num-iodevices":"4","starting-partition-id":"0","svc-http-port":"8095"}},"recoveryType":"recovery-full"},{"nodeInfo":{"nodeId":"c80577861232f3d3d8edfddde5789c7d","priority":1970329131941888,"opaque":{"cbas-version":"7.1.0-2333","cc-http-port":"9111","controller-id":"1","host":"172.23.99.160","ns-server-port":"8091","num-iodevices":"16","starting-partition-id":"4","svc-http-port":"8095"}},"recoveryType":"recovery-full"}],"ejectNodes":[]} And we were asked to cancel the rebalance here: 2022-03-11T08:19:06.948-08:00 WARN CBAS.rebalance.Rebalance [HttpExecutor(port:9111)-15] Rebalance (f6895a60bffad14085b63dea6af729ec) cancellation requested Between the rebalance start and the cancellation, Analytics was trying to disconnect the DCP streams and that involves persisting any in-memory data to disk. All of that time we were trying to persist the data to disk. As I mentioned earlier, if you limit Analytics writes to 1.3MB and you have in this test 60 Analytics collections, things are expected to take this long. I suggest you increase the write rate limit and possibly reduce the number of Analytics collections. I still don't know what type of SSD is capable of only 6.933 MB/s. It seemed to me earlier tests suggested that the disk on node (.23) is capable of ~36MB/s.

            After the change in disk , rescheduled the test with :
             1. a factor of 0.8 and 16 partitions, the parameter would have : 2173600 Bytes.

            Link to the job: http://perf.jenkins.couchbase.com/job/themis_multibucket/121/

            jyotsna.nayak Jyotsna Nayak added a comment - After the change in disk , rescheduled the test with :  1. a factor of 0.8 and 16 partitions, the parameter would have : 2173600 Bytes. Link to the job: http://perf.jenkins.couchbase.com/job/themis_multibucket/121/
            jyotsna.nayak Jyotsna Nayak added a comment - - edited

            Murtadha Hubail  , I have run the test as mentioned in the comment above ; with the parameter set to 2173600 Bytes.
            The test is failing at after rebalancing all the components ; with the following error
            The cluster is not balanced
            Upon checking the rebalance logs , this is the message printed 

            {"stageInfo":{"analytics":{"totalProgress":2.484999999999952e-11,"perNodeProgress":

            {"ns_1@172.23.99.160":2.484999999999952e-13,"ns_1@172.23.96.23":2.484999999999952e-13}

            ,"startTime":"2022-03-30T18:04:08.826-07:00","completedTime":false,"timeTaken":2554572},"eventing":{"startTime":false,"completedTime":false,"timeTaken":false},"search":{"totalProgress":100,"perNodeProgress":

            {"ns_1@172.23.96.20":1}

            ,"startTime":"2022-03-30T18:04:05.482-07:00","completedTime":"2022-03-30T18:04:05.936-07:00","timeTaken":453},"index":{"totalProgress":100,"perNodeProgress":

            {"ns_1@172.23.96.15":1,"ns_1@172.23.96.19":1}

            ,"startTime":"2022-03-30T18:04:05.936-07:00","completedTime":"2022-03-30T18:04:08.826-07:00","timeTaken":2890},"data":{"totalProgress":100,"perNodeProgress":

            {"ns_1@172.23.99.157":1,"ns_1@172.23.99.158":1,"ns_1@172.23.99.159":1}

            ,"startTime":"2022-03-30T18:03:55.918-07:00","completedTime":"2022-03-30T18:04:05.482-07:00","timeTaken":9565},"query":{"startTime":false,"completedTime":false,"timeTaken":false}},"rebalanceId":"9d7d027beca1eaf5d1746604e115a43f","nodesInfo":

            {"active_nodes":["ns_1@172.23.99.157","ns_1@172.23.99.158","ns_1@172.23.99.159","ns_1@172.23.96.19","ns_1@172.23.96.15","ns_1@172.23.97.177","ns_1@172.23.96.23","ns_1@172.23.96.20","ns_1@172.23.99.160"],"keep_nodes":["ns_1@172.23.99.157","ns_1@172.23.99.158","ns_1@172.23.99.159","ns_1@172.23.96.19","ns_1@172.23.96.15","ns_1@172.23.97.177","ns_1@172.23.96.23","ns_1@172.23.96.20","ns_1@172.23.99.160"],"eject_nodes":[],"delta_nodes":[],"failed_nodes":[]}

            ,"masterNode":"ns_1@172.23.99.157","startTime":"2022-03-30T18:03:55.913-07:00","completedTime":"2022-03-30T18:46:43.398-07:00","timeTaken":2567486,"completionMessage":"Rebalance exited with reason {service_rebalance_failed,cbas,\n                              {worker_died,\n                               {'EXIT',<0.25460.435>,\n                                {rebalance_failed,\n                                

            {service_error,\n                                  <<\"Rebalance 5692dee195b5f22cd3fb646ea3a742a8 failed: CBAS0001: Analytics collections in different partitions have different DCP states. Mutations needed to catch up = 1738. User action: Try again later\">>}

            }}}}."}

            Link to the job :  http://perf.jenkins.couchbase.com/job/themis_multibucket/121/

            logs:

            https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-themis_multibucket-121/172.23.96.15.zip
            https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-themis_multibucket-121/172.23.96.19.zip
            https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-themis_multibucket-121/172.23.96.20.zip
            https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-themis_multibucket-121/172.23.96.23.zip
            https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-themis_multibucket-121/172.23.97.177.zip
            https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-themis_multibucket-121/172.23.99.157.zip
            https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-themis_multibucket-121/172.23.99.158.zip
            https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-themis_multibucket-121/172.23.99.159.zip
            https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-themis_multibucket-121/172.23.99.160.zip
            https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-themis_multibucket-121/172.23.99.161.zip
            https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-themis_multibucket-121/tools.zip

            jyotsna.nayak Jyotsna Nayak added a comment - - edited Murtadha Hubail   , I have run the test as mentioned in the comment above ; with the parameter set to 2173600 Bytes. The test is failing at after rebalancing all the components ; with the following error The cluster is not balanced Upon checking the rebalance logs , this is the message printed  {"stageInfo":{"analytics":{"totalProgress":2.484999999999952e-11,"perNodeProgress": {"ns_1@172.23.99.160":2.484999999999952e-13,"ns_1@172.23.96.23":2.484999999999952e-13} ,"startTime":"2022-03-30T18:04:08.826-07:00","completedTime":false,"timeTaken":2554572},"eventing":{"startTime":false,"completedTime":false,"timeTaken":false},"search":{"totalProgress":100,"perNodeProgress": {"ns_1@172.23.96.20":1} ,"startTime":"2022-03-30T18:04:05.482-07:00","completedTime":"2022-03-30T18:04:05.936-07:00","timeTaken":453},"index":{"totalProgress":100,"perNodeProgress": {"ns_1@172.23.96.15":1,"ns_1@172.23.96.19":1} ,"startTime":"2022-03-30T18:04:05.936-07:00","completedTime":"2022-03-30T18:04:08.826-07:00","timeTaken":2890},"data":{"totalProgress":100,"perNodeProgress": {"ns_1@172.23.99.157":1,"ns_1@172.23.99.158":1,"ns_1@172.23.99.159":1} ,"startTime":"2022-03-30T18:03:55.918-07:00","completedTime":"2022-03-30T18:04:05.482-07:00","timeTaken":9565},"query":{"startTime":false,"completedTime":false,"timeTaken":false}},"rebalanceId":"9d7d027beca1eaf5d1746604e115a43f","nodesInfo": {"active_nodes":["ns_1@172.23.99.157","ns_1@172.23.99.158","ns_1@172.23.99.159","ns_1@172.23.96.19","ns_1@172.23.96.15","ns_1@172.23.97.177","ns_1@172.23.96.23","ns_1@172.23.96.20","ns_1@172.23.99.160"],"keep_nodes":["ns_1@172.23.99.157","ns_1@172.23.99.158","ns_1@172.23.99.159","ns_1@172.23.96.19","ns_1@172.23.96.15","ns_1@172.23.97.177","ns_1@172.23.96.23","ns_1@172.23.96.20","ns_1@172.23.99.160"],"eject_nodes":[],"delta_nodes":[],"failed_nodes":[]} ,"masterNode":"ns_1@172.23.99.157","startTime":"2022-03-30T18:03:55.913-07:00","completedTime":"2022-03-30T18:46:43.398-07:00","timeTaken":2567486,"completionMessage":"Rebalance exited with reason {service_rebalance_failed,cbas,\n                              {worker_died,\n                               {'EXIT',<0.25460.435>,\n                                {rebalance_failed,\n                                 {service_error,\n                                  <<\"Rebalance 5692dee195b5f22cd3fb646ea3a742a8 failed: CBAS0001: Analytics collections in different partitions have different DCP states. Mutations needed to catch up = 1738. User action: Try again later\">>} }}}}."} Link to the job :  http://perf.jenkins.couchbase.com/job/themis_multibucket/121/ logs: https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-themis_multibucket-121/172.23.96.15.zip https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-themis_multibucket-121/172.23.96.19.zip https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-themis_multibucket-121/172.23.96.20.zip https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-themis_multibucket-121/172.23.96.23.zip https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-themis_multibucket-121/172.23.97.177.zip https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-themis_multibucket-121/172.23.99.157.zip https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-themis_multibucket-121/172.23.99.158.zip https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-themis_multibucket-121/172.23.99.159.zip https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-themis_multibucket-121/172.23.99.160.zip https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-themis_multibucket-121/172.23.99.161.zip https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-themis_multibucket-121/tools.zip

            Jyotsna Nayak,

            This is expected when the DCP stream is disconnected from Analytics ungracefully (e.g. as a result of a KV topology change). As the rebalance failure message suggests, some data partitions are 1738 mutations behind. It usually takes less than a minute for all partitions to catch up to the same DCP state. If you try the rebalance after a minute or so, the rebalance should proceed.

            murtadha.hubail Murtadha Hubail added a comment - Jyotsna Nayak , This is expected when the DCP stream is disconnected from Analytics ungracefully (e.g. as a result of a KV topology change). As the rebalance failure message suggests, some data partitions are 1738 mutations behind. It usually takes less than a minute for all partitions to catch up to the same DCP state. If you try the rebalance after a minute or so, the rebalance should proceed.

            This test has run from end to end ; and the cluster seemed to be balanced on the UI front a few mins after the test completed the run . 
            Link to the job :  http://perf.jenkins.couchbase.com/job/themis_multibucket/121/
            Will have a rerun of the test after increasing the sleep time  in between the rebalances is increased.
            The issue due to which this bug was initially filed is no longer observed. 

            jyotsna.nayak Jyotsna Nayak added a comment - This test has run from end to end ; and the cluster seemed to be balanced on the UI front a few mins after the test completed the run .  Link to the job :  http://perf.jenkins.couchbase.com/job/themis_multibucket/121/ Will have a rerun of the test after increasing the sleep time  in between the rebalances is increased. The issue due to which this bug was initially filed is no longer observed. 
            wayne Wayne Siu added a comment -

            Jyotsna NayakMurtadha Hubail
            I'm closing this ticket based on latest updates. (origin issue reported is no longer observed).
            Please open a new ticket should there is a new issue from the re-run with a new sleep time. Thanks.

            wayne Wayne Siu added a comment - Jyotsna Nayak Murtadha Hubail I'm closing this ticket based on latest updates. (origin issue reported is no longer observed). Please open a new ticket should there is a new issue from the re-run with a new sleep time. Thanks.
            jyotsna.nayak Jyotsna Nayak added a comment - - edited

            Analysis after increasing the amount of sleep :

            I have rerun the test after increasing the sleep between the rebalances from 1 hour to 
            1.  2 hours (test failed due to 6 cbas mutations left to catch up ; link to job: here 
            Error message: 

            {"stageInfo":{"analytics":{"totalProgress":5.700000000000002e-13,"perNodeProgress":

            {"ns_1@172.23.99.160":5.700000000000002e-15,"ns_1@172.23.96.23":5.700000000000002e-15}

            ,"startTime":"2022-04-12T20:26:36.001-07:00","completedTime":false,"timeTaken":56506},"eventing":{"startTime":false,"completedTime":false,"timeTaken":false},"search":{"totalProgress":100,"perNodeProgress":

            {"ns_1@172.23.96.20":1}

            ,"startTime":"2022-04-12T20:26:32.355-07:00","completedTime":"2022-04-12T20:26:32.869-07:00","timeTaken":514},"index":{"totalProgress":100,"perNodeProgress":

            {"ns_1@172.23.96.15":1,"ns_1@172.23.96.19":1}

            ,"startTime":"2022-04-12T20:26:32.869-07:00","completedTime":"2022-04-12T20:26:36.001-07:00","timeTaken":3132},"data":{"totalProgress":100,"perNodeProgress":

            {"ns_1@172.23.99.157":1,"ns_1@172.23.99.158":1,"ns_1@172.23.99.159":1}

            ,"startTime":"2022-04-12T20:26:22.889-07:00","completedTime":"2022-04-12T20:26:32.355-07:00","timeTaken":9466},"query":{"startTime":false,"completedTime":false,"timeTaken":false}},"rebalanceId":"fef9a523cd142ca550b5671cb67f02ec","nodesInfo":

            {"active_nodes":["ns_1@172.23.99.157","ns_1@172.23.99.158","ns_1@172.23.99.159","ns_1@172.23.96.19","ns_1@172.23.96.15","ns_1@172.23.97.177","ns_1@172.23.96.23","ns_1@172.23.96.20","ns_1@172.23.99.160"],"keep_nodes":["ns_1@172.23.99.157","ns_1@172.23.99.158","ns_1@172.23.99.159","ns_1@172.23.96.19","ns_1@172.23.96.15","ns_1@172.23.97.177","ns_1@172.23.96.23","ns_1@172.23.96.20","ns_1@172.23.99.160"],"eject_nodes":[],"delta_nodes":[],"failed_nodes":[]}

            ,"masterNode":"ns_1@172.23.99.157","startTime":"2022-04-12T20:26:22.880-07:00","completedTime":"2022-04-12T20:27:32.508-07:00","timeTaken":69628,"completionMessage":"Rebalance exited with reason {service_rebalance_failed,cbas,\n                              {worker_died,\n                               {'EXIT',<0.23164.614>,\n                                {rebalance_failed,\n                                 {service_error,\n                                  <<\"Rebalance cf90e012469a96b7555ad9eb9a0902cc failed: CBAS0001: Analytics collections in different partitions have different DCP states. Mutations needed to catch up = 6. User action: Try again later\">>}}}}}."}

            2. 3 hours (test failed due to 1 cbas mutations left to catch up ; link to the job : here )
            Error message:
            {"stageInfo":{"analytics":{"totalProgress":5.729979539608404,"perNodeProgress":

            {"ns_1@172.23.99.160":0.05729979539608404,"ns_1@172.23.96.23":0.05729979539608404}

            ,"startTime":"2022-04-21T22:45:58.519-07:00","completedTime":false,"timeTaken":481388},"eventing":{"startTime":false,"completedTime":false,"timeTaken":false},"search":{"totalProgress":100,"perNodeProgress":

            {"ns_1@172.23.96.20":1}

            ,"startTime":"2022-04-21T22:45:54.745-07:00","completedTime":"2022-04-21T22:45:55.266-07:00","timeTaken":520},"index":{"totalProgress":100,"perNodeProgress":

            {"ns_1@172.23.96.15":1,"ns_1@172.23.96.19":1}

            ,"startTime":"2022-04-21T22:45:55.266-07:00","completedTime":"2022-04-21T22:45:58.519-07:00","timeTaken":3253},"data":{"totalProgress":100,"perNodeProgress":

            {"ns_1@172.23.99.157":1,"ns_1@172.23.99.158":1,"ns_1@172.23.99.159":1}

            ,"startTime":"2022-04-21T22:45:45.579-07:00","completedTime":"2022-04-21T22:45:54.745-07:00","timeTaken":9166},"query":{"startTime":false,"completedTime":false,"timeTaken":false}},"rebalanceId":"a484886399b811651e3c3a8386bdb95c","nodesInfo":

            {"active_nodes":["ns_1@172.23.99.157","ns_1@172.23.99.158","ns_1@172.23.99.159","ns_1@172.23.96.19","ns_1@172.23.96.15","ns_1@172.23.97.177","ns_1@172.23.96.23","ns_1@172.23.96.20","ns_1@172.23.99.160"],"keep_nodes":["ns_1@172.23.99.157","ns_1@172.23.99.158","ns_1@172.23.99.159","ns_1@172.23.96.19","ns_1@172.23.96.15","ns_1@172.23.97.177","ns_1@172.23.96.23","ns_1@172.23.96.20","ns_1@172.23.99.160"],"eject_nodes":[],"delta_nodes":[],"failed_nodes":[]}

            ,"masterNode":"ns_1@172.23.99.157","startTime":"2022-04-21T22:45:45.574-07:00","completedTime":"2022-04-21T22:53:59.906-07:00","timeTaken":494332,"completionMessage":"Rebalance exited with reason {service_rebalance_failed,cbas,\n                              {worker_died,\n                               {'EXIT',<0.17599.784>,\n                                {rebalance_failed,\n                                 {service_error,\n                                  <<\"Rebalance 861ea35e761c76836acfa59ee14411da failed: CBAS0001: Analytics collections in different partitions have different DCP states. Mutations needed to catch up = 1. User action: Try again later\">>}}}}}."} 

            jyotsna.nayak Jyotsna Nayak added a comment - - edited Analysis after increasing the amount of sleep : I have rerun the test after increasing the sleep between the rebalances from 1 hour to  1.   2 hours (test failed due to 6 cbas mutations left to catch up ; link to job:  here   Error message:   {"stageInfo":{"analytics":{"totalProgress":5.700000000000002e-13,"perNodeProgress": {"ns_1@172.23.99.160":5.700000000000002e-15,"ns_1@172.23.96.23":5.700000000000002e-15} ,"startTime":"2022-04-12T20:26:36.001-07:00","completedTime":false,"timeTaken":56506},"eventing":{"startTime":false,"completedTime":false,"timeTaken":false},"search":{"totalProgress":100,"perNodeProgress": {"ns_1@172.23.96.20":1} ,"startTime":"2022-04-12T20:26:32.355-07:00","completedTime":"2022-04-12T20:26:32.869-07:00","timeTaken":514},"index":{"totalProgress":100,"perNodeProgress": {"ns_1@172.23.96.15":1,"ns_1@172.23.96.19":1} ,"startTime":"2022-04-12T20:26:32.869-07:00","completedTime":"2022-04-12T20:26:36.001-07:00","timeTaken":3132},"data":{"totalProgress":100,"perNodeProgress": {"ns_1@172.23.99.157":1,"ns_1@172.23.99.158":1,"ns_1@172.23.99.159":1} ,"startTime":"2022-04-12T20:26:22.889-07:00","completedTime":"2022-04-12T20:26:32.355-07:00","timeTaken":9466},"query":{"startTime":false,"completedTime":false,"timeTaken":false}},"rebalanceId":"fef9a523cd142ca550b5671cb67f02ec","nodesInfo": {"active_nodes":["ns_1@172.23.99.157","ns_1@172.23.99.158","ns_1@172.23.99.159","ns_1@172.23.96.19","ns_1@172.23.96.15","ns_1@172.23.97.177","ns_1@172.23.96.23","ns_1@172.23.96.20","ns_1@172.23.99.160"],"keep_nodes":["ns_1@172.23.99.157","ns_1@172.23.99.158","ns_1@172.23.99.159","ns_1@172.23.96.19","ns_1@172.23.96.15","ns_1@172.23.97.177","ns_1@172.23.96.23","ns_1@172.23.96.20","ns_1@172.23.99.160"],"eject_nodes":[],"delta_nodes":[],"failed_nodes":[]} ,"masterNode":"ns_1@172.23.99.157","startTime":"2022-04-12T20:26:22.880-07:00","completedTime":"2022-04-12T20:27:32.508-07:00","timeTaken":69628,"completionMessage":"Rebalance exited with reason {service_rebalance_failed,cbas,\n                              {worker_died,\n                               {'EXIT',<0.23164.614>,\n                                {rebalance_failed,\n                                 {service_error,\n                                  <<\"Rebalance cf90e012469a96b7555ad9eb9a0902cc failed: CBAS0001: Analytics collections in different partitions have different DCP states. Mutations needed to catch up = 6 . User action: Try again later\">>}}}}}."} 2. 3 hours (test failed due to 1 cbas mutations left to catch up ; link to the job : here  ) Error message: {"stageInfo":{"analytics":{"totalProgress":5.729979539608404,"perNodeProgress": {"ns_1@172.23.99.160":0.05729979539608404,"ns_1@172.23.96.23":0.05729979539608404} ,"startTime":"2022-04-21T22:45:58.519-07:00","completedTime":false,"timeTaken":481388},"eventing":{"startTime":false,"completedTime":false,"timeTaken":false},"search":{"totalProgress":100,"perNodeProgress": {"ns_1@172.23.96.20":1} ,"startTime":"2022-04-21T22:45:54.745-07:00","completedTime":"2022-04-21T22:45:55.266-07:00","timeTaken":520},"index":{"totalProgress":100,"perNodeProgress": {"ns_1@172.23.96.15":1,"ns_1@172.23.96.19":1} ,"startTime":"2022-04-21T22:45:55.266-07:00","completedTime":"2022-04-21T22:45:58.519-07:00","timeTaken":3253},"data":{"totalProgress":100,"perNodeProgress": {"ns_1@172.23.99.157":1,"ns_1@172.23.99.158":1,"ns_1@172.23.99.159":1} ,"startTime":"2022-04-21T22:45:45.579-07:00","completedTime":"2022-04-21T22:45:54.745-07:00","timeTaken":9166},"query":{"startTime":false,"completedTime":false,"timeTaken":false}},"rebalanceId":"a484886399b811651e3c3a8386bdb95c","nodesInfo": {"active_nodes":["ns_1@172.23.99.157","ns_1@172.23.99.158","ns_1@172.23.99.159","ns_1@172.23.96.19","ns_1@172.23.96.15","ns_1@172.23.97.177","ns_1@172.23.96.23","ns_1@172.23.96.20","ns_1@172.23.99.160"],"keep_nodes":["ns_1@172.23.99.157","ns_1@172.23.99.158","ns_1@172.23.99.159","ns_1@172.23.96.19","ns_1@172.23.96.15","ns_1@172.23.97.177","ns_1@172.23.96.23","ns_1@172.23.96.20","ns_1@172.23.99.160"],"eject_nodes":[],"delta_nodes":[],"failed_nodes":[]} ,"masterNode":"ns_1@172.23.99.157","startTime":"2022-04-21T22:45:45.574-07:00","completedTime":"2022-04-21T22:53:59.906-07:00","timeTaken":494332,"completionMessage":"Rebalance exited with reason {service_rebalance_failed,cbas,\n                              {worker_died,\n                               {'EXIT',<0.17599.784>,\n                                {rebalance_failed,\n                                 {service_error,\n                                  <<\"Rebalance 861ea35e761c76836acfa59ee14411da failed: CBAS0001: Analytics collections in different partitions have different DCP states. Mutations needed to catch up = 1 . User action: Try again later\">>}}}}}."} 

            People

              jyotsna.nayak Jyotsna Nayak
              jyotsna.nayak Jyotsna Nayak
              Votes:
              0 Vote for this issue
              Watchers:
              13 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty