Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-33875

Auto failover time increased by 500ms/600ms in MH

    XMLWordPrintable

Details

    • Bug
    • Status: Closed
    • Critical
    • Resolution: Fixed
    • 6.5.0
    • 6.5.0
    • ns_server
    • build : 6.5.0-2864
      hestia cluster, centos 7
    • Untriaged
    • Yes

    Description

      After following change in ns_server the failover  time increased from 110ms to 200+ ms:

      https://github.com/couchbase/ns_server/commit/bed1e6ea9959c99e9cf4cb6942a236be4c7702d1

      See logs attached.

      There was another failover time increase between 2864 and 2943. So total regression is about 500/600ms . I'll update the ticket as soon as I track that one as well.

      http://showfast.sc.couchbase.com/#/timeline/Linux/reb/failover/all

       

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            oleksandr.gyryk Alex Gyryk (Inactive) created issue -
            wayne Wayne Siu made changes -
            Field Original Value New Value
            Is this a Regression? Unknown [ 10452 ] Yes [ 10450 ]
            wayne Wayne Siu added a comment -

            Ajit Yagaty [X]
            Just checking. Can you let us know if you have any estimate when you may look at this ticket? Thanks.

            wayne Wayne Siu added a comment - Ajit Yagaty [X] Just checking. Can you let us know if you have any estimate when you may look at this ticket? Thanks.
            wayne Wayne Siu made changes -
            Priority Major [ 3 ] Critical [ 2 ]
            ajit.yagaty Ajit Yagaty [X] (Inactive) made changes -
            Assignee Ajit Yagaty [ ajit.yagaty ] Abhijeeth Nuthan [ abhijeeth.nuthan ]

            Abhijeeth Nuthan - Can you please take a look at this?

            ajit.yagaty Ajit Yagaty [X] (Inactive) added a comment - Abhijeeth Nuthan - Can you please take a look at this?
            Abhijeeth.Nuthan Abhijeeth Nuthan made changes -
            Link This issue depends on MB-34378 [ MB-34378 ]

            Apart from fixing MB-34378 , we can try to pipeline the set_vbucket requests as an optimization. 

            Abhijeeth.Nuthan Abhijeeth Nuthan added a comment - Apart from fixing MB-34378 , we can try to pipeline the set_vbucket requests as an optimization. 
            sharath.sulochana Sharath Sulochana (Inactive) made changes -
            Summary 2x failover time increase on 6.5.0-2864 5x Auto failover time increase on 6.5.0-2864
            wayne Wayne Siu made changes -
            Summary 5x Auto failover time increase on 6.5.0-2864 Auto failover time increased by 500ms in MH on 6.5.0-2864
            wayne Wayne Siu made changes -
            Summary Auto failover time increased by 500ms in MH on 6.5.0-2864 Auto failover time increased by 500ms in MH
            wayne Wayne Siu made changes -
            Environment hestia cluster, centos 7 build : 6.5.0-2864
            hestia cluster, centos 7
            wayne Wayne Siu made changes -
            Description After following change in ns_server the failover  time increased from 110ms to 200+ ms:

            [https://github.com/couchbase/ns_server/commit/bed1e6ea9959c99e9cf4cb6942a236be4c7702d1]

            See logs attached.

            There was another failover time increase between 2864 and 2943. So total regression is about 5x. I'll update the ticket as soon as I track that one as well.

            [http://showfast.sc.couchbase.com/#/timeline/Linux/reb/failover/all]

             
            After following change in ns_server the failover  time increased from 110ms to 200+ ms:

            [https://github.com/couchbase/ns_server/commit/bed1e6ea9959c99e9cf4cb6942a236be4c7702d1]

            See logs attached.

            There was another failover time increase between 2864 and 2943. So total regression is about 500/600ms . I'll update the ticket as soon as I track that one as well.

            [http://showfast.sc.couchbase.com/#/timeline/Linux/reb/failover/all]

             
            wayne Wayne Siu made changes -
            Summary Auto failover time increased by 500ms in MH Auto failover time increased by 500ms/600ms in MH
            dfinlay Dave Finlay made changes -
            Assignee Abhijeeth Nuthan [ abhijeeth.nuthan ] Artem Stemkovski [ artem ]
            dfinlay Dave Finlay made changes -
            Fix Version/s Cheshire-Cat [ 15915 ]
            Fix Version/s Mad-Hatter [ 15037 ]

            After http://review.couchbase.org/#/c/117246/ performance had improved. No we see failover time around 200ms (vs. 135ms before durability related changes)

            All set_vbucket calls together take arounf 66ms, so we decided that trying to improve this number with pipelining is not worth the additional code. Therefore closing the bug and abandoning pipelining related commits.

            artem Artem Stemkovski added a comment - After http://review.couchbase.org/#/c/117246/ performance had improved. No we see failover time around 200ms (vs. 135ms before durability related changes) All set_vbucket calls together take arounf 66ms, so we decided that trying to improve this number with pipelining is not worth the additional code. Therefore closing the bug and abandoning pipelining related commits.
            artem Artem Stemkovski made changes -
            Fix Version/s Mad-Hatter [ 15037 ]
            Fix Version/s Cheshire-Cat [ 15915 ]
            Resolution Fixed [ 1 ]
            Status Open [ 1 ] Resolved [ 5 ]
            wayne Wayne Siu made changes -
            Assignee Artem Stemkovski [ artem ] Korrigan Clark [ korrigan.clark ]
            korrigan.clark Korrigan Clark made changes -

            Dave Finlay is this an acceptable regression? up to 200ms from 120ms, 66% increase.

            korrigan.clark Korrigan Clark added a comment - Dave Finlay  is this an acceptable regression? up to 200ms from 120ms, 66% increase.
            dfinlay Dave Finlay added a comment - - edited

            Hi Korry - yes, given the extra work that's happening this is fine - and the fact that this is a relatively small percentage change in the overall failover time. Thanks.

            dfinlay Dave Finlay added a comment - - edited Hi Korry - yes, given the extra work that's happening this is fine - and the fact that this is a relatively small percentage change in the overall failover time. Thanks.
            korrigan.clark Korrigan Clark made changes -
            Status Resolved [ 5 ] Closed [ 6 ]

            People

              korrigan.clark Korrigan Clark
              oleksandr.gyryk Alex Gyryk (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty