Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-27454

kv failover/recovery and rebalance in when eventing node is processing mutations hangs

    XMLWordPrintable

Details

    • Untriaged
    • Centos 64-bit
    • No

    Description

      Steps to Repro:

      ./testrunner -i /tmp/testexec.26743.ini get-cbcollect-info=True -t eventing.eventing_rebalance.EventingRebalance.test_kv_failover_and_recovery_rebalance_with_eventing_node,nodes_init=6,services_init=kv-kv-kv-eventing-eventing-index:n1ql,dataset=default,groups=simple,reset_services=True,doc-per-day=10,skip_cleanup=True,failover_type=hard,recovery_type=full
      

      Logs attached.

      Attachments

        1. eventing.log 2.zip
          1.28 MB
        2. test_16.zip
          38.71 MB
        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          Tried simulating this issue with the patch against the tests testrunner has - run log[1]. Observation from one the failures:

          02:02:45.192-07:00 - 02:08:02.883-07:00 : During this period, GoCB returned "no access" error
          02:08:05.126-07:00 - 02:08:06.383-07:00 : During this period, GoCB returned "operation has timed out"
          02:08:07.633-07:00 - 02:08:27.308-07:00 : During this period, GoCB returned "temporary failure occurred, try again later"
          02:08:28.308-07:00 - 02:08:51.810-07:00 : And finally it again started throwing "operation has timed out"

          Seems like GoCB gets somewhat screwed up in some cases when a KV node is recovered post failover. If I above it again, will file a JIRA against SDK.

          [1]http://qa.sc.couchbase.com/job/temp_rebalance_even/226/consoleFull

          asingh Abhishek Singh (Inactive) added a comment - Tried simulating this issue with the patch against the tests testrunner has - run log [1] . Observation from one the failures: 02:02:45.192-07:00 - 02:08:02.883-07:00 : During this period, GoCB returned "no access" error 02:08:05.126-07:00 - 02:08:06.383-07:00 : During this period, GoCB returned "operation has timed out" 02:08:07.633-07:00 - 02:08:27.308-07:00 : During this period, GoCB returned "temporary failure occurred, try again later" 02:08:28.308-07:00 - 02:08:51.810-07:00 : And finally it again started throwing "operation has timed out" Seems like GoCB gets somewhat screwed up in some cases when a KV node is recovered post failover. If I above it again, will file a JIRA against SDK. [1] http://qa.sc.couchbase.com/job/temp_rebalance_even/226/consoleFull
          asingh Abhishek Singh (Inactive) added a comment - - edited

          http://review.couchbase.org/92824 on unstable. Based on the error seen in previous cbcollect shared, it should get addressed. FYI, as mentioned in previous comment MB-29147 shows up randomly in one of the 16 or so test around this scenario & causes rebalance hang.

          asingh Abhishek Singh (Inactive) added a comment - - edited http://review.couchbase.org/92824 on unstable. Based on the error seen in previous cbcollect shared, it should get addressed. FYI, as mentioned in previous comment MB-29147 shows up randomly in one of the 16 or so test around this scenario & causes rebalance hang.

          Build couchbase-server-5.5.0-2542 contains eventing commit ef68e1f131b3732b04c9ebc814c3b4d80ceec0a0 with commit message:
          MB-27454 Retry STREAMREQ on NOT_MY_VBUCKET
          https://github.com/couchbase/eventing/commit/ef68e1f131b3732b04c9ebc814c3b4d80ceec0a0

          build-team Couchbase Build Team added a comment - Build couchbase-server-5.5.0-2542 contains eventing commit ef68e1f131b3732b04c9ebc814c3b4d80ceec0a0 with commit message: MB-27454 Retry STREAMREQ on NOT_MY_VBUCKET https://github.com/couchbase/eventing/commit/ef68e1f131b3732b04c9ebc814c3b4d80ceec0a0

          Build couchbase-server-6.0.0-1038 contains eventing commit ef68e1f131b3732b04c9ebc814c3b4d80ceec0a0 with commit message:
          MB-27454 Retry STREAMREQ on NOT_MY_VBUCKET
          https://github.com/couchbase/eventing/commit/ef68e1f131b3732b04c9ebc814c3b4d80ceec0a0

          build-team Couchbase Build Team added a comment - Build couchbase-server-6.0.0-1038 contains eventing commit ef68e1f131b3732b04c9ebc814c3b4d80ceec0a0 with commit message: MB-27454 Retry STREAMREQ on NOT_MY_VBUCKET https://github.com/couchbase/eventing/commit/ef68e1f131b3732b04c9ebc814c3b4d80ceec0a0

          Validated this on 5.5.0-2637. I did not notice this hang anymore. Will reopen the bug if it happens again.

          Balakumaran.Gopal Balakumaran Gopal added a comment - Validated this on 5.5.0-2637. I did not notice this hang anymore. Will reopen the bug if it happens again.

          People

            asingh Abhishek Singh (Inactive)
            Balakumaran.Gopal Balakumaran Gopal
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty