Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-10660

stuck upr takeover may cause janitor_agent to stuck. With no chance of recovery

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • 3.0
    • 3.0
    • ns_server
    • Security Level: Public
    • None
    • Untriaged
    • No

    Description

      I've got upr takeover to stuck (showed to Mike as it was looking like ep-engine bug).

      But what is notable is that rebalance stop was not working. I've got diag and found that ns_vbucket_mover process is blocked doing janitor_agent:bulk_set_vbucket_state call. Which caused rebalance exit to not be handled. This part is easy to fix.

      But what part is not easy to fix is what you can clearly see after I've manually killed ns_vbucket_mover. In second /diag grabbed just after that you can see that one of node's janitor_agents is still stuck on call to replication manager. Which in turn is stuck on upr takeover.

      We'll need to come up with nice way of fixing the later issue (of janitor_agent getting stuck). The former (ns_vbucket_mover not reacting on stop) is easy to address. No need to rush. We have time to think about that issue and discuss.

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          People

            andreibaranouski Andrei Baranouski
            alkondratenko Aleksey Kondratenko (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty