Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-5110

race condition clearing config during node ejection (was: ns_config:clear may silently fail if config saver is running at call time leading to ejected node having part old config)

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Incomplete
    • Affects Version/s: 1.8.0
    • Fix Version/s: 2.0-beta
    • Component/s: ns_server
    • Security Level: Public
    • Labels:
      None

      Description

      This seems to be quite old bug but it's still bug and quite embarrassing. Seemingly our change to update config during rebalance made it much more probable.

      So what happens is ns_config clear clears config, waits for saver and then reloads config. The problem is that wait for saver wait's only for currently running save and can spawn new saver if changes were made since saver was started. Exactly this happens when config is cleared while saver is running. Leading config reload to race with saver. I've observed this seemingly few times already myself.

      UPDATE: When I filed this bug I was thinking about particular race of ns_config:clear and async config saving. But actual bug was filed because people (including me) where seeing this weird condition when ejected node couldn't be added back and was thinking it's still part of cluster. As can be seen below in comments we traced this down to race in shutting down config merger process and clearing config.

        Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            People

            • Assignee:
              alkondratenko Aleksey Kondratenko (Inactive)
              Reporter:
              alkondratenko Aleksey Kondratenko (Inactive)
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: