Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-7182

[RN 2.0.1]ns_server experiences random timeouts supposedly due to lack of async io threads causing rebalance to fail and other potential badness

    Details

    • Flagged:
      Release Note

      Description

      SUBJ.

      In many diags we were seeing we're seeing occasional timeouts here and there. Sometimes and perhaps most of the time they don't affect correct operation of product. After all erlang is famous for it's fault resiliency.

      But sometimes it causes rebalance to fail. I.e. see MB-7166 where mb_master which supervised ns_orchestrator which supervised rebalance died due to timeout. Which according to normal error handling behavior of Erlang caused it's restart. But part of restart was shutting down of child processes, including obviously rebalancer.

      In my personal experience this is quite easy to hit on physical hardware and spinning disks. But apparently we're now getting in on Xen and SSDs as well as potentially (MB-7152) on physical hardware and SSDs.

      No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

        Hide
        thuan Thuan Nguyen added a comment -

        I hit this bug again in build 2.0.0-1952 on windows 2008 R2 64bit in ec2
        Add 2 nodes to cluster of 6 nodes, rebalance failed with error
        "Resetting rebalance status since it's not really running" after more than one hour running

        Here is the collect info of all nodes https://s3.amazonaws.com/bugdb/jira/MB-7182/8nodes-ci-1952-reset-reb-20121119121657.tgz
        Link to manifest file of this build http://builds.hq.northscale.net/latestbuilds/couchbase-server-enterprise_x86_64_2.0.0-1952-rel.setup.exe.manifest.xml

        Show
        thuan Thuan Nguyen added a comment - I hit this bug again in build 2.0.0-1952 on windows 2008 R2 64bit in ec2 Add 2 nodes to cluster of 6 nodes, rebalance failed with error "Resetting rebalance status since it's not really running" after more than one hour running Here is the collect info of all nodes https://s3.amazonaws.com/bugdb/jira/MB-7182/8nodes-ci-1952-reset-reb-20121119121657.tgz Link to manifest file of this build http://builds.hq.northscale.net/latestbuilds/couchbase-server-enterprise_x86_64_2.0.0-1952-rel.setup.exe.manifest.xml
        Hide
        steve Steve Yen added a comment -

        an important 2.0.1 bug

        Show
        steve Steve Yen added a comment - an important 2.0.1 bug
        Hide
        dipti Dipti Borkar added a comment -

        This may need an erlang upgrade or moving back to async or both.

        Given that its a big change, too risky 2.0.1

        Show
        dipti Dipti Borkar added a comment - This may need an erlang upgrade or moving back to async or both. Given that its a big change, too risky 2.0.1
        Hide
        alkondratenko Aleksey Kondratenko (Inactive) added a comment -

        We believe workaround we merged to couchdb's 2.0.1 branch fixes problem.

        Show
        alkondratenko Aleksey Kondratenko (Inactive) added a comment - We believe workaround we merged to couchdb's 2.0.1 branch fixes problem.
        Hide
        kzeller kzeller added a comment -

        <class id="cluster"/>

        <issue type="cb" ref="MB-7182"/>

        <rntext>

        <para>
        The server had experienced random timeouts possible due to lack of asynchronous I/O threads. This
        caused rebalance to fail. This has been fixed.
        </para>

        Show
        kzeller kzeller added a comment - <class id="cluster"/> <issue type="cb" ref=" MB-7182 "/> <rntext> <para> The server had experienced random timeouts possible due to lack of asynchronous I/O threads. This caused rebalance to fail. This has been fixed. </para>

          People

          • Assignee:
            kzeller kzeller
            Reporter:
            alkondratenko Aleksey Kondratenko (Inactive)
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Gerrit Reviews

              There are no open Gerrit changes