Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-7995

[system test] rebalance failed due to memcached on added node crashed

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Critical
    • Resolution: Duplicate
    • Affects Version/s: 2.0.1
    • Fix Version/s: 2.1.0
    • Component/s: couchbase-bucket
    • Security Level: Public
    • Environment:
      physical windows server 2008 R2 64bit

      Description

      Install couchbase server 2.0.1-185 on 4 physical servers with 2 separated disks
      Create a cluster with 3 nodes
      10.2.1.61
      10.2.1.62
      10.2.1.63
      Create 2 buckets: default (14GB) and sasl (10GB)
      No view or xdcr created
      Load 20+ million items to both bucket until resident ratio on both bucket around 90%
      Access cluster in 3 hours with spec in this page http://hub.internal.couchbase.com/confluence/pages/viewpage.action?pageId=6785119
      Add node 10.2.1.64 to cluster and rebalance.
      Rebalance failed with error

      Rebalance exited with reason {unexpected_exit,
      {'EXIT',<0.28651.49>,
      {{badmatch,[

      {<20943.32282.9>,noproc}]},
      [{misc, sync_shutdown_many_i_am_trapping_exits,1}, {misc,try_with_maybe_ignorant_after,2}, {gen_server,terminate,6}, {proc_lib,init_p_do_apply,3}]}}}
      ns_orchestrator002 ns_1@10.2.1.61 18:16:19 - Fri Mar 29, 2013
      <0.28617.49> exited with {unexpected_exit,
      {'EXIT',<0.28651.49>,
      {{badmatch,[{<20943.32282.9>,noproc}

      ]},
      [

      {misc, sync_shutdown_many_i_am_trapping_exits, 1}

      ,

      {misc,try_with_maybe_ignorant_after,2}

      ,

      {gen_server,terminate,6}

      ,

      {proc_lib,init_p_do_apply,3}

      ]}}} ns_vbucket_mover000 ns_1@10.2.1.61 18:16:18 - Fri Mar 29, 2013
      Control connection to memcached on 'ns_1@10.2.1.64' disconnected: {badmatch,
      {error,
      closed}} ns_memcached004 ns_1@10.2.1.64 18:16:14 - Fri Mar 29, 2013
      Port server memcached on node 'ns_1@10.2.1.64' exited with status 255. Restarting. Messages: Fri Mar 29 18:16:10.684162 Pacific Daylight Time 3: Fatal error in persisting DELETE ``5921A42BFC18C32EC6E223A3'' on vb 98!!! Requeue it...
      Fri Mar 29 18:16:10.684162 Pacific Daylight Time 3: Fatal error in persisting DELETE ``5AFE76C1A82ABEF54B14E9FA'' on vb 98!!! Requeue it...
      Fri Mar 29 18:16:10.684162 Pacific Daylight Time 3: Fatal error in persisting DELETE ``5DC7BEB033392DD5BECAE57C'' on vb 98!!! Requeue it...
      Fri Mar 29 18:16:10.684162 Pacific Daylight Time 3: Fatal error in persisting DELETE ``6156CF15B5109B7AB88829EF'' on vb 98!!! Requeue it...
      Fri Mar 29 18:16:10.684162 Pacific Daylight Time 3: Fatal error in persisti

      Link to manifest file of this build http://builds.hq.northscale.net/latestbuilds/couchbase-server-community_x86_64_2.0.1-185-rel.setup.exe.manifest.xml

      Link to collect info of all nodes https://s3.amazonaws.com/packages.couchbase/collect_info/2_0_1/201304/4phy-win-201_185-reb-failed-disk-write-failed-20130401-114233.tgz

      No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

        Hide
        trond Trond Norbye added a comment -

        Wild guess... could it be related to: http://review.couchbase.org/#/c/25411/ ?

        Show
        trond Trond Norbye added a comment - Wild guess... could it be related to: http://review.couchbase.org/#/c/25411/ ?
        Hide
        maria Maria McDuff (Inactive) added a comment -

        bumping up to critical because it is related to memcached crashing.

        Show
        maria Maria McDuff (Inactive) added a comment - bumping up to critical because it is related to memcached crashing.
        Hide
        maria Maria McDuff (Inactive) added a comment -

        moving to 2.0.2

        Show
        maria Maria McDuff (Inactive) added a comment - moving to 2.0.2
        Hide
        kzeller kzeller added a comment -

        Confirmed with Abhinav: no RN, internal only 4/16/2013

        Show
        kzeller kzeller added a comment - Confirmed with Abhinav: no RN, internal only 4/16/2013
        Hide
        mikew Mike Wiederhold added a comment - - edited

        Trond,

        It is not related to that change. This is an issue caused by a race condition between ep-engine and the compactor. The ep-engine team will take a look when we have some time. Below are the important log messages that show why this happened. I'm not sure exactly why we crash though.

        Fri Mar 29 18:15:44.413716 Pacific Daylight Time 3: Warning: failed to save docs to database, numDocs = 145 error=error reading file [errno = 0: `No error', WINAPI error = 2: `The system cannot find the file specified.']
        Fri Mar 29 18:15:44.413716 Pacific Daylight Time 3: Warning: commit failed, cannot save CouchDB docs for vbucket = 98 rev = 5
        Fri Mar 29 18:15:44.413716 Pacific Daylight Time 3: Fatal error in persisting SET ``0211845B3991EF6F6C6A9654'' on vb 98!!! Requeue it...
        Fri Mar 29 18:15:44.413716 Pacific Daylight Time 3: Fatal error in persisting SET ``02E43A02C36F90B69C97A0D8'' on vb 98!!! Requeue it...

        Crashes with status 255 unfortunately.

        Also, If this crash is seen again please let me know so we can look at the state of the machine. Also please stop load to the server as soon as the crash is seen.

        Show
        mikew Mike Wiederhold added a comment - - edited Trond, It is not related to that change. This is an issue caused by a race condition between ep-engine and the compactor. The ep-engine team will take a look when we have some time. Below are the important log messages that show why this happened. I'm not sure exactly why we crash though. Fri Mar 29 18:15:44.413716 Pacific Daylight Time 3: Warning: failed to save docs to database, numDocs = 145 error=error reading file [errno = 0: `No error', WINAPI error = 2: `The system cannot find the file specified.'] Fri Mar 29 18:15:44.413716 Pacific Daylight Time 3: Warning: commit failed, cannot save CouchDB docs for vbucket = 98 rev = 5 Fri Mar 29 18:15:44.413716 Pacific Daylight Time 3: Fatal error in persisting SET ``0211845B3991EF6F6C6A9654'' on vb 98!!! Requeue it... Fri Mar 29 18:15:44.413716 Pacific Daylight Time 3: Fatal error in persisting SET ``02E43A02C36F90B69C97A0D8'' on vb 98!!! Requeue it... Crashes with status 255 unfortunately. Also, If this crash is seen again please let me know so we can look at the state of the machine. Also please stop load to the server as soon as the crash is seen.
        Hide
        maria Maria McDuff (Inactive) added a comment -
        Show
        maria Maria McDuff (Inactive) added a comment - MB-7996
        Hide
        maria Maria McDuff (Inactive) added a comment -
        Show
        maria Maria McDuff (Inactive) added a comment - MB-7996

          People

          • Assignee:
            mikew Mike Wiederhold
            Reporter:
            thuan Thuan Nguyen
          • Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Gerrit Reviews

              There are no open Gerrit changes