Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-7007

Killing myself errors after upgrade to 1.8.1

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 1.8.1
    • Fix Version/s: None
    • Component/s: ns_server
    • Security Level: Public
    • Labels:
      None

      Description

      We are running 1.8.1-937 Coucbase server on 10 Linux nodes with about 200M items bucket.

      About 1.5 month ago we've upgraded servers from 1.7 to 1.8 and started to see following error messages in the log:

      killing myself due to unexpected upstream sender exit with reason: {{badmatch,
      {error,
      closed}},
      [

      {ebucketmigrator_srv, upstream_sender_loop, 1}

      ]} ebucketmigrator_srv000 ns_1@172.19.4.41 03:52:33 - Thu Sep 20, 2012

      In many cases this error comes from same node, but sometimes from different ones.

      Since the upgrade we've experienced a failure of the node, which became unresponsive after spitting multiple errors of this kind. But in other cases (with single error appearing once a day or so), there seems to be no impact.

      Google search gave me only following references for this error:

      http://www.couchbase.com/issues/browse/MB-5171
      http://www.couchbase.com/issues/browse/MB-5467
      http://www.couchbase.com/issues/secure/ReleaseNote.jspa?projectId=10010&version=10295

      It looks like according to the last reference, this error supposed to be fixed in 1.8.1 version.

      I also can't find clear information as to whether this is a 'bad' error or not and how to fix and/or prevent it.

      1. cbcollect_info_20121025-063018.aa
        9.54 MB
        Dmytro Kovalov
      2. cbcollect_info_20121025-063018.ab
        9.54 MB
        Dmytro Kovalov
      3. cbcollect_info_20121025-063018.ac
        9.54 MB
        Dmytro Kovalov
      4. cbcollect_info_20121025-063018.ad
        4.82 MB
        Dmytro Kovalov
      No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

        Hide
        alkondratenko Aleksey Kondratenko (Inactive) added a comment -

        I cannot explain what I see. Debug logs are rotated past interesting moments. But info logs have a lot of useful details.

        One such moment is around Sep 19 18:30 and onwards. There are lots of ebucketmigrator restarts caused mostly by source node (.41) closing it's tap connection. They are rapidly restarted by their supervisor, but fail rapidly again (or sometimes rapidly again), which causes supervisor to give up and fail. Parent supervisor restarts replication supervisor and janitor sets up replications again in about 10 seconds, only to start this 'loop' again.

        I see that apparently eventually replicas fall back far enough behind so that tap cannot reuse checkpoints anymore. It seems to start at around this:

        [rebalance:info] [2012-09-19 20:56:53] [ns_1@172.19.4.41:<0.26205.227>:ebucketmigrator_srv:init:252] Starting tap stream:
        [

        {vbuckets,[17,320,321,322,323,932,933,934,935,936,937,938]},
        {checkpoints,[{17,32722},
        {320,32508},
        {321,32523},
        {322,32566},
        {323,32393},
        {932,32455},
        {933,32368},
        {934,32570},
        {935,32578},
        {936,32505},
        {937,32580},
        {938,32547}]},
        {name,"replication_ns_1@172.19.4.44"},
        {takeover,false}]
        {{"172.19.4.41",11209},
        {"172.19.4.44",11209},
        [{username,"default"},
        {password,[]},
        {vbuckets,[17,320,321,322,323,932,933,934,935,936,937,938]}

        ,

        {takeover,false}

        ,

        {suffix,"ns_1@172.19.4.44"}

        ]}
        [rebalance:info] [2012-09-19 20:56:53] [ns_1@172.19.4.41:<0.26205.227>:ebucketmigrator_srv:process_upstream:465] Initial stream for vbucket 320

        And soon after that we see backfills for many more vbuckets for this and other different tap connections too.

        I see last reached_max_restart_intensity at 22:33 Oct 19. After that I'm seeing period of massive stats archiver timeouts. Perhaps indicating there's lots of disk contention.

        At around 3 AM next night (Oct 20) box was apparently restarted (or just couchbase server, doesn't matter).

        Show
        alkondratenko Aleksey Kondratenko (Inactive) added a comment - I cannot explain what I see. Debug logs are rotated past interesting moments. But info logs have a lot of useful details. One such moment is around Sep 19 18:30 and onwards. There are lots of ebucketmigrator restarts caused mostly by source node (.41) closing it's tap connection. They are rapidly restarted by their supervisor, but fail rapidly again (or sometimes rapidly again), which causes supervisor to give up and fail. Parent supervisor restarts replication supervisor and janitor sets up replications again in about 10 seconds, only to start this 'loop' again. I see that apparently eventually replicas fall back far enough behind so that tap cannot reuse checkpoints anymore. It seems to start at around this: [rebalance:info] [2012-09-19 20:56:53] [ns_1@172.19.4.41:<0.26205.227>:ebucketmigrator_srv:init:252] Starting tap stream: [ {vbuckets,[17,320,321,322,323,932,933,934,935,936,937,938]}, {checkpoints,[{17,32722}, {320,32508}, {321,32523}, {322,32566}, {323,32393}, {932,32455}, {933,32368}, {934,32570}, {935,32578}, {936,32505}, {937,32580}, {938,32547}]}, {name,"replication_ns_1@172.19.4.44"}, {takeover,false}] {{"172.19.4.41",11209}, {"172.19.4.44",11209}, [{username,"default"}, {password,[]}, {vbuckets,[17,320,321,322,323,932,933,934,935,936,937,938]} , {takeover,false} , {suffix,"ns_1@172.19.4.44"} ]} [rebalance:info] [2012-09-19 20:56:53] [ns_1@172.19.4.41:<0.26205.227>:ebucketmigrator_srv:process_upstream:465] Initial stream for vbucket 320 And soon after that we see backfills for many more vbuckets for this and other different tap connections too. I see last reached_max_restart_intensity at 22:33 Oct 19. After that I'm seeing period of massive stats archiver timeouts. Perhaps indicating there's lots of disk contention. At around 3 AM next night (Oct 20) box was apparently restarted (or just couchbase server, doesn't matter).
        Hide
        alkondratenko Aleksey Kondratenko (Inactive) added a comment -

        Farshid, you gate-keep-ed all bugs and you may know more. Feel free to comment/help, or assign to Chiyoung for further analysis.

        Show
        alkondratenko Aleksey Kondratenko (Inactive) added a comment - Farshid, you gate-keep-ed all bugs and you may know more. Feel free to comment/help, or assign to Chiyoung for further analysis.
        Hide
        farshid Farshid Ghods (Inactive) added a comment -

        Hi Dimitry,

        would you please confirm whether you have applied these patch to the 1.8.1 cluster

        http://www.couchbase.com/issues/browse/MB-5343

        Show
        farshid Farshid Ghods (Inactive) added a comment - Hi Dimitry, would you please confirm whether you have applied these patch to the 1.8.1 cluster http://www.couchbase.com/issues/browse/MB-5343
        Hide
        alkondratenko Aleksey Kondratenko (Inactive) added a comment -

        Don't know if MB-5343 is relevant at all but this condition cannot happen with 2.1.0 and 2.0.1

        Show
        alkondratenko Aleksey Kondratenko (Inactive) added a comment - Don't know if MB-5343 is relevant at all but this condition cannot happen with 2.1.0 and 2.0.1

          People

          • Assignee:
            Unassigned
            Reporter:
            Dmytro Kovalov Dmytro Kovalov
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Gerrit Reviews

              There are no open Gerrit changes