Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-7059

[system test] beam.smp is running at node 43 but all other nodes saw this node down

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Cannot Reproduce
    • Affects Version/s: 2.0
    • Fix Version/s: 2.0
    • Component/s: ns_server
    • Security Level: Public
    • Labels:
    • Environment:
      centos 6.2 64bit build 2.0.0-1908

      Description

      Cluster information:

      • 8 centos 6.2 64bit server with 4 cores CPU
      • Each server has 32 GB RAM and 400 GB SSD disk.
      • 24.8 GB RAM for couchbase server at each node
      • SSD disk format ext4 on /data
      • Each server has its own SSD drive, no disk sharing with other server.
      • Create cluster with 6 nodes installed couchbase server 2.0.0-1908
      • Link to manifest file http://builds.hq.northscale.net/latestbuilds/couchbase-server-enterprise_x86_64_2.0.0-1908-rel.rpm.manifest.xml
      • Cluster has 2 buckets, default (12GB with 2 replica) and saslbucket (12GB with 1 replica).
      • Each bucket has one doc and 2 views for each doc (default d1 and saslbucket d11)

      10.6.2.37
      10.6.2.38
      10.6.2.44
      10.6.2.45
      10.6.2.42
      10.6.2.43

      • Load 16 million items to default bucket and 20 million items to saslbuckett. Each key has size from 512 bytes to 1024 bytes
      • After done loading, wait until initial index. Disable view compaction.
      • After initial indexing done, mutate all items with size from 1024 to 1512 bytes.
      • Queries all 4 views from 2 docs
      • Do swap rebalance, remove node 39, 40 and add node 44, 45.
      • At the end of rebalance saslbucket, rebalance exited with timeout on node 43
      • Then see a lot of reset connection to mccouch. Updated bug MB-7046
      • Kill all loads pointing to this cluster. Node 43 did not back to stable state.
      • beam.smp is running but node 43 still down.
      • Kill beam.smp by sigusr1 to create erlang core dump

      Link to collect info of all nodes https://s3.amazonaws.com/packages.couchbase/collect_info/orange/2_0_0/201210/orange-ci-1908-node43-down-erl-hang-20121030.tgz

      No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

        Hide
        alkondratenko Aleksey Kondratenko (Inactive) added a comment -

        So I screwed up this one a bit.

        As pointed out we've seen node .43 sitting there doing nothing.

        I ssh-ed to this machine and found erlang eating 100% CPU and about 8 gigs of RAM.

        Then I sent SIGUSR1 to that process but it seemed like erlang ignored it. So I concluded it's stuck more badly than expected.

        I gdb-ed to that process and found all threads but one idle. That idle thread was seemingly stuck in some bignum code in erlang. I.e. I did 'finish' command to wait until it steps out of some call and few seconds later concluded it's stuck there, apparently looping.

        Then I made decision to capture process' state in core dump. And this is my mistake number 1. I did 'call abort()' in gdb causing process to abort and dump core. But because I did that on thread that was most valuable I've found backtrace to have no useful information in that thread. Instead backtrace of that thread had stack frames set up by gdb for abort call. Stupid move

        So information was lost there.

        Also by examining backtraces in core dump I've found that erlang actually started dumping crash dump. And I've found partially written crash dump on disk. Unfortunately incompleteness of that crash dump doesn't allow me to make any conclusions. So mistake (and lesson) #2. I should have waited more until erlang starts dumping crash dump.

        Sorry, folks. It appears we'll have to wait until this happens again.

        Show
        alkondratenko Aleksey Kondratenko (Inactive) added a comment - So I screwed up this one a bit. As pointed out we've seen node .43 sitting there doing nothing. I ssh-ed to this machine and found erlang eating 100% CPU and about 8 gigs of RAM. Then I sent SIGUSR1 to that process but it seemed like erlang ignored it. So I concluded it's stuck more badly than expected. I gdb-ed to that process and found all threads but one idle. That idle thread was seemingly stuck in some bignum code in erlang. I.e. I did 'finish' command to wait until it steps out of some call and few seconds later concluded it's stuck there, apparently looping. Then I made decision to capture process' state in core dump. And this is my mistake number 1. I did 'call abort()' in gdb causing process to abort and dump core. But because I did that on thread that was most valuable I've found backtrace to have no useful information in that thread. Instead backtrace of that thread had stack frames set up by gdb for abort call. Stupid move So information was lost there. Also by examining backtraces in core dump I've found that erlang actually started dumping crash dump. And I've found partially written crash dump on disk. Unfortunately incompleteness of that crash dump doesn't allow me to make any conclusions. So mistake (and lesson) #2. I should have waited more until erlang starts dumping crash dump. Sorry, folks. It appears we'll have to wait until this happens again.
        Hide
        alkondratenko Aleksey Kondratenko (Inactive) added a comment -

        See above. Please reassign to me when this happens next time.

        Ideally we'd need both crash dump and core dump. So if I'm around when this happens next time, let me know please and I'll do something to gather both bits of information.

        Show
        alkondratenko Aleksey Kondratenko (Inactive) added a comment - See above. Please reassign to me when this happens next time. Ideally we'd need both crash dump and core dump. So if I'm around when this happens next time, let me know please and I'll do something to gather both bits of information.
        Hide
        alkondratenko Aleksey Kondratenko (Inactive) added a comment -

        Logs from crashed nodes only have lots of timeouts left and right. Perhaps related to this condition in erlang VM or just MB-6595 or maybe side effect of lack of async IO threads.

        Show
        alkondratenko Aleksey Kondratenko (Inactive) added a comment - Logs from crashed nodes only have lots of timeouts left and right. Perhaps related to this condition in erlang VM or just MB-6595 or maybe side effect of lack of async IO threads.
        Hide
        thuan Thuan Nguyen added a comment -

        I will let you know when I hit this bug again.

        Show
        thuan Thuan Nguyen added a comment - I will let you know when I hit this bug again.
        Hide
        steve Steve Yen added a comment -

        Reviewed in bug-scrub mtg... waiting for possible repo.

        Show
        steve Steve Yen added a comment - Reviewed in bug-scrub mtg... waiting for possible repo.
        Hide
        steve Steve Yen added a comment -

        per bug-scrub mtg, if we see this again, please reopen / recreate.

        Show
        steve Steve Yen added a comment - per bug-scrub mtg, if we see this again, please reopen / recreate.

          People

          • Assignee:
            thuan Thuan Nguyen
            Reporter:
            thuan Thuan Nguyen
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Gerrit Reviews

              There are no open Gerrit changes