Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-5075

after one node is auto failovered, another node CPU goes very high and becomes unresponsive

    XMLWordPrintable

Details

    • Bug
    • Resolution: Cannot Reproduce
    • Critical
    • 2.0-beta
    • 1.8.0
    • ns_server
    • Security Level: Public
    • EC2, 4-node cluster, m2.xlarge instance type, Ubuntu 10.04

    Description

      One node is failed over automatically, due to what looks like some kernel or driver bug. That's OK. But then right after that, a 2nd node in the cluster which otherwise looks healthy has CPU spike very high (seen in the AWS administration console), becomes unresponsive, and has to be rebooted.

      This happened more than once. On a previous time, a process logger caught beam.smp at 100% CPU, AND memcached nearly 100%.

      Logs will be provided in a private comment.

      On the latest one, the first failing node has dmesg ouput like this:

      [302338.512230] INFO: task xfssyncd:8805 blocked for more than 120 seconds.
      [302338.512241] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [302338.512246] xfssyncd D ffffffff80547400 0 8805 2 0x00000000
      [302338.512252] ffff880444055be0 0000000000000246 ffff880443d1bb60 ffff880444055b60
      [302338.512259] ffff8804423677e8 ffff880444055ba8 000000000000a380 ffff880443becb28
      [302338.512265] 000000000000a380 ffff880444055b90 000000000000a380 000000000000a380
      [302338.512271] Call Trace:
      [302338.512280] [<ffffffff80535a45>] schedule_timeout+0x175/0x1d0
      [302338.512286] [<ffffffff8020b865>] ? __bio_add_page+0x205/0x270
      [302338.512291] [<ffffffff80534c59>] wait_for_common+0xf9/0x210
      [302338.512297] [<ffffffff80130180>] ? default_wake_function+0x0/0x10
      [302338.512301] [<ffffffff80534e08>] wait_for_completion+0x18/0x20
      [302338.512306] [<ffffffff80348a89>] xfs_reclaim_inode+0x139/0x1a0
      [302338.512310] [<ffffffff803499a3>] xfs_reclaim_inode_now+0x93/0xa0
      [302338.512314] [<ffffffff80349910>] ? xfs_reclaim_inode_now+0x0/0xa0
      [302338.512318] [<ffffffff80349502>] xfs_inode_ag_walk+0x72/0xd0
      [302338.512322] [<ffffffff80349910>] ? xfs_reclaim_inode_now+0x0/0xa0
      [302338.512326] [<ffffffff803495c7>] xfs_inode_ag_iterator+0x67/0xa0
      [302338.512330] [<ffffffff80349614>] xfs_reclaim_inodes+0x14/0x20
      [302338.512334] [<ffffffff80349679>] xfs_sync_worker+0x59/0x90
      [302338.512337] [<ffffffff80348d07>] xfssyncd+0x177/0x220
      [302338.512341] [<ffffffff80348b90>] ? xfssyncd+0x0/0x220
      [302338.512347] [<ffffffff80154ca6>] kthread+0xa6/0xb0
      [302338.512353] [<ffffffff8010a57a>] child_rip+0xa/0x20
      [302338.512358] [<ffffffff80154c00>] ? kthread+0x0/0xb0
      [302338.512362] [<ffffffff8010a570>] ? child_rip+0x0/0x20
      [302338.512368] INFO: task beam.smp:9594 blocked for more than 120 seconds.
      [302338.512372] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [302338.512376] beam.smp D ffffffff801a4ab0 0 9594 1 0x00000000
      [302338.512381] ffff88044389fc78 0000000000000286 0000000000000001 ffff88044389fbf8
      [302338.512388] ffff88044458e790 ffff88044389fc40 000000000000a380 ffff88044389c3e8
      [302338.512394] 000000000000a380 ffff88044389fc28 000000000000a380 000000000000a380
      [302338.512399] Call Trace:
      [302338.512409] [<ffffffffa006a616>] ? dm_get_table+0x46/0x60 [dm_mod]
      [302338.512417] [<ffffffffa006a79b>] ? dm_unplug_all+0x4b/0x70 [dm_mod]
      [302338.512422] [<ffffffff801a4ab0>] ? sync_page+0x0/0x50
      [302338.512426] [<ffffffff805357d8>] io_schedule+0x28/0x40
      [302338.512430] [<ffffffff801a4aed>] sync_page+0x3d/0x50
      [302338.512433] [<ffffffff80535c97>] __wait_on_bit+0x57/0x80
      [302338.512437] [<ffffffff801a4c5e>] wait_on_page_bit+0x6e/0x80
      [302338.512441] [<ffffffff80155290>] ? wake_bit_function+0x0/0x40
      [302338.512446] [<ffffffff801ae770>] ? pagevec_lookup_tag+0x20/0x30
      [302338.512450] [<ffffffff801a5055>] wait_on_page_writeback_range+0xf5/0x190
      [302338.512454] [<ffffffff801a5117>] filemap_fdatawait+0x27/0x30
      [302338.512460] [<ffffffff803417cd>] xfs_wait_on_pages+0x2d/0x40
      [302338.512465] [<ffffffff80339b76>] xfs_fsync+0x46/0x1b0
      [302338.512468] [<ffffffff801a5173>] ? __filemap_fdatawrite_range+0x53/0x60
      [302338.512473] [<ffffffff8034160c>] xfs_file_fsync+0x6c/0x80
      [302338.512479] [<ffffffff80204a66>] vfs_fsync+0x86/0xf0
      [302338.512482] [<ffffffff80204b09>] do_fsync+0x39/0x60
      [302338.512487] [<ffffffff80204b5b>] sys_fsync+0xb/0x10
      [302338.512491] [<ffffffff801099a8>] system_call_fastpath+0x16/0x1b
      [302338.512496] [<ffffffff80109940>] ? system_call+0x0/0x52
      [302338.512502] INFO: task memcached:9869 blocked for more than 120 seconds.
      [302338.512505] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [302338.512509] memcached D ffffffff80547400 0 9869 9600 0x00000000
      [302338.512515] ffff88044431bc78 0000000000000282 0000000000000001 ffff88044431bbf8
      [302338.512521] ffff880443f41df0 ffff88044431bc40 000000000000a380 ffff880443de07e8
      [302338.512528] 000000000000a380 ffff88044431bc28 000000000000a380 000000000000a380
      [302338.512534] Call Trace:
      [302338.512540] [<ffffffffa006a616>] ? dm_get_table+0x46/0x60 [dm_mod]
      [302338.512548] [<ffffffffa006a79b>] ? dm_unplug_all+0x4b/0x70 [dm_mod]
      [302338.512552] [<ffffffff801a4ab0>] ? sync_page+0x0/0x50
      [302338.512556] [<ffffffff805357d8>] io_schedule+0x28/0x40
      [302338.512560] [<ffffffff801a4aed>] sync_page+0x3d/0x50
      [302338.512563] [<ffffffff80535c97>] __wait_on_bit+0x57/0x80
      [302338.512568] [<ffffffff801a4c5e>] wait_on_page_bit+0x6e/0x80
      [302338.512573] [<ffffffff80155290>] ? wake_bit_function+0x0/0x40
      [302338.512577] [<ffffffff801ae770>] ? pagevec_lookup_tag+0x20/0x30
      [302338.512581] [<ffffffff801a5055>] wait_on_page_writeback_range+0xf5/0x190
      [302338.512585] [<ffffffff801a5117>] filemap_fdatawait+0x27/0x30
      [302338.512590] [<ffffffff803417cd>] xfs_wait_on_pages+0x2d/0x40
      [302338.512594] [<ffffffff80339b76>] xfs_fsync+0x46/0x1b0
      [302338.512597] [<ffffffff801a5173>] ? __filemap_fdatawrite_range+0x53/0x60
      [302338.512603] [<ffffffff8034160c>] xfs_file_fsync+0x6c/0x80
      [302338.512607] [<ffffffff80204a66>] vfs_fsync+0x86/0xf0
      [302338.512610] [<ffffffff80204b09>] do_fsync+0x39/0x60
      [302338.512614] [<ffffffff80204b3e>] sys_fdatasync+0xe/0x20
      [302338.512618] [<ffffffff801099a8>] system_call_fastpath+0x16/0x1b
      [302338.512622] [<ffffffff80109940>] ? system_call+0x0/0x52

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          People

            alkondratenko Aleksey Kondratenko (Inactive)
            TimSmith Tim Smith (Inactive)
            Votes:
            1 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty