Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-6992

rebalance hangs after failing over disconnected node

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Blocker
    • Resolution: Fixed
    • Affects Version/s: 2.0
    • Fix Version/s: 2.0
    • Component/s: couchbase-bucket, ns_server
    • Security Level: Public
    • Labels:
    • Environment:
      build 10.3.3.59

      Description

      1 node went down while loading data on 22 node cluster. (possibly related to xen-hypervisor as it could not ping gateway and network interface needed to be restarted)
      While node was down I tried to fail it over and rebalance.
      However, rebalance never completes and looks like there is no rebalance activity occuring on tap.

      Some activity seen in logs at time of node down:

      10.3.3.59 sees .60 nodedown :

      [user:warn,2012-10-22T11:06:38.896,ns_1@10.3.3.59:ns_node_disco:ns_node_disco:handle_info:168]Node 'ns_1@10.3.3.59' saw that node 'ns_1@10.3.3.60' went down.

      at the same time stamp node .60 shows:

      [ns_server:error,2012-10-22T11:06:00.350,ns_1@10.3.3.60:<0.12281.36>:ns_janitor:cleanup_with_states:84]Bucket "default" not yet ready on ['ns_1@10.3.2.84','ns_1@10.3.2.
      85',
      'ns_1@10.3.2.110','ns_1@10.3.2.111',
      'ns_1@10.3.2.112','ns_1@10.3.2.113',
      'ns_1@10.3.2.114','ns_1@10.3.2.115',
      'ns_1@10.3.3.59','ns_1@10.3.3.62',
      'ns_1@10.3.3.65','ns_1@10.3.3.66',
      'ns_1@10.3.3.69','ns_1@10.3.3.70',
      'ns_1@10.3.121.90','ns_1@10.3.121.91',
      'ns_1@10.3.2.107','ns_1@10.3.2.108',
      'ns_1@10.3.2.109']
      [ns_server:debug,2012-10-22T11:06:07.388,ns_1@10.3.3.60:<0.12508.36>:janitor_agent:new_style_query_vbucket_states_loop:116]Exception from query_vbucket_states of "defau
      lt":'ns_1@10.3.2.85'
      {'EXIT',{{nodedown,'ns_1@10.3.2.85'},
      {gen_server,call,
      [

      {'janitor_agent-default','ns_1@10.3.2.85'}

      ,
      query_vbucket_states,infinity]}}}

      1. 10.3.3.59.debug.tar.gz
        3.89 MB
        Tommie McAfee
      2. 10.3.3.60.debug.tar.gz
        830 kB
        Tommie McAfee
      No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

        Hide
        chiyoung Chiyoung Seo added a comment -

        I dumped connections stats from memcached layer in 10.3.3.65 and found one connection that is still in ewouldblock state and corresponds to vb_checkpoint_persistence command

        STAT conn 0x58cf080
        STAT socket 83
        STAT protocol binary
        STAT transport TCP
        STAT nevents 20
        STAT sasl_conn 0xcbaf5390
        STAT state conn_nread
        STAT substate bin_reading_packet
        STAT registered_in_libevent 0
        STAT ev_flags 12
        STAT which 2
        STAT rbuf 0x58d0000
        STAT rcurr 0x58d0020
        STAT rsize 2048
        STAT rbytes 0
        STAT wbuf 0x58e7800
        STAT wcurr 0x58fb000
        STAT wsize 2048
        STAT wbytes 24
        STAT write_and_go 0x4104f0
        STAT write_and_free (nil)
        STAT ritem 0x58d0020
        STAT rlbytes 0
        STAT item (nil)
        STAT store_op 0
        STAT sbytes 0
        STAT iov 0x58e0800
        STAT iovsize 400
        STAT iovused 0
        STAT msglist 0x58d5440
        STAT msgsize 10
        STAT msgused 1
        STAT msgcurr 0
        STAT msgbytes 0
        STAT ilist 0x58d9100
        STAT isize 200
        STAT icurr 0x58d9100
        STAT ileft 0
        STAT suffixlist 0x58539a0
        STAT suffixsize 20
        STAT suffixcurr 0x58539a0
        STAT suffixleft 0
        STAT noreply 0
        STAT refcount 1
        STAT dynamic_buffer.buffer (nil)
        STAT dynamic_buffer.size 2048
        STAT dynamic_buffer.offset 24
        STAT engine_storage 0xcd6fb0a0
        STAT cas 0
        STAT cmd 177
        STAT opaque 0
        STAT keylen 0
        STAT list_state 0
        STAT next (nil)
        STAT thread 0x10c55f0
        STAT aiostat 0
        STAT ewouldblock 1
        STAT tap_iterator (nil)

        I'm further debugging it now.

        Show
        chiyoung Chiyoung Seo added a comment - I dumped connections stats from memcached layer in 10.3.3.65 and found one connection that is still in ewouldblock state and corresponds to vb_checkpoint_persistence command STAT conn 0x58cf080 STAT socket 83 STAT protocol binary STAT transport TCP STAT nevents 20 STAT sasl_conn 0xcbaf5390 STAT state conn_nread STAT substate bin_reading_packet STAT registered_in_libevent 0 STAT ev_flags 12 STAT which 2 STAT rbuf 0x58d0000 STAT rcurr 0x58d0020 STAT rsize 2048 STAT rbytes 0 STAT wbuf 0x58e7800 STAT wcurr 0x58fb000 STAT wsize 2048 STAT wbytes 24 STAT write_and_go 0x4104f0 STAT write_and_free (nil) STAT ritem 0x58d0020 STAT rlbytes 0 STAT item (nil) STAT store_op 0 STAT sbytes 0 STAT iov 0x58e0800 STAT iovsize 400 STAT iovused 0 STAT msglist 0x58d5440 STAT msgsize 10 STAT msgused 1 STAT msgcurr 0 STAT msgbytes 0 STAT ilist 0x58d9100 STAT isize 200 STAT icurr 0x58d9100 STAT ileft 0 STAT suffixlist 0x58539a0 STAT suffixsize 20 STAT suffixcurr 0x58539a0 STAT suffixleft 0 STAT noreply 0 STAT refcount 1 STAT dynamic_buffer.buffer (nil) STAT dynamic_buffer.size 2048 STAT dynamic_buffer.offset 24 STAT engine_storage 0xcd6fb0a0 STAT cas 0 STAT cmd 177 STAT opaque 0 STAT keylen 0 STAT list_state 0 STAT next (nil) STAT thread 0x10c55f0 STAT aiostat 0 STAT ewouldblock 1 STAT tap_iterator (nil) I'm further debugging it now.
        Hide
        chiyoung Chiyoung Seo added a comment -

        Tommie,

        As we saw, the rebalance out 10.3.3.65 was successful. While debugging this issue, I had some issues in .65 while doing GETs for non-resident items, which puts my connection in ewouldblock state in memcached layer. I will continue to debug this issue.

        Please update the bug if you see the same rebalance hung issue again.

        Show
        chiyoung Chiyoung Seo added a comment - Tommie, As we saw, the rebalance out 10.3.3.65 was successful. While debugging this issue, I had some issues in .65 while doing GETs for non-resident items, which puts my connection in ewouldblock state in memcached layer. I will continue to debug this issue. Please update the bug if you see the same rebalance hung issue again.
        Hide
        thuan Thuan Nguyen added a comment -

        Integrated in github-ep-engine-2-0 #451 (See http://qa.hq.northscale.net/job/github-ep-engine-2-0/451/)
        MB-6992 Add more informative logs to checkpoint prioritization (Revision 3c719d47ca41285bbcbc61817f719180448f1042)

        Result = SUCCESS
        Chiyoung Seo :
        Files :

        • src/ep.cc
        Show
        thuan Thuan Nguyen added a comment - Integrated in github-ep-engine-2-0 #451 (See http://qa.hq.northscale.net/job/github-ep-engine-2-0/451/ ) MB-6992 Add more informative logs to checkpoint prioritization (Revision 3c719d47ca41285bbcbc61817f719180448f1042) Result = SUCCESS Chiyoung Seo : Files : src/ep.cc
        Show
        chiyoung Chiyoung Seo added a comment - http://review.couchbase.org/#/c/22022/
        Hide
        thuan Thuan Nguyen added a comment -

        Integrated in github-ep-engine-2-0 #452 (See http://qa.hq.northscale.net/job/github-ep-engine-2-0/452/)
        MB-6992 Control the flusher execution by the transaction size (Revision b327be09a1f971145fda5c249b4fa7a8304b1920)

        Result = SUCCESS
        Chiyoung Seo :
        Files :

        • tests/ep_testsuite.cc
        • src/ep.cc
        • src/ep.hh
        • src/flusher.hh
        • src/flusher.cc
        Show
        thuan Thuan Nguyen added a comment - Integrated in github-ep-engine-2-0 #452 (See http://qa.hq.northscale.net/job/github-ep-engine-2-0/452/ ) MB-6992 Control the flusher execution by the transaction size (Revision b327be09a1f971145fda5c249b4fa7a8304b1920) Result = SUCCESS Chiyoung Seo : Files : tests/ep_testsuite.cc src/ep.cc src/ep.hh src/flusher.hh src/flusher.cc

          People

          • Assignee:
            chiyoung Chiyoung Seo
            Reporter:
            tommie Tommie McAfee
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Gerrit Reviews

              There are no open Gerrit changes