Details

    • Type: Technical task
    • Status: Resolved
    • Priority: Blocker
    • Resolution: Duplicate
    • Affects Version/s: 2.0
    • Fix Version/s: 3.0
    • Component/s: couchbase-bucket
    • Security Level: Public
    • Environment:
      centos 6.2 64bit build 2.0.0-1931

      Description

      Cluster information:

      • 8 centos 6.2 64bit server with 4 cores CPU
      • Each server has 32 GB RAM and 400 GB SSD disk.
      • 24.8 GB RAM for couchbase server at each node
      • SSD disk format ext4 on /data
      • Each server has its own SSD drive, no disk sharing with other server.
      • Create cluster with 6 nodes installed couchbase server 2.0.0-1931
      • Link to manifest file http://builds.hq.northscale.net/latestbuilds/couchbase-server-enterprise_x86_64_2.0.0-1931-rel.rpm.manifest.xml
      • Cluster has 2 buckets, default and saslbucket (12GB/each with 1 replica) and with 64 vbuckets setup.
      • Each bucket has one doc and 2 views for each doc (default d1 and saslbucket d11)

      10.6.2.37
      10.6.2.38
      10.6.2.44
      10.6.2.45
      10.6.2.42
      10.6.2.43

      • Load 20 million items to each bucket. Each key has size 1024 bytes
      • After done loading, wait until initial index.
      • After initial indexing done, mutate all items with size from 1024 to 1512 bytes.
      • Queries all 4 views from 2 docs
      • Add node 44 and rebalance. Passed
      • Add node 45 and rebalance. Passed.
      • Check auto failover is enable on cluster.
      • Turn on firewall on node 40
        iptables -A INPUT -p tcp -i eth0 --dport 1000:60000 -j REJECT
        iptables -A OUTPUT -p tcp -o eth0 --sport 1000:60000 -j REJECT
      • Node 40 was down as expected.
      • Auto failover kicked in after one minute.
      • Disable firewall on node 40. Cluster saw node 40 up.
      • Add node 40 back to cluster and rebalance. In few seconds, rebalance failed with error: "Failed to wait deletion of some buckets on some nodes." Filed bug MB-7110
      • Wait about 1 and half hour, rebalance again. Rebalance failed with error:" wait_checkpoint_persisted_failed"

      ns_server:info,2012-11-06T5:42:13.901,ns_1@10.6.2.37:janitor_agent-default<0.30140.0>:janitor_agent:handle_info:676]Undoing temporary vbucket states caused by rebalance
      [error_logger:error,2012-11-06T5:42:13.901,ns_1@10.6.2.37:error_logger<0.5.0>:ale_error_logger_handler:log_report:72]
      =========================CRASH REPORT=========================
      crasher:
      initial call: ns_single_vbucket_mover:mover/6
      pid: <0.11943.2727>
      registered_name: []
      exception exit: {unexpected_exit,
      {'EXIT',<0.12020.2727>,
      {{wait_checkpoint_persisted_failed,"default",50,3131,
      [{'ns_1@10.6.2.40',
      {'EXIT',
      {{badmatch,{error,timeout,
      [

      {mc_client_binary,cmd_binary_vocal_recv,5},
      {mc_client_binary,select_bucket,2},
      {ns_memcached,ensure_bucket,2},
      {ns_memcached,handle_info,2},
      {gen_server,handle_msg,5},
      {proc_lib,init_p_do_apply,3}]},
      {gen_server,call,
      ['ns_memcached-default',
      {wait_for_checkpoint_persistence,37,2959},
      infinity]}},
      {gen_server,call,
      [{'janitor_agent-default','ns_1@10.6.2.40'},
      {if_rebalance,<0.32081.2694>,
      {wait_checkpoint_persisted,50,3131}},
      infinity]}}}}]},
      [{ns_single_vbucket_mover, '-wait_checkpoint_persisted_many/5-fun-1-',5}]}}}
      in function ns_single_vbucket_mover:spawn_and_wait/1
      in call from ns_single_vbucket_mover:mover_inner/6
      in call from misc:try_with_maybe_ignorant_after/2
      in call from ns_single_vbucket_mover:mover/6
      ancestors: [<0.32081.2694>,<0.18896.2646>]
      messages: [{'EXIT',<0.32081.2694>,
      {unexpected_exit,
      {'EXIT',<0.20985.2736>,
      {{wait_checkpoint_persisted_failed,"default",37,2959,
      [{'ns_1@10.6.2.40',
      {'EXIT',
      {{badmatch,{error,timeout,
      [{mc_client_binary,cmd_binary_vocal_recv,5}

      ,

      {mc_client_binary,select_bucket,2},
      {ns_memcached,ensure_bucket,2},
      {ns_memcached,handle_info,2},
      {gen_server,handle_msg,5},
      {proc_lib,init_p_do_apply,3}]},
      {gen_server,call,
      ['ns_memcached-default',
      {wait_for_checkpoint_persistence,37,2959},
      infinity]}},
      {gen_server,call,
      [{'janitor_agent-default','ns_1@10.6.2.40'},
      {if_rebalance,<0.32081.2694>,
      {wait_checkpoint_persisted,37,2959}},
      infinity]}}}}]},
      [{ns_single_vbucket_mover, '-wait_checkpoint_persisted_many/5-fun-1-',5}]}}}}]
      links: [<0.32081.2694>,<0.17284.2744>]
      dictionary: [{cleanup_list,[<0.11946.2727>,<0.12020.2727>]}]
      trap_exit: true
      status: running
      heap_size: 6765
      stack_size: 24
      reductions: 12015
      neighbours:

      [user:info,2012-11-06T5:42:13.903,ns_1@10.6.2.37:<0.14641.0>:ns_orchestrator:handle_info:319]Rebalance exited with reason {unexpected_exit,
      {'EXIT',<0.20985.2736>,
      {{wait_checkpoint_persisted_failed,"default",
      37,2959,
      [{'ns_1@10.6.2.40',
      {'EXIT',
      {{badmatch,{error,timeout,
      [{mc_client_binary, cmd_binary_vocal_recv,5},
      {mc_client_binary,select_bucket,2}

      ,

      {ns_memcached,ensure_bucket,2},
      {ns_memcached,handle_info,2},
      {gen_server,handle_msg,5},
      {proc_lib,init_p_do_apply,3}]},
      {gen_server,call,
      ['ns_memcached-default',
      {wait_for_checkpoint_persistence,37, 2959},
      infinity]}},
      {gen_server,call,
      [{'janitor_agent-default', 'ns_1@10.6.2.40'},
      {if_rebalance,<0.32081.2694>,
      {wait_checkpoint_persisted,37,2959}},
      infinity]}}}}]},
      [{ns_single_vbucket_mover, '-wait_checkpoint_persisted_many/5-fun-1-', 5}]}}}

      [error_logger:error,2012-11-06T5:42:13.902,ns_1@10.6.2.37:error_logger<0.5.0>:ale_error_logger_handler:log_msg:76]** Generic server <0.32081.2694> terminating
      ** Last message in was {'EXIT',<0.20927.2736>,
      {unexpected_exit,
      {'EXIT',<0.20985.2736>,
      {{wait_checkpoint_persisted_failed,"default",37,
      2959,
      [{'ns_1@10.6.2.40',
      {'EXIT',
      {{badmatch,{error,timeout,
      [{mc_client_binary,cmd_binary_vocal_recv,5},
      {mc_client_binary,select_bucket,2},
      {ns_memcached,ensure_bucket,2}

      ,

      {ns_memcached,handle_info,2}

      ,

      {gen_server,handle_msg,5}

      ,

      {proc_lib,init_p_do_apply,3}

      ]},
      {gen_server,call,
      ['ns_memcached-default',

      {wait_for_checkpoint_persistence,37,2959}

      ,
      infinity]}},
      {gen_server,call,
      [

      {'janitor_agent-default','ns_1@10.6.2.40'}

      ,
      {if_rebalance,<0.32081.2694>,
      {wait_checkpoint_persisted,37,2959}},
      infinity]}}}}]},
      [

      {ns_single_vbucket_mover, '-wait_checkpoint_persisted_many/5-fun-1-', 5}

      ]}}}}

        • When Server state == {state,"default",<0.32082.2694>,
          {dict,8,16,16,8,80,48,
          {[],[],[],[],[],[],[],[],[],[],[],[],[], [],[],[]}

          ,
          {{[['ns_1@10.6.2.40'|8]],
          [],
          [['ns_1@10.6.2.42'|3]],
          [['ns_1@10.6.2.43'|3]],

      I will upload collect info later

      No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

        Hide
        steve Steve Yen added a comment -

        moved to 2.0.1 per bug-scrub mtg

        another new bug will be filed by farshid to bump the timeouts higher again for 2.0.

        Show
        steve Steve Yen added a comment - moved to 2.0.1 per bug-scrub mtg another new bug will be filed by farshid to bump the timeouts higher again for 2.0.
        Hide
        mikew Mike Wiederhold added a comment -

        Xiaoqin,

        This issue is likely the same issue that can cause our unit test for checkpoint persistence to fail periodically. When you have time please take a look at this failing unit test.

        Running [0014/0015]: checkpoint: wait for persistence (couchstore)...tests/ep_testsuite.cc:4595 Test failed: `Expected CHECKPOINT_PERSISTENCE_TIMEOUT was adjusted to be greater than 10 secs' (get_int_stat(hp->h, hp->h1, "ep_chk_persistence_timeout") > 10)
        DIED

        Show
        mikew Mike Wiederhold added a comment - Xiaoqin, This issue is likely the same issue that can cause our unit test for checkpoint persistence to fail periodically. When you have time please take a look at this failing unit test. Running [0014/0015] : checkpoint: wait for persistence (couchstore)...tests/ep_testsuite.cc:4595 Test failed: `Expected CHECKPOINT_PERSISTENCE_TIMEOUT was adjusted to be greater than 10 secs' (get_int_stat(hp->h, hp->h1, "ep_chk_persistence_timeout") > 10) DIED
        Hide
        xiaoqin Xiaoqin Ma (Inactive) added a comment -

        For the failing unit test, it is a timing issue. If it doesn't happen often, we don't need to fix it.

        Show
        xiaoqin Xiaoqin Ma (Inactive) added a comment - For the failing unit test, it is a timing issue. If it doesn't happen often, we don't need to fix it.
        Hide
        maria Maria McDuff (Inactive) added a comment -

        bug scrub: ketaki – are we still seeing this issue? what's the freq? can you pls update this bug? Thanks.

        Show
        maria Maria McDuff (Inactive) added a comment - bug scrub: ketaki – are we still seeing this issue? what's the freq? can you pls update this bug? Thanks.
        Hide
        mikew Mike Wiederhold added a comment - - edited

        I have filed MB-8002 to track these kind of issues. We may also need help from QE to reproduce this issue later.

        Show
        mikew Mike Wiederhold added a comment - - edited I have filed MB-8002 to track these kind of issues. We may also need help from QE to reproduce this issue later.

          People

          • Assignee:
            ketaki Ketaki Gangal
            Reporter:
            thuan Thuan Nguyen
          • Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Gerrit Reviews

              There are no open Gerrit changes