Details

    • Type: Technical task
    • Status: Resolved
    • Priority: Blocker
    • Resolution: Fixed
    • Affects Version/s: 2.0
    • Fix Version/s: 2.0.1
    • Component/s: ns_server
    • Security Level: Public
    • Labels:
    • Environment:
      centos 6.2 64bit build 2.0.0-1931

      Description

      Cluster information:

      • 8 centos 6.2 64bit server with 4 cores CPU
      • Each server has 32 GB RAM and 400 GB SSD disk.
      • 24.8 GB RAM for couchbase server at each node
      • SSD disk format ext4 on /data
      • Each server has its own SSD drive, no disk sharing with other server.
      • Create cluster with 6 nodes installed couchbase server 2.0.0-1931
      • Link to manifest file http://builds.hq.northscale.net/latestbuilds/couchbase-server-enterprise_x86_64_2.0.0-1931-rel.rpm.manifest.xml
      • Cluster has 2 buckets, default and saslbucket (12GB/each with 1 replica) and with 64 vbuckets setup.
      • Each bucket has one doc and 2 views for each doc (default d1 and saslbucket d11)

      10.6.2.37
      10.6.2.38
      10.6.2.44
      10.6.2.45
      10.6.2.42
      10.6.2.43

      • Load 20 million items to each bucket. Each key has size 1024 bytes
      • After done loading, wait until initial index.
      • After initial indexing done, mutate all items with size from 1024 to 1512 bytes.
      • Queries all 4 views from 2 docs
      • Add node 44 and rebalance. Passed
      • Add node 45 and rebalance. Passed.
      • Check auto failover is enable on cluster.
      • Turn on firewall on node 40
        iptables -A INPUT -p tcp -i eth0 --dport 1000:60000 -j REJECT
        iptables -A OUTPUT -p tcp -o eth0 --sport 1000:60000 -j REJECT
      • Node 40 was down as expected.
      • Auto failover kicked in after one minute.
      • Disable firewall on node 40. Cluster saw node 40 up.
      • Add node 40 back to cluster and rebalance. In few seconds, rebalance failed with error

      [rebalance:error,2012-11-06T0:41:48.498,ns_1@10.6.2.37:<0.4077.2612>:ns_rebalancer:do_wait_buckets_shutdown:204]Failed to wait deletion of some buckets on some nodes: [{'ns_1@10.6.2.40',
      {'EXIT',

      {old_buckets_shutdown_wait_failed, ["default"]}

      }}]

      [user:info,2012-11-06T0:41:48.500,ns_1@10.6.2.37:<0.14641.0>:ns_orchestrator:handle_info:319]Rebalance exited with reason {buckets_shutdown_wait_failed,
      [{'ns_1@10.6.2.40',
      {'EXIT',

      {old_buckets_shutdown_wait_failed, ["default"]}

      }}]}

      Link to collect info of all nodes https://s3.amazonaws.com/packages.couchbase/collect_info/orange/2_0_0/201211/8nodes-ci-1931-reb-failed-undelete-old-bucket-20121106-121536.tgz

      No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

        Hide
        alkondratenko Aleksey Kondratenko (Inactive) added a comment -

        It's reasonably rare race that happened here.

        After firewall was disabled, node quickly discovered that it was actually failed over. When this happens there are two concurrent things racing each other:

        • we send die! signal to memcached so that it exits quickly
        • and we start bucket deletions

        In this particular case memcached died rather quickly and we quickly started fresh instance (without any buckets set up yet).

        Then death of original memcached caused ns_memcached to die. And be restarted before we started bucket deletion.

        So that restarted ns_memcached actually re-created bucket only few milliseconds later to be asked to delete it.

        There's known problem in ep-engine that it won't stop bucket when warmup happens. And because we restarted memcached and recreated buckets, this is exactly what happens.

        It'll have to complete warmup and then we'll be able to complete deletion of old bucket.

        After that rebalance will work.

        So probably not a blocker.

        If it is, then I can do something about this, but note that there would still be small race in ep-engine where, for example, memcached crash just prior to bucket deletion would cause same issue. So I believe it's best to ignore this race in ns_server and instead make ep-engine delete bucket work under warmup.

        Show
        alkondratenko Aleksey Kondratenko (Inactive) added a comment - It's reasonably rare race that happened here. After firewall was disabled, node quickly discovered that it was actually failed over. When this happens there are two concurrent things racing each other: we send die! signal to memcached so that it exits quickly and we start bucket deletions In this particular case memcached died rather quickly and we quickly started fresh instance (without any buckets set up yet). Then death of original memcached caused ns_memcached to die. And be restarted before we started bucket deletion. So that restarted ns_memcached actually re-created bucket only few milliseconds later to be asked to delete it. There's known problem in ep-engine that it won't stop bucket when warmup happens. And because we restarted memcached and recreated buckets, this is exactly what happens. It'll have to complete warmup and then we'll be able to complete deletion of old bucket. After that rebalance will work. So probably not a blocker. If it is, then I can do something about this, but note that there would still be small race in ep-engine where, for example, memcached crash just prior to bucket deletion would cause same issue. So I believe it's best to ignore this race in ns_server and instead make ep-engine delete bucket work under warmup.
        Hide
        alkondratenko Aleksey Kondratenko (Inactive) added a comment -

        No changed since 1.8.1.

        We added that 'die!' behavior exactly for reasons outlined above by Farshid.

        Show
        alkondratenko Aleksey Kondratenko (Inactive) added a comment - No changed since 1.8.1. We added that 'die!' behavior exactly for reasons outlined above by Farshid.
        Hide
        farshid Farshid Ghods (Inactive) added a comment -

        Please update the ticket after reproducing the issue.

        Show
        farshid Farshid Ghods (Inactive) added a comment - Please update the ticket after reproducing the issue.
        Hide
        andreibaranouski Andrei Baranouski added a comment - - edited

        was reproduced with smaller data and cluster size

        steps:

        1. cluster 4 nodes, 1 default and 1 sasl bucket with 1500MB of RAM allocated
        10.3.121.112, 10.3.121.113, 10.3.121.114, 10.3.121.115
        2. load ~1.6M items in each bucket
        3. rebalance in 10.3.121.116
        4. add one ddoc and 2 views in each bucket
        5. start updating existing data in each bucket
        6. start perform queries all 4 views from 2 docs
        7. rebalance in 10.3.121.117
        8. check auto failover is enable on cluster.
        9. Turn on firewall on node 10.3.121.113
        [root@localhost ~]# iptables -A INPUT -p tcp -i eth0 --dport 1000:60000 -j REJECT
        [root@localhost ~]# iptables -A OUTPUT -p tcp -o eth0 --sport 1000:60000 -j REJECT
        10. Auto failover kicked in after 30-60 sec.
        11. Disable firewall on node 10.3.121.113. Cluster saw node 10.3.121.113 up.
        12 Add node 10.3.121.113 back to cluster and rebalance. In few seconds, rebalance failed with error

        rebalance fails with the same error:

        Rebalance exited with reason {buckets_shutdown_wait_failed,
        [{'ns_1@10.3.121.113',
        {'EXIT',

        {old_buckets_shutdown_wait_failed, ["sasl"]}}}]}
        ns_orchestrator002 ns_1@10.3.121.112 16:18:27 - Fri Nov 16, 2012
        Failed to wait deletion of some buckets on some nodes: [{'ns_1@10.3.121.113',
        {'EXIT',{old_buckets_shutdown_wait_failed, ["sasl"]}

        }}]
        ns_rebalancer000 ns_1@10.3.121.112 16:18:27 - Fri Nov 16, 2012

        centOS release 5.7, 4 GB RAM, 4 core

        Show
        andreibaranouski Andrei Baranouski added a comment - - edited was reproduced with smaller data and cluster size steps: 1. cluster 4 nodes, 1 default and 1 sasl bucket with 1500MB of RAM allocated 10.3.121.112, 10.3.121.113, 10.3.121.114, 10.3.121.115 2. load ~1.6M items in each bucket 3. rebalance in 10.3.121.116 4. add one ddoc and 2 views in each bucket 5. start updating existing data in each bucket 6. start perform queries all 4 views from 2 docs 7. rebalance in 10.3.121.117 8. check auto failover is enable on cluster. 9. Turn on firewall on node 10.3.121.113 [root@localhost ~] # iptables -A INPUT -p tcp -i eth0 --dport 1000:60000 -j REJECT [root@localhost ~] # iptables -A OUTPUT -p tcp -o eth0 --sport 1000:60000 -j REJECT 10. Auto failover kicked in after 30-60 sec. 11. Disable firewall on node 10.3.121.113. Cluster saw node 10.3.121.113 up. 12 Add node 10.3.121.113 back to cluster and rebalance. In few seconds, rebalance failed with error rebalance fails with the same error: Rebalance exited with reason {buckets_shutdown_wait_failed, [{'ns_1@10.3.121.113', {'EXIT', {old_buckets_shutdown_wait_failed, ["sasl"]}}}]} ns_orchestrator002 ns_1@10.3.121.112 16:18:27 - Fri Nov 16, 2012 Failed to wait deletion of some buckets on some nodes: [{'ns_1@10.3.121.113', {'EXIT',{old_buckets_shutdown_wait_failed, ["sasl"]} }}] ns_rebalancer000 ns_1@10.3.121.112 16:18:27 - Fri Nov 16, 2012 centOS release 5.7, 4 GB RAM, 4 core
        Show
        andreibaranouski Andrei Baranouski added a comment - logs https://s3.amazonaws.com/bugdb/jira/MB-7110/c8ca51a1/10.3.121.112-11162012-552-diag.zip https://s3.amazonaws.com/bugdb/jira/MB-7110/c8ca51a1/10.3.121.113-11162012-555-diag.zip https://s3.amazonaws.com/bugdb/jira/MB-7110/c8ca51a1/10.3.121.114-11162012-557-diag.zip https://s3.amazonaws.com/bugdb/jira/MB-7110/c8ca51a1/10.3.121.115-11162012-61-diag.zip https://s3.amazonaws.com/bugdb/jira/MB-7110/c8ca51a1/10.3.121.116-11162012-65-diag.zip https://s3.amazonaws.com/bugdb/jira/MB-7110/c8ca51a1/10.3.121.117-11162012-68-diag.zip https://s3.amazonaws.com/bugdb/jira/MB-7110/c8ca51a1/10.3.121.112-8091-diag.txt.gz https://s3.amazonaws.com/bugdb/jira/MB-7110/c8ca51a1/10.3.121.113-8091-diag.txt.gz https://s3.amazonaws.com/bugdb/jira/MB-7110/c8ca51a1/10.3.121.114-8091-diag.txt.gz https://s3.amazonaws.com/bugdb/jira/MB-7110/c8ca51a1/10.3.121.115-8091-diag.txt.gz https://s3.amazonaws.com/bugdb/jira/MB-7110/c8ca51a1/10.3.121.116-8091-diag.txt.gz https://s3.amazonaws.com/bugdb/jira/MB-7110/c8ca51a1/10.3.121.117-8091-diag.txt.gz
        Hide
        alkondratenko Aleksey Kondratenko (Inactive) added a comment -

        AFAIR the question was mostly not if we can reproduce at all, but how often we can reproduce. Not sure we have answer.

        Show
        alkondratenko Aleksey Kondratenko (Inactive) added a comment - AFAIR the question was mostly not if we can reproduce at all, but how often we can reproduce. Not sure we have answer.
        Hide
        jin Jin Lim (Inactive) added a comment -

        This is a duplicate of MB-7272.

        Show
        jin Jin Lim (Inactive) added a comment - This is a duplicate of MB-7272 .
        Hide
        jin Jin Lim (Inactive) added a comment -

        This is a duplicate of MB-7272 and the fix has been merged (build 1974)

        Show
        jin Jin Lim (Inactive) added a comment - This is a duplicate of MB-7272 and the fix has been merged (build 1974)

          People

          • Assignee:
            jin Jin Lim (Inactive)
            Reporter:
            thuan Thuan Nguyen
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Gerrit Reviews

              There are no open Gerrit changes