Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-6550

[longevity] Rebalance hang after failover and remove node because of the memory leak on a couple of nodes

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 2.0-beta
    • Fix Version/s: 2.0-beta
    • Component/s: couchbase-bucket
    • Security Level: Public
    • Labels:
    • Environment:
      centos 6.2 64bit

      Description

      Cluster information:

      • 11 centos 6.2 64bit server with 4 cores CPU
      • Each server has 10 GB RAM and 150 GB disk.
      • 8 GB RAM for couchbase server at each node (80% total system memmories)
      • Disk format ext3 on both data and root
      • Each server has its own drive, no disk sharing with other server.
      • Load 9 million items to both buckets
      • Cluster has 2 buckets, default (3GB) and saslbucket (3GB)
      • Each bucket has one doc and 2 views for each doc (default d1 and saslbucket d11)
      • Add one more doc d2 with 2 views to default bucket
      • Start cluster with 10 nodes installed couchbase server 2.0.0-1663
        10.3.121.13
        10.3.121.14
        10.3.121.15
        10.3.121.16
        10.3.121.17
        10.3.121.20
        10.3.121.22
        10.3.121.24
        10.3.121.25
        10.3.121.23
      • Data path /data
      • View path /data
      • The last run, I do swap rebalance remove node 13 and add node 26.
      • Then node 26 failed due to physical failure. I failover node 26 and rebalance.
      • Rebalance failed with known issue MB-6497 at the end of rebalance saslbucket
      • Node 22 down due to run out of disk space. Failover node 22.
      • Remove node 13. Start rebalance from 19:26:35 - Wed Sep 5, 2012

      Bucket "default" rebalance does not seem to be swap rebalance ns_vbucket_mover000 ns_1@10.3.121.14 19:26:35 - Wed Sep 5, 2012

      Rebalance hang until now Thu Sep 6 19:25:29 PDT 2012

      CPU and beam stats

      10.3.121.15
      Vm: 2796m Rm: 613m CPU: 13.7 beam.smp
      Vm: 6091m Rm: 4.2g CPU: 9.8 memcached
      10.3.121.13
      Vm: 1845m Rm: 338m CPU: 9.9 beam.smp
      Vm: 1230m Rm: 1.0g CPU: 2.0 memcached
      10.3.121.23
      Vm: 2443m Rm: 652m CPU: 9.8 beam.smp
      Vm: 4969m Rm: 3.4g CPU: 7.9 memcached
      10.3.121.24
      Vm: 3304m Rm: 907m CPU: 19.4 beam.smp
      Vm: 5440m Rm: 4.0g CPU: 3.9 memcached
      10.3.121.14
      Vm: 3462m Rm: 665m CPU: 30.7 beam.smp
      Vm: 6329m Rm: 4.1g CPU: 5.1 memcached
      10.3.121.16
      Vm: 2702m Rm: 642m CPU: 13.2 beam.smp
      Vm: 4845m Rm: 3.5g CPU: 5.0 memcached
      10.3.121.17
      Vm: 4498m Rm: 1.4g CPU: 91.2 beam.smp
      Vm: 5359m Rm: 3.6g CPU: 1.7 memcached
      10.3.121.20
      Vm: 3793m Rm: 1.0g CPU: 11.7 beam.smp
      Vm: 5356m Rm: 3.7g CPU: 1.7 memcached

      Swap stats in MB
      Total Used Free
      10.3.121.15
      Swap: 5199 1815 3384
      10.3.121.13
      Swap: 5199 10 5189
      10.3.121.22
      Swap: 5199 15 5184
      10.3.121.14
      Swap: 5199 2503 2696
      10.3.121.23
      Swap: 5199 1037 4162
      10.3.121.24
      Swap: 5199 1543 3656
      10.3.121.17
      Swap: 5199 2156 3043
      10.3.121.16
      Swap: 5199 1156 4043
      10.3.121.20
      Swap: 5199 1949 3250

      Link to diags of all nodes
      https://s3.amazonaws.com/packages.couchbase/diag-logs/orange/201209/9nodes-1663-reb-hang-20120906.tgz

      No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

        Hide
        kzeller kzeller added a comment -

        Beta RN: Fixed rebalance failure. Rebalanced had stalled
        after performing failover and removing node due to memory leak on
        cluster nodes.

        Show
        kzeller kzeller added a comment - Beta RN: Fixed rebalance failure. Rebalanced had stalled after performing failover and removing node due to memory leak on cluster nodes.
        Hide
        farshid Farshid Ghods (Inactive) added a comment -

        is this a system test blocker ? if so please add sblocker label

        Show
        farshid Farshid Ghods (Inactive) added a comment - is this a system test blocker ? if so please add sblocker label
        Hide
        thuan Thuan Nguyen added a comment -

        Integrated in github-ep-engine-2-0 #426 (See http://qa.hq.northscale.net/job/github-ep-engine-2-0/426/)
        MB-6550 Free bg-fetched items if the TAP connection is invalid. (Revision 25f4791191a3c3aca670781357b61559191a7f65)

        Result = SUCCESS
        Chiyoung Seo :
        Files :

        • src/tapconnmap.cc
        Show
        thuan Thuan Nguyen added a comment - Integrated in github-ep-engine-2-0 #426 (See http://qa.hq.northscale.net/job/github-ep-engine-2-0/426/ ) MB-6550 Free bg-fetched items if the TAP connection is invalid. (Revision 25f4791191a3c3aca670781357b61559191a7f65) Result = SUCCESS Chiyoung Seo : Files : src/tapconnmap.cc
        Show
        chiyoung Chiyoung Seo added a comment - http://review.couchbase.org/#/c/20632/
        Hide
        chiyoung Chiyoung Seo added a comment - - edited

        The memory usage on 10.3.121.14 and 10.3.121.15 is above 90% of their bucket quota even after most of active and replica items were ejected. This is the reason why rebalance got stuck:

        Chiyoung-MacBook:ep-engine chiyoung$ ./management/cbstats 10.3.121.14:11210 raw memory
        ep_kv_size: 2436606624
        ep_max_data_size: 3145728000
        ep_mem_high_wat: 2359296000
        ep_mem_low_wat: 1887436800
        ep_mem_tracker_enabled: true
        ep_oom_errors: 0
        ep_overhead: 221345920
        ep_tmp_oom_errors: 0
        ep_value_size: 2214922031
        mem_used: 2831961568
        tcmalloc_current_thread_cache_bytes: 2281472
        tcmalloc_max_thread_cache_bytes: 4194304
        tcmalloc_unmapped_bytes: 7356416
        total_allocated_bytes: 5440249488
        total_fragmentation_bytes: 919716208
        total_free_bytes: 2457600
        total_heap_bytes: 6362423296

        Chiyoung-MacBook:ep-engine chiyoung$ ./management/cbstats 10.3.121.14:11210 all | grep resident
        ep_num_non_resident: 2427780
        vb_active_num_non_resident: 1005950
        vb_active_perc_mem_resident: 0
        vb_pending_num_non_resident: 0
        vb_pending_perc_mem_resident: 0
        vb_replica_num_non_resident: 1421830
        vb_replica_perc_mem_resident: 0

        It seems to me that there is a serious memory leak on 14 and 15. Especially, ep_value_size (2214922031) means that most of Blob value instances are freed even after we ejected them. Those blob values are referenced in many places (hash table, flusher, tap replicator, etc.)

        Show
        chiyoung Chiyoung Seo added a comment - - edited The memory usage on 10.3.121.14 and 10.3.121.15 is above 90% of their bucket quota even after most of active and replica items were ejected. This is the reason why rebalance got stuck: Chiyoung-MacBook:ep-engine chiyoung$ ./management/cbstats 10.3.121.14:11210 raw memory ep_kv_size: 2436606624 ep_max_data_size: 3145728000 ep_mem_high_wat: 2359296000 ep_mem_low_wat: 1887436800 ep_mem_tracker_enabled: true ep_oom_errors: 0 ep_overhead: 221345920 ep_tmp_oom_errors: 0 ep_value_size: 2214922031 mem_used: 2831961568 tcmalloc_current_thread_cache_bytes: 2281472 tcmalloc_max_thread_cache_bytes: 4194304 tcmalloc_unmapped_bytes: 7356416 total_allocated_bytes: 5440249488 total_fragmentation_bytes: 919716208 total_free_bytes: 2457600 total_heap_bytes: 6362423296 Chiyoung-MacBook:ep-engine chiyoung$ ./management/cbstats 10.3.121.14:11210 all | grep resident ep_num_non_resident: 2427780 vb_active_num_non_resident: 1005950 vb_active_perc_mem_resident: 0 vb_pending_num_non_resident: 0 vb_pending_perc_mem_resident: 0 vb_replica_num_non_resident: 1421830 vb_replica_perc_mem_resident: 0 It seems to me that there is a serious memory leak on 14 and 15. Especially, ep_value_size (2214922031) means that most of Blob value instances are freed even after we ejected them. Those blob values are referenced in many places (hash table, flusher, tap replicator, etc.)

          People

          • Assignee:
            chiyoung Chiyoung Seo
            Reporter:
            thuan Thuan Nguyen
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Gerrit Reviews

              There are no open Gerrit changes