Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-6649

beam.smp memory usage grows to 2 GB when xdcr feature is enabled and rebalancing is in progress

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Critical
    • Resolution: Cannot Reproduce
    • Affects Version/s: 2.0-beta-2
    • Fix Version/s: 2.0-beta-2
    • Component/s: XDCR
    • Security Level: Public
    • Environment:
      2.0.0-1721-rel
      Centos 4G RAM 64-bit machines
      1024 vbuckets

      Description

      • Created default buckets on a 2:2 cluster
        [10.1.3.235, 10.1.3.236] : [10.1.3.237, 10.1.3.238]
      • Set up bidirectional replication for the bucket, ran load on both the buckets.
      • Swap rebalanced a node on both clusters
        [10.1.3.235, 10.3.2.54] : [10.1.3.237, 10.3.2.55]
      • Upon completion of rebalance, stopped load on default buckets.
      • Created standard buckets on both the clusters.
      • Set up unidirectional replication for the standard bucket from cluster 1 to cluster 2, ran load on cluster 1.
      • Stopped load after a point.
      • With replication still going on, Rebalance-in the removed nodes on each cluster (to make it 3:3)
        [10.1.3.235, 10.3.2.54, 10.1.3.236] : [10.1.3.237, 10.3.2.55, 10.1.3.238]
      • During rebalance, load is not going on on either cluster, however replication is still going on.
      • Heavy swap on the orchestrators of both the clusters
      • Erlang using up a lot of memory ( > 2.5G )
      • Rebalance gradually completed on cluster 2.
      • Rebalance fails on cluster 1:
      • - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
        Rebalance exited with reason badmatch,{error,timeout,
        Unknown macro: {gen_server,call,[{'ns_memcached-bucket','ns_1@10.1.3.235'},{get_vbucket,835},60000]}

        }

      • - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
      • If tried to re-rebalance, rebalance fails again:
      • - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
        Rebalance exited with reason {not_all_nodes_are_ready_yet,['ns_1@10.1.3.235']}
      • - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
      • This is probably because, 10.1.3.235 is in "PEND" state on the UI.
      • Uploading grabbed diags onto s3.

      https://s3.amazonaws.com/bugdb/MB-6649/10.1.3.235-8091-diag.txt.gz
      https://s3.amazonaws.com/bugdb/MB-6649/10.1.3.236-8091-diag.txt.gz
      https://s3.amazonaws.com/bugdb/MB-6649/10.3.2.54-8091-diag.txt.gz
      https://s3.amazonaws.com/bugdb/MB-6649/10.1.3.237-8091-diag.txt.gz
      https://s3.amazonaws.com/bugdb/MB-6649/10.1.3.238-8091-diag.txt.gz
      https://s3.amazonaws.com/bugdb/MB-6649/10.3.2.55-8091-diag.txt.gz

      No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

        abhinav Abhinav Dangeti created issue -
        abhinav Abhinav Dangeti made changes -
        Field Original Value New Value
        Description - Created default buckets on a 2:2 cluster
                [10.1.3.235, 10.1.3.236] : [10.1.3.237, 10.1.3.238]
        - Set up bidirectional replication for the bucket, ran load on both the buckets.
        - Swap rebalanced a node on both clusters
                [10.1.3.235, 10.3.2.54] : [10.1.3.237, 10.3.2.55]
        - Upon completion of rebalance, stopped load on default buckets.
        - Created standard buckets on both the clusters.
        - Set up unidirectional replication for the standard bucket from cluster 1 to cluster 2, ran load on cluster 1.
        - Stopped load after a point.
        - With replication still going on, Rebalance-in the removed nodes on each cluster (to make it 3:3)
                [10.1.3.235, 10.3.2.54, 10.1.3.236] : [10.1.3.237, 10.3.2.55, 10.1.3.238]
        - During rebalance, load is not going on on either cluster, however replication is still going on.
                - Heavy swap on the orchestrators of both the clusters
                - Erlang using up a lot of memory ( > 2.5G )
        - Difficulties communicating with cluster 1, rebalance seems to be stuck.
        - Rebalance gradually completed on cluster 2.

        Uploading grabbed diags onto s3.
        - Created default buckets on a 2:2 cluster
                [10.1.3.235, 10.1.3.236] : [10.1.3.237, 10.1.3.238]
        - Set up bidirectional replication for the bucket, ran load on both the buckets.
        - Swap rebalanced a node on both clusters
                [10.1.3.235, 10.3.2.54] : [10.1.3.237, 10.3.2.55]
        - Upon completion of rebalance, stopped load on default buckets.
        - Created standard buckets on both the clusters.
        - Set up unidirectional replication for the standard bucket from cluster 1 to cluster 2, ran load on cluster 1.
        - Stopped load after a point.
        - With replication still going on, Rebalance-in the removed nodes on each cluster (to make it 3:3)
                [10.1.3.235, 10.3.2.54, 10.1.3.236] : [10.1.3.237, 10.3.2.55, 10.1.3.238]
        - During rebalance, load is not going on on either cluster, however replication is still going on.
                - Heavy swap on the orchestrators of both the clusters
                - Erlang using up a lot of memory ( > 2.5G )
        - Difficulties communicating with cluster 1, rebalance didn't complete:
        - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
        Rebalance exited with reason {{badmatch,{error,timeout}},
        {gen_server,call,
        [{'ns_memcached-bucket','ns_1@10.1.3.235'},
        {get_vbucket,835},
        60000]}}
        - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
        - Rebalance gradually completed on cluster 2.

        Uploading grabbed diags onto s3.
        abhinav Abhinav Dangeti made changes -
        Description - Created default buckets on a 2:2 cluster
                [10.1.3.235, 10.1.3.236] : [10.1.3.237, 10.1.3.238]
        - Set up bidirectional replication for the bucket, ran load on both the buckets.
        - Swap rebalanced a node on both clusters
                [10.1.3.235, 10.3.2.54] : [10.1.3.237, 10.3.2.55]
        - Upon completion of rebalance, stopped load on default buckets.
        - Created standard buckets on both the clusters.
        - Set up unidirectional replication for the standard bucket from cluster 1 to cluster 2, ran load on cluster 1.
        - Stopped load after a point.
        - With replication still going on, Rebalance-in the removed nodes on each cluster (to make it 3:3)
                [10.1.3.235, 10.3.2.54, 10.1.3.236] : [10.1.3.237, 10.3.2.55, 10.1.3.238]
        - During rebalance, load is not going on on either cluster, however replication is still going on.
                - Heavy swap on the orchestrators of both the clusters
                - Erlang using up a lot of memory ( > 2.5G )
        - Difficulties communicating with cluster 1, rebalance didn't complete:
        - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
        Rebalance exited with reason {{badmatch,{error,timeout}},
        {gen_server,call,
        [{'ns_memcached-bucket','ns_1@10.1.3.235'},
        {get_vbucket,835},
        60000]}}
        - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
        - Rebalance gradually completed on cluster 2.

        Uploading grabbed diags onto s3.
        - Created default buckets on a 2:2 cluster
                [10.1.3.235, 10.1.3.236] : [10.1.3.237, 10.1.3.238]
        - Set up bidirectional replication for the bucket, ran load on both the buckets.
        - Swap rebalanced a node on both clusters
                [10.1.3.235, 10.3.2.54] : [10.1.3.237, 10.3.2.55]
        - Upon completion of rebalance, stopped load on default buckets.
        - Created standard buckets on both the clusters.
        - Set up unidirectional replication for the standard bucket from cluster 1 to cluster 2, ran load on cluster 1.
        - Stopped load after a point.
        - With replication still going on, Rebalance-in the removed nodes on each cluster (to make it 3:3)
                [10.1.3.235, 10.3.2.54, 10.1.3.236] : [10.1.3.237, 10.3.2.55, 10.1.3.238]
        - During rebalance, load is not going on on either cluster, however replication is still going on.
                - Heavy swap on the orchestrators of both the clusters
                - Erlang using up a lot of memory ( > 2.5G )
        - Rebalance gradually completed on cluster 2.
        - Rebalance fails on cluster 1:

        - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
        Rebalance exited with reason {{badmatch,{error,timeout}},
        {gen_server,call,
        [{'ns_memcached-bucket','ns_1@10.1.3.235'},
        {get_vbucket,835},
        60000]}}
        - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

        - If tried to re-rebalance, rebalance fails again:

        - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
        Rebalance exited with reason {not_all_nodes_are_ready_yet,['ns_1@10.1.3.235']}
        - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

        - Uploading grabbed diags onto s3.

        https://s3.amazonaws.com/bugdb/MB-6649/10.1.3.235-8091-diag.txt.gz
        https://s3.amazonaws.com/bugdb/MB-6649/10.1.3.236-8091-diag.txt.gz
        https://s3.amazonaws.com/bugdb/MB-6649/10.3.2.54-8091-diag.txt.gz
        https://s3.amazonaws.com/bugdb/MB-6649/10.1.3.237-8091-diag.txt.gz
        https://s3.amazonaws.com/bugdb/MB-6649/10.1.3.238-8091-diag.txt.gz
        https://s3.amazonaws.com/bugdb/MB-6649/10.3.2.55-8091-diag.txt.gz
        abhinav Abhinav Dangeti made changes -
        Attachment Screen Shot 2012-09-13 at 2.21.56 PM.png [ 15014 ]
        Attachment Screen Shot 2012-09-13 at 2.22.14 PM.png [ 15015 ]
        Description - Created default buckets on a 2:2 cluster
                [10.1.3.235, 10.1.3.236] : [10.1.3.237, 10.1.3.238]
        - Set up bidirectional replication for the bucket, ran load on both the buckets.
        - Swap rebalanced a node on both clusters
                [10.1.3.235, 10.3.2.54] : [10.1.3.237, 10.3.2.55]
        - Upon completion of rebalance, stopped load on default buckets.
        - Created standard buckets on both the clusters.
        - Set up unidirectional replication for the standard bucket from cluster 1 to cluster 2, ran load on cluster 1.
        - Stopped load after a point.
        - With replication still going on, Rebalance-in the removed nodes on each cluster (to make it 3:3)
                [10.1.3.235, 10.3.2.54, 10.1.3.236] : [10.1.3.237, 10.3.2.55, 10.1.3.238]
        - During rebalance, load is not going on on either cluster, however replication is still going on.
                - Heavy swap on the orchestrators of both the clusters
                - Erlang using up a lot of memory ( > 2.5G )
        - Rebalance gradually completed on cluster 2.
        - Rebalance fails on cluster 1:

        - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
        Rebalance exited with reason {{badmatch,{error,timeout}},
        {gen_server,call,
        [{'ns_memcached-bucket','ns_1@10.1.3.235'},
        {get_vbucket,835},
        60000]}}
        - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

        - If tried to re-rebalance, rebalance fails again:

        - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
        Rebalance exited with reason {not_all_nodes_are_ready_yet,['ns_1@10.1.3.235']}
        - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

        - Uploading grabbed diags onto s3.

        https://s3.amazonaws.com/bugdb/MB-6649/10.1.3.235-8091-diag.txt.gz
        https://s3.amazonaws.com/bugdb/MB-6649/10.1.3.236-8091-diag.txt.gz
        https://s3.amazonaws.com/bugdb/MB-6649/10.3.2.54-8091-diag.txt.gz
        https://s3.amazonaws.com/bugdb/MB-6649/10.1.3.237-8091-diag.txt.gz
        https://s3.amazonaws.com/bugdb/MB-6649/10.1.3.238-8091-diag.txt.gz
        https://s3.amazonaws.com/bugdb/MB-6649/10.3.2.55-8091-diag.txt.gz
        - Created default buckets on a 2:2 cluster
                [10.1.3.235, 10.1.3.236] : [10.1.3.237, 10.1.3.238]
        - Set up bidirectional replication for the bucket, ran load on both the buckets.
        - Swap rebalanced a node on both clusters
                [10.1.3.235, 10.3.2.54] : [10.1.3.237, 10.3.2.55]
        - Upon completion of rebalance, stopped load on default buckets.
        - Created standard buckets on both the clusters.
        - Set up unidirectional replication for the standard bucket from cluster 1 to cluster 2, ran load on cluster 1.
        - Stopped load after a point.
        - With replication still going on, Rebalance-in the removed nodes on each cluster (to make it 3:3)
                [10.1.3.235, 10.3.2.54, 10.1.3.236] : [10.1.3.237, 10.3.2.55, 10.1.3.238]
        - During rebalance, load is not going on on either cluster, however replication is still going on.
                - Heavy swap on the orchestrators of both the clusters
                - Erlang using up a lot of memory ( > 2.5G )
        - Rebalance gradually completed on cluster 2.
        - Rebalance fails on cluster 1:

        - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
        Rebalance exited with reason {{badmatch,{error,timeout}},
        {gen_server,call,
        [{'ns_memcached-bucket','ns_1@10.1.3.235'},
        {get_vbucket,835},
        60000]}}
        - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

        - If tried to re-rebalance, rebalance fails again:

        - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
        Rebalance exited with reason {not_all_nodes_are_ready_yet,['ns_1@10.1.3.235']}
        - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

        - This is probably because, 10.1.3.235 is in "Pend state on the UI"

        - Uploading grabbed diags onto s3.

        https://s3.amazonaws.com/bugdb/MB-6649/10.1.3.235-8091-diag.txt.gz
        https://s3.amazonaws.com/bugdb/MB-6649/10.1.3.236-8091-diag.txt.gz
        https://s3.amazonaws.com/bugdb/MB-6649/10.3.2.54-8091-diag.txt.gz
        https://s3.amazonaws.com/bugdb/MB-6649/10.1.3.237-8091-diag.txt.gz
        https://s3.amazonaws.com/bugdb/MB-6649/10.1.3.238-8091-diag.txt.gz
        https://s3.amazonaws.com/bugdb/MB-6649/10.3.2.55-8091-diag.txt.gz
        abhinav Abhinav Dangeti made changes -
        Description - Created default buckets on a 2:2 cluster
                [10.1.3.235, 10.1.3.236] : [10.1.3.237, 10.1.3.238]
        - Set up bidirectional replication for the bucket, ran load on both the buckets.
        - Swap rebalanced a node on both clusters
                [10.1.3.235, 10.3.2.54] : [10.1.3.237, 10.3.2.55]
        - Upon completion of rebalance, stopped load on default buckets.
        - Created standard buckets on both the clusters.
        - Set up unidirectional replication for the standard bucket from cluster 1 to cluster 2, ran load on cluster 1.
        - Stopped load after a point.
        - With replication still going on, Rebalance-in the removed nodes on each cluster (to make it 3:3)
                [10.1.3.235, 10.3.2.54, 10.1.3.236] : [10.1.3.237, 10.3.2.55, 10.1.3.238]
        - During rebalance, load is not going on on either cluster, however replication is still going on.
                - Heavy swap on the orchestrators of both the clusters
                - Erlang using up a lot of memory ( > 2.5G )
        - Rebalance gradually completed on cluster 2.
        - Rebalance fails on cluster 1:

        - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
        Rebalance exited with reason {{badmatch,{error,timeout}},
        {gen_server,call,
        [{'ns_memcached-bucket','ns_1@10.1.3.235'},
        {get_vbucket,835},
        60000]}}
        - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

        - If tried to re-rebalance, rebalance fails again:

        - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
        Rebalance exited with reason {not_all_nodes_are_ready_yet,['ns_1@10.1.3.235']}
        - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

        - This is probably because, 10.1.3.235 is in "Pend state on the UI"

        - Uploading grabbed diags onto s3.

        https://s3.amazonaws.com/bugdb/MB-6649/10.1.3.235-8091-diag.txt.gz
        https://s3.amazonaws.com/bugdb/MB-6649/10.1.3.236-8091-diag.txt.gz
        https://s3.amazonaws.com/bugdb/MB-6649/10.3.2.54-8091-diag.txt.gz
        https://s3.amazonaws.com/bugdb/MB-6649/10.1.3.237-8091-diag.txt.gz
        https://s3.amazonaws.com/bugdb/MB-6649/10.1.3.238-8091-diag.txt.gz
        https://s3.amazonaws.com/bugdb/MB-6649/10.3.2.55-8091-diag.txt.gz
        - Created default buckets on a 2:2 cluster
                [10.1.3.235, 10.1.3.236] : [10.1.3.237, 10.1.3.238]
        - Set up bidirectional replication for the bucket, ran load on both the buckets.
        - Swap rebalanced a node on both clusters
                [10.1.3.235, 10.3.2.54] : [10.1.3.237, 10.3.2.55]
        - Upon completion of rebalance, stopped load on default buckets.
        - Created standard buckets on both the clusters.
        - Set up unidirectional replication for the standard bucket from cluster 1 to cluster 2, ran load on cluster 1.
        - Stopped load after a point.
        - With replication still going on, Rebalance-in the removed nodes on each cluster (to make it 3:3)
                [10.1.3.235, 10.3.2.54, 10.1.3.236] : [10.1.3.237, 10.3.2.55, 10.1.3.238]
        - During rebalance, load is not going on on either cluster, however replication is still going on.
                - Heavy swap on the orchestrators of both the clusters
                - Erlang using up a lot of memory ( > 2.5G )
        - Rebalance gradually completed on cluster 2.
        - Rebalance fails on cluster 1:

        - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
        Rebalance exited with reason {{badmatch,{error,timeout}},
        {gen_server,call,
        [{'ns_memcached-bucket','ns_1@10.1.3.235'},
        {get_vbucket,835},
        60000]}}
        - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

        - If tried to re-rebalance, rebalance fails again:

        - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
        Rebalance exited with reason {not_all_nodes_are_ready_yet,['ns_1@10.1.3.235']}
        - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

        - This is probably because, 10.1.3.235 is in "PEND" state on the UI.

        - Uploading grabbed diags onto s3.

        https://s3.amazonaws.com/bugdb/MB-6649/10.1.3.235-8091-diag.txt.gz
        https://s3.amazonaws.com/bugdb/MB-6649/10.1.3.236-8091-diag.txt.gz
        https://s3.amazonaws.com/bugdb/MB-6649/10.3.2.54-8091-diag.txt.gz
        https://s3.amazonaws.com/bugdb/MB-6649/10.1.3.237-8091-diag.txt.gz
        https://s3.amazonaws.com/bugdb/MB-6649/10.1.3.238-8091-diag.txt.gz
        https://s3.amazonaws.com/bugdb/MB-6649/10.3.2.55-8091-diag.txt.gz
        junyi Junyi Xie (Inactive) made changes -
        Priority Blocker [ 1 ] Critical [ 2 ]
        farshid Farshid Ghods (Inactive) made changes -
        Summary Heavy usage of memory in XDCR: beam.smp using up a great deal of memory (over 2G) beam.smp memory usage grows to 2 GB when xdcr feature is enabled and rebalancing is in progress
        farshid Farshid Ghods (Inactive) made changes -
        Labels 2.0-beta-release-notes
        junyi Junyi Xie (Inactive) made changes -
        Assignee Junyi Xie [ junyi ] Abhinav Dangeti [ abhinav ]
        abhinav Abhinav Dangeti made changes -
        Status Open [ 1 ] Resolved [ 5 ]
        Resolution Cannot Reproduce [ 5 ]
        farshid Farshid Ghods (Inactive) made changes -
        Sprint Status Current Sprint
        farshid Farshid Ghods (Inactive) made changes -
        Status Resolved [ 5 ] Closed [ 6 ]

          People

          • Assignee:
            abhinav Abhinav Dangeti
            Reporter:
            abhinav Abhinav Dangeti
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Gerrit Reviews

              There are no open Gerrit changes