Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-48825

AWS rebalance performance tests hard out of memory error

    XMLWordPrintable

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Duplicate
    • 7.1.0
    • 7.1.0
    • couchbase-bucket
    • OS: Amazon Linux 2
      ARM instance: m6g.large
      x86 instance: m5.large

      Both machines using 2vCPU, 8GB memory, EBS 40GB disk.
    • Untriaged
    • 1
    • Unknown

    Attachments

      Issue Links

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          *the same test on comparable instance runs successfully

          sean.corrigan Sean Corrigan added a comment - *the same test on comparable instance runs successfully
          bryan.mccoid Bryan McCoid added a comment -

          Ok so after digging through the logs and stats its a little bit unclear why this is happening and I have trouble seeing much of any connection between the two..

          One thing I want to note, is that it can be a little bit challenging to match up logs with jenkins test output when all the clocks are apparently set to different random times that seem hard to align. Luckily we are looking for more obvious events such as rebalance, so hopefully the time frame I isolated is actually the correct one, seems like it must be.

          The m6 (arm):

          1.) rebalance begins ..

          2.) dcp streams are added / started

          3.) add stream for partition 745 ec2-44-195

          4.) add stream (other side) ec2-3-223

          5.) (some time goes by) and then "still waiting for backfill on connection"

          [rebalance:debug,2021-10-07T22:35:41.445Z,ns_1@ec2-44-195-22-82.compute-1.amazonaws.com:<0.1108.3>:dcp_replicator:wait_for_data_move_on_one_node:192]Still waiting for backfill on connection "replication:ns_1@ec2-44-195-22-82.compute-1.amazonaws.com->ns_1@ec2-3-223-6-164.compute-1.amazonaws.com:bucket-1" bucket "bucket-1", partition 745, last estimate {0,0, <<"calculating-item-count">>}
          

          6.) this happens 115 times 

          7.) the response keeps returning <<"calculating-item-count">> from memcached with no estimation

          8.) I think this might be an issue the memcached team can look at to determine the reason we can't estimate the item count. 

          NOTE: during this time there are a lot of attempts and the cpu usage goes very very high.. there must be a bug somewhere and it might be worth looking into whether it's an arm specific issue. Also this happens for partition 399 as well.. So we are stuck spinning at this phase and never progress.

           

          The m5:

          This one is trickier.. In fact it sorta seems like it completes.

          ns_server:info,2021-10-08T10:18:59.189-04:00,ns_1@ec2-44-198-61-230.compute-1.amazonaws.com:ns_memcached-bucket-1<0.5404.0>:ns_memcached:handle_call:286]Enabling traffic to bucket "bucket-1"
          ns_server:info,2021-10-08T10:18:59.189-04:00,ns_1@ec2-44-198-61-230.compute-1.amazonaws.com:ns_memcached-bucket-1<0.5404.0>:ns_memcached:handle_call:290]Bucket "bucket-1" marked as warmed in 1 seconds 
          

          But honestly I might just need a little more time to look at it. But it's unclear to me why this appears to "hang".. I did notice that this one was killed much more quickly than the other, though it does seem to be taking a bit longer than you would expect. All that said I didn't see it stuck in the same place as the last one.

          TLDR: We should have memcached team look at the m6 one, at the very least. I can look at the m5 one again if need be but it would be good if I could get another set of logs that let it run longer, ideally. 

          bryan.mccoid Bryan McCoid added a comment - Ok so after digging through the logs and stats its a little bit unclear why this is happening and I have trouble seeing much of any connection between the two.. One thing I want to note, is that it can be a little bit challenging to match up logs with jenkins test output when all the clocks are apparently set to different random times that seem hard to align. Luckily we are looking for more obvious events such as rebalance, so hopefully the time frame I isolated is actually the correct one, seems like it must be. The m6 (arm): 1.) rebalance begins .. 2.) dcp streams are added / started 3.) add stream for partition 745 ec2-44-195 4.) add stream (other side) ec2-3-223 5.) (some time goes by) and then "still waiting for backfill on connection" [rebalance:debug,2021-10-07T22:35:41.445Z,ns_1@ec2-44-195-22-82.compute-1.amazonaws.com:<0.1108.3>:dcp_replicator:wait_for_data_move_on_one_node:192]Still waiting for backfill on connection "replication:ns_1@ec2-44-195-22-82.compute-1.amazonaws.com->ns_1@ec2-3-223-6-164.compute-1.amazonaws.com:bucket-1" bucket "bucket-1", partition 745, last estimate {0,0, <<"calculating-item-count">>} 6.) this happens 115 times  7.) the response keeps returning <<"calculating-item-count">> from memcached with no estimation 8.) I think this might be an issue the memcached team can look at to determine the reason we can't estimate the item count.  NOTE: during this time there are a lot of attempts and the cpu usage goes very very high.. there must be a bug somewhere and it might be worth looking into whether it's an arm specific issue. Also this happens for partition 399 as well.. So we are stuck spinning at this phase and never progress.   The m5: This one is trickier.. In fact it sorta seems like it completes. ns_server:info,2021-10-08T10:18:59.189-04:00,ns_1@ec2-44-198-61-230.compute-1.amazonaws.com:ns_memcached-bucket-1<0.5404.0>:ns_memcached:handle_call:286]Enabling traffic to bucket "bucket-1" ns_server:info,2021-10-08T10:18:59.189-04:00,ns_1@ec2-44-198-61-230.compute-1.amazonaws.com:ns_memcached-bucket-1<0.5404.0>:ns_memcached:handle_call:290]Bucket "bucket-1" marked as warmed in 1 seconds But honestly I might just need a little more time to look at it. But it's unclear to me why this appears to "hang".. I did notice that this one was killed much more quickly than the other, though it does seem to be taking a bit longer than you would expect. All that said I didn't see it stuck in the same place as the last one. TLDR: We should have memcached team look at the m6 one, at the very least. I can look at the m5 one again if need be but it would be good if I could get another set of logs that let it run longer, ideally. 

          Hi Bryan McCoid, thanks for looking into this - for the m6 issue I will contact the memcached team in order to investigate the issue.

           

          In terms of the m5, I have run a further test and let it run longer after the rebalance hang, what possibly could have been the issue with the previous logs is I saw the rebalance hang for a while and assumed it would be the same story as the m6 so terminated the job, hopefully these logs provide more information:

          https://s3.amazonaws.com/bugdb/jira/qe/collectinfo-2021-10-15T114904-ns_1%40ec2-100-26-1-78.compute-1.amazonaws.com.zip
          https://s3.amazonaws.com/bugdb/jira/qe/collectinfo-2021-10-15T114904-ns_1%40ec2-34-200-246-184.compute-1.amazonaws.com.zip
          https://s3.amazonaws.com/bugdb/jira/qe/collectinfo-2021-10-15T114904-ns_1%40ec2-35-170-70-9.compute-1.amazonaws.com.zip

          Also, when collecting these logs I kept an eye on the rebalance status on the couchbase UI which was hung as well. 

          sean.corrigan Sean Corrigan added a comment - Hi Bryan McCoid , thanks for looking into this - for the m6 issue I will contact the memcached team in order to investigate the issue.   In terms of the m5, I have run a further test and let it run longer after the rebalance hang, what possibly could have been the issue with the previous logs is I saw the rebalance hang for a while and assumed it would be the same story as the m6 so terminated the job, hopefully these logs provide more information: https://s3.amazonaws.com/bugdb/jira/qe/collectinfo-2021-10-15T114904-ns_1%40ec2-100-26-1-78.compute-1.amazonaws.com.zip https://s3.amazonaws.com/bugdb/jira/qe/collectinfo-2021-10-15T114904-ns_1%40ec2-34-200-246-184.compute-1.amazonaws.com.zip https://s3.amazonaws.com/bugdb/jira/qe/collectinfo-2021-10-15T114904-ns_1%40ec2-35-170-70-9.compute-1.amazonaws.com.zip Also, when collecting these logs I kept an eye on the rebalance status on the couchbase UI which was hung as well. 
          bryan.mccoid Bryan McCoid added a comment - - edited

          After looking at these next set of logs it becomes clear that it's the exact same issue as the m6. We are failing to get the item count from memcached for the dcp connection:

          [rebalance:debug,2021-10-15T07:41:11.982-04:00,ns_1@ec2-35-170-70-9.compute-1.amazonaws.com:<0.7862.2>:dcp_replicator:wait_for_data_move_on_one_node:192]Still waiting for backfill on connection "replication:ns_1@ec2-35-170-70-9.compute-1.amazonaws.com->ns_1@ec2-34-200-246-184.compute-1.amazonaws.com:bucket-1", bucket "bucket-1", partition 767, last estimate {0, 0, <<"calculating-item-count">>}
          

          So this is the same issue. This happens over and over for the same partition over the course of ~36 minutes. So it's waiting for a response from memcached.

          Hope that helps!

          Sean Corrigan - you can take it from here and pass to kv team or close, because I saw that some other tickets were made for the kv stuff.

          bryan.mccoid Bryan McCoid added a comment - - edited After looking at these next set of logs it becomes clear that it's the exact same issue as the m6. We are failing to get the item count from memcached for the dcp connection: [rebalance:debug, 2021 - 10 -15T07: 41 : 11.982 - 04 : 00 ,ns_1 @ec2 - 35 - 170 - 70 - 9 .compute- 1 .amazonaws.com:< 0.7862 . 2 >:dcp_replicator:wait_for_data_move_on_one_node: 192 ]Still waiting for backfill on connection "replication:ns_1@ec2-35-170-70-9.compute-1.amazonaws.com->ns_1@ec2-34-200-246-184.compute-1.amazonaws.com:bucket-1" , bucket "bucket-1" , partition 767 , last estimate { 0 , 0 , << "calculating-item-count" >>} So this is the same issue. This happens over and over for the same partition over the course of ~36 minutes. So it's waiting for a response from memcached. Hope that helps! Sean Corrigan - you can take it from here and pass to kv team or close, because I saw that some other tickets were made for the kv stuff.
          bryan.mccoid Bryan McCoid added a comment -

          This is probably a duplicate that was already reported by Sean Corrigan but I just wanted to let you guys decide if you agree. Should be dupliate, though..

          Thanks!

          bryan.mccoid Bryan McCoid added a comment - This is probably a duplicate that was already reported by Sean Corrigan but I just wanted to let you guys decide if you agree. Should be dupliate, though.. Thanks!

          Bryan McCoid Hi thanks for looking into this, yes both issues are being looked into now I will close this ticket.

          sean.corrigan Sean Corrigan added a comment - Bryan McCoid  Hi thanks for looking into this, yes both issues are being looked into now I will close this ticket.
          owend Daniel Owen added a comment -

          As Sean Corrigan request closing as a duplicate of MB-49037

          owend Daniel Owen added a comment - As Sean Corrigan request closing as a duplicate of MB-49037

          Closing out duplicates

          ritam.sharma Ritam Sharma added a comment - Closing out duplicates

          People

            owend Daniel Owen
            sean.corrigan Sean Corrigan
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty