Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-30011

Rebalance % during GSI swap rebalance doesn't always increase linearly

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 5.5.0
    • Fix Version/s: Mad-Hatter
    • Component/s: secondary-index
    • Labels:
    • Environment:
      5.5.0-2873-enterprise

      Description

      Script to Repro

      ./testrunner -i /tmp/upgrade1.ini -p get-cbcollect-info=True -t 2i.gsi_rebalance_2i.SecondaryIndexingRebalanceTests.test_kv_and_gsi_rebalance_with_high_ops,doc-per-day=1000,dataset=array,nodes_init=6,services_init=kv-kv-kv-index-index-n1ql,groups=simple,reset_services=True,skip_cleanup=True,kv_node_out=1,index_node_out=3,num_docs=7000000,skip_cleanup=True
      

      This script basically does KV and GSI swap rebalance at high ops rate(around 70k ops/sec). What I noticed is rebalance increases, decreases and then increases again. This happened a couple of times. This could be pretty confusing to the end customer.

      Logs attached.

      Logs from automation run that confirms this - index_rebalance_percentage

        Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

          Hide
          deepkaran.salooja Deepkaran Salooja added a comment -

          There are 2 instances of rebalance progress jumping abruptly and then coming back to normal in a few seconds:

          node 172.23.98.11

          2018-06-07T04:06:20.509-07:00 [Info] Rebalancer::computeProgress 0.47175086900620955
          2018-06-07T04:06:23.025-07:00 [Info] Rebalancer::computeProgress 0.47175086900620955
          2018-06-07T04:06:25.926-07:00 [Info] Rebalancer::computeProgress 0.47175086900620955
          2018-06-07T04:06:27.405-07:00 [Info] Rebalancer::computeProgress 0.7230820951762144
          2018-06-07T04:06:28.860-07:00 [Info] Rebalancer::computeProgress 0.7230820951762144
          2018-06-07T04:06:30.478-07:00 [Info] Rebalancer::computeProgress 0.7230820951762144
          2018-06-07T04:06:31.482-07:00 [Info] Rebalancer::computeProgress 0.7273953338548017
          2018-06-07T04:06:32.216-07:00 [Info] Rebalancer::computeProgress 0.7273953338548017
          2018-06-07T04:06:33.310-07:00 [Info] Rebalancer::computeProgress 0.7273953338548017
          2018-06-07T04:06:34.283-07:00 [Info] Rebalancer::computeProgress 0.7273953338548017
          2018-06-07T04:06:35.162-07:00 [Info] Rebalancer::computeProgress 0.7273953338548017
          2018-06-07T04:06:36.354-07:00 [Info] Rebalancer::computeProgress 0.7273953338548017
          2018-06-07T04:06:37.862-07:00 [Info] Rebalancer::computeProgress 0.48928646254942576
          2018-06-07T04:06:39.031-07:00 [Info] Rebalancer::computeProgress 0.48928646254942576
          2018-06-07T04:06:40.066-07:00 [Info] Rebalancer::computeProgress 0.48928646254942576
          

          and

          2018-06-07T04:07:56.749-07:00 [Info] Rebalancer::computeProgress 0.5045701157087751
          2018-06-07T04:07:59.778-07:00 [Info] Rebalancer::computeProgress 0.5045701157087751
          2018-06-07T04:08:01.730-07:00 [Info] Rebalancer::computeProgress 0.5045701157087751
          2018-06-07T04:08:02.768-07:00 [Info] Rebalancer::computeProgress 0.8224282555088576
          2018-06-07T04:08:03.633-07:00 [Info] Rebalancer::computeProgress 0.8224282555088576
          2018-06-07T04:08:05.083-07:00 [Info] Rebalancer::computeProgress 0.8224282555088576
          2018-06-07T04:08:05.976-07:00 [Info] Rebalancer::computeProgress 0.8224282555088576
          2018-06-07T04:08:06.479-07:00 [Info] Rebalancer::computeProgress 0.8224282555088576
          2018-06-07T04:08:07.989-07:00 [Info] Rebalancer::computeProgress 0.8289998710255895
          2018-06-07T04:08:09.219-07:00 [Info] Rebalancer::computeProgress 0.8289998710255895
          2018-06-07T04:08:10.093-07:00 [Info] Rebalancer::computeProgress 0.8289998710255895
          2018-06-07T04:08:10.598-07:00 [Info] Rebalancer::computeProgress 0.8289998710255895
          2018-06-07T04:08:11.464-07:00 [Info] Rebalancer::computeProgress 0.8289998710255895
          2018-06-07T04:08:12.598-07:00 [Info] Rebalancer::computeProgress 0.8289998710255895
          2018-06-07T04:08:13.802-07:00 [Info] Rebalancer::computeProgress 0.5551756583019596
          2018-06-07T04:08:14.700-07:00 [Info] Rebalancer::computeProgress 0.5551756583019596
          2018-06-07T04:08:15.703-07:00 [Info] Rebalancer::computeProgress 0.5551756583019596
          

          Both of these correspond to memcached getting killed due to OOM

          172.23.106.64

          2018-06-07T04:06:16.833-07:00, ns_log:0:info:message(ns_1@172.23.106.64) - Service 'memcached' exited with status 137. 
          

          172.23.106.76

          2018-06-07T04:07:46.035-07:00, ns_log:0:info:message(ns_1@172.23.106.76) - Service 'memcached' exited with status 137.
          

          Indexer has the retry mechanism and some validations to make sure partial list of seqnos are not used for stat calculation.
          But it looks like those are not sufficient.

          2018-06-07T04:07:45.016-07:00 [Warn] Indexer::getCurrentKVTs error=read tcp 172.23.98.11:37688->172.23.106.76:11210: read: connection reset by peer Retrying (1)
          2018-06-07T04:07:45.860-07:00 [Warn] Indexer::getCurrentKVTs error=dial tcp 172.23.106.76:11210: getsockopt: connection refused Retrying (2)
          2018-06-07T04:07:47.261-07:00 [Warn] Indexer::getCurrentKVTs error=dial tcp 172.23.106.76:11210: getsockopt: connection refused Retrying (3)
          2018-06-07T04:07:48.871-07:00 [Warn] Indexer::getCurrentKVTs error=dial tcp 172.23.106.76:11210: getsockopt: connection refused Retrying (4)
          2018-06-07T04:07:52.212-07:00 [Warn] Indexer::getCurrentKVTs error=dial tcp 172.23.106.76:11210: getsockopt: connection refused Retrying (5)
          2018-06-07T04:07:52.756-07:00 [Warn] Indexer::getCurrentKVTs error=dial tcp 172.23.106.76:11210: getsockopt: connection refused Retrying (6)
          2018-06-07T04:07:53.522-07:00 [Warn] Indexer::getCurrentKVTs error=dial tcp 172.23.106.76:11210: getsockopt: connection refused Retrying (7)
          2018-06-07T04:07:55.072-07:00 [Warn] Indexer::getCurrentKVTs error=MCResponse status=KEY_ENOENT, opcode=0x89, opaque=0, msg:  Retrying (8)
          2018-06-07T04:07:55.425-07:00 [Warn] Indexer::getCurrentKVTs error=MCResponse status=KEY_ENOENT, opcode=0x89, opaque=0, msg:  Retrying (9)
          2018-06-07T04:07:56.351-07:00 [Warn] Indexer::getCurrentKVTs error=MCResponse status=KEY_ENOENT, opcode=0x89, opaque=0, msg:  Retrying (10)
          2018-06-07T04:07:56.718-07:00 [Error] Indexer::getCurrentKVTs Error Connecting to KV Cluster MCResponse status=KEY_ENOENT, opcode=0x89, opaque=0, msg: 
          2018-06-07T04:07:59.675-07:00 [Warn] Indexer::getCurrentKVTs error=MCResponse status=KEY_ENOENT, opcode=0x89, opaque=0, msg:  Retrying (1)
          2018-06-07T04:08:00.603-07:00 [Warn] Indexer::getCurrentKVTs error=MCResponse status=KEY_ENOENT, opcode=0x89, opaque=0, msg:  Retrying (2)
          

          During the duration of retry, the progress computation is correct. But once the retry is successful, the progress gets computed incorrectly.

          2018-06-07T04:07:56.749-07:00 [Info] Rebalancer::computeProgress 0.5045701157087751
          2018-06-07T04:07:59.778-07:00 [Info] Rebalancer::computeProgress 0.5045701157087751
          2018-06-07T04:08:01.730-07:00 [Info] Rebalancer::computeProgress 0.5045701157087751
          2018-06-07T04:08:02.768-07:00 [Info] Rebalancer::computeProgress 0.8224282555088576
          2018-06-07T04:08:03.633-07:00 [Info] Rebalancer::computeProgress 0.8224282555088576
          2018-06-07T04:08:05.083-07:00 [Info] Rebalancer::computeProgress 0.8224282555088576
          

          There may be a race condition between when the connection is successful vs. when we have the latest seqnos. And somehow we end up using 0 seqnos for the set of vbuckets on the node where the memcached died. This causes the progress to go up. Once we get the latest seqnos, the progress is back to normal.

          Show
          deepkaran.salooja Deepkaran Salooja added a comment - There are 2 instances of rebalance progress jumping abruptly and then coming back to normal in a few seconds: node 172.23.98.11 2018-06-07T04:06:20.509-07:00 [Info] Rebalancer::computeProgress 0.47175086900620955 2018-06-07T04:06:23.025-07:00 [Info] Rebalancer::computeProgress 0.47175086900620955 2018-06-07T04:06:25.926-07:00 [Info] Rebalancer::computeProgress 0.47175086900620955 2018-06-07T04:06:27.405-07:00 [Info] Rebalancer::computeProgress 0.7230820951762144 2018-06-07T04:06:28.860-07:00 [Info] Rebalancer::computeProgress 0.7230820951762144 2018-06-07T04:06:30.478-07:00 [Info] Rebalancer::computeProgress 0.7230820951762144 2018-06-07T04:06:31.482-07:00 [Info] Rebalancer::computeProgress 0.7273953338548017 2018-06-07T04:06:32.216-07:00 [Info] Rebalancer::computeProgress 0.7273953338548017 2018-06-07T04:06:33.310-07:00 [Info] Rebalancer::computeProgress 0.7273953338548017 2018-06-07T04:06:34.283-07:00 [Info] Rebalancer::computeProgress 0.7273953338548017 2018-06-07T04:06:35.162-07:00 [Info] Rebalancer::computeProgress 0.7273953338548017 2018-06-07T04:06:36.354-07:00 [Info] Rebalancer::computeProgress 0.7273953338548017 2018-06-07T04:06:37.862-07:00 [Info] Rebalancer::computeProgress 0.48928646254942576 2018-06-07T04:06:39.031-07:00 [Info] Rebalancer::computeProgress 0.48928646254942576 2018-06-07T04:06:40.066-07:00 [Info] Rebalancer::computeProgress 0.48928646254942576 and 2018-06-07T04:07:56.749-07:00 [Info] Rebalancer::computeProgress 0.5045701157087751 2018-06-07T04:07:59.778-07:00 [Info] Rebalancer::computeProgress 0.5045701157087751 2018-06-07T04:08:01.730-07:00 [Info] Rebalancer::computeProgress 0.5045701157087751 2018-06-07T04:08:02.768-07:00 [Info] Rebalancer::computeProgress 0.8224282555088576 2018-06-07T04:08:03.633-07:00 [Info] Rebalancer::computeProgress 0.8224282555088576 2018-06-07T04:08:05.083-07:00 [Info] Rebalancer::computeProgress 0.8224282555088576 2018-06-07T04:08:05.976-07:00 [Info] Rebalancer::computeProgress 0.8224282555088576 2018-06-07T04:08:06.479-07:00 [Info] Rebalancer::computeProgress 0.8224282555088576 2018-06-07T04:08:07.989-07:00 [Info] Rebalancer::computeProgress 0.8289998710255895 2018-06-07T04:08:09.219-07:00 [Info] Rebalancer::computeProgress 0.8289998710255895 2018-06-07T04:08:10.093-07:00 [Info] Rebalancer::computeProgress 0.8289998710255895 2018-06-07T04:08:10.598-07:00 [Info] Rebalancer::computeProgress 0.8289998710255895 2018-06-07T04:08:11.464-07:00 [Info] Rebalancer::computeProgress 0.8289998710255895 2018-06-07T04:08:12.598-07:00 [Info] Rebalancer::computeProgress 0.8289998710255895 2018-06-07T04:08:13.802-07:00 [Info] Rebalancer::computeProgress 0.5551756583019596 2018-06-07T04:08:14.700-07:00 [Info] Rebalancer::computeProgress 0.5551756583019596 2018-06-07T04:08:15.703-07:00 [Info] Rebalancer::computeProgress 0.5551756583019596 Both of these correspond to memcached getting killed due to OOM 172.23.106.64 2018-06-07T04:06:16.833-07:00, ns_log:0:info:message(ns_1@172.23.106.64) - Service 'memcached' exited with status 137. 172.23.106.76 2018-06-07T04:07:46.035-07:00, ns_log:0:info:message(ns_1@172.23.106.76) - Service 'memcached' exited with status 137. Indexer has the retry mechanism and some validations to make sure partial list of seqnos are not used for stat calculation. But it looks like those are not sufficient. 2018-06-07T04:07:45.016-07:00 [Warn] Indexer::getCurrentKVTs error=read tcp 172.23.98.11:37688->172.23.106.76:11210: read: connection reset by peer Retrying (1) 2018-06-07T04:07:45.860-07:00 [Warn] Indexer::getCurrentKVTs error=dial tcp 172.23.106.76:11210: getsockopt: connection refused Retrying (2) 2018-06-07T04:07:47.261-07:00 [Warn] Indexer::getCurrentKVTs error=dial tcp 172.23.106.76:11210: getsockopt: connection refused Retrying (3) 2018-06-07T04:07:48.871-07:00 [Warn] Indexer::getCurrentKVTs error=dial tcp 172.23.106.76:11210: getsockopt: connection refused Retrying (4) 2018-06-07T04:07:52.212-07:00 [Warn] Indexer::getCurrentKVTs error=dial tcp 172.23.106.76:11210: getsockopt: connection refused Retrying (5) 2018-06-07T04:07:52.756-07:00 [Warn] Indexer::getCurrentKVTs error=dial tcp 172.23.106.76:11210: getsockopt: connection refused Retrying (6) 2018-06-07T04:07:53.522-07:00 [Warn] Indexer::getCurrentKVTs error=dial tcp 172.23.106.76:11210: getsockopt: connection refused Retrying (7) 2018-06-07T04:07:55.072-07:00 [Warn] Indexer::getCurrentKVTs error=MCResponse status=KEY_ENOENT, opcode=0x89, opaque=0, msg: Retrying (8) 2018-06-07T04:07:55.425-07:00 [Warn] Indexer::getCurrentKVTs error=MCResponse status=KEY_ENOENT, opcode=0x89, opaque=0, msg: Retrying (9) 2018-06-07T04:07:56.351-07:00 [Warn] Indexer::getCurrentKVTs error=MCResponse status=KEY_ENOENT, opcode=0x89, opaque=0, msg: Retrying (10) 2018-06-07T04:07:56.718-07:00 [Error] Indexer::getCurrentKVTs Error Connecting to KV Cluster MCResponse status=KEY_ENOENT, opcode=0x89, opaque=0, msg: 2018-06-07T04:07:59.675-07:00 [Warn] Indexer::getCurrentKVTs error=MCResponse status=KEY_ENOENT, opcode=0x89, opaque=0, msg: Retrying (1) 2018-06-07T04:08:00.603-07:00 [Warn] Indexer::getCurrentKVTs error=MCResponse status=KEY_ENOENT, opcode=0x89, opaque=0, msg: Retrying (2) During the duration of retry, the progress computation is correct. But once the retry is successful, the progress gets computed incorrectly. 2018-06-07T04:07:56.749-07:00 [Info] Rebalancer::computeProgress 0.5045701157087751 2018-06-07T04:07:59.778-07:00 [Info] Rebalancer::computeProgress 0.5045701157087751 2018-06-07T04:08:01.730-07:00 [Info] Rebalancer::computeProgress 0.5045701157087751 2018-06-07T04:08:02.768-07:00 [Info] Rebalancer::computeProgress 0.8224282555088576 2018-06-07T04:08:03.633-07:00 [Info] Rebalancer::computeProgress 0.8224282555088576 2018-06-07T04:08:05.083-07:00 [Info] Rebalancer::computeProgress 0.8224282555088576 There may be a race condition between when the connection is successful vs. when we have the latest seqnos. And somehow we end up using 0 seqnos for the set of vbuckets on the node where the memcached died. This causes the progress to go up. Once we get the latest seqnos, the progress is back to normal.
          Hide
          deepkaran.salooja Deepkaran Salooja added a comment - - edited

          I think it should be okay to address this in the next release. seqno cache is a sensitive code path and this bug doesn't seem severe enough to make a change at this point of the release.

          A good bug catch definitely.

          Show
          deepkaran.salooja Deepkaran Salooja added a comment - - edited I think it should be okay to address this in the next release. seqno cache is a sensitive code path and this bug doesn't seem severe enough to make a change at this point of the release. A good bug catch definitely.

            People

            • Assignee:
              deepkaran.salooja Deepkaran Salooja
              Reporter:
              Balakumaran.Gopal Balakumaran Gopal
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Due:
                Created:
                Updated:

                Gerrit Reviews

                There are no open Gerrit changes

                  PagerDuty

                  Error rendering 'com.pagerduty.jira-server-plugin:PagerDuty'. Please contact your Jira administrators.