Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-34911

[high-bucket] FTS: Rebalance failed after failover in high bucket density test

    XMLWordPrintable

Details

    • Untriaged
    • Unknown

    Attachments

      Issue Links

        For Gerrit Dashboard: MB-34911
        # Subject Branch Project Status CR V

        Activity

          10 seconds

          abhinav Abhinav Dangeti added a comment - 10 seconds

          Here's an update on the change .. http://review.couchbase.org/#/c/117921/2

          This change is to error out early in case a request timeout is seen, rather than attempt sending more requests in (which could also potentially timeout) in which case we would be updating the moving partitions count incorrectly anyway.

          Note that with this change - rebalance would likely still fail in the scenario above - instead of StartTopologyChange timing out as ns_server didn't hear back from FTS within 60s, it would show an error saying that the preparation phase of StartTopologyChange failed because of a request timeout to ns_server from FTS. A retry attempt on the rebalance will pick up things where we last left off and could potentially succeed iff ns_server responds to FTS requests before they're cancelled due to timeout (10s).

          2019-11-14T03:45:12.175-08:00 [INFO] ctl: getMovingPartitionsCount, CouchbasePartitions failed, err: gocouchbase_helper: CouchbaseBucket failed GetPool, server: http://127.0.0.1:8091, poolName: default, bucketName: bucket-15, sourceParams: "{}", err: Get http://127.0.0.1:8091/pools/default/buckets?v=42730975&uuid=4adff6cb6cbb3f86e7f025cabe97e2e0: net/http: request canceled (Client.Timeout exceeded while awaiting headers) 

          However a retest is advised even before merging the change, with the latest mad-hatter build that includes (http://review.couchbase.org/#/c/117634 + http://review.couchbase.org/#/c/118041/) which address the issue where FTS sends numerous incorrect requests to ns_server which result in these log messages ..

          2019-11-14T03:46:35.882-08:00 [INFO] ctl: getMovingPartitionsCount, CouchbasePartitions failed, err: gocouchbase_helper: CouchbaseBucket failed GetPool, server: http://127.0.0.1:8091, poolName: default, bucketName: bucket-1, sourceParams: "{}", err: invalid character '<' looking for beginning of value

          Ping Mahesh Mandhare.

          abhinav Abhinav Dangeti added a comment - Here's an update on the change ..  http://review.couchbase.org/#/c/117921/2 This change is to error out early in case a request timeout is seen, rather than attempt sending more requests in (which could also potentially timeout) in which case we would be updating the moving partitions count incorrectly anyway. Note that with this change - rebalance would likely still fail in the scenario above - instead of StartTopologyChange timing out as ns_server didn't hear back from FTS within 60s, it would show an error saying that the preparation phase of StartTopologyChange failed because of a request timeout to ns_server from FTS. A retry attempt on the rebalance will pick up things where we last left off and could potentially succeed iff ns_server responds to FTS requests before they're cancelled due to timeout (10s). 2019 - 11 -14T03: 45 : 12.175 - 08 : 00 [INFO] ctl: getMovingPartitionsCount, CouchbasePartitions failed, err: gocouchbase_helper: CouchbaseBucket failed GetPool, server: http: //127.0.0.1:8091, poolName: default, bucketName: bucket-15, sourceParams: "{}", err: Get http://127.0.0.1:8091/pools/default/buckets?v=42730975&uuid=4adff6cb6cbb3f86e7f025cabe97e2e0: net/http: request canceled (Client.Timeout exceeded while awaiting headers) However a retest is advised even before merging the change, with the latest mad-hatter build that includes ( http://review.couchbase.org/#/c/117634  +  http://review.couchbase.org/#/c/118041/ ) which address the issue where FTS sends numerous incorrect requests to ns_server which result in these log messages .. 2019 - 11 -14T03: 46 : 35.882 - 08 : 00 [INFO] ctl: getMovingPartitionsCount, CouchbasePartitions failed, err: gocouchbase_helper: CouchbaseBucket failed GetPool, server: http: //127.0.0.1:8091, poolName: default, bucketName: bucket-1, sourceParams: "{}", err: invalid character '<' looking for beginning of value Ping Mahesh Mandhare .

          Jyotsna Nayak: can you please run the high bucket density test(with 30 buckets) again to see if this is reproducible? Thanks.

          keshav Keshav Murthy added a comment - Jyotsna Nayak : can you please run the high bucket density test(with 30 buckets) again to see if this is reproducible? Thanks.
          jyotsna.nayak Jyotsna Nayak added a comment - - edited

          Raju Suravarjjala , sorry for the delay. I had run the test two weeks back but got a mismatch error with was not captured in the cbcollect logs or the logs on the UI. I ran the test on 2 buckets for build 7.0.2 and 7.1.0. The test passed for both the builds with no issues.

          I have scheduled the test for 30 buckets (It is in the queue).
          Link : here

          jyotsna.nayak Jyotsna Nayak added a comment - - edited Raju Suravarjjala  , sorry for the delay. I had run the test two weeks back but got a mismatch error with was not captured in the cbcollect logs or the logs on the UI. I ran the test on 2 buckets for build 7.0.2 and 7.1.0. The test passed for both the builds with no issues. I have scheduled the test for 30 buckets (It is in the queue). Link : here

          Closing this issue as we are not seeing this on CC build: 7.0.0-4554.
          Swap rebalance completed. Time taken for swap rebalance : 468.9639 min (~7.81 hrs)
          Link to the job: http://perf.jenkins.couchbase.com/job/themis_multibucket/63/consoleFull

          jyotsna.nayak Jyotsna Nayak added a comment - Closing this issue as we are not seeing this on CC build: 7.0.0-4554. Swap rebalance completed. Time taken for swap rebalance : 468.9639 min (~7.81 hrs) Link to the job: http://perf.jenkins.couchbase.com/job/themis_multibucket/63/consoleFull

          People

            jyotsna.nayak Jyotsna Nayak
            mahesh.mandhare Mahesh Mandhare (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            11 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              PagerDuty