Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-51682

[BP to 7.0.4] - [System test upgrade] - Index build stuck during rebalance due to large number of pending items

    XMLWordPrintable

Details

    • Untriaged
    • 1
    • Unknown

    Description

      In the system upgrade tests, index rebalance is stuck. The reason for rebalance stuck seems to be due to huge number of pending documents 

      E.g., The instance idx4_8C6Jx is stuck in Moving state

       

      "name": "idx4_8C6JX",
         "bucket": "bucket6",
         "scope": "scope_0",
         "collection": "coll_2",
         "secExprs": [
          "`price`",
          "`city`",
          "`name`"
         ],
         "indexType": "plasma",
         "status": "Moving",
         "definition": "CREATE INDEX `idx4_8C6JX` ON `bucket6`.`scope_0`.`coll_2`(`price`,`city`,`name`) PARTITION BY hash((meta().`id`)) WITH { \"nodes\":[ \"172.23.120.77:18091\",\"172.23.123.26:18091\",\"172.23.123.33:18091\",\"172.23.97.105:18091\",\"172.23.97.148:18091\",\"172.23.97.149:18091\" ], \"num_replica\":2, \"num_partition\":5 }",
         "hosts": [
          "172.23.120.77:18091",
          "172.23.123.33:18091",
          "172.23.97.105:18091",
          "172.23.97.148:18091"
         ],
         "completion": 100,
         "progress": 100,
         "scheduled": false,
      

      Taking node 120.77 as example Indexer has actually moved the index instance to ACTIVE stats but due to huge number of pending documents, the rebalance of this index is not considered done

      2022-02-24T01:23:37.682-08:00 [Info] Rebalancer::waitForIndexBuild Index: bucket6:scope_0:coll_2:idx4_8C6JX State: INDEX_STATE_ACTIVE Pending: 2.7228046e+07 EstTime: 79 Partitions: [5] Destination: 127.0.0.1:9102
      2022-02-24T01:23:40.698-08:00 [Info] Rebalancer::waitForIndexBuild Index: bucket6:scope_0:coll_2:idx4_8C6JX State: INDEX_STATE_ACTIVE Pending: 2.7228046e+07 EstTime: 79 Partitions: [5] Destination: 127.0.0.1:9102
      

      Logs:
      Supportal
      http://supportal.couchbase.com/snapshot/a791810e6f6b73c2457c225d9e24a115::7

      https://cb-engineering.s3.amazonaws.com/MB-51159/collectinfo-2022-02-24T092200-ns_1%40172.23.106.134.zip
      https://cb-engineering.s3.amazonaws.com/MB-51159/collectinfo-2022-02-24T092200-ns_1%40172.23.106.137.zip
      https://cb-engineering.s3.amazonaws.com/MB-51159/collectinfo-2022-02-24T092200-ns_1%40172.23.106.138.zip
      https://cb-engineering.s3.amazonaws.com/MB-51159/collectinfo-2022-02-24T092200-ns_1%40172.23.120.58.zip
      https://cb-engineering.s3.amazonaws.com/MB-51159/collectinfo-2022-02-24T092200-ns_1%40172.23.120.73.zip
      https://cb-engineering.s3.amazonaws.com/MB-51159/collectinfo-2022-02-24T092200-ns_1%40172.23.120.74.zip
      https://cb-engineering.s3.amazonaws.com/MB-51159/collectinfo-2022-02-24T092200-ns_1%40172.23.120.75.zip
      https://cb-engineering.s3.amazonaws.com/MB-51159/collectinfo-2022-02-24T092200-ns_1%40172.23.120.77.zip
      https://cb-engineering.s3.amazonaws.com/MB-51159/collectinfo-2022-02-24T092200-ns_1%40172.23.120.81.zip
      https://cb-engineering.s3.amazonaws.com/MB-51159/collectinfo-2022-02-24T092200-ns_1%40172.23.120.86.zip
      https://cb-engineering.s3.amazonaws.com/MB-51159/collectinfo-2022-02-24T092200-ns_1%40172.23.121.118.zip
      https://cb-engineering.s3.amazonaws.com/MB-51159/collectinfo-2022-02-24T092200-ns_1%40172.23.121.77.zip
      https://cb-engineering.s3.amazonaws.com/MB-51159/collectinfo-2022-02-24T092200-ns_1%40172.23.123.25.zip
      https://cb-engineering.s3.amazonaws.com/MB-51159/collectinfo-2022-02-24T092200-ns_1%40172.23.123.26.zip
      https://cb-engineering.s3.amazonaws.com/MB-51159/collectinfo-2022-02-24T092200-ns_1%40172.23.123.32.zip
      https://cb-engineering.s3.amazonaws.com/MB-51159/collectinfo-2022-02-24T092200-ns_1%40172.23.123.33.zip
      https://cb-engineering.s3.amazonaws.com/MB-51159/collectinfo-2022-02-24T092200-ns_1%40172.23.96.122.zip
      https://cb-engineering.s3.amazonaws.com/MB-51159/collectinfo-2022-02-24T092200-ns_1%40172.23.96.14.zip
      https://cb-engineering.s3.amazonaws.com/MB-51159/collectinfo-2022-02-24T092200-ns_1%40172.23.96.243.zip
      https://cb-engineering.s3.amazonaws.com/MB-51159/collectinfo-2022-02-24T092200-ns_1%40172.23.96.48.zip
      https://cb-engineering.s3.amazonaws.com/MB-51159/collectinfo-2022-02-24T092200-ns_1%40172.23.97.105.zip
      https://cb-engineering.s3.amazonaws.com/MB-51159/collectinfo-2022-02-24T092200-ns_1%40172.23.97.110.zip
      https://cb-engineering.s3.amazonaws.com/MB-51159/collectinfo-2022-02-24T092200-ns_1%40172.23.97.112.zip
      https://cb-engineering.s3.amazonaws.com/MB-51159/collectinfo-2022-02-24T092200-ns_1%40172.23.97.148.zip
      https://cb-engineering.s3.amazonaws.com/MB-51159/collectinfo-2022-02-24T092200-ns_1%40172.23.97.149.zip
      https://cb-engineering.s3.amazonaws.com/MB-51159/collectinfo-2022-02-24T092200-ns_1%40172.23.97.150.zip
      https://cb-engineering.s3.amazonaws.com/MB-51159/collectinfo-2022-02-24T092200-ns_1%40172.23.97.74.zip
      https://cb-engineering.s3.amazonaws.com/MB-51159/collectinfo-2022-02-24T092200-ns_1%40172.23.96.254.zip

      Attachments

        1. before.pdf
          129 kB
        2. file_after_scan.pdf
          138 kB
        3. new.pdf
          101 kB
        4. projector_mem.pprof
          38 kB
        5. script.go
          0.3 kB

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            abhinandan.singla Abhinandan Singla added a comment - - edited

            Steps to Reproduce:

            1.  Start projector in TLS mode 
            2. Query the “/stats” endpoint of projector 100K times - Used a script script.go
            3. Capture the memory profile

            Memory percentage in tls handshake before the fix : 76.16%
            Memory percentage in tls handshake after the fix : 10.55%

            before.pdf new.pdf

            abhinandan.singla Abhinandan Singla added a comment - - edited Steps to Reproduce:  Start projector in TLS mode  Query the “/stats” endpoint of projector 100K times - Used a script script.go Capture the memory profile Memory percentage in tls handshake before the fix : 76.16% Memory percentage in tls handshake after the fix : 10.55% before.pdf new.pdf

            Build couchbase-server-7.0.4-7243 contains indexing commit ba1b023 with commit message:
            MB-51682 [BP to 7.0.4] Prevent connection object leak by deleting closed connections

            build-team Couchbase Build Team added a comment - Build couchbase-server-7.0.4-7243 contains indexing commit ba1b023 with commit message: MB-51682 [BP to 7.0.4] Prevent connection object leak by deleting closed connections

            I've tried with the scenario Abhinandan has mentioned above, however, I'm not able to make any conclusion from the captured memory profile.

             

            Steps to validate:

            • Create a 3 node cluster with kv-index-n1ql service.
            • Enable TLS with strict mode

              ./couchbase-cli setting-autofailover -c localhost:8091 -u Administrator -p password --enable-auto-failover=0
              ./couchbase-cli node-to-node-encryption -c localhost:8091 -u Administrator -p password --enable
              ./couchbase-cli setting-security -c localhost:8091 -u Administrator -p password --set --cluster-encryption-level all
              ./couchbase-cli setting-autofailover -c localhost:8091 -u Administrator -p password --enable-auto-failover=1 --auto-failover-timeout=120 --max-failovers=1
              ./couchbase-cli setting-security -c https://localhost:18091 -u Administrator -p password --set --cluster-encryption-level strict --no-ssl-verify

            • Load travel-sample bucket
            • Set projector memory profile capture to true

              curl --location --request POST 'https://10.112.205.102:19102/settings' \
              --header 'Authorization: Basic QWRtaW5pc3RyYXRvcjpwYXNzd29yZA==' \
              --header 'Content-Type: text/plain' \
              --data-raw '{"projector.memProfDir" : "", "projector.memProfile" : true
              }'

            •  Run the below command to query the projector stats on KV node

              for i in {0..100000}; do echo `curl --location --request GET -k 'https://10.112.205.101:9999/stats' --header 'Authorization: Basic QWRtaW5pc3RyYXRvcjpwYXNzd29yZA=='`;done

            • Generate Graph with the projector_

              go tool pprof --pdf /opt/couchbase/bin/indexer projector_mem.pprof > file.pdf

            I've talked to Abhinandan but the visual file generated from my run is not correlating with what Abhinandan has attached.

            Varun Velamuri, can you confirm if this test is ok?

             

            Cbcollect logs, memory prof and graph are attached.

            hemant.rajput Hemant Rajput added a comment - I've tried with the scenario Abhinandan has mentioned above, however, I'm not able to make any conclusion from the captured memory profile.   Steps to validate: Create a 3 node cluster with kv-index-n1ql service. Enable TLS with strict mode ./couchbase-cli setting-autofailover -c localhost:8091 -u Administrator -p password --enable-auto-failover=0 ./couchbase-cli node-to-node-encryption -c localhost:8091 -u Administrator -p password --enable ./couchbase-cli setting-security -c localhost:8091 -u Administrator -p password --set --cluster-encryption-level all ./couchbase-cli setting-autofailover -c localhost:8091 -u Administrator -p password --enable-auto-failover=1 --auto-failover-timeout=120 --max-failovers=1 ./couchbase-cli setting-security -c https://localhost:18091 -u Administrator -p password --set --cluster-encryption-level strict --no-ssl-verify Load travel-sample bucket Set projector memory profile capture to true curl --location --request POST 'https://10.112.205.102:19102/settings' \ --header 'Authorization: Basic QWRtaW5pc3RyYXRvcjpwYXNzd29yZA==' \ --header 'Content-Type: text/plain' \ --data-raw '{"projector.memProfDir" : "", "projector.memProfile" : true }'  Run the below command to query the projector stats on KV node for i in {0..100000}; do echo `curl --location --request GET -k 'https://10.112.205.101:9999/stats' --header 'Authorization: Basic QWRtaW5pc3RyYXRvcjpwYXNzd29yZA=='`;done Generate Graph with the projector_ go tool pprof --pdf /opt/couchbase/bin/indexer projector_mem.pprof > file.pdf I've talked to Abhinandan but the visual file generated from my run is not correlating with what Abhinandan has attached. Varun Velamuri , can you confirm if this test is ok?   Cbcollect logs, memory prof and graph are attached.

            Hemant Rajput , You should capture the projector memory profile after doing the curl requests. Currently, you are capturing the requests before doing the requests - so, that would not capture the overhead of the requests.

            varun.velamuri Varun Velamuri added a comment - Hemant Rajput , You should capture the projector memory profile after doing the curl requests. Currently, you are capturing the requests before doing the requests - so, that would not capture the overhead of the requests.

            Validated on 7.0.4-7261.

            Steps to validate:

            1. Create a 3 node cluster kv-index-n1ql and enable TLS with strict mode
            2. Load travel-sample bucket
            3. Fetch projector stats
            4. dump projector mem profile and check whether "no memory allocated to TLS connections". Before fix memory was being assigned to TLS connections

            projector_mem.pprof file_after_scan.pdf

            projector_mem.pprof 

            hemant.rajput Hemant Rajput added a comment - Validated on 7.0.4-7261. Steps to validate: Create a 3 node cluster kv-index-n1ql and enable TLS with strict mode Load travel-sample bucket Fetch projector stats dump projector mem profile and check whether "no memory allocated to TLS connections" . Before fix memory was being assigned to TLS connections projector_mem.pprof file_after_scan.pdf projector_mem.pprof  

            People

              hemant.rajput Hemant Rajput
              varun.velamuri Varun Velamuri
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty