Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-36964

Intermittent failure in create partitioned index with num_replica

    XMLWordPrintable

Details

    Description

      Build : 6.5.0-4821

      Sometimes, create partitioned index with replicas are failing with the following error:

      [
      {
      "code": 5000,
      "msg": "GSI CreateIndex() - cause: Fail to create index due to rebalancing, another concurrent request, network partition, or node failed. The operation may have succeed. If not, please retry the operation at later time.",
      "query": "CREATE INDEX idx3 on `beer-sample`(name,city) partition by hash(name) with

      {\"num_partition\":12,\"num_replica\":2}

      "
      }
      ]

      This is not 100% consistent, but can be easily reproduced.

      Steps :
      1. 4 node cluster with indexing service on all 4 nodes
      2. Install beer sample bucket
      3. Optional step : curl localhost:9102/settings/planner?excludeNode=in -u Administrator:password on node 0 (Issue can be seen without this step as well)
      4. Create index :
      CREATE INDEX idx3 on `beer-sample`(name,city) partition by hash(name) with

      {"num_partition":12,"num_replica":2}

      This statement will fail sometimes with the above mentioned error.

      Seeing several failures related to this bug in the functional regression since 6.5.0-4783. Last known good build was 6.5.0-4744.

      There are some planner related changes that went in between 6.5.0-4744 and 6.5.0-4783 that might have caused this regression.
      http://172.23.123.43:8000/getchangelog?product=couchbase-server&fromb=6.5.0-4744&tob=6.5.0-4783

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            Hi Amit Kulkarni I am running the same set of tests against both the builds - 6.5.0-4879 and 6.5.0-4894. I will post my observations once done.

            mihir.kamdar Mihir Kamdar (Inactive) added a comment - Hi Amit Kulkarni I am running the same set of tests against both the builds - 6.5.0-4879 and 6.5.0-4894. I will post my observations once done.

            Amit Kulkarni The changes look good. I am not seeing any failures in the partitioned indexes jobs run on 6.5.0-4879 and 6.5.0-4894.

            mihir.kamdar Mihir Kamdar (Inactive) added a comment - Amit Kulkarni The changes look good. I am not seeing any failures in the partitioned indexes jobs run on 6.5.0-4879 and 6.5.0-4894.

            QE testing: Lookout of error messages like following in the query logs.

            Encounter planner error.  Use round robin strategy for planning
            

            If the message is seen, planner error mentioned in previous comments might have been reproduced. Please open a separate MB for it.

            After this log message, if create index is attempted using round robin and it succeeds, the this MB is verified and this MB can be "closed".

            amit.kulkarni Amit Kulkarni added a comment - QE testing: Lookout of error messages like following in the query logs. Encounter planner error. Use round robin strategy for planning If the message is seen, planner error mentioned in previous comments might have been reproduced. Please open a separate MB for it. After this log message, if create index is attempted using round robin and it succeeds, the this MB is verified and this MB can be "closed".

            After creating multiple partiioned indexes with replicas (around 20), could able to see this error message:

            {format}
            2019-12-12T14:56:24.756-08:00 [Error] Encounter planner error. Use round robin strategy for planning. Error:
            MemoryQuota: 536870912
            CpuQuota: 4
            — Violations for index <idx20 5 (replica 1), beer-sample> (mem 0, cpu 0) at node 172.23.121.67:8091
            Cannot move to 172.23.109.181:8091: ReplicaViolation (free mem 1.67772e+07T, free cpu 4)
            Can move to 172.23.121.66:8091: NoViolation (free mem 1.67772e+07T, free cpu 4)
            — Violations for index <idx20 6 (replica 1), beer-sample> (mem 0, cpu 0) at node 172.23.121.67:8091
            Cannot move to 172.23.109.181:8091: ReplicaViolation (free mem 1.67772e+07T, free cpu 4)
            Can move to 172.23.121.66:8091: NoViolation (free mem 1.67772e+07T, free cpu 4)
            — Violations for index <idx20 5 (replica 2), beer-sample> (mem 0, cpu 0) at node 172.23.121.67:8091
            Cannot move to 172.23.109.181:8091: ReplicaViolation (free mem 1.67772e+07T, free cpu 4)
            Can move to 172.23.121.66:8091: NoViolation (free mem 1.67772e+07T, free cpu 4)
            — Violations for index <idx20 6 (replica 2), beer-sample> (mem 0, cpu 0) at node 172.23.121.67:8091
            Cannot move to 172.23.109.181:8091: ReplicaViolation (free mem 1.67772e+07T, free cpu 4)
            Can move to 172.23.121.66:8091: NoViolation (free mem 1.67772e+07T, free cpu 4){format}

            After this, we did see idx20 has been created successfully.

            logs:

            https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1576193583/collectinfo-2019-12-12T233304-ns_1%40172.23.109.180.zip
            https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1576193583/collectinfo-2019-12-12T233304-ns_1%40172.23.121.66.zip
            https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1576193583/collectinfo-2019-12-12T233304-ns_1%40172.23.121.67.zip

            girish.benakappa Girish Benakappa added a comment - After creating multiple partiioned indexes with replicas (around 20), could able to see this error message: {format} 2019-12-12T14:56:24.756-08:00 [Error] Encounter planner error. Use round robin strategy for planning. Error: MemoryQuota: 536870912 CpuQuota: 4 — Violations for index <idx20 5 (replica 1), beer-sample> (mem 0, cpu 0) at node 172.23.121.67:8091 Cannot move to 172.23.109.181:8091: ReplicaViolation (free mem 1.67772e+07T, free cpu 4) Can move to 172.23.121.66:8091: NoViolation (free mem 1.67772e+07T, free cpu 4) — Violations for index <idx20 6 (replica 1), beer-sample> (mem 0, cpu 0) at node 172.23.121.67:8091 Cannot move to 172.23.109.181:8091: ReplicaViolation (free mem 1.67772e+07T, free cpu 4) Can move to 172.23.121.66:8091: NoViolation (free mem 1.67772e+07T, free cpu 4) — Violations for index <idx20 5 (replica 2), beer-sample> (mem 0, cpu 0) at node 172.23.121.67:8091 Cannot move to 172.23.109.181:8091: ReplicaViolation (free mem 1.67772e+07T, free cpu 4) Can move to 172.23.121.66:8091: NoViolation (free mem 1.67772e+07T, free cpu 4) — Violations for index <idx20 6 (replica 2), beer-sample> (mem 0, cpu 0) at node 172.23.121.67:8091 Cannot move to 172.23.109.181:8091: ReplicaViolation (free mem 1.67772e+07T, free cpu 4) Can move to 172.23.121.66:8091: NoViolation (free mem 1.67772e+07T, free cpu 4){format} After this, we did see idx20 has been created successfully. logs: https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1576193583/collectinfo-2019-12-12T233304-ns_1%40172.23.109.180.zip https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1576193583/collectinfo-2019-12-12T233304-ns_1%40172.23.121.66.zip https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1576193583/collectinfo-2019-12-12T233304-ns_1%40172.23.121.67.zip

            Closing based on above testing done with 6.5.0-4947.

            girish.benakappa Girish Benakappa added a comment - Closing based on above testing done with 6.5.0-4947.

            People

              girish.benakappa Girish Benakappa
              mihir.kamdar Mihir Kamdar (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty