Uploaded image for project: 'Couchbase Java Client'
  1. Couchbase Java Client
  2. JCBC-1007

N1QL Queries not Load Balancing between Query Nodes

    XMLWordPrintable

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 2.3.0
    • 2.3.4
    • N1QL
    • None
    • Couchbase Server 4.5.0

    Description

      Problem

      I am seeing N1QL queries not load balancing between available query nodes. This causes excessive CPU on the heavy hit node.

      Description

      Using a 4 node 4.5.0 vagrant cluster, 2 nodes are Data only, and 2 nodes (103 and 104) are Query/Index nodes. When issuing a simple N1QL query in a loop, one node receives more than twice the query load. Sample code and screenshot attached.

      I used the beer-sample bucket, and deleted the primary index. I created a covering index:

      create index name_idx on `beer-sample`(name);

      ASK

      Please determine why the query requests are not more evenly distributed

      Attachments

        1. App.java
          2 kB
        2. Couchbase-Java-Client-2.3.4-distrib-vf1.zip
          15.81 MB
        3. Couchbase-Java-Client-2.3.4-SNAPSHOT-distrib-vf2.zip
          15.27 MB
        4. Example.java
          2 kB
        5. fine_error_log_query_load_balancing.log
          188 kB
        6. LoadBalanceTestSDK2.2.3.jar
          5.88 MB
        7. LoadBalanceTestSDK2.2.4.jar
          6.00 MB
        8. LoadBalanceTestSDK2.3.3.jar
          6.56 MB
        9. QueryTest.png
          QueryTest.png
          112 kB
        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          Yeah, indeed this is an issue. Here is whats happening: We have a round robin counter which wraps at the number of nodes, but it also includes the nodes where no query node is enabled. so If you have 4 nodes, 2 data and 2 query see whats happening (101, 102 data, 103 and 104 query):

          n3/10.142.150.103
          n3/10.142.150.103
          n3/10.142.150.103
          n4/10.142.150.104
          n3/10.142.150.103
          n3/10.142.150.103
          n3/10.142.150.103
          n4/10.142.150.104
          n3/10.142.150.103
          n3/10.142.150.103
          n3/10.142.150.103
          n4/10.142.150.104
          n3/10.142.150.103
          n3/10.142.150.103
          n3/10.142.150.103
          n4/10.142.150.104
          n3/10.142.150.103
          n3/10.142.150.103
          n3/10.142.150.103
          n4/10.142.150.104
          n3/10.142.150.103
          n3/10.142.150.103
          n3/10.142.150.103
          n4/10.142.150.104
          n3/10.142.150.103
          n3/10.142.150.103
          n3/10.142.150.103
          n4/10.142.150.104
          

          So it selects 103 as the substitute for 101 and 102 since we correctly recognize that the query service there isn't enabled and then pick the next one, which leads to uneven distribution under MDS. I'll come up with a fix for this.

          daschl Michael Nitschinger added a comment - Yeah, indeed this is an issue. Here is whats happening: We have a round robin counter which wraps at the number of nodes, but it also includes the nodes where no query node is enabled. so If you have 4 nodes, 2 data and 2 query see whats happening (101, 102 data, 103 and 104 query): n3/10.142.150.103 n3/10.142.150.103 n3/10.142.150.103 n4/10.142.150.104 n3/10.142.150.103 n3/10.142.150.103 n3/10.142.150.103 n4/10.142.150.104 n3/10.142.150.103 n3/10.142.150.103 n3/10.142.150.103 n4/10.142.150.104 n3/10.142.150.103 n3/10.142.150.103 n3/10.142.150.103 n4/10.142.150.104 n3/10.142.150.103 n3/10.142.150.103 n3/10.142.150.103 n4/10.142.150.104 n3/10.142.150.103 n3/10.142.150.103 n3/10.142.150.103 n4/10.142.150.104 n3/10.142.150.103 n3/10.142.150.103 n3/10.142.150.103 n4/10.142.150.104 So it selects 103 as the substitute for 101 and 102 since we correctly recognize that the query service there isn't enabled and then pick the next one, which leads to uneven distribution under MDS. I'll come up with a fix for this.
          daschl Michael Nitschinger added a comment - Patchset up for review http://review.couchbase.org/#/c/68148

          Uploaded a pre-release of 2.3.4 which includes the load balancing change and acts as a verification fix for the issue.

          daschl Michael Nitschinger added a comment - Uploaded a pre-release of 2.3.4 which includes the load balancing change and acts as a verification fix for the issue.

          Jack Harper The fix looks good except for a small issue when the offset becomes negative after a max int number of iterations, it will end up as array out of bounds.

          subhashni Subhashni Balakrishnan (Inactive) added a comment - Jack Harper The fix looks good except for a small issue when the offset becomes negative after a max int number of iterations, it will end up as array out of bounds.

          I looked at the updated VF. Looks good to me. Thanks Subhashni Balakrishnan!

          ingenthr Matt Ingenthron added a comment - I looked at the updated VF. Looks good to me. Thanks Subhashni Balakrishnan !

          People

            subhashni Subhashni Balakrishnan (Inactive)
            jdillon Jeff Dillon (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty