Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-27028

[ANSI JOIN][RQG] Some query results are showing as off by 1

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 5.5.0
    • Fix Version/s: 5.5.0
    • Component/s: query
    • Labels:
    • Environment:
      5.1.0-1461
      VM - 172.23.105.209
    • Triage:
      Untriaged
    • Is this a Regression?:
      Yes

      Description

      So this bug is tough, because when the queries that are marked as failed are rerun inside the exact environment in which they failed, they pass. Im thinking this is an issue that occurs when multiple join queries are run at the same time. Basically the query results are off by 1 for the n1ql vs the sql when the failure is observed. I have left an open environment at this VM 172.23.105.209, but running this command should reproduce the error consistently:

      ./testrunner -i /root/tuqvm.ini -p gsi_type=plasma -t rqg.test_rqg.RQGTests.test_rqg_concurrent,test_file_path=b/resources/rqg/multiple_table_db/query_test_using_templates/queries_joins_50000.txt.zip,database=multiple_table_db,reset_database=True,concurreny_count=10,index_quota_percent=30,record_failure=False,password=password,use_mysql=True,replicas=0,total_queries=200,failure_record_path=/tmp,skip_cleanup=false,subquery=False,ansi_joins=True,create_secondary_meta_indexes=True,create_secondary_indexes=True

      Here is an example of a failing query (that passes when run by itself on the same environment):

      SELECT t_2.int_field1 , t_2.decimal_field1 , t_1.primary_key_id , t_1.varchar_field1 , t_1.char_field1 , t_4.bool_field1 FROM multiple_table_db_1815_simple_table_10 t_4 LEFT JOIN multiple_table_db_1815_simple_table_3 t_5 ON ( t_4.primary_key_id = t_5.primary_key_id ) LEFT JOIN multiple_table_db_1815_simple_table_2 t_1 ON ( t_5.primary_key_id = t_1.primary_key_id ) INNER JOIN multiple_table_db_1815_simple_table_1 t_3 ON ( t_1.primary_key_id = t_3.primary_key_id ) LEFT JOIN multiple_table_db_1815_simple_table_4 t_2 ON ( t_3.primary_key_id = t_2.primary_key_id ) INNER JOIN multiple_table_db_1815_simple_table_4 t_2BodvZKCwzI ON ( t_1.primary_key_id = t_2BodvZKCwzI.primary_key_id ) INNER JOIN multiple_table_db_1815_simple_table_4 t_2OVdKJPpnph ON ( t_2.primary_key_id = t_2OVdKJPpnph.primary_key_id ) LEFT JOIN multiple_table_db_1815_simple_table_4 t_2cJFiCeQfRo ON ( t_5.primary_key_id = t_2cJFiCeQfRo.primary_key_id ) WHERE ((NOT (t_5.bool_field1) OR t_2.decimal_field1 > 4942)) OR (NOT (t_2.bool_field1))

      The jenkins job claims that this query returns 872 results instead of 873 results, but when run through the UI it returns 873 results. I will attach the logs, but if you could see if this is a bug that would be great.

      I don't think its an rqg issue because this issue is only seen with ANSI JOINS runs, the normal JOINS runs do not have this issue, and the current ansi job is just the normal job with ON instead of ON KEYS.

       

        Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

          Hide
          bingjie.miao Bingjie Miao added a comment -

          Ajay Bhullar has the RQG changes been merged in? How do I run the ANSI JOIN cases in order to observe this issue?

          Show
          bingjie.miao Bingjie Miao added a comment - Ajay Bhullar has the RQG changes been merged in? How do I run the ANSI JOIN cases in order to observe this issue?
          Hide
          ajay.bhullar Ajay Bhullar added a comment -

          Bingjie Miao I am currently debugging an issue that makes running the most RQG tests useless, some bad check in. I will comment here once the change is isolated and reverted

          Show
          ajay.bhullar Ajay Bhullar added a comment - Bingjie Miao I am currently debugging an issue that makes running the most RQG tests useless, some bad check in. I will comment here once the change is isolated and reverted
          Hide
          keshav Keshav Murthy added a comment -

          I loaded additional "route" documents to travel sample and ran the tests.

          I see a difference in result count between NL and HJ.

          Execute this few times to get the number of routing documents to be high (~400k). insert into `travel-sample` (key UUID(), value v) select v from `travel-sample` v where v.type = 'route';

          I haven't tested to see at what point it triggers the wrong result.

           

          About 400K documents for route and 5K documents for airline.
          NL: 140 seconds, HJ: 5 seconds ☺
          ----------
          select count(1)
          from `travel-sample` r inner join `travel-sample` a 
                   on (r.airlineid = meta(a).id)
          where r.type = 'route' ;
           
          select count(1)
          from `travel-sample` r inner join `travel-sample` a USE HASH (build)
                   on (r.airlineid = meta(a).id)
          where r.type = 'route' ;
           
           
          bq>    >    >    > {
              "requestID": "53158faf-5661-4338-b479-8f979ad87efa",
              "signature": {
                  "$1": "number"
              },
              "results": [
                  {
                      "$1": 263822
                  }
              ],
              "status": "success",
              "metrics": {
                  "elapsedTime": "2m19.73173472s",
                  "executionTime": "2m19.731708359s",
                  "resultCount": 1,
                  "resultSize": 36
              }
          }
          cbq> cbq>    >    >    > {
              "requestID": "6ced5b99-afc3-461b-a9b1-e58d63e0fd6c",
              "signature": {
                  "$1": "number"
              },
              "results": [
                  {
                      "$1": 263823
                  }
              ],
          

           

              "status": "success",

              "metrics": {

                  "elapsedTime": "5.155698777s",

                  "executionTime": "5.155677752s",

                  "resultCount": 1,

                  "resultSize": 36

              }

          }

           

           

          Show
          keshav Keshav Murthy added a comment - I loaded additional "route" documents to travel sample and ran the tests. I see a difference in result count between NL and HJ. Execute this few times to get the number of routing documents to be high (~400k). insert into `travel-sample` (key UUID(), value v) select v from `travel-sample` v where v.type = 'route'; I haven't tested to see at what point it triggers the wrong result.   About 400K documents for route and 5K documents for airline. NL: 140 seconds, HJ: 5 seconds ☺ ---------- select count( 1 ) from `travel-sample` r inner join `travel-sample` a           on (r.airlineid = meta(a).id) where r.type = 'route' ;   select count( 1 ) from `travel-sample` r inner join `travel-sample` a USE HASH (build)          on (r.airlineid = meta(a).id) where r.type = 'route' ;     bq>    >    >    > {     "requestID" : "53158faf-5661-4338-b479-8f979ad87efa" ,     "signature" : {         "$1" : "number"     },     "results" : [         {             "$1" : 263822         }     ],     "status" : "success" ,     "metrics" : {         "elapsedTime" : "2m19.73173472s" ,         "executionTime" : "2m19.731708359s" ,         "resultCount" : 1 ,         "resultSize" : 36     } } cbq> cbq>    >    >    > {     "requestID" : "6ced5b99-afc3-461b-a9b1-e58d63e0fd6c" ,     "signature" : {         "$1" : "number"     },     "results" : [         {             "$1" : 263823         }     ],       "status": "success",     "metrics": {         "elapsedTime": "5.155698777s",         "executionTime": "5.155677752s",         "resultCount": 1,         "resultSize": 36     } }    
          Hide
          build-team Couchbase Build Team added a comment -

          Build couchbase-server-5.5.0-1913 contains query commit e1dcfd59cc4095c4e525e5ae5c17a5aeace5a3a5 with commit message:
          MB-27028 ANSI JOIN query may miss rows in rare cases
          https://github.com/couchbase/query/commit/e1dcfd59cc4095c4e525e5ae5c17a5aeace5a3a5

          Show
          build-team Couchbase Build Team added a comment - Build couchbase-server-5.5.0-1913 contains query commit e1dcfd59cc4095c4e525e5ae5c17a5aeace5a3a5 with commit message: MB-27028 ANSI JOIN query may miss rows in rare cases https://github.com/couchbase/query/commit/e1dcfd59cc4095c4e525e5ae5c17a5aeace5a3a5
          Hide
          ajay.bhullar Ajay Bhullar added a comment -

          5.5.0-2520 is showing green results for the job that was reporting this off by one error. 

          Show
          ajay.bhullar Ajay Bhullar added a comment - 5.5.0-2520 is showing green results for the job that was reporting this off by one error. 

            People

            • Assignee:
              ajay.bhullar Ajay Bhullar
              Reporter:
              ajay.bhullar Ajay Bhullar
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Gerrit Reviews

                There are no open Gerrit changes

                  PagerDuty

                  Error rendering 'com.pagerduty.jira-server-plugin:PagerDuty'. Please contact your Jira administrators.