Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-27028

[ANSI JOIN][RQG] Some query results are showing as off by 1

    XMLWordPrintable

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 5.5.0
    • 5.5.0
    • query
    • 5.1.0-1461
      VM - 172.23.105.209
    • Untriaged
    • Yes

    Description

      So this bug is tough, because when the queries that are marked as failed are rerun inside the exact environment in which they failed, they pass. Im thinking this is an issue that occurs when multiple join queries are run at the same time. Basically the query results are off by 1 for the n1ql vs the sql when the failure is observed. I have left an open environment at this VM 172.23.105.209, but running this command should reproduce the error consistently:

      ./testrunner -i /root/tuqvm.ini -p gsi_type=plasma -t rqg.test_rqg.RQGTests.test_rqg_concurrent,test_file_path=b/resources/rqg/multiple_table_db/query_test_using_templates/queries_joins_50000.txt.zip,database=multiple_table_db,reset_database=True,concurreny_count=10,index_quota_percent=30,record_failure=False,password=password,use_mysql=True,replicas=0,total_queries=200,failure_record_path=/tmp,skip_cleanup=false,subquery=False,ansi_joins=True,create_secondary_meta_indexes=True,create_secondary_indexes=True

      Here is an example of a failing query (that passes when run by itself on the same environment):

      SELECT t_2.int_field1 , t_2.decimal_field1 , t_1.primary_key_id , t_1.varchar_field1 , t_1.char_field1 , t_4.bool_field1 FROM multiple_table_db_1815_simple_table_10 t_4 LEFT JOIN multiple_table_db_1815_simple_table_3 t_5 ON ( t_4.primary_key_id = t_5.primary_key_id ) LEFT JOIN multiple_table_db_1815_simple_table_2 t_1 ON ( t_5.primary_key_id = t_1.primary_key_id ) INNER JOIN multiple_table_db_1815_simple_table_1 t_3 ON ( t_1.primary_key_id = t_3.primary_key_id ) LEFT JOIN multiple_table_db_1815_simple_table_4 t_2 ON ( t_3.primary_key_id = t_2.primary_key_id ) INNER JOIN multiple_table_db_1815_simple_table_4 t_2BodvZKCwzI ON ( t_1.primary_key_id = t_2BodvZKCwzI.primary_key_id ) INNER JOIN multiple_table_db_1815_simple_table_4 t_2OVdKJPpnph ON ( t_2.primary_key_id = t_2OVdKJPpnph.primary_key_id ) LEFT JOIN multiple_table_db_1815_simple_table_4 t_2cJFiCeQfRo ON ( t_5.primary_key_id = t_2cJFiCeQfRo.primary_key_id ) WHERE ((NOT (t_5.bool_field1) OR t_2.decimal_field1 > 4942)) OR (NOT (t_2.bool_field1))

      The jenkins job claims that this query returns 872 results instead of 873 results, but when run through the UI it returns 873 results. I will attach the logs, but if you could see if this is a bug that would be great.

      I don't think its an rqg issue because this issue is only seen with ANSI JOINS runs, the normal JOINS runs do not have this issue, and the current ansi job is just the normal job with ON instead of ON KEYS.

       

      Attachments

        For Gerrit Dashboard: MB-27028
        # Subject Branch Project Status CR V

        Activity

          bingjie.miao Bingjie Miao added a comment -

          Ajay Bhullar has the RQG changes been merged in? How do I run the ANSI JOIN cases in order to observe this issue?

          bingjie.miao Bingjie Miao added a comment - Ajay Bhullar has the RQG changes been merged in? How do I run the ANSI JOIN cases in order to observe this issue?
          ajay.bhullar Ajay Bhullar added a comment -

          Bingjie Miao I am currently debugging an issue that makes running the most RQG tests useless, some bad check in. I will comment here once the change is isolated and reverted

          ajay.bhullar Ajay Bhullar added a comment - Bingjie Miao I am currently debugging an issue that makes running the most RQG tests useless, some bad check in. I will comment here once the change is isolated and reverted

          I loaded additional "route" documents to travel sample and ran the tests.

          I see a difference in result count between NL and HJ.

          Execute this few times to get the number of routing documents to be high (~400k). insert into `travel-sample` (key UUID(), value v) select v from `travel-sample` v where v.type = 'route';

          I haven't tested to see at what point it triggers the wrong result.

           

          About 400K documents for route and 5K documents for airline.
          NL: 140 seconds, HJ: 5 seconds ☺
          ----------
          select count(1)
          from `travel-sample` r inner join `travel-sample` a 
                   on (r.airlineid = meta(a).id)
          where r.type = 'route' ;
           
          select count(1)
          from `travel-sample` r inner join `travel-sample` a USE HASH (build)
                   on (r.airlineid = meta(a).id)
          where r.type = 'route' ;
           
           
          bq>    >    >    > {
              "requestID": "53158faf-5661-4338-b479-8f979ad87efa",
              "signature": {
                  "$1": "number"
              },
              "results": [
                  {
                      "$1": 263822
                  }
              ],
              "status": "success",
              "metrics": {
                  "elapsedTime": "2m19.73173472s",
                  "executionTime": "2m19.731708359s",
                  "resultCount": 1,
                  "resultSize": 36
              }
          }
          cbq> cbq>    >    >    > {
              "requestID": "6ced5b99-afc3-461b-a9b1-e58d63e0fd6c",
              "signature": {
                  "$1": "number"
              },
              "results": [
                  {
                      "$1": 263823
                  }
              ],
          

           

              "status": "success",

              "metrics": {

                  "elapsedTime": "5.155698777s",

                  "executionTime": "5.155677752s",

                  "resultCount": 1,

                  "resultSize": 36

              }

          }

           

           

          keshav Keshav Murthy added a comment - I loaded additional "route" documents to travel sample and ran the tests. I see a difference in result count between NL and HJ. Execute this few times to get the number of routing documents to be high (~400k). insert into `travel-sample` (key UUID(), value v) select v from `travel-sample` v where v.type = 'route'; I haven't tested to see at what point it triggers the wrong result.   About 400K documents for route and 5K documents for airline. NL: 140 seconds, HJ: 5 seconds ☺ ---------- select count( 1 ) from `travel-sample` r inner join `travel-sample` a           on (r.airlineid = meta(a).id) where r.type = 'route' ;   select count( 1 ) from `travel-sample` r inner join `travel-sample` a USE HASH (build)          on (r.airlineid = meta(a).id) where r.type = 'route' ;     bq>    >    >    > {     "requestID" : "53158faf-5661-4338-b479-8f979ad87efa" ,     "signature" : {         "$1" : "number"     },     "results" : [         {             "$1" : 263822         }     ],     "status" : "success" ,     "metrics" : {         "elapsedTime" : "2m19.73173472s" ,         "executionTime" : "2m19.731708359s" ,         "resultCount" : 1 ,         "resultSize" : 36     } } cbq> cbq>    >    >    > {     "requestID" : "6ced5b99-afc3-461b-a9b1-e58d63e0fd6c" ,     "signature" : {         "$1" : "number"     },     "results" : [         {             "$1" : 263823         }     ],       "status": "success",     "metrics": {         "elapsedTime": "5.155698777s",         "executionTime": "5.155677752s",         "resultCount": 1,         "resultSize": 36     } }    

          Build couchbase-server-5.5.0-1913 contains query commit e1dcfd59cc4095c4e525e5ae5c17a5aeace5a3a5 with commit message:
          MB-27028 ANSI JOIN query may miss rows in rare cases
          https://github.com/couchbase/query/commit/e1dcfd59cc4095c4e525e5ae5c17a5aeace5a3a5

          build-team Couchbase Build Team added a comment - Build couchbase-server-5.5.0-1913 contains query commit e1dcfd59cc4095c4e525e5ae5c17a5aeace5a3a5 with commit message: MB-27028 ANSI JOIN query may miss rows in rare cases https://github.com/couchbase/query/commit/e1dcfd59cc4095c4e525e5ae5c17a5aeace5a3a5
          ajay.bhullar Ajay Bhullar added a comment -

          5.5.0-2520 is showing green results for the job that was reporting this off by one error. 

          ajay.bhullar Ajay Bhullar added a comment - 5.5.0-2520 is showing green results for the job that was reporting this off by one error. 

          People

            ajay.bhullar Ajay Bhullar
            ajay.bhullar Ajay Bhullar
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty