Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-57625

Embed projection into operators

    XMLWordPrintable

Details

    • Improvement
    • Resolution: Unresolved
    • Major
    • Morpheus
    • Morpheus
    • analytics
    • 0

    Description

      Currently, we inject PROJECT operators greedily between operators to reduce the size of intermediate results; however, this might still not efficient enough. To illustrate, let's consider the following query:

      SELECT MAX(r.temp)
      FROM ExperDatasetRow e, e.readings r 

      Plan:

      distribute result [$$48] [cardinality: 0.0, op-cost: 0.0, total-cost: 0.0]
      -- DISTRIBUTE_RESULT  |UNPARTITIONED|
        exchange [cardinality: 0.0, op-cost: 0.0, total-cost: 0.0]
        -- ONE_TO_ONE_EXCHANGE  |UNPARTITIONED|
          project ([$$48]) [cardinality: 0.0, op-cost: 0.0, total-cost: 0.0]
          -- STREAM_PROJECT  |UNPARTITIONED|
            assign [$$48] <- [{\"$1\": $$50}] [cardinality: 0.0, op-cost: 0.0, total-cost: 0.0]
            -- ASSIGN  |UNPARTITIONED|
              aggregate [$$50] <- [agg-global-sql-max($$53)] [cardinality: 0.0, op-cost: 0.0, total-cost: 0.0]
              -- AGGREGATE  |UNPARTITIONED|
                exchange [cardinality: 0.0, op-cost: 0.0, total-cost: 0.0]
                -- RANDOM_MERGE_EXCHANGE  |PARTITIONED|
                  aggregate [$$53] <- [agg-local-sql-max($$46)] [cardinality: 0.0, op-cost: 0.0, total-cost: 0.0]
                  -- AGGREGATE  |PARTITIONED|
                    project ([$$46]) [cardinality: 0.0, op-cost: 0.0, total-cost: 0.0]
                    -- STREAM_PROJECT  |PARTITIONED|
                      assign [$$46] <- [$$r.getField(\"temp\")] [cardinality: 0.0, op-cost: 0.0, total-cost: 0.0]
                      -- ASSIGN  |PARTITIONED|
                        project ([$$r]) [cardinality: 0.0, op-cost: 0.0, total-cost: 0.0]
                        -- STREAM_PROJECT  |PARTITIONED|
                          unnest $$r <- scan-collection($$51) [cardinality: 0.0, op-cost: 0.0, total-cost: 0.0]
                          -- UNNEST  |PARTITIONED|
                            project ([$$51]) [cardinality: 0.0, op-cost: 0.0, total-cost: 0.0]
                            -- STREAM_PROJECT  |PARTITIONED|
                              assign [$$51] <- [$$e.getField(\"readings\")] [cardinality: 0.0, op-cost: 0.0, total-cost: 0.0]
                              -- ASSIGN  |PARTITIONED|
                                project ([$$e]) [cardinality: 0.0, op-cost: 0.0, total-cost: 0.0]
                                -- STREAM_PROJECT  |PARTITIONED|
                                  exchange [cardinality: 0.0, op-cost: 0.0, total-cost: 0.0]
                                  -- ONE_TO_ONE_EXCHANGE  |PARTITIONED|
                                    data-scan []<-[$$49, $$e] <- FullDataverse.ExperDatasetRow [cardinality: 0.0, op-cost: 0.0, total-cost: 0.0]
                                    -- DATASOURCE_SCAN  |PARTITIONED|
                                      exchange [cardinality: 0.0, op-cost: 0.0, total-cost: 0.0]
                                      -- ONE_TO_ONE_EXCHANGE  |PARTITIONED|
                                        empty-tuple-source [cardinality: 0.0, op-cost: 0.0, total-cost: 0.0]
                                        -- EMPTY_TUPLE_SOURCE  |PARTITIONED|

       

      We see that the UNNEST operator is followed by a PROJECT operator:

                        project ([$$r]) [cardinality: 0.0, op-cost: 0.0, total-cost: 0.0]
                        -- STREAM_PROJECT  |PARTITIONED|
                          unnest $$r <- scan-collection($$51) [cardinality: 0.0, op-cost: 0.0, total-cost: 0.0]
                          -- UNNEST  |PARTITIONED|

      Let's say the frame size is 32KB (the current default) and the size of the array "readings" is close to the frame size itself (i.e., a little bit less than 32KB). Then, UNNEST would flush output frames rapidly as the frame size can only accommodate the array + a one or two elements of the unnested elements. Now, the array itself is not needed at all as it is projected out by project ([$$r]) and only the unnested elements are needed after UNNEST. This is wasteful and is one of the main reasons that unnesting large arrays is too slow.

       

      If the unnest itself does not output the array (i.e., only produces the unnested elements), those unnested elements will have the full 32KB for themselves – reducing the number of frames produced by the UNNEST operator.

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          People

            wail.alkowaileet Wail Alkowaileet
            wail.alkowaileet Wail Alkowaileet
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty