XMLWordPrintable

Details

    • 0

    Description

      The primary goal is to partition large data read queries to optimally use spark workers

       

      From field

      1. High level description of what we need: possibility of partitioning reads when a SQL query is executed in Spark. Main idea, run analytics queries in the same way than Couchbase Analytics service but using Spark and Couchbase Query/Index services instead (no need to deploy Couchbase Analytics Services).
      2. Objective of the feature: needs to read large amounts of data from the Spark connector via SQL++. It is necessary to split the load on different Spark node executors as a single Spark node executor would not have enough capacities. Also, memory backpressure is not enough to support business requirements. Probably, a possible solution to take into account is to add a feature into Couchbase query service which allows to execute the same query concurrently from multiple Spark node executors. Another alternative might be support it in the Spark connector driver in order to split the query in multiple parallel chunks. Something like the JDBC parameters mentioned in the description.
      3. Success Criteria: ability to ** read large amounts of data in parallel in multiple Spark node executors when a SQL++ query is executed.
      4. Assumptions: BBVA nodes are limited to 64 GB of RAM on both Couchbase and Spark nodes executors.
      5. Milestones: support more than 100M+ documents in the current project, probably 1000+ in the next projects.

      Attachments

        Issue Links

          Activity

            People

              graham.pople Graham Pople
              priya.rajagopal Priya Rajagopal
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                PagerDuty