Details
-
New Feature
-
Resolution: Fixed
-
Major
-
None
-
None
-
0
Description
The primary goal is to partition large data read queries to optimally use spark workers
From field
- High level description of what we need: possibility of partitioning reads when a SQL query is executed in Spark. Main idea, run analytics queries in the same way than Couchbase Analytics service but using Spark and Couchbase Query/Index services instead (no need to deploy Couchbase Analytics Services).
- Objective of the feature: needs to read large amounts of data from the Spark connector via SQL++. It is necessary to split the load on different Spark node executors as a single Spark node executor would not have enough capacities. Also, memory backpressure is not enough to support business requirements. Probably, a possible solution to take into account is to add a feature into Couchbase query service which allows to execute the same query concurrently from multiple Spark node executors. Another alternative might be support it in the Spark connector driver in order to split the query in multiple parallel chunks. Something like the JDBC parameters mentioned in the description.
- Success Criteria: ability to ** read large amounts of data in parallel in multiple Spark node executors when a SQL++ query is executed.
- Assumptions: BBVA nodes are limited to 64 GB of RAM on both Couchbase and Spark nodes executors.
- Milestones: support more than 100M+ documents in the current project, probably 1000+ in the next projects.
Attachments
Issue Links
- relates to
-
SPARKC-193 Pathfind partitioned Spark query reads
- Closed