Details
-
Task
-
Resolution: Unresolved
-
Critical
-
Cheshire-Cat
-
CX Sprint 182, CX Sprint 183, CX Sprint 184, CX Sprint 185, CX Sprint 186, CX Sprint 187, CX Sprint 188, CX Sprint 189, CX Sprint 190, CX Sprint 191, CX Sprint 192, CX Sprint 193, CX Sprint 194, CX Sprint 195, CX Sprint 196, CX Sprint 197, CX Sprint 198, CX Sprint 199, CX Sprint 200, CX Sprint 201, CX Sprint 202, CX Sprint 203, CX Sprint 204
Description
Currently data is hash-partitioned to all partitions of a dataset.
While this yields a good distribution of data, it is not ideal for rebalancing (as too much data gets moved) or reingestion from KV (as too much data needs to be re-read).
We need to investigate what changes are needed to distribute the data of a dataset in a way that
- avoids moving too much data and that
- allows for different distribution strategies for different datasets.