Hello,
I have quite slow sort operations on some reflections, most of the time is spent at this stage.
It’s well multithreaded but a few threads do most of the work. Spill time on these threads is really high.
Eg.
This reflection sources data on an S3 bucket of ~90Go of parquet data (files of ~750Mb, snappy compressed, 1 row group). We use rook ceph block volumes attached to each 3 executors (not distributed storage). Network I/O high >500Mb/sec. Executors memory 60G 8CPU. 15 data string/boolean/float/timestamp/date fields with 1 sorted field and 1 partitionned date. Cloud cash ok on the data.
Can you tell why the workload is not balanced accross threads ?
Would a distributed storage setup improve on the spilling operations ?
Thanks for your help !