External_sort few threads working


I have quite slow sort operations on some reflections, most of the time is spent at this stage.
It’s well multithreaded but a few threads do most of the work. Spill time on these threads is really high.

This reflection sources data on an S3 bucket of ~90Go of parquet data (files of ~750Mb, snappy compressed, 1 row group). We use rook ceph block volumes attached to each 3 executors (not distributed storage). Network I/O high >500Mb/sec. Executors memory 60G 8CPU. 15 data string/boolean/float/timestamp/date fields with 1 sorted field and 1 partitionned date. Cloud cash ok on the data.

Can you tell why the workload is not balanced accross threads ?
Would a distributed storage setup improve on the spilling operations ?

Thanks for your help !

@allCag This is due to the default reflection setting “minimize files”, you can change it to “Minimize refresh time” and it will get evenly distributed but the query that uses this reflection might get slower

Edit reflection-advanced-click on the tiny cog wheel inside the layout and you will find it there, see screenshot below

1 Like

@balaji.ramaswamy, this is working, reflections are much faster !
Thanks for your feedback.

Thanks for the update @allCag, good to hear that