External_sort few threads working

allCag · February 18, 2022, 4:08pm

Hello,

I have quite slow sort operations on some reflections, most of the time is spent at this stage.
It’s well multithreaded but a few threads do most of the work. Spill time on these threads is really high.

Eg.
This reflection sources data on an S3 bucket of ~90Go of parquet data (files of ~750Mb, snappy compressed, 1 row group). We use rook ceph block volumes attached to each 3 executors (not distributed storage). Network I/O high >500Mb/sec. Executors memory 60G 8CPU. 15 data string/boolean/float/timestamp/date fields with 1 sorted field and 1 partitionned date. Cloud cash ok on the data.

Can you tell why the workload is not balanced accross threads ?
Would a distributed storage setup improve on the spilling operations ?

Thanks for your help !

balaji.ramaswamy · March 2, 2022, 4:29pm

@allCag This is due to the default reflection setting “minimize files”, you can change it to “Minimize refresh time” and it will get evenly distributed but the query that uses this reflection might get slower

Edit reflection-advanced-click on the tiny cog wheel inside the layout and you will find it there, see screenshot below

allCag · March 3, 2022, 1:27pm

@balaji.ramaswamy, this is working, reflections are much faster !
Thanks for your feedback.

balaji.ramaswamy · March 4, 2022, 3:19am

Thanks for the update @allCag, good to hear that

Topic		Replies	Views
Reflection creation progressing extremely slowly	11	107	October 8, 2024
Reflections with partitions using External Sort with one thread	3	236	February 28, 2024
Slow EXTERNAL_SORT	2	1378	October 22, 2021
Creating reflections taking very long	3	1214	September 13, 2020
Scalability in creating reflections	9	1142	October 30, 2019

External_sort few threads working

Related topics