S3 writing performance issue

rnkarthick · December 21, 2020, 7:35pm

We are trying to extract data from one s3 bucket and write it into another.
The original dataset is having 1billion rows. We are trying to extract and write 10million rows into the S3 bucket. We are using CTAS query to write into the S3.
On analyzing the profile we found that the parquet writer step is taking nearly 90% of the total time taken.
Additional information:
Iniitally we used 2 executer node and 1 master node setup for this. Only 1 node reached 50% cpu limit and the other being almost idle.
We tried increasing the executer nodes and still had the we could see only 1 node is used effectively and other being idle.

Need help to increase the parquet write performance.

Have attached profile dump for reference.
c6916e68-374f-4769-b70a-9ac727af7fed.zip (26.6 KB)

balaji.ramaswamy · December 22, 2020, 6:55am

@rnkarthick

Are you able to provide the profile with 2 executors?

rnkarthick · December 22, 2020, 7:42am

I have already attached the profile with 2 executors in my original post. Or i’m missing something

balaji.ramaswamy · December 23, 2020, 6:56pm

@rnkarthick

Apologies, my bad. If you see phase 1 operators, are multi-threaded, phase 0 is always single threaded due to the SCREEN operator and the PARQUET WRITER is planned in phase 0. Wondering if it is due to the limit clause. Can you please run the query without the LIMIT, do not have to run it completely, once the planning completes and query starts execution (spinning wheel in jobs page), cancel the query and send us the profile

rnkarthick · December 24, 2020, 4:32am

Thanks a lot for the reply. Yes, indeed we figured out that limit was the issue and your suggestion makes believe we are in the right direction. Is there an alternative for limit (infact we tested with fetch first… but the result was same)? Your support is much appreciated.

Topic		Replies	Views
Extremely slow query performance under concurrency	9	2863	November 13, 2020
S3 node parallelism limited by number of files	0	974	December 9, 2019
Query was cancelled planning time exceeded 60 seconds	3	2012	June 7, 2018
Improve S3 Parquet mapping and metadata update	2	1397	January 8, 2020
Unable to execute HTTP request: Timeout waiting for connection from pool	9	9069	January 11, 2019

S3 writing performance issue

Related topics