We are trying to extract data from one s3 bucket and write it into another.
The original dataset is having 1billion rows. We are trying to extract and write 10million rows into the S3 bucket. We are using CTAS query to write into the S3.
On analyzing the profile we found that the parquet writer step is taking nearly 90% of the total time taken.
Additional information:
Iniitally we used 2 executer node and 1 master node setup for this. Only 1 node reached 50% cpu limit and the other being almost idle.
We tried increasing the executer nodes and still had the we could see only 1 node is used effectively and other being idle.
Need help to increase the parquet write performance.
Have attached profile dump for reference.
c6916e68-374f-4769-b70a-9ac727af7fed.zip (26.6 KB)
@rnkarthick
Are you able to provide the profile with 2 executors?
I have already attached the profile with 2 executors in my original post. Or i’m missing something
@rnkarthick
Apologies, my bad. If you see phase 1 operators, are multi-threaded, phase 0 is always single threaded due to the SCREEN operator and the PARQUET WRITER is planned in phase 0. Wondering if it is due to the limit clause. Can you please run the query without the LIMIT, do not have to run it completely, once the planning completes and query starts execution (spinning wheel in jobs page), cancel the query and send us the profile
Thanks a lot for the reply. Yes, indeed we figured out that limit was the issue and your suggestion makes believe we are in the right direction. Is there an alternative for limit (infact we tested with fetch first… but the result was same)? Your support is much appreciated.