We are working on converting a large CSV file stored in AWS S3 (around 50 to 100GB size) to parquet file format. For this we are using AWS Dremio community Edition.
While processing the CSV file, Dremio is using only Single Thread for the complete process. This makes the job running for long time
(takes around 25 mins for processing 25% of the source CSV data). Increasing the executor nodes did not solve the issue as only one thread was used and others were idle.
- Is there any option/workaround to make the Dremio to use the multiple threads/nodes. Please help on this to bring down the overall execution time.
attached job profile for reference (cancelled the execution after 25 mins of execution)
Co-ordinator nodes: 1
Executors nodes: 2 (m5d.2xlarge)
9678aa8c-0476-4d53-a785-d98f5acd0a79.zip (21.1 KB)
The dataset on which you are trying to create a reflection has only one split, is it a single file?
Yes Balaji, we are using single CSV file around 50 GB size as a source.
Please note, We are trying to convert that single CSV file to parquet files using the “CREATE TABLE … AS” SQL command.
Please provide your suggestion to help us to improve the performance. Also let us know if there are any alternative/recommended approaches to convert single large CSV file to parquet files.
IS there any chance you can split the CSV file into smaller files? So we can get some parallelism
Balaji, as per the requirement we are getting single large csv file for processing.
In order to split the S3 csv file, we may have to introduce an additional step in our data processing logic.
Please let us know if there is any option/way to split the csv file on S3 using Dremio. It would help us to handle the file split logic in Dremio itself.
You would need to use SQL to split it into multiple VDS then use CTAS to create Parquet (using Dremio), then merge them back to promote as a single dataset inside Dremio
Promoting Entities · Dremio?