Single Large CSV file to Parquet

srikanth_n · January 30, 2021, 1:50pm

Hi,

We are working on converting a large CSV file stored in AWS S3 (around 50 to 100GB size) to parquet file format. For this we are using AWS Dremio community Edition.
While processing the CSV file, Dremio is using only Single Thread for the complete process. This makes the job running for long time
(takes around 25 mins for processing 25% of the source CSV data). Increasing the executor nodes did not solve the issue as only one thread was used and others were idle.

Is there any option/workaround to make the Dremio to use the multiple threads/nodes. Please help on this to bring down the overall execution time.

Reference:

attached job profile for reference (cancelled the execution after 25 mins of execution)

Configuration:

Co-ordinator nodes: 1
Executors nodes: 2 (m5d.2xlarge)

9678aa8c-0476-4d53-a785-d98f5acd0a79.zip (21.1 KB)

Thanks,
Srikanth

balaji.ramaswamy · January 31, 2021, 5:38am

@srikanth_n

The dataset on which you are trying to create a reflection has only one split, is it a single file?

splits=[1])

srikanth_n · January 31, 2021, 10:00am

Yes Balaji, we are using single CSV file around 50 GB size as a source.

Please note, We are trying to convert that single CSV file to parquet files using the “CREATE TABLE … AS” SQL command.

Please provide your suggestion to help us to improve the performance. Also let us know if there are any alternative/recommended approaches to convert single large CSV file to parquet files.

Thanks,
Srikanth

balaji.ramaswamy · February 1, 2021, 8:47am

@srikanth_n

IS there any chance you can split the CSV file into smaller files? So we can get some parallelism

srikanth_n · February 1, 2021, 11:51am

Balaji, as per the requirement we are getting single large csv file for processing.

In order to split the S3 csv file, we may have to introduce an additional step in our data processing logic.

Please let us know if there is any option/way to split the csv file on S3 using Dremio. It would help us to handle the file split logic in Dremio itself.

Thanks,
Srikanth

balaji.ramaswamy · February 2, 2021, 7:52am

@srikanth_n

You would need to use SQL to split it into multiple VDS then use CTAS to create Parquet (using Dremio), then merge them back to promote as a single dataset inside Dremio

http://docs.dremio.com/sql-reference/sql-commands/tables.html

Promoting Entities · Dremio?

Thanks
Bali

Topic		Replies	Views
Iceberg query performance with many parquet files Dremio University	12	1364	July 22, 2023
Dremio query duration takes too much time	5	665	January 12, 2024
Why use reflection on reading data from S3?	2	2756	September 15, 2018
Handling large csv	1	1269	December 18, 2017
Large Reflection creation, speed and performance	4	2238	April 16, 2019

Single Large CSV file to Parquet

Related topics