Reflection Creation/Refresh taking too long

Hi,

I have been trying to create Raw reflection on datasets which contain over 60 million records. The PDS is created out of CSV file. But the reflection “refresh” is taking over 36 minutes.

I am working on an application which requires the user to start applying aggregations on such huge datasets in very less time. Reflections seem like a solution to this, but over half an hour of wait time for an end user every time a new file is processed would be difficult.

I tried referring to multiple posts on dremio community, but could not find the right solution. I have also tried applying “Minimum refresh time” while creating reflection. Still no luck.
Could you please suggest what can I do to reduce the reflection creation/refresh time

Hi @reshma.cs, I am afraid you have to use some kind of incremental/differential/partitioning update strategy here. You are talking of tens of millions of rows.
Best

@fetanchaud: Thank you for replying… I read on dremio’s documentation that Incremental refresh would not be possible for CSV files.
The data and columns are dynamic, and each file uploaded by the user will have different schema. Hence it would be difficult to figure out on which basis can we partition the file. But can look further into partitioning strategy.
Are there any other options to accelerate refresh operation for PDS from CSV files which are dynamic in nature.

Did you try to play with the reflection partition option ? Do you need all the columns ? Hoping it helps.
IMHO end users should not load such huge files, they should be preprocessed before in a staging step, a work for a data engineer.

Hi @fetanchaud,

Tried applying horizontal partition but couldn’t do it successfully since the column names are dynamic. Each time a dataset is uploaded differently from another and the column names would be different.

And, the application is a data heavy one and hence the users use it particularly for processing huge files. Was hoping for a more generic solution.

@reshma.cs

Kindly send us the job profile of the original 36 minute reflection creation job and we can see what you have to tune

Thanks
Bali

@balaji.ramaswamy

We are facing a similar issue and have already applied horizontal partitioning on csv files

The source data is just 6GB right now, expected to go to 240TB. Expected response time to be less than 3 seconds where Dremio needs to return less than 1M records

Reflection creation has been running for past 54 minutes for 6GB data with no partition or sorting applied in reflection settings.

Running 1 Coordinator and 1 executor on docker which has 25GB RAM and can use 8 Cores of the machine. Going with default memory settings of dremio (latest version)

Profile Attached
13426342-144f-47b1-923d-32db7951c597.zip (25.9 KB)

Also, I noticed that there is a lot of wait time in TEXT_SUB_SCAN operation. What’s that for?

@hemant.gupta Each thread scans close to 26 Million rows. There are 3578 files and Dremio is scanning every file. All the time is spent on IO request to these 3578 files

Where are these files stored? Since this is not columnar format, C3 cache is also not used

Have you considered using Parquet instead?

balaji.ramaswamy The files are stored on local and Dremio is also running on the same machine. Since, the files are csv and supports append writes, Parquet isn’t the good choice as it doesn’t support append mode.

FYI, I did try with multiple smaller parquet files (as there is no append), and the results were even worse. 10 min. to return 428 records. Dataset was just 6GB

I am not using Cloud at all. Are you suggesting that we should cloud based approach for this kind of scenario?

@hemant.gupta CSV has limitations and Dremio’s product roadmap is all PARQUET and ICEBERG, how about you explore Apache Iceberg, allows DML