Reflection Creation/Refresh taking too long


I have been trying to create Raw reflection on datasets which contain over 60 million records. The PDS is created out of CSV file. But the reflection “refresh” is taking over 36 minutes.

I am working on an application which requires the user to start applying aggregations on such huge datasets in very less time. Reflections seem like a solution to this, but over half an hour of wait time for an end user every time a new file is processed would be difficult.

I tried referring to multiple posts on dremio community, but could not find the right solution. I have also tried applying “Minimum refresh time” while creating reflection. Still no luck.
Could you please suggest what can I do to reduce the reflection creation/refresh time

Hi @reshma.cs, I am afraid you have to use some kind of incremental/differential/partitioning update strategy here. You are talking of tens of millions of rows.

@fetanchaud: Thank you for replying… I read on dremio’s documentation that Incremental refresh would not be possible for CSV files.
The data and columns are dynamic, and each file uploaded by the user will have different schema. Hence it would be difficult to figure out on which basis can we partition the file. But can look further into partitioning strategy.
Are there any other options to accelerate refresh operation for PDS from CSV files which are dynamic in nature.

Did you try to play with the reflection partition option ? Do you need all the columns ? Hoping it helps.
IMHO end users should not load such huge files, they should be preprocessed before in a staging step, a work for a data engineer.

Hi @fetanchaud,

Tried applying horizontal partition but couldn’t do it successfully since the column names are dynamic. Each time a dataset is uploaded differently from another and the column names would be different.

And, the application is a data heavy one and hence the users use it particularly for processing huge files. Was hoping for a more generic solution.