Dremio reflection creation performance tuning

Is there currently any guide on performance tuning dremio reflections, specifically the creation of reflections when underlying PDFS reflection storage is HDFS?

We have a use case where we create a reflection on a physical data source (which is an HDFS directory) with approx 16M records and it takes ~2.5 minutes which is good. However when we create a subsequent reflection on the same dataset that includes a partition column, the reflection takes 1.5 hours even though its using the first reflection to accelerate.

We noticed that when we add two partition columns instead of one, the time goes down to 35 minutes.

Our setup uses HDFS for the underlying PDFS reflection storage. We don’t see CPU or memory being the bottleneck.

Is there any further tuning we can do to improve the performance of creating reflections specifically for HDFS? I see many parameters available in Dremio options.

HI @igreg

Is the partition on a very high cardinality column?

The dataset is for a single day and the partition is on a date column which has carnality 1 at this time (i.e just data for a single date). The dataset will have more days added incrementally and thus why the date partition column.

Based on the query profile, it appears the reading, sorting and writing steps are done in sequence whereas with the raw reflection (with no partition columns) it is running these steps in parallel. In addition if we add more sub-partition columns to the reflection, Dremio also performs these read, sort, write steps in parallel.

Is there any other way to parallelize the reflection creation with a partition when carnality is very low?