How to orchestrate reflection refresh incrementally in dremio-oss incase of overwriting partition in a datapipeline

I have a spark job as part of my data pipeline implemented in Apache airflow which runs daily and create new partition value in an iceberg table which is partitioned by timestamp column ‘event_time’. Since I may need to rerun the pipeline I am using overwritePartitions() method of DataFrameWriterV2 API to update data in table. I am planning to create reflection on this table and wanted to include reflection refresh also as part of the pipeline using REST APIs.

The pipeline I am planning is :

  1. Run spark job
  2. Create raw reflection with partition column same as that of the table using REST APIs if reflection not exists.
    API used to create - Reflection | Dremio Documentation
    API used to check reflection exists - Reflection | Dremio Documentation
  3. Refresh reflection using below endpoints by setting type as incremental & refresh field as event_time.
    Change settings to incremental - Table | Dremio Documentation
    Trigger refresh - Table | Dremio Documentation

The approach above works fine for normal runs ( When new values for event_time comes in data ). However I came to know that incremental refresh in step #3 above is not working in case of rerunning the pipeline since we overwrite data for a certain value of event_time which already exists in the table. I think it is due to the limitation mentioned in this link.

Only option to make both the scenarios work for me is full refresh which I don’t want as it needs more resource and time and also refresh time increases day by day.

I tried the below flow without any luck.

  1. Delete the existing value from table for the event_time it is going to run
  2. Create reflection if not exists
  3. Refresh the reflection with incremental & incremental field event_time
  4. Execute the spark job
  5. Execute #3

I think the specific feature I am asking for is already available in 24.2 version which is not released for dremio-oss yet. Is there a workaround which I can do to avoid full refresh in my pipeline until the feature is available in dremio-oss?
I checked the dremio-oss code base in this class already thinking I can put some hacks via code but was unable to understand the logic which causes this behavior. Here it is comparing the schema hash of data set and reflection data set.

Based on the links you posted, you did a lot of good research on how incremental reflection refresh works. Until 24.2 is available, you could consider building the reflection’s materialization yourself (incrementally) and exposing it to the planner as an external reflection.

Thanks for the reply @Benny_Chow . Any idea when 24.2 will be available in dremio-oss ? I can see the tag 24.2 available in dockerhub for dremio-oss. But not in github. Does this mean it is already available for dremio-oss ?. I have a few raw reflections with all the fields selected from the anchor table, and the partition column is also the same. I have noticed a performance advantage when using these reflections for my UI queries. In this scenario, do I need to create a derived table, as mentioned in this link, to make external reflections work ? Alternatively, if I simply create a dummy view on top of the table itself and then create an external reflection, will I still achieve the same performance advantage? It makes sense to create derived tables in cases involving aggregate reflections and raw reflections with a subset of columns or different partitions, but that is not applicable to my current situation.