How to orchestrate reflection refresh incrementally in dremio-oss incase of overwriting partition in a datapipeline

irshad-pai · October 23, 2023, 2:44pm

I have a spark job as part of my data pipeline implemented in Apache airflow which runs daily and create new partition value in an iceberg table which is partitioned by timestamp column ‘event_time’. Since I may need to rerun the pipeline I am using overwritePartitions() method of DataFrameWriterV2 API to update data in table. I am planning to create reflection on this table and wanted to include reflection refresh also as part of the pipeline using REST APIs.

The pipeline I am planning is :

Run spark job
Create raw reflection with partition column same as that of the table using REST APIs if reflection not exists.
API used to create - Reflection | Dremio Documentation
API used to check reflection exists - Reflection | Dremio Documentation
Refresh reflection using below endpoints by setting type as incremental & refresh field as event_time.
Change settings to incremental - Table | Dremio Documentation
Trigger refresh - Table | Dremio Documentation

The approach above works fine for normal runs ( When new values for event_time comes in data ). However I came to know that incremental refresh in step #3 above is not working in case of rerunning the pipeline since we overwrite data for a certain value of event_time which already exists in the table. I think it is due to the limitation mentioned in this link.

Only option to make both the scenarios work for me is full refresh which I don’t want as it needs more resource and time and also refresh time increases day by day.

I tried the below flow without any luck.

Delete the existing value from table for the event_time it is going to run
Create reflection if not exists
Refresh the reflection with incremental & incremental field event_time
Execute the spark job
Execute #3

I think the specific feature I am asking for is already available in 24.2 version which is not released for dremio-oss yet. Is there a workaround which I can do to avoid full refresh in my pipeline until the feature is available in dremio-oss?
I checked the dremio-oss code base in this class already thinking I can put some hacks via code but was unable to understand the logic which causes this behavior. Here it is comparing the schema hash of data set and reflection data set.

Benny_Chow · October 24, 2023, 5:21pm

Based on the links you posted, you did a lot of good research on how incremental reflection refresh works. Until 24.2 is available, you could consider building the reflection’s materialization yourself (incrementally) and exposing it to the planner as an external reflection.

irshad-pai · October 25, 2023, 2:03pm

Thanks for the reply @Benny_Chow . Any idea when 24.2 will be available in dremio-oss ? I can see the tag 24.2 available in dockerhub for dremio-oss. But not in github. Does this mean it is already available for dremio-oss ?. I have a few raw reflections with all the fields selected from the anchor table, and the partition column is also the same. I have noticed a performance advantage when using these reflections for my UI queries. In this scenario, do I need to create a derived table, as mentioned in this link, to make external reflections work ? Alternatively, if I simply create a dummy view on top of the table itself and then create an external reflection, will I still achieve the same performance advantage? It makes sense to create derived tables in cases involving aggregate reflections and raw reflections with a subset of columns or different partitions, but that is not applicable to my current situation.

Topic		Replies	Views
Incremental data reflection with Iceberg Dremio University	26	2957	April 17, 2023
Automating Reflection Refresh	10	4332	December 20, 2023
Refresh asset after data load	7	2489	April 27, 2020
Refresh Reflection using API call	28	6224	May 13, 2024
Function not work for refreshng metadata and reflection by partition	1	441	November 1, 2023

How to orchestrate reflection refresh incrementally in dremio-oss incase of overwriting partition in a datapipeline

Related topics