Documentation Update Request - Reflection Refresh Algorithm Clarification

Hi Dremio Team,

I’ve been studying the reflection refresh behavior described in the documentation at Refresh Reflections | Dremio Documentation and noticed that the current explanation might be
misleading regarding how dependent reflections are refreshed.

Current Documentation Issue:
The documentation states that “Refreshing Reflection R5 through the API also refreshes R1, R2, R3, and R4” without mentioning any intelligent filtering. This suggests a naive cascading refresh that would ignore individual reflection refresh policies.

Actual Implementation:
After examining the source code, I found that Dremio actually implements a sophisticated algorithm in DependencyManager.shouldRefresh() that:

  1. Respects individual refresh policies (PERIOD, SCHEDULE, NEVER)
  2. Checks last refresh timestamps before triggering dependent refreshes
  3. Avoids unnecessary refreshes when dependencies haven’t actually changed
  4. Uses REFRESH_PENDING state to prevent redundant operations

For example, if Reflection R1 has a 1-month refresh period and was refreshed a week ago, triggering R5 (1-hour period) won’t unnecessarily refresh R1.

Suggestion:
Could the documentation be updated to clarify that:

  • Dependent reflection refreshes are conditionally triggered based on their individual policies
  • The system checks if a refresh is actually needed before executing it
  • Timestamps and refresh policies are considered to optimize performance

This would help developers better understand the intelligent behavior behind Dremio’s reflection refresh system and avoid concerns about inefficient cascading refreshes.

Thank you for considering this documentation improvement!

Hi @cclairmont , in the online docs, R5 and R1 both depend on Table1. If data changed in Table1 and you trigger a refresh on R5, then R1 will refresh first, followed by R5. This will happen regardless of the refresh policy set on Table1 which both R1 and R5 inherit.

The design behind this is that reflection management is trying to achieve an eventually consistent model where all reflections in the system reflect the same state of the underlying tables.

Since you said that you triggered a refresh on R5 but R1 did not refresh, then it might be the case that when refreshing R1, Dremio detected there was no data/snapshot changes in Table1 (assuming Table1 is an Iceberg table). If the snapshot id didn’t change, it’s a “no op” refresh and so both R1 and R5 won’t be refreshed unnecessarily.

You can confirm if a no op refresh occurred or how the snapshot ids changed for each source table in the query profiles. Under the Planning Tab → Refresh Decision section, there’s an explanation around:

  • Whether the refresh is full or incremental?
  • If incremental, what type? Snapshot based append, Partition based rebuild, Field based and MTime based
  • If source tables are Iceberg, whether data has changed and if so, what snapshot id was in the previous refresh and what the current snapshot id is?
  • Finally, if there is no data changes, we clearly state that there will be no refresh.

BTW, there are improvements coming in the next version of Dremio that:

  • Provide a table function that takes a reflection id and returns the refresh policies of all its underlying tables so that you can understand when a given reflection will refresh next.
  • Include in the “REFRESH REFLECTION” query profile a listing of all the source table snapshot ids or last query time. This is particularly useful when a reflection is built on another reflection and you need to know that the current reflection might not be using the latest data from the source tables.

Hi !

Thank you so much @Benny_Chow for this very detailed response!

This is extremely enlightening to understand that Dremio actually implements an eventually consistent model with intelligent change detection at the Iceberg snapshot level. The “no-op” refresh logic based on snapshot IDs is particularly elegant.

I’m also very excited about the upcoming improvements, especially:

  • The table function to understand refresh policies of dependencies

  • Including snapshot IDs in REFRESH REFLECTION query profiles

Documentation suggestion: I hope these clarifications can be integrated into the official documentation to prevent other developers from thinking that all dependencies are blindly refreshed without checking if it’s actually necessary. A mention of the intelligent data change detection would make the behavior much clearer.

Thanks again for taking the time to explain these technical details!

1 Like