We have a standard use case with Dremio on top of parquet files on S3 serving Tableau dashboards. Since there are many concurrent users and the dashboards are complex, we have to make heavy use of reflections. The problem is how to automate reflections refresh.
There are 3 options:
Refreshing a physical dataset (PDS) using the API (Dremio) refreshes all downstream reflections in the correct order. The problem is that if a reflection uses 10 PDS-s, it will be refreshed 10 times, which is slow and unnecessary.
Refreshing reflections on a schedule. When using a schedule, Dremio does not respect the dependency tree (even when a refresh schedule is set on data lake level), so downstream reflections get refreshed before upstream reflections, resulting in inconsistent data.
The third option is to orchestrate reflections refresh with an external process (ETL tool, workflow system), but there is no command or API to refresh a reflection. There is a workaround to trigger a refresh of a reflection by disabling and enabling it using the API, but the problem there is that while refresh is ongoing, reflection is disabled resulting in poor performance. A simple command or API endpoint to refresh a reflection would solve the problem, but there isn’t any.
We are struggling to migrate from Redshift because of this and would appreciate any guidance on how to get around it.
@igor So I assume you have 1 reflection on the PDS and not individual reflections on these PDS, if you refresh the VDS, are you expecting the PDS’s to refresh in a certain order? Ultimately they all need to refresh, for the VDS to have the latest data,
@igor We have some of the same issues in our setup, although not to the same degree.
There is a variant of option 3, where you create an (almost) identical, new reflection, that builds while you can serve queries from the old reflection.
However, in order to avoid the reflection simply building from data in the “old” reflection, you want to replace, you have to include a field not currently in the old reflection. A low cardinality field (or a calculated, static one) adds the least overhead. To automate this you would need two columns to alternate between, so automating this approach is some work.
Currently, we use this trick manually, when we have that specific need (because disabling the reflection would tank performance for us as well). I can’t say if it would be useful for you as well. Depending on the performance hit you’re currently taking on refreshes, it may or may not be worth the effort.
We do not have any reflections on PDS. New data comes in, we refresh PDS metadata and that works fine. The problem is how to refresh to downstream VDS reflections in the correct order (respecting the dependency tree) and avoid refreshing multiple times.
What do you mean by “refresh the VDS”? AFAIK there is no command or API endpoint to do refresh a particular reflection.
We are now trying out an approach with an additional layer of reflections before the “application layer”. We are structuring our semantic layer as described here: Dremio Semantic Layer. End users are only hitting the application layer. We add a new layer between the business and application layers (we call it access layer) with VDS which are just copies of application layers VDS. We first refresh the business layer in the correct order, then the access layer and finally the application layer (using the disbale/enable workaround).
This way, even when the application layer reflections are being refreshed, there are no performance issues, cause the end user is being served by access layer reflections. I’ll let you know how it goes.
It does work, but we resorted to doing it manually in the cases where we had the need to force a refresh of a reflection with the old one serving queries during a rebuild.
The above said, we don’t really use reflections as much as we did before, for two reasons:
Iceberg is now our main table source, rather than parquet
We run multiple Dremio clusters on top of the same data lake and tables
Regarding 1., we’ve doubled down on Iceberg for our data lake since my last comment. Incremental reflection refreshes wasn’t a possibility until recently for Iceberg (IF you’re using Nessie as a catalog), which was a deal breaker for us. Regarding 2., we are running multiple Dremio Community installations on top of the same data, which means each cluster would have to do repeat work, which we’d like to avoid.
So, for us most of the reflections are now external reflections, managed and maintained in Spark pipelines outside of Dremio. We still use them for ad hoc stuff and small tables/joins, but only where the overhead of maintaining them is low.