We are working on a problem and evaluating dremio as a common layer on top of different data stores (comparing against hive/druid and similar solutions).
I am not able to figure out a way to refresh a particular partition of a reflection or how to implement rolling reflections.
Any help or suggestions.
This is important considering GDPR (or other similar cases), some of the users can say that they dont want the PI data to be present in system.
What if I create monthly aggregate reflections. Last month aggregate reflection will be updated incrementally and I can do a full refresh of previous month aggregation if any update or delete happen in previous month data sets.
There are currently 2 ways to update a reflection - full or incremental - https://docs.dremio.com/acceleration/updating-reflections.html
If you do an incremental update, the previous months data won’t be affected. If you do a full, the previous months data will be overwritten.
If there is a column you don’t want to include in the reflection, you can create a VDS that drops it, or you can secure it via row level or column level permissions in the VDS.
Could you please provide your thoughts on the approach I have mentioned above.
And could you please provide a link to the article if you have to update the reflections manually via API.
Hi Lokesh, I think your idea may work. You would create VDS on each of the months, then a UNION ALL for the months together in your rolling window.
But I think you could also create a VDS that has all data within your window as a date predicate, then set to full refresh, and that will keep the window up to date based on the refresh policy you set. This seems easier to implement, but a full refresh may be impractical depending on your source.
@lokeshlal here is the end-point for refreshing all reflections based on a physical dataset:
Make sure to update that dataset Refresh Policy with “Never Refresh” and suitable expiration for your use-case.
Thanks for the responses.
As mentioned by @kelly, I tried with multiple raw reflections (one for every month).
In this case, If i have a query spanning more than that month, it goes to the source instead of accelerating query using reflection for that month. so I end-up creating a rolling view with the joins (which can be managed using the REST apis)
I also tried creating aggregated reflections and they doesn’t work like I thought they will work (as in other BI modeling tools). They only came into effect when the data is for that specified time window otherwise the query gets accelerated by raw reflections.
There is a huge performance difference between raw and aggregate reflections.
Any thing I am missing?
@can Is there any future plan to include the following functionality
Define segments within the aggregated (or may be raw) reflections and refresh a specific segment as and when required. This will be really helpful where append only is not true and update and insert can be handled with ease.
Could you please also provide article reference where I can define referesh policy for the aggregated reflections or custom reflections.
You can define refresh policy by going to the Settings > Refresh Policy of either the raw source (Hive connection) or the raw physical table. Screenshots of both attached.
Can you give more details on how you did this? Provide examples of the VDS for each month, as well as the VDS that does a UNION of the month-specific VDS.
Finally, confirm how you set up the relationship and aggregation reflections.
To help better understand why the performance is so different between the two, have a look at the footprint size for each and share that as well.
@lokeshlal Yes, this is on our radar. Stay tuned.
@can Do we have any update on this item please? Thanks!
@sudhko not yet, but this is one of the top items on our reflection roadmap.
Is there any update on this thread?
I’m a follower of dremio, trying to understand - if partition specific reload is supported yet?
@Jinnah Is the request to have separate refresh schedules for RAW and AGG reflections?
Hi @balaji.ramaswamy ,
I don’t have a requirement to create reflections for RAW but only for aggregates.
Looks like the reflections are fully reloaded every time there is a change in the underlying data in s3. Since I’m planning to leverage Dremio reflections to pre-cook some aggregates, joining couple of dimensions and events & the event data is rapidly growing every minute/second: I don’t want Dremio to reload the entire reflection (precooked aggregates) every time there is a new event.
The pre-cooked aggregates (reflection) themselves should be partitioned by date/month and I want Dremio to reload only last n days of partitions of the reflection rather than refreshing all 5 or 10 years of data in the reflection. This is because I know that there won’t be new events for dimensions that are older than n days. So there is no point in re-calculating all those aggregates again and again for older data. All I want is re-calculate the aggregates for last n days but keep the remaining data as-is, forever in the reflection. [some sort of a batch update requirement, but provided the reflections will be immutable - all I need is the ability to configure to drop last n partitions and reload only them]
Is this possible with Dremio?
Hi @balaji.ramaswamy , @can ,
Is there any update on this thread? Say, I do have reflections of aggregated data. Mostly counts of few IOT events grouped by day/week. How I reload these aggregated reflections partially, only for the recent days/weeks instead of reloading the whole reflection which will be a huge recompute.
@Jinnah If the view definition does not have joins then you can try incremental, if not then may have to add VDS logic so you only do latest month/week etc