Update partition of a reflection

lokeshlal · June 7, 2018, 5:00am

We are working on a problem and evaluating dremio as a common layer on top of different data stores (comparing against hive/druid and similar solutions).

I am not able to figure out a way to refresh a particular partition of a reflection or how to implement rolling reflections.

Any help or suggestions.

Thanks,
Lokesh

lokeshlal · June 7, 2018, 5:18am

This is important considering GDPR (or other similar cases), some of the users can say that they dont want the PI data to be present in system.

Thanks

lokeshlal · June 7, 2018, 11:48am

What if I create monthly aggregate reflections. Last month aggregate reflection will be updated incrementally and I can do a full refresh of previous month aggregation if any update or delete happen in previous month data sets.

Any suggestions…

Thanks,
Lokesh

anthony · June 7, 2018, 11:51am

There are currently 2 ways to update a reflection - full or incremental - https://docs.dremio.com/acceleration/updating-reflections.html
If you do an incremental update, the previous months data won’t be affected. If you do a full, the previous months data will be overwritten.

If there is a column you don’t want to include in the reflection, you can create a VDS that drops it, or you can secure it via row level or column level permissions in the VDS.

lokeshlal · June 7, 2018, 12:07pm

Thanks Anthony.

Could you please provide your thoughts on the approach I have mentioned above.

And could you please provide a link to the article if you have to update the reflections manually via API.

Thanks

kelly · June 7, 2018, 6:54pm

Hi Lokesh, I think your idea may work. You would create VDS on each of the months, then a UNION ALL for the months together in your rolling window.

But I think you could also create a VDS that has all data within your window as a date predicate, then set to full refresh, and that will keep the window up to date based on the refresh policy you set. This seems easier to implement, but a full refresh may be impractical depending on your source.

can · June 7, 2018, 7:56pm

@lokeshlal here is the end-point for refreshing all reflections based on a physical dataset:
http://docs.dremio.com/rest-api/catalog/endpoints.html#refreshing-a-catalog-entity

Make sure to update that dataset Refresh Policy with “Never Refresh” and suitable expiration for your use-case.

lokeshlal · June 8, 2018, 5:50am

Thanks for the responses.

As mentioned by @kelly, I tried with multiple raw reflections (one for every month).

In this case, If i have a query spanning more than that month, it goes to the source instead of accelerating query using reflection for that month. so I end-up creating a rolling view with the joins (which can be managed using the REST apis)

I also tried creating aggregated reflections and they doesn’t work like I thought they will work (as in other BI modeling tools). They only came into effect when the data is for that specified time window otherwise the query gets accelerated by raw reflections.

There is a huge performance difference between raw and aggregate reflections.

Any thing I am missing?

Thanks

lokeshlal · June 8, 2018, 6:02am

@can Is there any future plan to include the following functionality

Define segments within the aggregated (or may be raw) reflections and refresh a specific segment as and when required. This will be really helpful where append only is not true and update and insert can be handled with ease.

Thanks

lokeshlal · June 8, 2018, 6:31am

Could you please also provide article reference where I can define referesh policy for the aggregated reflections or custom reflections.

Thanks,

anthony · June 8, 2018, 12:32pm

You can define refresh policy by going to the Settings > Refresh Policy of either the raw source (Hive connection) or the raw physical table. Screenshots of both attached.

kelly · June 8, 2018, 2:10pm

Can you give more details on how you did this? Provide examples of the VDS for each month, as well as the VDS that does a UNION of the month-specific VDS.

Finally, confirm how you set up the relationship and aggregation reflections.

To help better understand why the performance is so different between the two, have a look at the footprint size for each and share that as well.

can · June 8, 2018, 6:12pm

@lokeshlal Yes, this is on our radar. Stay tuned.

sudhko · July 31, 2018, 10:44pm

@can Do we have any update on this item please? Thanks!

can · August 15, 2018, 6:19pm

@sudhko not yet, but this is one of the top items on our reflection roadmap.

Jinnah · September 5, 2021, 4:03am

@can ,
Is there any update on this thread?
I’m a follower of dremio, trying to understand - if partition specific reload is supported yet?

Thanks,

balaji.ramaswamy · September 7, 2021, 8:33am

@Jinnah Is the request to have separate refresh schedules for RAW and AGG reflections?

Jinnah · September 8, 2021, 9:37am

Hi @balaji.ramaswamy ,
I don’t have a requirement to create reflections for RAW but only for aggregates.
Looks like the reflections are fully reloaded every time there is a change in the underlying data in s3. Since I’m planning to leverage Dremio reflections to pre-cook some aggregates, joining couple of dimensions and events & the event data is rapidly growing every minute/second: I don’t want Dremio to reload the entire reflection (precooked aggregates) every time there is a new event.

The pre-cooked aggregates (reflection) themselves should be partitioned by date/month and I want Dremio to reload only last n days of partitions of the reflection rather than refreshing all 5 or 10 years of data in the reflection. This is because I know that there won’t be new events for dimensions that are older than n days. So there is no point in re-calculating all those aggregates again and again for older data. All I want is re-calculate the aggregates for last n days but keep the remaining data as-is, forever in the reflection. [some sort of a batch update requirement, but provided the reflections will be immutable - all I need is the ability to configure to drop last n partitions and reload only them]

Is this possible with Dremio?

Jinnah · December 5, 2021, 6:01am

Hi @balaji.ramaswamy , @can ,
Is there any update on this thread? Say, I do have reflections of aggregated data. Mostly counts of few IOT events grouped by day/week. How I reload these aggregated reflections partially, only for the recent days/weeks instead of reloading the whole reflection which will be a huge recompute.

balaji.ramaswamy · December 9, 2021, 12:44am

@Jinnah If the view definition does not have joins then you can try incremental, if not then may have to add VDS logic so you only do latest month/week etc

Topic		Replies	Views
How to get “Identify new records using the field” appear? Dremio University	9	2962	October 17, 2019
Reflections on hive external tables	3	1335	December 28, 2018
How to orchestrate reflection refresh incrementally in dremio-oss incase of overwriting partition in a datapipeline	2	598	October 25, 2023
Refresh asset after data load	7	2501	April 27, 2020
Refresh policy option is missing	5	1407	July 17, 2019

Update partition of a reflection

Related topics