Reflection on the dynamic table is doing full refresh then incremental

We have a dynamic iceberg table with no partition. We created a reflection on the table with partition on the day. When ever we add a column to the table doing full refresh. As the data in the base table will hold years of data. Refresh is becoming costly.
Is there any way to tune reflections do incremental refresh then snapshot based?.

@pavankumargundeti

It looks like this is the first refresh. Have you done one full refresh? If that is not an option, can you create a VDS on data you need? example say last 1 year of data and then create a reflection on that VDS, again first refresh will be full and subsequent ones will be incremental

Hi Balaji,
My case is when table created 1st refresh is already done to partitioned raw reflection for the columns in the table. when i ingesting the data to the table it is doing incremental refesh. This is good.
Actual problem occurs if we add a new column to the table . The table is able to create a new snapshot with in milli seconds. But partitioned raw reflection on manual refresh is doing full refresh which pulls in GB’s of data. Ideally anticipate the delta sync for the schema change on the source table.
Can you help here?.
Thanks,
Pavan

Hi @pavankumargundeti,
The issue is that Dremio will try to figure out if the underlying dataset is changed before we allow incremental refresh. When you base your reflection on the physical dataset/table, and make a change to the physical dataset (such as add another column), Dremio will detect this change and do a full reflection refresh when you add the new column.

To go around this problem you can isolate the adding a new column change from the reflection by creating a view that uses the table (as Balaji said). In the view definition, do not use select *, but only select the columns that you need one by one. Then base the reflection on the newly created view, not the table. We will validate that the view definition is not changed (it is not), and that all the columns used in the view are still in the physical dataset and unchanged. Then we will allow the incremental refresh.

I agree that Dremio could handle this case better and allow incremental refresh when the newly added column is not involved in the reflection itself. I hope the workaround above helps until we resolve this.
Thanks,
Julian

Hi Julian & Balaji,
Thanks a lot for the solutions. But i am looking for a solution where view and iceberg table evolves schema with new columns addition and view reflection to do incremental refresh for newly added columns.
Looks like that is not happening.
Solution:
Looks like create multiple reflections for the new columns added to avoid full refresh on the previous created reflection.
It looks like when new column is added then full refresh is madatory on reflections.

can you confirm this?. I am looking for a solution to avoid full refresh on add column but to do incremental refresh on the reflection which is created with fixed set of columns.
Thanks,
Pavan

Hi Pavan,
If you want to add new columns to the base dataset, and those new columns are not part of the reflection it is possible to do incremental refresh. See the view approach I described in the previous post above. The view remains static too with that approach.
If you want to add new columns to the dataset and those new columns need to be part of the reflection too, it is NOT currently possible to do incremental refresh.
Thanks,
Julian

Thanks Julian. I am looking for the point 2) you mentioned. “add new columns to the dataset and those new columns need to be part of the reflection too”. That case to be incremental refresh.
Ideally reflection is also a iceberg table. iceberg table does support it without over head why reflections has to do full refresh?. It can do incremental refresh and try to optimize the data eventually?. Is this option in your roadmap?.

@Julian_Kiryakov @balaji.ramaswamy : We are also facing the exact same problem. Our Iceberg tables are also dynamic using Iceberg’s schema evolution feature. We have daily data batch processing pipelines which process previous day’s ingested data. For most of our sources schema evolution is enabled and new columns get added. So most of the time it is unnecessary full reflection refresh for us. In our scenario, the ability to specify a filter (limited to the partition where new columns are added) for the reflection refresh command would resolve this issue. Do we have any such option to specify the partitions in recent versions or in roadmap? We are using 24.2 dremio oss which doesn’t have this feature.

@irshad-pai When you said mention FILTER, you mean adding tot he VDS? and your question is if that can be automatic? Does your partition column change? WHat is the current VDS definition?

@irshad-pai Supporting incremental refresh on schema evolution is on the roadmap, however the target release date has yet to be determined.

Any ETA on the incremental refresh on schema evolution?. I hope you are addressing adding new columns to physical iceberg table and into the reflection will be incremental not full refresh?.

Thanks,
Pavan

To fix the schema evolution problem and max use of reflection :
approach 1:
Following is the approach we are going with as we have 1500 columns max per tenant. Do give the feedback on this

Iceberg table …col1,col2…column1499

(only for aggregated cols) col1view1, col2view2…col1499view11499 Worst case

(aggregate reflections) Reflectioncol1, Reflectioncol2 … Reflectioncol200 initially. ( Eviction of LRU reflection/ Consolidation of reflection to new snapshot reflection of columns)

Let us know do you see any concerns with this approach ?. We are going with fetch raw time-series data and aggregates

Approach 2:
Pivot the ingested data so that we get the fixed set of fields and used for aggregation and query time-series data.

iceberg table:
_time columns char_float. char_varchar
t1 col1 12.5 “hello”
t2 col2 13.5 “world”
t3 … ----- -------

aggregate reflection patition by day. We are going with fetch raw time-series data and analytics aggregates.

Though we analyzed pros and cons of approaches. We want to know from you to get the confidence on the approach.

Note: We have a shared dreamio cluster multi tenant.

I am posting this approaches which might help others if they fall into the same usecase.

@pavankumargundeti With approach1, if you aggregate on all columns there will be not much reduction in rows, approach2 seems better

Do you need all 1500 columns for the dashboard?

Yes, we may need aggregates for 1500 columns or even more.
Pivot may not completely help as columns keep evolving on the table and pivot always expects the fixed column name in the definition. Adding an extra column will result in a full refresh of reflection which will be costly.

With approach 1, If we maintain a view on each column and create reflections for each view(For example 1000 to 2000 reflections).
What challenges do we have to face or this is not even feasible?
Do consider not all columns will get changed under the 1-hour refresh interval.

Also, do we have any timelines from dremio on partial refresh on schema evolution of table?

Thanks

Having a reflection for every column would not scale as 2000 reflections have to be refreshed before you run the report. Have you tried using Iceberg instead of parquet so you do not have to do 2 things

  • metadata refresh
  • performance of iceberg

What is this? Also, do we have any timelines from dremio on partial refresh on schema evolution of table? are you asking if Dremio will detect schema changes? This should already be there?

Thanks
Bali

Hi Bali,
among the 2000 columns active columns are the subset might be 500 columns. As we are considering this a timeseries store with a dedicated dremio engine for incremental refreshes can that be feasible. As the reflections configured refresh is per hour assuming the no of columns refresh with footprint is minimul…ie other reflections refreshes in millsec.
We are also looking for the rest api to refresh the column reflection to trigger manually on evluating the delta snapshot api no to make 2000 columns refresh. Do you have a Rest api to refresh a view on demand?.
We are mostly looking for quering the aggregates over a year for the analytical columns max 30 at a time with iceberg table partitioned on a day at petabyte scale on which doing on the fly aggregates is not feasible.
So looking for optimal use of reflections to get max benifits with write and read. In the other approach2 the data set represent in pivot into iceberg will increase the dataset size to multifold as each column data represented in rows. Using reflections on that you will see challenges of data refresh , write latencies…etc?.

Thanks,
Pavan