Incremental data reflection with Iceberg

FYI: It seems querying the remote filesystem for existence of a possible version new version actually happens twice per refresh, which doesn’t make it any faster:

alter table iceberg metadata refresh.log.zip (1.4 KB)

For reference, v18 is the current version (18 being the content of version-hint.text) and v19 is non-existent.

In regards to:

-this was actually the one I was thinking about:

Thanks for clearing that up @Benny_Chow :+1:

Hey @Benny_Chow

Quick question for you!
For Glue catalog, is there a way to manually refresh the metadata store to discover datasets on-demand as opposed to choosing a fixed interval. Based on the source, it seems that the lowest frequency is 1 minute, but I would like to trigger manual refresh. Is there a workaround for this?

Thanks!

@wundi From a metadata refresh point of view, Nessie and Arctic behave the same. I.e. Dremio doesn’t cache any metadata such as the list of tables/views and Iceberg metadata (such as the table schema metadata.json). You can imagine it would be very hard to manage a cached copy of this metadata outside of Nessie/Arctic as the concept of branching and tags greatly increases the volume of metadata to keep in sync.

@kyleahn For Glue/Hive, if you know ahead there is a new table and Dremio has not shown it in the UI, I think if you directly run a query on it, Dremio will find it in the Glue/Hive catalog, cache the metadata and kick off an inline metadata refresh to record split information.

1 Like

When testing a table, that is backed by a Nessie catalog ( using services.nessie.remote-uri in dremio.conf) I do observe that all the iceberg metadata is fetched during each query - both metadata.json as well as manifest list as manifests. But in regards to:

As least on Dremio OSS 24.0.0, there is no request to the Nessie catalog when issuing a query. And if I modify the table from outside Dremio, Dremio continues to use the same snapshot as before and thereby not accessing the latest metadata/data (which makes sense, since Nessie is not being queried Dremio is unaware a change has happened). When modifying the table using Dremio, Dremio is of course aware that a change has happened and fetches the current metadata pointer from Nessie. Otherwise an ALTER TABLE REFRESH METADATA command needs to be issued, at which point the Nessie catalog is consulted.

Based on the expected behaviour you mentioned @Benny_Chow , this sounds like a bug then. Do you want me to create a bug report on Github?

@Benny_Chow Here’s a screen recording of my exact process. I’m running two instances of Dremio, a Nessie catalog and tailing S3 requests (served my MinIO):

dremio-nessie_1024_1_5fps

It’s a bit difficult to see in the recording, but the Dremio instance not performing the INSERT statement continues using the same Iceberg snapshot and data, until a manual metadata refresh is issued.

I later tried playing with the “metadata refresh” settings on the S3 source storing the Iceberg metadata and data files. When set to a minute Nessie is queries every 1-2 minutes for the latest metadata pointer. When set to an hour, I haven’t waited long enough to observe it (but also less than an hour):

Hi @wundi , it sounds like you have two catalogs on the same Iceberg table which is not supported. You have the the Nessie catalog and the Hadoop catalog from S3. Since you are querying through the S3 source, then the metadata refresh policy on the source applies and you won’t alway see the latest Iceberg snapshot. Nessie as a source is coming soon. (Similar to how Arctic is available for Dremio Cloud).