Near real time metadata refresh

igorsechyn · October 13, 2021, 8:49am

Hello,

We are currently testing the new functionality released in dremio 18 around real time metadata refresh. I was trying to test it, but it does not seem to work as expected (maybe my expectation is wrong).

Both flags are enabled

I have an s3 data lake source connected to local minio instance containing parquet files. My expectation would be, if I add a new file to the folder that already has been promoted as a physical data set, I do not need to refresh its metadata and see new data almost instantly. however this is not the case.

Is my expectation wrong? Or is it some limitation in minio?

balaji.ramaswamy · October 19, 2021, 3:23am

@igorsechyn After how long do you see the data from the new file?

igorsechyn · October 19, 2021, 6:29pm

@balaji.ramaswamy just tested it in an aws environment with s3 bucket:

Delete a file from a folder promoted to a pds in dremio
immediately run a query on the pds

I can see how before running a query, metadata is being updated without me doing anything. Will test with adding a new file with new data, but i expect it will work. I suppose there is some incompatibility with minio

igorsechyn · November 2, 2021, 5:55pm

Sorry for late response, but this seems to work as expected

balaji.ramaswamy · November 7, 2021, 4:41pm

@igorsechyn Glad it works as expected

igorsechyn · December 1, 2021, 10:14am

hi @balaji.ramaswamy ,

have to reopen this, as I did more testing, and it still does not work as expected. These are the steps I performed on an S3 data source with parquet files

a folder is promoted to a physical dataset and show the data from all parquet files correctly
remove one parquet file from s3 folder
- expectation: this change will reflect in queried through dremio immediately
- actual: the data does reflect changes and does not show data entries from removed file. I can see these logs, which force metadata refresh

2021-12-01 10:04:57,625 [Fabric-RPC-Offload2772] INFO  c.d.exec.work.foreman.AttemptManager - 1e58b736-7da5-a560-bcea-167364cad200: State change requested RUNNING --> FAILED, Exception com.dremio.common.exceptions.UserRemoteException: INVALID_DATASET_METADATA ERROR: One or more of the referred data files are absent [File not found 71da8638-fac1-46e7-a8a3-b32dfc929728/test/02c116ea-adb7-4a0b-b70b-7dfec43aa809/02c116ea-adb7-4a0b-b70b-7dfec43aa809-1637672446990.parquet].

SqlOperatorImpl TABLE_FUNCTION
Location 0:0:12
Fragment 0:0

[Error Id: 4137b10f-390f-420d-96bd-84ab067b9348 on dremio-executor-0.dremio-cluster-pod.dremio.svc.cluster.local:0]

  (java.lang.RuntimeException) One or more of the referred data files are absent [File not found 71da8638-fac1-46e7-a8a3-b32dfc929728/test/02c116ea-adb7-4a0b-b70b-7dfec43aa809/02c116ea-adb7-4a0b-b70b-7dfec43aa809-1637672446990.parquet].

upload a new paquet file into s3 folder with new data
- expectation: data shows up almost immediately when querying it through dremio
- actual: even after a minute I still see stale data. Only after performing alter table <pds> refresh metadata I can see new data

Can you please advise, if we are missing some configuration?

Cheers, Igor

balaji.ramaswamy · December 9, 2021, 12:09am

@igorsechyn Near real time metadata refresh refers to the incremental refresh Dremio does since v18.0 and will be very quick, the default interval of metadata refresh is every 1 hour so not seeing new data in 1 minute is expected, if you want data to be seen at once, then best is to run the ALTER PDS REFRESH METADATA command right after the ETL job loads data on to the lake

spireite · December 10, 2021, 11:19am

Out of interest, what is the impact of issuing a REFRESH METADATA command on the PDS on a more regular basis?

Issuing it after a data copy process makes sense as the last step of an ETL job.

As I see it the UI configuration for this is a bit lacking. What be nice to see are the following options.

Schedule a metadata/reflection refresh at a fixed time
Schedule at x minute intervals (for example, 5m intervals, 10m etc) in addition to the existing Hourly, Weekly etc

Again it all comes down to use cases, but there are use cases where refresh more often that once an hour is actually needed.

doron · December 10, 2021, 8:24pm

@spireite Depending on the source, detecting changed files can be an expensive operation. Ideally you would refresh the metadata of changed datasets as you finish your ETL pipeline for example rather than using a set interval.

This is also why technologies like iceberg are being integrated to help with these type of usecases.

Topic		Replies	Views
Refresh Metadata Taking Ling Time	15	4064	February 25, 2021
Physical Dataset not Auto Refreshing	4	917	June 21, 2023
Dremio Refreshing Data	3	3516	May 12, 2020
Unable to refresh metadata for the dataset (due to concurrent updates). Please retry Dremio Cloud	2	451	December 6, 2023
S3 Source Metadata Refresh for time bucket formatted PDS	1	559	July 6, 2023

Near real time metadata refresh

Related topics