We are currently testing the new functionality released in dremio 18 around real time metadata refresh. I was trying to test it, but it does not seem to work as expected (maybe my expectation is wrong).
Both flags are enabled
I have an s3 data lake source connected to local minio instance containing parquet files. My expectation would be, if I add a new file to the folder that already has been promoted as a physical data set, I do not need to refresh its metadata and see new data almost instantly. however this is not the case.
Is my expectation wrong? Or is it some limitation in minio?
@balaji.ramaswamy just tested it in an aws environment with s3 bucket:
Delete a file from a folder promoted to a pds in dremio
immediately run a query on the pds
I can see how before running a query, metadata is being updated without me doing anything. Will test with adding a new file with new data, but i expect it will work. I suppose there is some incompatibility with minio
have to reopen this, as I did more testing, and it still does not work as expected. These are the steps I performed on an S3 data source with parquet files
a folder is promoted to a physical dataset and show the data from all parquet files correctly
remove one parquet file from s3 folder
expectation: this change will reflect in queried through dremio immediately
actual: the data does reflect changes and does not show data entries from removed file. I can see these logs, which force metadata refresh
2021-12-01 10:04:57,625 [Fabric-RPC-Offload2772] INFO c.d.exec.work.foreman.AttemptManager - 1e58b736-7da5-a560-bcea-167364cad200: State change requested RUNNING --> FAILED, Exception com.dremio.common.exceptions.UserRemoteException: INVALID_DATASET_METADATA ERROR: One or more of the referred data files are absent [File not found 71da8638-fac1-46e7-a8a3-b32dfc929728/test/02c116ea-adb7-4a0b-b70b-7dfec43aa809/02c116ea-adb7-4a0b-b70b-7dfec43aa809-1637672446990.parquet].
SqlOperatorImpl TABLE_FUNCTION
Location 0:0:12
Fragment 0:0
[Error Id: 4137b10f-390f-420d-96bd-84ab067b9348 on dremio-executor-0.dremio-cluster-pod.dremio.svc.cluster.local:0]
(java.lang.RuntimeException) One or more of the referred data files are absent [File not found 71da8638-fac1-46e7-a8a3-b32dfc929728/test/02c116ea-adb7-4a0b-b70b-7dfec43aa809/02c116ea-adb7-4a0b-b70b-7dfec43aa809-1637672446990.parquet].
upload a new paquet file into s3 folder with new data
expectation: data shows up almost immediately when querying it through dremio
actual: even after a minute I still see stale data. Only after performing alter table <pds> refresh metadata I can see new data
Can you please advise, if we are missing some configuration?
@igorsechyn Near real time metadata refresh refers to the incremental refresh Dremio does since v18.0 and will be very quick, the default interval of metadata refresh is every 1 hour so not seeing new data in 1 minute is expected, if you want data to be seen at once, then best is to run the ALTER PDS REFRESH METADATA command right after the ETL job loads data on to the lake
@spireite Depending on the source, detecting changed files can be an expensive operation. Ideally you would refresh the metadata of changed datasets as you finish your ETL pipeline for example rather than using a set interval.
This is also why technologies like iceberg are being integrated to help with these type of usecases.