Near real time metadata refresh

Hello,

We are currently testing the new functionality released in dremio 18 around real time metadata refresh. I was trying to test it, but it does not seem to work as expected (maybe my expectation is wrong).

Both flags are enabled image

I have an s3 data lake source connected to local minio instance containing parquet files. My expectation would be, if I add a new file to the folder that already has been promoted as a physical data set, I do not need to refresh its metadata and see new data almost instantly. however this is not the case.

Is my expectation wrong? Or is it some limitation in minio?

@igorsechyn After how long do you see the data from the new file?

@balaji.ramaswamy just tested it in an aws environment with s3 bucket:

  • Delete a file from a folder promoted to a pds in dremio
  • immediately run a query on the pds

I can see how before running a query, metadata is being updated without me doing anything. Will test with adding a new file with new data, but i expect it will work. I suppose there is some incompatibility with minio

Sorry for late response, but this seems to work as expected :slight_smile:

@igorsechyn Glad it works as expected

hi @balaji.ramaswamy ,

have to reopen this, as I did more testing, and it still does not work as expected. These are the steps I performed on an S3 data source with parquet files

  • a folder is promoted to a physical dataset and show the data from all parquet files correctly
  • remove one parquet file from s3 folder
    • expectation: this change will reflect in queried through dremio immediately
    • actual: the data does reflect changes and does not show data entries from removed file. I can see these logs, which force metadata refresh
2021-12-01 10:04:57,625 [Fabric-RPC-Offload2772] INFO  c.d.exec.work.foreman.AttemptManager - 1e58b736-7da5-a560-bcea-167364cad200: State change requested RUNNING --> FAILED, Exception com.dremio.common.exceptions.UserRemoteException: INVALID_DATASET_METADATA ERROR: One or more of the referred data files are absent [File not found 71da8638-fac1-46e7-a8a3-b32dfc929728/test/02c116ea-adb7-4a0b-b70b-7dfec43aa809/02c116ea-adb7-4a0b-b70b-7dfec43aa809-1637672446990.parquet].

SqlOperatorImpl TABLE_FUNCTION
Location 0:0:12
Fragment 0:0

[Error Id: 4137b10f-390f-420d-96bd-84ab067b9348 on dremio-executor-0.dremio-cluster-pod.dremio.svc.cluster.local:0]

  (java.lang.RuntimeException) One or more of the referred data files are absent [File not found 71da8638-fac1-46e7-a8a3-b32dfc929728/test/02c116ea-adb7-4a0b-b70b-7dfec43aa809/02c116ea-adb7-4a0b-b70b-7dfec43aa809-1637672446990.parquet].

  • upload a new paquet file into s3 folder with new data
    • expectation: data shows up almost immediately when querying it through dremio
    • actual: even after a minute I still see stale data. Only after performing alter table <pds> refresh metadata I can see new data

Can you please advise, if we are missing some configuration?

Cheers, Igor

@igorsechyn Near real time metadata refresh refers to the incremental refresh Dremio does since v18.0 and will be very quick, the default interval of metadata refresh is every 1 hour so not seeing new data in 1 minute is expected, if you want data to be seen at once, then best is to run the ALTER PDS REFRESH METADATA command right after the ETL job loads data on to the lake

Out of interest, what is the impact of issuing a REFRESH METADATA command on the PDS on a more regular basis?

Issuing it after a data copy process makes sense as the last step of an ETL job.

As I see it the UI configuration for this is a bit lacking. What be nice to see are the following options.

  1. Schedule a metadata/reflection refresh at a fixed time
  2. Schedule at x minute intervals (for example, 5m intervals, 10m etc) in addition to the existing Hourly, Weekly etc

Again it all comes down to use cases, but there are use cases where refresh more often that once an hour is actually needed.

@spireite Depending on the source, detecting changed files can be an expensive operation. Ideally you would refresh the metadata of changed datasets as you finish your ETL pipeline for example rather than using a set interval.

This is also why technologies like iceberg are being integrated to help with these type of usecases.