Reflection Refresh via API Problem

Hello,

I have a pipeline that runs every 5 minutes that pushes new files to Azure Blob Storage. Those containers are source PDS in Dremio with Raw Reflection (incremental) setup with a 1 hr refresh (UI).

The goal is to be able to see those new files (through the PDS) ideally every 5 min (as soon as they are uploaded to Azure).

Given that Azure caches metadata and the Reflections are, in effect, another layer of cache, we need to refresh the metadata and reflection programmatically as part of the pipeline.

To do this, I wrote a Python wrapper around the API to execute a metadata refresh (SQL API) and a reflection refresh (catalog API) of the PDS.

Both calls succeed and I can see them in the Job list in the UI.

For the Reflection refresh, I see to 2 jobs:

  1. REFRESH REFLECTION ‘e24742f9-e61f-43b5-b5b3-7712844f40cc’ AS ‘8755176e-1f34-4cc1-9545-9bccd57149b2’
  2. LOAD MATERIALIZATION METADATA “e24742f9-e61f-43b5-b5b3-7712844f40cc”.“8755176e-1f34-4cc1-9545-9bccd57149b2”

However, when I look in the Job list for the subsequent jobs that hit that PDS Reflection, it appears like the reflection Age is 5 hrs old. I can see the Reflection being used (flame icon) and next to the Reflection name that was used, it says “Age: 5hr 22min”. Even if I wasn’t refreshing via API, it should have refreshed every hour.

Any idea what’s going on? How can I tell if the reflection was actually refreshed with the latest data?

@drew Can you check at the source level or the PDS level if “never refresh” / “never expire” has been checked? Also when you run the query, do you see the latest data assuming the query is accelerated?

I confirmed that those settings were not set. It should have refreshed when I hit the API,or at least, after 1 hr, per the expiration time.

Verifying that new data is present is tricky as there is a pipeline in front of this that processes new data. It could be working, but the Dremio UI that showed a Reflection that was 5hrs old surprised me.

@drew

Can you check your jobs page for this reflection id? remember to select type - acceleration and see if the job ran and for some reason failed, as you said every refresh should have 2 jobs, REFRESH REFLECTION and LOAD MATERIALIZATION, even if LOAD is missing the job is not considered complete. Other option you can do is, send JSON or PARQUET download of sys.reflections, sys.materializations, sys.refreshes and the reflection_id or dataset name I need to look for