Reading S3 directory file not found error

Hey all,

Looking at using dremio to query S3. I have a folder which pyspark writes files into. The path of the folder is always constant - so I set dremio up to query it as a “directory” as per https://docs.dremio.com/data-sources/files-and-directories.html.

First time - it works fine. However when I run the job again and pyspark writes another file into the folder to replace the old one (with a slightly different name) and when I go to query it again with dremio I get a "File Not Found " error.

I would have thought that dremio would just query whatever files are in the containing directory, whatever the name?

Hi,

Dremio caches metadata for performance reasons, and the default is to refresh every hour. If you edit the source you will need to expand Advanced Options (we are working on making the source configuration easier to use) and you can configure the metadata caching options.

09%20PM

We usually recommend adding new files over time.

Thankyou - it looks like you can only refresh every hour? - can it be faster?

Thanks,
Tim

You can go down to 1 minute as the smallest refresh - note that depending on the size of your S3 data and network between Dremio and S3 metadata refresh can take some time.

Hi again, thanks for your help - im having trouble finding the advanced options? can you post some screenshots how to get there?

Sure, you need to click on the Show Advanced Options shown here with an arrow:

1 Like

Hey - thanks so much, that worked a treat!