Looking at using dremio to query S3. I have a folder which pyspark writes files into. The path of the folder is always constant - so I set dremio up to query it as a “directory” as per https://docs.dremio.com/data-sources/files-and-directories.html.
First time - it works fine. However when I run the job again and pyspark writes another file into the folder to replace the old one (with a slightly different name) and when I go to query it again with dremio I get a "File Not Found " error.
I would have thought that dremio would just query whatever files are in the containing directory, whatever the name?
Dremio caches metadata for performance reasons, and the default is to refresh every hour. If you edit the source you will need to expand
Advanced Options (we are working on making the source configuration easier to use) and you can configure the metadata caching options.
We usually recommend adding new files over time.
Thankyou - it looks like you can only refresh every hour? - can it be faster?
You can go down to 1 minute as the smallest refresh - note that depending on the size of your S3 data and network between Dremio and S3 metadata refresh can take some time.
Hi again, thanks for your help - im having trouble finding the advanced options? can you post some screenshots how to get there?
Sure, you need to click on the
Show Advanced Options shown here with an arrow:
Hey - thanks so much, that worked a treat!