in my team we deal with a lot of parquet files (~100k files, ~250GB, increasing). The data is stored on a local disk as well on S3 but we always used the local disk since we run a single Dremio instance. We are now considering using the S3 as data source since we need multiple Dremio instances (on different environments, not multinode installations) to access the data, however we realized that the performances on S3 are terrible, especially when you first map the objects or update the metadata (10 minutes, compared to few seconds on disk).
We also gave a try to MinIO, slower than disk, but still much faster than AWS S3.
When we map S3 objects, all resources are not cap-ed (CPU, MEM, Network…).
We wonder why S3 discovery and metadata update is so slow compared to both disk and MinIO (that emulate the same S3 API) and if there are settings to tweak.