AWS S3 costs caused ListBucket request on metadata refresh

Hi,

I’m using dremio to read data from a S3 source. Our data are partitioned in four levels (year, month, day and hour) to increase query performance. I have noticed that the AWS costs caused by the ListBucket request increases a lot, according with Metadata Refresh interval decreases or when new datasets are created.

There is some way to reduce this cost without decreasing too much the metadata refresh frequency?

I have forgot to say, but I’m using Dremio version 4.7.3.

@eduardoslopes Dremio 19.0 should do minimal metadata requests, please review release notes and documentation starting Dremio 18.0 on how to turn on Iceberg and unlimited splits

https://docs.dremio.com/release-notes/1800-release-notes/

Thanks, @balaji.ramaswamy!

I’m starting tests with Dremio 18.2.0 on dev environment. Do you know if metadata refresh improvements works for json formatted tables or only for iceberg formatted tables?

@eduardoslopes The new flow is only for PARQUET/ORC and AVRO format files

@balaji.ramaswamy Thanks!

Can you explain me if there is any other way to reduce this costs and continue using json data format? Our data has complex schemas, and is hard for us to use formats like Parquet.

@eduardoslopes I faced the same issue when promote our cloudtrail logs bucket to dataset. I’m currently work around to run a batch job to transform cloudtrail json logs into parquet with iceberg table format. That was clunky to setup but after that, the performance is much better and also the cost of ListBucket too.

About latency, I experience the same because if you don’t refresh Dremio metadata, you also get no information about new json files.