AWS S3 costs caused ListBucket request on metadata refresh

eduardoslopes · October 29, 2021, 7:53pm

Hi,

I’m using dremio to read data from a S3 source. Our data are partitioned in four levels (year, month, day and hour) to increase query performance. I have noticed that the AWS costs caused by the ListBucket request increases a lot, according with Metadata Refresh interval decreases or when new datasets are created.

There is some way to reduce this cost without decreasing too much the metadata refresh frequency?

eduardoslopes · November 1, 2021, 12:32pm

I have forgot to say, but I’m using Dremio version 4.7.3.

balaji.ramaswamy · November 7, 2021, 4:37pm

@eduardoslopes Dremio 19.0 should do minimal metadata requests, please review release notes and documentation starting Dremio 18.0 on how to turn on Iceberg and unlimited splits

https://docs.dremio.com/release-notes/1800-release-notes/

eduardoslopes · November 8, 2021, 2:08pm

Thanks, @balaji.ramaswamy!

I’m starting tests with Dremio 18.2.0 on dev environment. Do you know if metadata refresh improvements works for json formatted tables or only for iceberg formatted tables?

balaji.ramaswamy · November 10, 2021, 6:57am

@eduardoslopes The new flow is only for PARQUET/ORC and AVRO format files

eduardoslopes · November 10, 2021, 12:01pm

@balaji.ramaswamy Thanks!

Can you explain me if there is any other way to reduce this costs and continue using json data format? Our data has complex schemas, and is hard for us to use formats like Parquet.

chulucninh09 · November 10, 2021, 2:01pm

@eduardoslopes I faced the same issue when promote our cloudtrail logs bucket to dataset. I’m currently work around to run a batch job to transform cloudtrail json logs into parquet with iceberg table format. That was clunky to setup but after that, the performance is much better and also the cost of ListBucket too.

About latency, I experience the same because if you don’t refresh Dremio metadata, you also get no information about new json files.

Topic		Replies	Views
Dremio s3 metadata storage	19	209	January 17, 2025
S3 : streaming JSON files in a bucket as new documents / records	8	2043	October 1, 2020
Refresh Metadata Taking Ling Time	15	4082	February 25, 2021
Metadata Retrieval at query time (AWS Glue)	7	1728	January 16, 2021
Near real time metadata refresh	8	2451	December 10, 2021

AWS S3 costs caused ListBucket request on metadata refresh

Related topics