Dremio s3 metadata storage

Hi, we are auditing our s3 costs connected to Dremio usage. When looking into s3 we found out that >90% of costs go to S3 bucket created by Dremio:
like dremio-me-.....e4bfc8263e2

Analyzing its folders i found that the biggest folder is metadata folder which takes 99% of total space.(i stopped scanning after reaching 7 TB and >1.3M objects but i guess it almost equals whole size of a bucket which is >20TB).

My question is - can i control size of this bucket/folder by any settings? (Can i / do i need) to do any cleanup there.
Can i identfy which datasets affect size of it more than others?

Build
25.0.7-202407181421320869-2632b04f
Edition
AWS Edition (activated)

@vladislav-stolyarov

The Dremio version you are on should have Iceberg snapshot cleanup. Are you files in Parquet or Iceberg format?

If Iceberg, you can run the Vacuum table command to expire snapshots

If Parquet then, everytime you run “ALTER PDS REFRESH METDATA” or your background refresh should clean them, By default 8 days are kept and below too

MIN_SNAPSHOTS_TO_KEEP = 5;
MAX_NUM_SNAPSHOTS_TO_EXPIRE = 15;

Also in Both Iceberg and Parquet, metadata.json will not get cleaned

For Parquet, Last 100 metadata.json will be kept, cannot be changed

For Iceberg also last 100 copies will be kept but can be override by setting table property (not support key) write.metadata.delete-after-commit.enabled

If there is a certain PDS you no longer need then you can run ALTER PDS <PDS_NAME> FORGET METDATA, this will remove everything for that table

Our datasets are:

  1. Glue database with partitioned s3 parquet based glue tables and 1-2 iceberg tables.
  2. 1 ms sql and 1 postgre db
  3. native dremio s3 datasets
    querying sys.tables shows 1400 tables.

wondering can metadata storage explode because of:

  1. in case of Relational database source we use 10-15 tables in our virtual datasets while total number of tables+views is >200 for a database. So it queries/stores metadata for too much objects? Or relational databases metadata is lightweight?
  2. in case of glue datasource - same we use a small number of tables in glue database while glue database contains twenty times more tables.

P.S: i am just trying to understand why dremio internal s3 bucket metadata folder contains 20tb that is more than size of all datasets attached to dremio described at the top.

@vladislav-stolyarov

RDBMS metadata does not go to S3, it goes to local disk on your coordinator. Only Parquet metadata goes to S3

it most likely looks like all Glue tables are getting refreshed. How many entries does information_schema."tables" contain?

You mean how many entries of glue tables?

where TABLE_SCHEMA like 'Aws glue%' 

gives 346 tables.

Overall count of records in information_schema."tables" is 1701 but as i can see it contains vds also.

@vladislav-stolyarov information_schema.“tables” will not have views. 346 tables os not a lot.Can you send us the biggest folder zipped under the metadata folder, Probably too many snapshots due to too many refreshes?

information_schema.“tables” table has table type column that has ‘view’ records. Maybe you meant sys.tables.

Probably too many snapshots due to too many refreshes?

metadata refreshes - yes we do metadata refresh programatically many time per day.

Under dremio/metadata folder there are >200 subfolders with {guid} as its name. I guess each such subfolder refers to one dataset?

I am not sure it is feasible atm to send whole folder. For example one of the biggest folders is 500 GB. Compressing it, still it would take >10GB. maybe i can simply send you some examples from it.

For example dremio/metadata/7f34b55e-5e67-47dc-b404-f7bd95f53ffb/metadata/ folder.
It contains 14523 files. Seems majority are json files and some avro.

The earliest file is from 2023. I can see that new files are generated every 2-3 hours and sometimes there are bigger gaps like 5 hours-1-2 days. And each new file size is growing with each new version. Avro files are all small so ill neglect them for now.

What is strange is that i found a very big time gap in json file sequence.

This is the example file dated by July 31, 2024
12346-627e7328-dae3-4d7b-ac1e-216c906ee3f8.metadata.json

and next json file is
18105-4cf727dd-ea8c-4045-b6fb-360e123f83d6.metadata.json
and it is dated by October 24, 2024.

I will attach both files. What is interesting is that inside each file there is:
"table-uuid" : "1c4b902e-839b-4fab-85d5-ed839ec76a0c",
And i tried to query this id from sys.tables with no luck to find anything.
Also last json file is dated by the October 25.

From json i can see that it is regular parquet glue table.

PS: in that time period in October i modified glue datasource by changing Allowed Databases property and it warned me about smth will be recreated.

metadata.zip (3,9 МБ)

@vladislav-stolyarov Yes, your metadata.json is large. Do you have complex types like STRUCT, ARRAYS in your schema?

No in general only simple types. But this exact dataset has 2 list columns. The files i attached contain schema.
Am i right that one folder contains all metadata files per single dataset?
folder that i attached example files from
dremio/metadata/7f34b55e-5e67-47dc-b404-f7bd95f53ffb/metadata/
contain files per single glue table metadata.
All json files has integer prefix that makes then consecutive and each new file is 4kb larger than previous, that makes me think that each file contains all information from previous. Than folder has more then 10k of metadata json files with 500gb of total size, while the glue table itself is as small as 1-2 GB. Thats a bit weird. Why do dremio need so many historical files…or smth is broken in cleanup process?
For the experiment i took sequential 2 files from 2023 (cos they are more or less small compared to 70mb now) and compared them using merge tools.
01804-66b929cd-0e45-4fcf-9de7-446307433020.metadata.json
and
01805-e8f37565-8c96-40e4-ada7-3d9a33318453.metadata.json

What i found.
last-updated-ms and last-column-id has changed schema.schema-id has changed.
and now very curios - ids of two last LIST columns has changed for some reason in the schema element and from their value sounds like this id is changing on every metadata refresh regardless.


also schemas array is extended with new copy of schema.

current-snapshot-id is changed and new entry is added to snapshots array and snapshot-log

And now metadata-log is extended with new record pointing to prev metadata json file. BUT the first record in the log is removed, and it looks like it preserves 100 latest items in the array. This correlates to what you said about 100 metadata json files. BUT
the entry it removed was

"metadata-file" : "s3://dremio-me-047e62a8-7fab-4b4d-825e-3f022ee4e3c4-6e18fc350d1989ab/dremio/metadata/7f34b55e-5e67-47dc-b404-f7bd95f53ffb/metadata/01704-ff5eb5de-ba41-408f-912b-5e86e9bf66c4.metadata.json"

And 01704-ff5eb5de-ba41-408f-912b-5e86e9bf66c4.metadata.json still exists, which means metadata files are not cleaned up.

So which issues i can see from it:

  1. metadata json files above 100 count were not deleted and accumulated.
  2. list columns made schema change on every metadata refresh for no reason.
  3. metadata json file keeps all history of schema changes making file huge over time.
  4. same for snapshots.
    Important note - all those files refer to 2023, we were using older dremio version at that time (maybe 24 not sure)

UPDATE: i have checked latest sequence files and can confirm that indeed last 100 metadata files are preserved, so seems there was a bug at some point of time and all those thousands of files are orphans that accumulated over time.
BUT still in latest files i can see that snapshots array is endlessly growing not like metadata array that keeps only last 100 records and it makes json file still huge for datasets that have no LIST columns and huge schemas array.

Is there any feedback?
We are paying 200 usd every month (within last year at least) for those extra TBs of data occupied by thousand of metadata.json file above 100 limit per dataset. How can i fix/remove orphan metadata json files per dataset?

@vladislav-stolyarov looks like there are 2 issues

  • Large metadata.json due to struct/list columns - This issue is fixed and will be part of next CE release
  • Are you seeing more than 100 metadata.json files for the same table?

Yes exactly. Thousand of files instead of 100 for older datasets.
As i wrote above it looks like there was a bug in older versions that is fixed now, cos now theres a gab in metadata sequential json files keeping last 100 as expected - so at some time in past 101th file was not removed and keept accumulating, than after update it started deleteing 101th file but of course havent cleaned up older.

@vladislav-stolyarov Looks like the old ones have to be cleaned manually. You can take a backup and clean them, worst case of there is a refrence we can restore. After a few weeks of the cluster running without issues, you can remove your backup taken

Should i only remove all json files in folder below last 100 files(ordered by filename/date) ?
Also is there a way to map json file to dremio dataset? there are two ids i can find but seems none of them is equal to table_id in sys.tables:
guid inside folder s3 path dremio/metadata/7f34b55e-5e67-47dc-b404-f7bd95f53ffb/metadata/
and guid inside metadata.json: "table-uuid" : "1c4b902e-839b-4fab-85d5-ed839ec76a0c"
It would be helpful to map folder to table.

And other two questions:
Wouldnt ALTER TABLE FORGET METADATA on existing glue table remove all files including old orphans? If yes than i can execute it for all tables in my catalog to do full reset?

And the second question i already mentioned above. In Glue datasource settings you can change Allowed databases to filter databases. Changing this setting shows such warning:

Will clicking yes cleanup all existing metadata folders? The problem is that i did this recently for my glue datasource and metadata s3 bucket size haven’t dropped, BUT it seems like new folders in metadata bucket appeared.

@vladislav-stolyarov Yes, forget metadata should cleanup the folder that are part of this PDS. From the profile you can backtrack on S3 which is the UUID for a dataset, not the other way around

What if simply delete huge folders from metadata folder i.e: dremio/metadata/0c572946-e6a8-4af7-86c4-72e60812b3ee Will that break dremio or it will identify that metadata folder for dataet is missing and will create new one(init metadata)?