Hello,
I have Iceberg tables on a Cloudian S3 Bucket connected with Dremio. After I execute a vacuum Statement via Dremio SQL Query, I can still see the vXX.metadata.json files on my file system and they continually keep growing in size. All other files (.avro, snapshot, data files) are correctly deleted according to the timestamp I provided.
Is there a reason for this behavior or an option to delete these files too?
E.g.:
VACUUM TABLE tablepath
EXPIRE SNAPSHOTS older_than ‘2024-09-20 00:00:00.000’
Thanks in advance,
Sebastian
1 Like
@Sebastian metadata.json only for expired snapshots would be cleared. What is the timestamp on the metadata.json file you think should be deleted?
Hi Balaji,
the timestamp of the metadata.json file I would expect to be deleted is older than the one provided in the query. The same occurs if I provide a retain_last, all files except the metadata.json are correctly deleted up to the provided amount / timestamp and only metadata.json remain looking like the screenshot below.
criteria: older_than ‘2024-07-31 11:37:00.000’
@Sebastian
The ExpireSnapshots query doesn’t clean the metadata json files. Because, they are not part of old snapshots. There are two ways you can clean old metadata json files
-
Set the Iceberg table’s property: write.metadata.delete-after-commit.enabled
. Then, Iceberg will be able to remove old metadata files, when the old ones are over a given number, 100 by default. (More readings).
-
Use Spark’s current query: RemoveOrphanFiles query, which can help to vacuum those orphan metadata files.
Hello Balaji,
thanks for your answer.
How is the removal triggered in option 1, when I have set the tableproperties?
@Sebastian Are you saying you have set the table property and not working as expected?
If I do not have to trigger the file deletion anyhow, then yes I set the properties as above and still have the standalone metadata.json files which exceed the amount of 20 specified in the properties.