Metadata.json remains after Iceberg vacuum statement

Sebastian · September 26, 2024, 1:53pm

Hello,
I have Iceberg tables on a Cloudian S3 Bucket connected with Dremio. After I execute a vacuum Statement via Dremio SQL Query, I can still see the vXX.metadata.json files on my file system and they continually keep growing in size. All other files (.avro, snapshot, data files) are correctly deleted according to the timestamp I provided.
Is there a reason for this behavior or an option to delete these files too?

E.g.:
VACUUM TABLE tablepath
EXPIRE SNAPSHOTS older_than ‘2024-09-20 00:00:00.000’

Thanks in advance,
Sebastian

balaji.ramaswamy · October 2, 2024, 5:36am

@Sebastian metadata.json only for expired snapshots would be cleared. What is the timestamp on the metadata.json file you think should be deleted?

Sebastian · October 2, 2024, 6:18am

Hi Balaji,
the timestamp of the metadata.json file I would expect to be deleted is older than the one provided in the query. The same occurs if I provide a retain_last, all files except the metadata.json are correctly deleted up to the provided amount / timestamp and only metadata.json remain looking like the screenshot below.

criteria: older_than ‘2024-07-31 11:37:00.000’

balaji.ramaswamy · October 3, 2024, 5:50am

@Sebastian

The ExpireSnapshots query doesn’t clean the metadata json files. Because, they are not part of old snapshots. There are two ways you can clean old metadata json files

Set the Iceberg table’s property: write.metadata.delete-after-commit.enabled . Then, Iceberg will be able to remove old metadata files, when the old ones are over a given number, 100 by default. (More readings).
Use Spark’s current query: RemoveOrphanFiles query, which can help to vacuum those orphan metadata files.

Sebastian · October 9, 2024, 9:40am

Hello Balaji,
thanks for your answer.
How is the removal triggered in option 1, when I have set the tableproperties?

balaji.ramaswamy · October 10, 2024, 4:32am

@Sebastian Are you saying you have set the table property and not working as expected?

Sebastian · October 10, 2024, 6:42am

If I do not have to trigger the file deletion anyhow, then yes I set the properties as above and still have the standalone metadata.json files which exceed the amount of 20 specified in the properties.

balaji.ramaswamy · October 15, 2024, 5:21am

@Sebastian How many days do those 20 metadata.json span? Can you add a listing?

Sebastian · October 16, 2024, 8:28am

Hello Balaji,
the active snapshots span about 2 months. I ran the last optimize & vacuum statement at beginning of October, resulting in the inactive metadata.json files below.

balaji.ramaswamy · October 19, 2024, 5:23am

Any chance you can send the output of the below command from Spark SQL?

show create table <table_name>

describe formatted table <table_name>

Sebastian · October 21, 2024, 7:24am

I am only using Dremio SQL to work with Iceberg tables, here are the results from running these queries in Dremio (my table is a copy of sys.recent_jobs that I am using to test Iceberg functions).

show create table TEST.“ICEBERG_PARTITION”.“iceberg_test_jobs”;

path	sql_definition
[TEST.ICEBERG_PARTITION.iceberg_test_jobs]	SELECT * FROM TEST.ICEBERG_PARTITION.iceberg_test_jobs

describe table TEST.“ICEBERG_PARTITION”.“iceberg_test_jobs”;

COLUMN_NAME	DATA_TYPE	IS_NULLABLE	NUMERIC_PRECISION	NUMERIC_SCALE
job_id	CHARACTER VARYING	YES
status	CHARACTER VARYING	YES
query_type	CHARACTER VARYING	YES
user_name	CHARACTER VARYING	YES
queried_datasets	CHARACTER VARYING	YES
scanned_datasets	CHARACTER VARYING	YES
attempt_count	INTEGER	YES	32	0
submitted_ts	TIMESTAMP	YES
attempt_started_ts	TIMESTAMP	YES
metadata_retrieval_ts	TIMESTAMP	YES
planning_start_ts	TIMESTAMP	YES
query_enqueued_ts	TIMESTAMP	YES
engine_start_ts	TIMESTAMP	YES
execution_planning_ts	TIMESTAMP	YES
execution_start_ts	TIMESTAMP	YES
final_state_ts	TIMESTAMP	YES
submitted_epoch_millis	BIGINT	YES	64	0
attempt_started_epoch_millis	BIGINT	YES	64	0
metadata_retrieval_epoch_millis	BIGINT	YES	64	0
planning_start_epoch_millis	BIGINT	YES	64	0
query_enqueued_epoch_millis	BIGINT	YES	64	0
engine_start_epoch_millis	BIGINT	YES	64	0
execution_planning_epoch_millis	BIGINT	YES	64	0
execution_start_epoch_millis	BIGINT	YES	64	0
final_state_epoch_millis	BIGINT	YES	64	0
planner_estimated_cost	DOUBLE	YES	53
rows_scanned	BIGINT	YES	64	0
bytes_scanned	BIGINT	YES	64	0
rows_returned	BIGINT	YES	64	0
bytes_returned	BIGINT	YES	64	0
accelerated	BOOLEAN	YES
queue_name	CHARACTER VARYING	YES
engine	CHARACTER VARYING	YES
error_msg	CHARACTER VARYING	YES
query	CHARACTER VARYING	YES
is_profile_incomplete	BOOLEAN	YES
execution_allocated_bytes	BIGINT	YES	64	0
execution_cpu_time_millis	BIGINT	YES	64	0

Sebastian · November 8, 2024, 7:35am

@balaji.ramaswamy We updated on 25.1.1 and now I get the opposite result when running optimize and vacuum.
The metadata.json files are being correctly deleted but the snapshot and datafiles remain in the S3 storage.
In the screenshot below the last valid snapshots after vacuum are v38 (left) and v41 (right). I changed nothing to the code or table properties we discussed prior. Has anything changed with the new Dremio version?
And is there an option to remove legacy / outdated / invalid Iceberg files from the storage using Dremio SQL (like the spark “removeOrphanFiles”)?

balaji.ramaswamy · November 13, 2024, 6:18am

@Sebastian There is a MIN_SNAPSHOTS_TO_KEEP = 5; AFAIK, metadata.json will not get cleaned in new version for Vacuum command, let me check if there is a change in behavior

Topic		Replies	Views
Dremio VACUUM TABLE REMOVE ORPHAN FILES command is not supported (Iceberg table)	1	22	August 19, 2025
Dremio s3 metadata storage	19	211	January 17, 2025
OPTIMIZE and VACUUM table cannot reduce iceberg metadata file size	11	141	March 22, 2025
Iceberg file size on dremio Dremio Cloud	26	705	April 5, 2025
Dremio iceberg delete expire Snapshots	1	941	July 19, 2022

Metadata.json remains after Iceberg vacuum statement

Related topics