Metadata.json remains after Iceberg vacuum statement

Hello,
I have Iceberg tables on a Cloudian S3 Bucket connected with Dremio. After I execute a vacuum Statement via Dremio SQL Query, I can still see the vXX.metadata.json files on my file system and they continually keep growing in size. All other files (.avro, snapshot, data files) are correctly deleted according to the timestamp I provided.
Is there a reason for this behavior or an option to delete these files too?

E.g.:
VACUUM TABLE tablepath
EXPIRE SNAPSHOTS older_than ‘2024-09-20 00:00:00.000’

Thanks in advance,
Sebastian

1 Like

@Sebastian metadata.json only for expired snapshots would be cleared. What is the timestamp on the metadata.json file you think should be deleted?

Hi Balaji,
the timestamp of the metadata.json file I would expect to be deleted is older than the one provided in the query. The same occurs if I provide a retain_last, all files except the metadata.json are correctly deleted up to the provided amount / timestamp and only metadata.json remain looking like the screenshot below.

criteria: older_than ‘2024-07-31 11:37:00.000’

@Sebastian

The ExpireSnapshots query doesn’t clean the metadata json files. Because, they are not part of old snapshots. There are two ways you can clean old metadata json files

  1. Set the Iceberg table’s property: write.metadata.delete-after-commit.enabled . Then, Iceberg will be able to remove old metadata files, when the old ones are over a given number, 100 by default. (More readings).

  2. Use Spark’s current query: RemoveOrphanFiles query, which can help to vacuum those orphan metadata files.

Hello Balaji,
thanks for your answer.
How is the removal triggered in option 1, when I have set the tableproperties?

@Sebastian Are you saying you have set the table property and not working as expected?

If I do not have to trigger the file deletion anyhow, then yes I set the properties as above and still have the standalone metadata.json files which exceed the amount of 20 specified in the properties.

@Sebastian How many days do those 20 metadata.json span? Can you add a listing?

Hello Balaji,
the active snapshots span about 2 months. I ran the last optimize & vacuum statement at beginning of October, resulting in the inactive metadata.json files below.

image

Any chance you can send the output of the below command from Spark SQL?

show create table <table_name>

describe formatted table <table_name>

I am only using Dremio SQL to work with Iceberg tables, here are the results from running these queries in Dremio (my table is a copy of sys.recent_jobs that I am using to test Iceberg functions).

show create table TEST.“ICEBERG_PARTITION”.“iceberg_test_jobs”;

path sql_definition
[TEST.ICEBERG_PARTITION.iceberg_test_jobs] SELECT * FROM TEST.ICEBERG_PARTITION.iceberg_test_jobs

describe table TEST.“ICEBERG_PARTITION”.“iceberg_test_jobs”;

COLUMN_NAME DATA_TYPE IS_NULLABLE NUMERIC_PRECISION NUMERIC_SCALE EXTENDED_PROPERTIES MASKING_POLICY SORT_ORDER_PRIORITY
job_id CHARACTER VARYING YES
status CHARACTER VARYING YES
query_type CHARACTER VARYING YES
user_name CHARACTER VARYING YES
queried_datasets CHARACTER VARYING YES
scanned_datasets CHARACTER VARYING YES
attempt_count INTEGER YES 32 0
submitted_ts TIMESTAMP YES
attempt_started_ts TIMESTAMP YES
metadata_retrieval_ts TIMESTAMP YES
planning_start_ts TIMESTAMP YES
query_enqueued_ts TIMESTAMP YES
engine_start_ts TIMESTAMP YES
execution_planning_ts TIMESTAMP YES
execution_start_ts TIMESTAMP YES
final_state_ts TIMESTAMP YES
submitted_epoch_millis BIGINT YES 64 0
attempt_started_epoch_millis BIGINT YES 64 0
metadata_retrieval_epoch_millis BIGINT YES 64 0
planning_start_epoch_millis BIGINT YES 64 0
query_enqueued_epoch_millis BIGINT YES 64 0
engine_start_epoch_millis BIGINT YES 64 0
execution_planning_epoch_millis BIGINT YES 64 0
execution_start_epoch_millis BIGINT YES 64 0
final_state_epoch_millis BIGINT YES 64 0
planner_estimated_cost DOUBLE YES 53
rows_scanned BIGINT YES 64 0
bytes_scanned BIGINT YES 64 0
rows_returned BIGINT YES 64 0
bytes_returned BIGINT YES 64 0
accelerated BOOLEAN YES
queue_name CHARACTER VARYING YES
engine CHARACTER VARYING YES
error_msg CHARACTER VARYING YES
query CHARACTER VARYING YES
is_profile_incomplete BOOLEAN YES
execution_allocated_bytes BIGINT YES 64 0
execution_cpu_time_millis BIGINT YES 64 0

@balaji.ramaswamy We updated on 25.1.1 and now I get the opposite result when running optimize and vacuum.
The metadata.json files are being correctly deleted but the snapshot and datafiles remain in the S3 storage.
In the screenshot below the last valid snapshots after vacuum are v38 (left) and v41 (right). I changed nothing to the code or table properties we discussed prior. Has anything changed with the new Dremio version?
And is there an option to remove legacy / outdated / invalid Iceberg files from the storage using Dremio SQL (like the spark “removeOrphanFiles”)?

@Sebastian There is a MIN_SNAPSHOTS_TO_KEEP = 5; AFAIK, metadata.json will not get cleaned in new version for Vacuum command, let me check if there is a change in behavior