OPTIMIZE and VACUUM table cannot reduce iceberg metadata file size

Hi Team,

I have an iceberg table called table_ingestion_log which gets updated frequently.

I have it in my Minio Storage and I observe that it’s size grows rapidly.

I tried using OPTIMIZE TABLE once a day, but then it is not working. It’s size still keeps increasing.

My only recourse is to RECREATE table once in a while, so that it will lose all metadata (just keeping the size small)

Below shows the size of the table a few minutes after recreation. You can see that it’s the metadata size increases and a lot of small files get created (which I don’t want)

minio@minio-prod:~$ du -sh /mnt/data/finance/ctrl/table_ingestion_log/*
12K	/mnt/data/finance/ctrl/table_ingestion_log/18a32882-9839-1f6c-6e8c-15d9ffe1fd00
12K	/mnt/data/finance/ctrl/table_ingestion_log/18a32884-6f8b-ae6a-1ff3-ed11077c5500
12K	/mnt/data/finance/ctrl/table_ingestion_log/18a32888-6fe1-a694-d415-f2460ffea700
12K	/mnt/data/finance/ctrl/table_ingestion_log/18a32889-c044-9d76-bf6d-0cc0aecbcf00
12K	/mnt/data/finance/ctrl/table_ingestion_log/18a3288e-1a56-bf84-aa60-f4e96d4cb500
12K	/mnt/data/finance/ctrl/table_ingestion_log/18a32891-4fe8-5585-1f54-50117d57aa00
12K	/mnt/data/finance/ctrl/table_ingestion_log/18a32895-cb96-eb8d-1b2e-4f0a586bad00
12K	/mnt/data/finance/ctrl/table_ingestion_log/18a32897-4e83-b94a-8167-327fc3d8bc00
12K	/mnt/data/finance/ctrl/table_ingestion_log/18a3289a-8081-1902-925f-056f87eac100
12K	/mnt/data/finance/ctrl/table_ingestion_log/18a3289c-c5e4-7fb6-6ee9-3f8fee5a1300
12K	/mnt/data/finance/ctrl/table_ingestion_log/18a328a0-ae1b-937d-9ae3-d8edc05f3f00
12K	/mnt/data/finance/ctrl/table_ingestion_log/18a328a3-4b17-e18a-6621-b58fdd4c6b00
12K	/mnt/data/finance/ctrl/table_ingestion_log/18a328a7-2e7d-498d-93b8-2195b9b7cf00
12K	/mnt/data/finance/ctrl/table_ingestion_log/18a328aa-44c4-34d5-a823-b9db057d4700
12K	/mnt/data/finance/ctrl/table_ingestion_log/18a328ad-dbfe-8fed-cb07-1a3d1be4ec00
12K	/mnt/data/finance/ctrl/table_ingestion_log/18a328af-665c-61fa-46ca-d55a5ec64a00
12K	/mnt/data/finance/ctrl/table_ingestion_log/18a328b1-56ee-d7f2-4cfa-20d039a03300
12K	/mnt/data/finance/ctrl/table_ingestion_log/18a328b3-14e1-c4f4-3948-6eae52b7d800
12K	/mnt/data/finance/ctrl/table_ingestion_log/18a3293e-722b-c337-b63a-fd06b51c1200
12K	/mnt/data/finance/ctrl/table_ingestion_log/18a32947-4d5c-a043-6911-afa02399cd00
12K	/mnt/data/finance/ctrl/table_ingestion_log/18a329aa-12eb-37e4-57ed-476f1b37b800
12K	/mnt/data/finance/ctrl/table_ingestion_log/18a329ab-444b-0d7e-0bda-6f849acd9400
12K	/mnt/data/finance/ctrl/table_ingestion_log/18a329ad-aa6f-5d81-219c-24f127a6a800
12K	/mnt/data/finance/ctrl/table_ingestion_log/18a329af-4d35-4f13-7a9b-7589d8990d00
12K	/mnt/data/finance/ctrl/table_ingestion_log/18a329b1-a5e5-d742-3a54-92ce40ec5300
12K	/mnt/data/finance/ctrl/table_ingestion_log/18a329b2-909d-d1e0-40b5-4f2880ce2900
12K	/mnt/data/finance/ctrl/table_ingestion_log/18a329b7-55cc-d3db-7396-d95a19940200
12K	/mnt/data/finance/ctrl/table_ingestion_log/18a329bb-771c-6a86-4590-ea461976b400
12K	/mnt/data/finance/ctrl/table_ingestion_log/18a329bf-c289-684a-8e91-7a587b49ad00
12K	/mnt/data/finance/ctrl/table_ingestion_log/18a329c2-6845-9bd0-fd88-b79b5a816800
12K	/mnt/data/finance/ctrl/table_ingestion_log/18a329c4-0302-9da2-b862-153c0c7d9b00
12K	/mnt/data/finance/ctrl/table_ingestion_log/18a329c5-c685-492b-e040-75c60ca76d00
12K	/mnt/data/finance/ctrl/table_ingestion_log/18a329cb-80f6-cee1-b615-233cd09e3700
12K	/mnt/data/finance/ctrl/table_ingestion_log/18a329cc-271c-adaa-8561-d2bdef301700
12K	/mnt/data/finance/ctrl/table_ingestion_log/18a329cd-568b-d942-f1c9-542b429d7300
12K	/mnt/data/finance/ctrl/table_ingestion_log/18a329d0-690c-6709-659f-9c3bf1e61300
12K	/mnt/data/finance/ctrl/table_ingestion_log/18a329d1-8c97-97c8-e835-39ec9b567d00
12K	/mnt/data/finance/ctrl/table_ingestion_log/18a329d6-a736-6e1c-b88e-60b24917dd00
12K	/mnt/data/finance/ctrl/table_ingestion_log/18a329d9-4c40-5e6d-0e4b-506842150c00
12K	/mnt/data/finance/ctrl/table_ingestion_log/18a329e5-60b4-295e-548b-c0dfe4436c00
12K	/mnt/data/finance/ctrl/table_ingestion_log/18a329f9-e348-fdbc-7151-0fb6b011fc00
12K	/mnt/data/finance/ctrl/table_ingestion_log/18a329fa-d646-0475-670a-57fddad2ad00
12K	/mnt/data/finance/ctrl/table_ingestion_log/18a32ade-905b-5242-a0d6-bbf974aefd00
12K	/mnt/data/finance/ctrl/table_ingestion_log/18a32ade-b50e-5cff-2a0a-ff1362ef1600
12K	/mnt/data/finance/ctrl/table_ingestion_log/18a32ae1-452a-fadd-74d7-3778bebbf300
12K	/mnt/data/finance/ctrl/table_ingestion_log/18a32ae2-7f77-f80b-7e04-a610b1552400
12K	/mnt/data/finance/ctrl/table_ingestion_log/18a32ae4-4a5e-9db2-14fb-64054f826300
12K	/mnt/data/finance/ctrl/table_ingestion_log/18a32ae4-e4e2-dc69-1775-e548ddb37000
12K	/mnt/data/finance/ctrl/table_ingestion_log/18a32ae7-169f-0c1b-770b-bbdbfc406500
12K	/mnt/data/finance/ctrl/table_ingestion_log/18a32ae8-7102-ef9c-9b6a-955ea6d69d00
12K	/mnt/data/finance/ctrl/table_ingestion_log/18a32aea-77b8-5057-627e-74fb1800bc00
12K	/mnt/data/finance/ctrl/table_ingestion_log/18a32aea-a4a9-8792-974c-0b6c9f7ef100
12K	/mnt/data/finance/ctrl/table_ingestion_log/18a32aed-1e70-479d-222a-3f15b649d000
12K	/mnt/data/finance/ctrl/table_ingestion_log/18a32aee-79a2-930a-5b68-3d2b06956300
12K	/mnt/data/finance/ctrl/table_ingestion_log/18a32aef-aa94-d1a2-1224-95c072ca4b00
12K	/mnt/data/finance/ctrl/table_ingestion_log/18a32af1-2dbf-4468-09fc-7fdbed1e0700
12K	/mnt/data/finance/ctrl/table_ingestion_log/18a32af3-3f16-ff30-a04a-936dfe365b00
12K	/mnt/data/finance/ctrl/table_ingestion_log/18a32af3-daf1-901d-0469-491b5ce5e400
12K	/mnt/data/finance/ctrl/table_ingestion_log/18a32b77-9dea-e2c1-46a2-b4ee245cb700
12K	/mnt/data/finance/ctrl/table_ingestion_log/18a32b81-78de-3fa3-9169-90bf3c762c00
100K	/mnt/data/finance/ctrl/table_ingestion_log/18a32bd6-e86d-d4b3-b3bb-d360fd637c00
1.7M	/mnt/data/finance/ctrl/table_ingestion_log/metadata

and this is my table properties

SHOW TBLPROPERTIES Minio.finance.ctrl.table_ingestion_log;

I have read through this page (OPTIMIZE TABLE | Dremio Documentation) and tried all these commands.

ALTER TABLE Minio.finance.ctrl.table_ingestion_log  SET TBLPROPERTIES (
            'history.expire.min-snapshots-to-keep' = '0',
           'history.expire.max-snapshot-age-ms' = '0',
            'write.metadata.delete-after-commit.enabled' = 'true',
            'write.metadata.previous-versions-max' = '0'
             );

VACUUM TABLE {table_name} EXPIRE SNAPSHOTS OLDER_THAN '{current_time_str}' retain_last 1 ;




OPTIMIZE TABLE Minio.finance.ctrl.table_ingestion_log  REWRITE MANIFESTS ;

OPTIMIZE TABLE Minio.finance.ctrl.table_ingestion_log  REWRITE DATA (MIN_FILE_SIZE_MB=100, MAX_FILE_SIZE_MB=1000, TARGET_FILE_SIZE_MB=512);

I set ‘{current_time_str}’ to the current time to expire everything, but then the metadata still grows.

WHat have I done wrong?

I am doing some tests here.

I created a dummy table called delete_me and then I keep inserting records into it.

Then I checked the snapshot_history and can see snapshots are created

select * from TABLE ( table_history ( 'Minio.finance.ctrl.delete_me' ) );

Then I ran vacuum table

VACUUM TABLE Minio.finance.ctrl.delete_me EXPIRE SNAPSHOTS OLDER_THAN'2024-12-14 09:21:00' RETAIN_LAST 1;

where the OLD_THAN is the current time.

When I go back to my Minio storage, I observe that NEW FILES are created (those with the XLDTR suffix). Meanwhile, the metadata file size did not really change.

How can I reduce the size of metadata file ? looks like vacuum table just adds more files, making it even bigger?

TBLPROPERTIES of my delete_me table


SHOW TBLPROPERTIES Minio.finance.ctrl.delete_me;

While performing OPTIMIZE on another table, I observe that it is ignoring the parquet files for compaction

I ran the below because it has some small files

OPTIMIZE TABLE Minio.finance.l0.macrotrends_market_capitalisation_history 
   REWRITE DATA USING BIN_PACK (TARGET_FILE_SIZE_MB=10000,  MAX_FILE_SIZE_MB=100000, MIN_INPUT_FILES=1);

But that dremio’s logs show that it is ignoring the parquet data files (because they are closed)

=========

My only recourse is to recreate the table (i.e. Using CTAS)

Before

AFTER

image

The number of data files dropped significantly, even when the number of rows remain the same

For now, I can only recreate the table to achieve compaction + reduce metadata file size.

Could I get some guidance on how to use VACUUM and OPTIMIZE correctly to achieve the below?

  • reduce metadata folder size (which contains avro files)
  • remove all snapshots
  • compact data files (parquet files) into only a few files ?

Thank you

@Ken That ignore was for a SELECT path. Are you able to send that log file and your OPTIMIZE job profile?

Thanks @balaji.ramaswamy .

I re-ran the below OPTIMIZE query just a moment ago.

OPTIMIZE TABLE Minio.finance.l0.macrotrends_market_capitalisation_history 
   REWRITE DATA USING BIN_PACK (TARGET_FILE_SIZE_MB=10000,  MAX_FILE_SIZE_MB=100000, MIN_INPUT_FILES=1)

9f067776-f3e5-4434-89c5-32d13d18ce3a.zip (30.3 KB)

I already re-created my table, so now it has few data files.

Here is another example

I expired the snapshots with VACUUM and it created the XLDIR files.

Then I ran the below OPTIMIZE command

OPTIMIZE TABLE Minio.finance.ctrl.delete_me 
   REWRITE DATA USING BIN_PACK (TARGET_FILE_SIZE_MB=10000,  MAX_FILE_SIZE_MB=100000, MIN_INPUT_FILES=1);

a091a6e0-163a-4ef8-b439-8f682c746ef5.zip (27.5 KB)

But the files do not seem compacted nor are XLDIR files removed.

@balaji.ramaswamy

Does dremio have an operation for removing orphan files?

I read OPTIMZE compacts data files and VACUUM removes snapshots (by removing references only).

Once the snapshot references got removed, will there be orphan data files? If so, how can I remove them using dremio?

Or does VACUUM also remove orphan files?

Thank you! :pray:

I found this answer from @ajay098

And once I changed the compression to gzip, it is working like a charm!!!

In short, run this

ALTER TABLE TABLE_NAME SET TBLPROPERTIES (
            'write.parquet.compression-codec' = 'gzip'
             );

and then do the normal VACUUM with OLDER_THAN and RETAIN_LAST;

@Ken Checking on this, will get back to you on this