@balaji.ramaswamy
Using two iceberg tables as an illustration.
THe one in l0
is constantly getting new data via MERGE
.
I perform vacuum (expire snapshot retain 1) and optimise after every MERGE.
However, it’s data size grows to 31GiB.
I ran this SQL to determine the file_size_in_bytes
select sum("record_count"), sum("file_size_in_bytes") from TABLE( table_files( 'Minio.finance.l0.XXXX' ) );
It says it is around 3.5GB (which is VERY DIFFERENT from the actual size of 31GiB).
========
However, after I recreate the table (and put it in the backup_2025_03_26
)
create table Minio.finance.backup_2025_03_26."XXXXXX"
as (SELECT * FROM Minio.finance.l0."XXXXXX")
The file sized shrank to 3.3GiB
. This is more inline with what the original statistics is saying.
Note that I have always expired all snapshots and it should not bloat to 31GiB.
I feel that I hit something similar to what is reported here (Iceberg file size on dremio - #13 by dacopan)
=======
In the original iceberg table (with regular MERGE of new data)
—> The data files folder are multiple. Those XLDIR directories were created after vacuum operation
, where snapshots got deleted.
However, I think there are still dangling snapshots in the data files.
Meanwhile, in the recreated table (via CTAS), there’s just ONE data file folder
So I suppose that many data files folder in the original iceberg tables are actually useless (or else they would have been copied to the new iceberg table via CTAS)