I confirm that this approach works!
In short, the approach is to delete data files not longer tracked by Dremio, but still remains in Minio.
Step 1. Query the data_files in Dremio
select "file_path" from table( table_files('XXXX'))
This list represents the list of data files tracked by Dremio.
Step 2. Use the list_objects function in Minio Client to list all the objects under the iceberg table folder
objects = self.client.list_objects(
bucket_name,
prefix=bucket_subpath,
recursive=recursive
)
Step 3: Perform reconciliation which matches each data file tracked by Dremio in Step 1 with the listed object in Step 2
Step 4: Except the Metadata files, delete all data files in Minio which are NOT tracked by Dremio. Use the remove_object method.
self.client.remove_object(
bucket_name=bucket_name,
object_name= file_path,
)
Validation
- File Size dropped
I have a superset dashboard which tracks the actual table size in Minio via du -sh
command once every hour. The below graph shows that all table folders drop in size significantly after the above logic is applied.
Below is another graph which shows the available percentage of disk space in the whole Minio VM. It is obtained via df -h
and by looking at the /
mount.
- Row Count increases
Most importantly, row count for all tables did not drop.
Below is the growth in row count of 2 of the tables.