We have a reflection over vds that involves AWS Glue table.
The reflection is incremental (reflection partion column(with truncate) = glue table partition column).
Reflection tab shows: Footprint: 594.96 MB (226.55 GB) where Total footprint = 226 GB.
The total size of the glue table underlying files is 2.7 GB while VDS filters away 60% of records.
Reflection is updated every hour.
From Dremio wiki:
**Total Footprint**
Shows the current size, in kilobytes, of all of the existing materializations of the Reflection. More than one materialization of a Reflection can exist at the same time, so that refreshes do not interrupt running queries that are being satisfied by the Reflection.
i guess this is not the case here. Why total footprint is so huge?(i suspect that it is being growing over time constantly) How can we control its size? Is there a way to do teh cleanup?
@vladislav-stolyarov Look the the reflection refresh job, use the first UUID which is the reflection id and under your location configured in dremio.conf under dist:/// there will be a folder called accelerator, under that, the first UUID is a folder and under that the secondd UUID is the materialization id, are you able to do a du -sh * from the reflection folder?
In my dremio.conf i have slightly different schema. I guess you meant this one? paths.accelerator = "dremioS3:///dremio-me-...../dremio/accelerator"
if i go to the folder and search for subfolder==reflection_id i was able to find it.
and its space is: Total number of objects 33,152 Total size 74.4 GB.
For the experiment i took another VDS that is not incremental like problematic one above and found:
Tooltal footprint is x6 more that current footprint which is normal.
On s3 it has 7 subfolders 6 of them has subfolder == materialization_id (from materializations table for this reflection_id) Only one folder is orphan.
subfolder names are in format {materialization_id}_0
On the contrast my problematic reflection folder has one subfolder with thousands of subfolders inside. What is interesting none of the materialization_id is used for those subfolders names.
So i guess in case of incremental reflections some other storage patterns apply.
Hi @vladislav-stolyarov , this is a known bug that is specific to INCREMENTAL reflections. Reflections track stats for each refresh job and its not adding them up correctly when there is compaction between the refreshes. There’s a sys.refreshes table with more details about these refreshes.