Multiple copies of reflection files

We have a moderately large dataset (676GB) in S3, and we have created a raw reflection for it with all columns and “full refresh” set. The footprint of the reflection, according to the Reflections tab, is close to the source size (629GB).

But we’ve found that the reflection is actually consuming over 6TB in the reflection store. I looked at the file dates, and I found a full copy of the reflection dated two days ago (the last successful refresh), totaling 674GB. I see an occasional file for preceding days, but 287GB for one day last October, and around 300GB each for three days last July. I’m wondering if this happens because a refresh has failed, and both the previous (successful) and new (partial) results are kept. Is this correct? And if so, how do I purge those old duplicates?

The reflection data files are parquet files. They all sit under a top level reflection specific folder and sub-folders for each materialization. Only the data files in the most recent materialization are needed. It’s possible you had old materializations that weren’t cleaned up due to job failures. It’s safe to remove the old materialization directories under the same reflection.

Thanks, Benny! Is there any way to automate the cleanup, or at least automate identifying obsolete folders? I’ve found what seems to be over 1TB of invalid files, but I’ve had to do a lot of cross-referencing.

I think I have two problems: files for a reflection that are obsolete, and files for reflections that no longer exist. It seems that the directory structure includes the reflection ID, ex:

The above (253b0eba-4501-4e92-8e52-ff51e3d97d75) aligns to a reflection_id I found in sys.materializations. So can I assume that if there is a directory with a key that isn’t in sys.reflections, then the directory can be deleted?

Yes, exactly. If there is a reflection_id in your filesystem that is not in sys.materializations, it is definitely safe to delete. There’s also the possibility of old materializations for valid reflections.

Here is a SQL query that returns the directories that contain “live” reflection files. So, you don’t want to delete any of these directories. Note that the file path has a * in it because we suffix the job attempt number to the materialization_id. So, basically, any directories NOT matching the result of this query can be safely cleaned up:

'./accelerator/' || m.reflection_id || '/' || materialization_id || '_*' as path
from sys.materializations m
left outer join sys.reflections r on m.reflection_id = r.reflection_id 
where state = 'DONE'
order by 1 desc

Thanks, Benny, that query is really helpful.

But before I start deleting stuff I want to make sure it’s really obsolete. I found something strange:

In sys.reflections, reflection_id is ‘137a1bb6-8fe8-4258-8122-ca262320c498’ and materialization_id is ‘f3813699-8e5f-414c-80d1-74f38f2a81d1’. But I don’t see a folder for that materialization_id in the reflection store. Instead I see:
PRE 517ca2c8-8286-441b-a5af-f9347a20f71e_0/
PRE 65e4d9ff-2678-4915-ad49-88848ba2830d_0/

I queried the VDS and checked the raw profile. I found this:

So it’s use the 517ca2c8-8286-441b-a5af-f9347a20f71e_0 directory even though that doesn’t exist in sys.materializations. Is there another mapping of materialization_ids somewhere?

Hi Joe,

sys.materializations shows a unique ID that corresponds to the each unique refresh of the reflection. For a full refresh, then this ID typically corresponds to the file path in the reflection store. If there is another full refresh, then a new materialization ID is generated and a new file path appears in the reflection store. Technically, they are separate Iceberg tables. For an incremental refresh, we need to re-use the existing Iceberg table and so here is when the materialization ID and the file path can not match.

In your example, it sounds like this is an incremental reflection?