IOException: No space left on device while rewrite

Hi, when I try to rebuild a table I get error that seems like that we are running out of space somewhere, but we have plenty of available space everywhere. Can you please help me to diagnose the issue, thanks.

IOException: No space left on device

55b58e3a-4131-4b49-8613-b9f024594a3f.zip (101.6 KB)

Thanks
Jaro

@jaroslav_marko Looks like it is on EXTERNAL_SORT, Can you check under data on the executors what is the disk size? or if you have configured spill to a specific location check there. One thing to note is that spill files will be deleted once query fails with no space available

Hi @balaji.ramaswamy I have upgraded the executors from 100GB to 1TB storage each, spilling is probably on, becuase I see the spill sign next to some queries. (can I check it somewhere?)
I have run it again, but failed on different error.

AttemptManager dremio-master-0.dremio-cluster-pod.mlops-lakehouse.svc.cluster.local no longer active. Cancelling fragment 1914442a-2223-0edd-c23e-c040f3713800:1:52

82e47e76-09ee-43a6-bc92-64b09c6cc64d.zip (102.7 KB)

Thanks Jaro

@jaroslav_marko

This is a completely different error, looks like your Master went down. Can you please check logs if it was restarted or you actually see a shutdown-thread?

Hi @balaji.ramaswamy unfortunately we have lost the logs :frowning:
I have run the job again and it did not finish after 17hours so I canceled it.
trying again, but it is not such a big table that this operation should be a problem.
btw. when optimize process fails what happens with already created files? how to clean them?
thanks
Jaro

Those are considered ‘orphan files’. We should have a mechanism to discard these half-baked files similar to the docs link ive attached

iceberg-tables-compaction-expiring-snapshots-and-more/

hi @balaji.ramaswamy thanks for the guide how to maintain Iceberg tables.

coming back to the original issue → i let the job run until failed (>18h). see the profile attached. can you help me to identify the error.

6f1e85f3-ee15-4710-8848-b08e7166ee65.zip (122.4 KB)

best regards
Jaro

@jaroslav_marko Looks like dremio-executor-5.dremio-cluster-pod.mlops-lakehouse.svc.cluster.local went unresponsive, do you have the server.log and GC logs when the error hapened?