Hi, when I try to rebuild a table I get error that seems like that we are running out of space somewhere, but we have plenty of available space everywhere. Can you please help me to diagnose the issue, thanks.
IOException: No space left on device
55b58e3a-4131-4b49-8613-b9f024594a3f.zip (101.6 KB)
Thanks
Jaro
@jaroslav_marko Looks like it is on EXTERNAL_SORT
, Can you check under data
on the executors what is the disk size? or if you have configured spill to a specific location check there. One thing to note is that spill files will be deleted once query fails with no space available
Hi @balaji.ramaswamy I have upgraded the executors from 100GB to 1TB storage each, spilling is probably on, becuase I see the spill sign next to some queries. (can I check it somewhere?)
I have run it again, but failed on different error.
AttemptManager dremio-master-0.dremio-cluster-pod.mlops-lakehouse.svc.cluster.local no longer active. Cancelling fragment 1914442a-2223-0edd-c23e-c040f3713800:1:52
82e47e76-09ee-43a6-bc92-64b09c6cc64d.zip (102.7 KB)
Thanks Jaro
@jaroslav_marko
This is a completely different error, looks like your Master went down. Can you please check logs if it was restarted or you actually see a shutdown-thread
?
Hi @balaji.ramaswamy unfortunately we have lost the logs
I have run the job again and it did not finish after 17hours so I canceled it.
trying again, but it is not such a big table that this operation should be a problem.
btw. when optimize process fails what happens with already created files? how to clean them?
thanks
Jaro
Those are considered ‘orphan files’. We should have a mechanism to discard these half-baked files similar to the docs link ive attached
iceberg-tables-compaction-expiring-snapshots-and-more/
hi @balaji.ramaswamy thanks for the guide how to maintain Iceberg tables.
coming back to the original issue → i let the job run until failed (>18h). see the profile attached. can you help me to identify the error.
6f1e85f3-ee15-4710-8848-b08e7166ee65.zip (122.4 KB)
best regards
Jaro
@jaroslav_marko Looks like dremio-executor-5.dremio-cluster-pod.mlops-lakehouse.svc.cluster.local
went unresponsive, do you have the server.log and GC logs when the error hapened?