Cache Clean up --- Failed to create directory for spilling. Please check that the spill location is accessible and confirm read, write & execute permissions

Large reflections are all failing with this error
Failed to create directory for spilling. Please check that the spill location is accessible and confirm read, write & execute permissions.

dremio.conf

services.executor.enabled: false
debug.dist.caching.enabled: true
paths.local: "/var/lib/dremio"
paths.results: "pdfs://"${paths.local}"/data/results"

# Web server encryption
#services.coordinator.web.ssl.enabled: true
#services.coordinator.web.ssl.auto-certificate.enabled: true
#services.coordinator.web.port: 443
services.coordinator.master.embedded-zookeeper.enabled: false
zookeeper: "172.31.14.149:2181"
paths.accelerator = "dremioS3:///dremio-me-df9038ad-0166-40a9-81b5-5776e3d83b2b-202f450f5cc8814f/dremio/accelerator"
paths.uploads = "dremioS3:///dremio-me-df9038ad-0166-40a9-81b5-5776e3d83b2b-202f450f5cc8814f/dremio/uploads"
paths.downloads = "dremioS3:///dremio-me-df9038ad-0166-40a9-81b5-5776e3d83b2b-202f450f5cc8814f/dremio/downloads"
paths.scratch = "dremioS3:///dremio-me-df9038ad-0166-40a9-81b5-5776e3d83b2b-202f450f5cc8814f/dremio/scratch"
provisioning.coordinator.enableAutoBackups = "true"
paths.metadata = "dremioS3:///dremio-me-df9038ad-0166-40a9-81b5-5776e3d83b2b-202f450f5cc8814f/dremio/metadata"
registration.publish-host: "172.31.14.149"
provisioning.ec2.efs.mountTargetIpAddress = "172.31.10.119"
[root@ip- /]# df
Filesystem             1K-blocks      Used        Available Use% Mounted on
devtmpfs                65248416         0         65248416   0% /dev
tmpfs                   65259188         0         65259188   0% /dev/shm
tmpfs                   65259188       676         65258512   1% /run
tmpfs                   65259188         0         65259188   0% /sys/fs/cgroup
/dev/nvme0n1p1          52416492   5340132         47076360  11% /
/dev/nvme1n1           309504832  91407148        204953388  31% /mnt/c1
/dev/nvme2n1           576608940     74892        547220792   1% /mnt/c2
172.31.10.119:/ 9007199254739968 182820864 9007199071919104   1% /var/dremio_efs
tmpfs                   13051840         0         13051840   0% /run/user/0

In /var/lib/
drwxr-xr-x 2 dremio dremio 28 Sep 21 21:26 dremio

Worker nodes are reaching limit in /mnt/c1 with many others nearing 95%

Filesystem             1K-blocks      Used        Available Use% Mounted on
devtmpfs                16221432         0         16221432   0% /dev
tmpfs                   16232200         0         16232200   0% /dev/shm
tmpfs                   16232200       600         16231600   1% /run
tmpfs                   16232200         0         16232200   0% /sys/fs/cgroup
/dev/nvme0n1p1           8376300   4779240          3597060  58% /
172.31.10.119:/ 9007199254739968 182988800 9007199071751168   1% /var/dremio_efs
/dev/nvme1n1           288238156 271484068          2089268 100% /mnt/c1
tmpfs                    3246444         0          3246444   0% /run/user/0

What is best caching policy? Our system was deployed with AWS CloudFormation, is cloud caching the best solution? How to best clean up cache?

@mary.lang The failure is not on cache but on spill, are you able to send us the profile for the failed job?

@balaji.ramaswamy
b05b26b0-3bda-4845-a65c-6dda1bd7d587.zip (26.7 KB)

Our stack was deployed with AWS CloudFormation, and I also mistakenly rebooted the coordinator EC2 instance last week, which I believe reset a lot of our settings and I’m not sure if you can speak to all the impacts that may have had? The spill issue may be coming up coincidentally near that reboot or I’m not sure if they could be related

We are also continuing to get errors on No space left on device

It seems a node set is filling up in spilling directory from cache /mnt/c1

Is it safe to clear this? What is the best policy for clearing spilling?

A recent profile for No space left on device with error in profile
Error: Failed to spill partition

65d350ea-e07f-45f2-bb18-c09977dd39c7.zip (391.5 KB)

It is true that it spills when there is not enough memory? Could it then be resolved by increasing job memory?

For the engine of concern we have limits for Queue Memory Limit per Node and Job Memory Limit per Node both set to 10TB

We are also considering moving to cloud caching - would this be helpful?