Dremio is very interesting project to explore; kudos for what looks like a really promising product.
I am trying to understand m/m consumption by dremio. I see query writes lots of data to results & spill directory on vm, spill space is free once query complete but data persist in results. What type of data is saved in results? As per my understanding dremio doesn’t save query results, it calculates it on fly using Apache Arrow columnar in-memory data processing and shows them to corresponding client application.
+1 on this question, I was about to create a new topic to ask something similar!
I was under the impression that Dremio does not store any data, apart from some metadata. However, I also see the data directory is growing fast in size and I’m wondering what is saved there. For us security of the data is a concern and we would strongly prefer having no sensible data saved or cached in Dremio.
Got reply from @naren on email thread.
As per dremio team
If data doesn’t fit in to m/m then it is moved to spill directory and spill space is free after job completes. Results directory contains job results cache so that when you clicked on a job to view the results they are retrieved for you. You can change this: https://docs.dremio.com/advanced-administration/job-results-cleanup.html.
1 follow up question on this. What will happen if spill directory is mapped to HDFS location; will HDFS spill space will also be cleared after job completes?
I notice it’s not getting cleared & reached to 100GB in couple of hours only. Set paths.spilling to HDFS location in dremio.conf
The spill files should get cleared once the query completes (whether it is successful or a failure). When this happens next time can you see if there were other running jobs. When you restarted Dremio did the files go away?
I would not recommend having spill directories on HDFS - unless you set replication to 1 for that spill directory. Usually HDFS has replication set to 3.
Spill data - is temporary and therefore you don’t want it to be replicated and increase data storage (even temporarily) 3x.
I am also facing similar issue spill directory is not cleared at times whenever query cancelled or foreman exception or channel closed error. Please help on this.
Using dremio 3.3.1 spill directory point to NAS in k8s.