Dremio space usage

Good Morning Team,

Dremio is very interesting project to explore; kudos for what looks like a really promising product.

I am trying to understand m/m consumption by dremio. I see query writes lots of data to results & spill directory on vm, spill space is free once query complete but data persist in results. What type of data is saved in results? As per my understanding dremio doesn’t save query results, it calculates it on fly using Apache Arrow columnar in-memory data processing and shows them to corresponding client application.

Please clarify my thoughts.

Thanks

+1 on this question, I was about to create a new topic to ask something similar!

I was under the impression that Dremio does not store any data, apart from some metadata. However, I also see the data directory is growing fast in size and I’m wondering what is saved there. For us security of the data is a concern and we would strongly prefer having no sensible data saved or cached in Dremio.

Got reply from @naren on email thread.
As per dremio team

If data doesn’t fit in to m/m then it is moved to spill directory and spill space is free after job completes. Results directory contains job results cache so that when you clicked on a job to view the results they are retrieved for you. You can change this: https://docs.dremio.com/advanced-administration/job-results-cleanup.html.

Any I think job results can be viewed only in enterprise version as I can’t find this feature in community one.

@naren Please confirm if my understanding is correct or not?

Results are only cached for queries issued through Dremio’s web GUI. Queries over JDBC/ODBC/REST will not have the results cached.

You can use file system permissions to secure data from spilling, jobs, and data reflections.

I think I have this right, someone will correct me if I’m wrong. :slight_smile:

@kelly Thanks for making it more clear.

1 follow up question on this. What will happen if spill directory is mapped to HDFS location; will HDFS spill space will also be cleared after job completes?
I notice it’s not getting cleared & reached to 100GB in couple of hours only. Set paths.spilling to HDFS location in dremio.conf

Thanks.

Hi @Monika_Goel

The spill files should get cleared once the query completes (whether it is successful or a failure). When this happens next time can you see if there were other running jobs. When you restarted Dremio did the files go away?

Thanks
@balaji.ramaswamy

I would not recommend having spill directories on HDFS - unless you set replication to 1 for that spill directory. Usually HDFS has replication set to 3.
Spill data - is temporary and therefore you don’t want it to be replicated and increase data storage (even temporarily) 3x.

@balaji.ramaswamy
No dremio jobs were running and I re-started dremio; still spill space is not cleared.
Need to remove spill files manually.

@Monika_Goel

When this happens again, would you be able to send me the listing of the files?

Thanks
@balaji.ramaswamy

@balaji.ramaswamy

Inside spill directory exists ‘esort-2431cecd-2899-5e81-41ac-eb39184de700.1.0.3’ directory; which contains file run00454, run00455, …, run00725 and merge00000 & merge00001.

Total space use by spill is 52.8G.

@Monika_Goel

What are the timestamps on these files? Created now? or a while back?

@balaji.ramaswamy
Files from run00454 - run00648 have Yesterday timestamp around 2018-10-22 21:42, while others have today timestamp.

I am also facing similar issue spill directory is not cleared at times whenever query cancelled or foreman exception or channel closed error. Please help on this.

Using dremio 3.3.1 spill directory point to NAS in k8s.

Thanks

@mahejava

Can you please point K8’s to Dremio’s later releases 4.3 and above and see if this problem still exists?