Need some clarity on the below concern:
How Dremio manages memory and hard disk space for executing each virtual data set. For eg: I have a huge dataset in my source and I created a virtual dataset and now I am querying and doing SQL operations on it multiple time to fetch the output. so, in this case, will it keep on consuming hard disk space to store the virtual data set result at any point of time. If it does consume how to check it should not go beyond the threshold of the space of the cluster in which Dremio is installed?
Do we have only the metadata stored for these VDS or data also gets stored in Dremio? If it happens which space it consumes and how can we track this space usage?
If you are running the queries in the UI or using the REST API, job results (roughly the first million rows) are stored to disk. They are stored under the metadata location (http://docs.dremio.com/deployment/metadata-storage.html) in a directory called
results. The job results are deleted daily.
What about the data in the “spaces”? when we write a new query or when we do some operation like join & aggregation and saved that result in the area of space, does the actual data of this result also get stored and consume cluster space or only metadata gets saved and not the actual data?
Does Dremio pull data from the source and save it in its memory? How does this work?
Dremio only persists metadata - every time you run a query, we will load the data into memory and process it. If you use reflections, then we persist them and they will use up disk space.