The /api/v3/sql endpoint allows to make SQL query and returns a Job ID. This allows us to get back the results later in a paged mode nicely. And multiple times too.
This is a very good mechanism. However, few question regarding the underlying architecture:
How long these SQL query job results will be available? Can we expect these job results to stay there permanently (even after server restarts etc.)?
Where are these results stored (in memory or some disk-based caching or …)?
How to cleanup the old jobs and their associated query results so that we can reclaim the memory/storage space? Is there a Job removal API for successful jobs that deletes the cached query results?
Job results are stored to disk so they stay after a reboot. By default we keep them for 24 hours - you can configure the age of job results here. The documentation is old and says 30 days but that was changed a while back, will open a ticket to fix that.
Jobs themselves are kept around for 30 days by default - configured by jobs.max.age_in_days.
The cleanup happens daily by a background task.
Results are stored under the results directory in the dist location specified in dremio.conf. You can safely delete files there, Dremio handles deleted job results.
Is there anyway the setting results.max.age_in_days can be set programmatically (through REST API)?
Saving the SQL query results to disk gives very interesting workflow possibilities, especially with AI/ML pipelines. However, it would also require little bit more fine grained control over the life-time of the results to use the storage more effectively. Below are few pointers for your consideration.
Please consider the below as kind of feature-request suggestions:
Controlling the results expiry on “per job” basis (overriding the global setting on per-job basis when needed).
i.e. At the time of job creation, the query results expiry can be specified by the user and the results will expire (and get cleaned up) after that time;
Allow relative expiry times. That is, consider the max age of the results as (“last accessed date” + max age) rather than (“creation date” + max-age).
This way, more frequently accessed query results will have automatically prolonged life and those rarely used results will get cleaned up early;
Comparing the results of two versions of an algorithm (fine tuning) is a very prominent requirement in AI/ML and the above features directly contribute to it.
The job results are stored primarily for the UI when listing results and is stored in file format designed for that. We don’t currently consider it to be something other tools should use and may break compatibility with old results. Results are also pruned to about 1 million rows.
If you want to store results with more control, I would suggest using CTAS to store results somewhere yourself and managing it that way.