We are running the latest version (24.0.0-202302100528110223-3a169b7c). The master node storage keeps growing even though we have jobs.max.age_in_days and results.max.age_in_days set to 1 day. These have worked on previous versions.
It looks like most of the storage is taken up by the catalog. The .log and .sst files. Running dremio-admin clean frees up the space. Is there another support setting I should be looking at?
Unrelated, stopping and starting the service did not work without changing the dremio.conf file. We had to disable ssl. It makes me wonder how it was even able to start initially.
@mlilius Thanks for the report, your jobs, profiles are just 1.2 GB so it looks like these files are needed for recovery in case of an unexpected failure. Can we not increase the mount to 30 GB and see if the 15 GB does not increase?
@balaji.ramaswamy sorry I miss-typed. The ebs backed storage is 50GB. I have not needed to run the clean yet but it is creeping close. It is now consuming 31GB. I ran another kvstore/report and attached the results here.
We’re seeing a higher usage than expected on the master node as well. I’m not proud to admit it, but we actually took an outage on Dremio last week, because the ~200GB allocated was all used and we weren’t monitoring usage on that disk closely enough.
We have configured jobs.max.age_in_days=90 in order to have proper history, so a higher-than-normal usage was expected. That said, this Dremio installation is probably only serving a few thousand queries per day, so I was surprised about the usage.
For now, we have moved the master storage to a 4TB drive, until we get to the bottom of it (we run on-prem, so increasing the backing volume is not as easy as EBS). We did observe ~60GB being used overnight, which I imagine was a RocksDB compaction gone awry. Based on that, a good rule of thumb is probably to have at least 50% space free, in order to be safe.
Much appreciated @balaji.ramaswamy - I’ll run this and get some insight. I’m on vacation the next two weeks, but I’ll chip in when I’m back.
This, as far as I recall, requires one to take the master offline, while running that maintenance. That would be an issue for us, as I imagine it’d be for many. Hopefully that shouldn’t be necessary on a regular basis?
@mlilius Checked your report, the above options have already cleaned as expected, the issue is all the .sst and .log files are needed for recovery so restart is clearing them automatically. 15GB seems to be less for your activity, I see close to 16K jobs per day, any chance you can increase it to 50GB?