AWS Dremio community edition drive filling up

We are running the latest version (24.0.0-202302100528110223-3a169b7c). The master node storage keeps growing even though we have jobs.max.age_in_days and results.max.age_in_days set to 1 day. These have worked on previous versions.

It looks like most of the storage is taken up by the catalog. The .log and .sst files. Running dremio-admin clean frees up the space. Is there another support setting I should be looking at?

Unrelated, stopping and starting the service did not work without changing the dremio.conf file. We had to disable ssl. It makes me wonder how it was even able to start initially.

image

@mlilius IS the disk volume 15GB? Can you run the below API command and send the output zip file. We can see what is using up all the space

curl --location --request GET 'localhost:9047/apiv2/kvstore/report?store=none' \
--header 'Authorization: _dremioirr8hj6qnpc3tfr3omiqvev51c' > kvstore_summary.zip
```

@balaji.ramaswamy yes it is 15GB. That command gave an empty file. I was able to get it by getting an authorization token.

kvstore_summary.zip (2.0 KB)

@mlilius Thanks for the report, your jobs, profiles are just 1.2 GB so it looks like these files are needed for recovery in case of an unexpected failure. Can we not increase the mount to 30 GB and see if the 15 GB does not increase?

@balaji.ramaswamy sorry I miss-typed. The ebs backed storage is 50GB. I have not needed to run the clean yet but it is creeping close. It is now consuming 31GB. I ran another kvstore/report and attached the results here.

kvstore_summary.zip (2.0 KB)

image

We’re seeing a higher usage than expected on the master node as well. I’m not proud to admit it, but we actually took an outage on Dremio last week, because the ~200GB allocated was all used and we weren’t monitoring usage on that disk closely enough.

We have configured jobs.max.age_in_days=90 in order to have proper history, so a higher-than-normal usage was expected. That said, this Dremio installation is probably only serving a few thousand queries per day, so I was surprised about the usage.

For now, we have moved the master storage to a 4TB drive, until we get to the bottom of it (we run on-prem, so increasing the backing volume is not as easy as EBS). We did observe ~60GB being used overnight, which I imagine was a RocksDB compaction gone awry. Based on that, a good rule of thumb is probably to have at least 50% space free, in order to be safe.

Much appreciated @balaji.ramaswamy - I’ll run this and get some insight. I’m on vacation the next two weeks, but I’ll chip in when I’m back.

This, as far as I recall, requires one to take the master offline, while running that maintenance. That would be an issue for us, as I imagine it’d be for many. Hopefully that shouldn’t be necessary on a regular basis?

@balaji.ramaswamy @wundi yes it does require the master to be offline causing down time. It is not ideal but works for now to prevent un-scheduled down time until the root cause can be found.

@wundi @mlilius just clean only prints the report and does not clean anything. What is happening is because you are restarting the SST files are no longer needed for recovery are getting dropped

@balaji.ramaswamy Sorry for the confusion. The complete command I am running is dremio-admin clean -j=1 -o -p -c -i.

I’ve attached the results from the command if that helps.

dremio.log.zip (2.7 KB)

@mlilius Checked your report, the above options have already cleaned as expected, the issue is all the .sst and .log files are needed for recovery so restart is clearing them automatically. 15GB seems to be less for your activity, I see close to 16K jobs per day, any chance you can increase it to 50GB?