AWS Dremio community edition drive filling up

mlilius · March 20, 2023, 9:38pm

We are running the latest version (24.0.0-202302100528110223-3a169b7c). The master node storage keeps growing even though we have jobs.max.age_in_days and results.max.age_in_days set to 1 day. These have worked on previous versions.

It looks like most of the storage is taken up by the catalog. The .log and .sst files. Running dremio-admin clean frees up the space. Is there another support setting I should be looking at?

Unrelated, stopping and starting the service did not work without changing the dremio.conf file. We had to disable ssl. It makes me wonder how it was even able to start initially.

balaji.ramaswamy · March 20, 2023, 9:54pm

@mlilius IS the disk volume 15GB? Can you run the below API command and send the output zip file. We can see what is using up all the space

curl --location --request GET 'localhost:9047/apiv2/kvstore/report?store=none' \
--header 'Authorization: _dremioirr8hj6qnpc3tfr3omiqvev51c' > kvstore_summary.zip
```

mlilius · March 21, 2023, 3:59am

@balaji.ramaswamy yes it is 15GB. That command gave an empty file. I was able to get it by getting an authorization token.

kvstore_summary.zip (2.0 KB)

balaji.ramaswamy · March 22, 2023, 6:57pm

@mlilius Thanks for the report, your jobs, profiles are just 1.2 GB so it looks like these files are needed for recovery in case of an unexpected failure. Can we not increase the mount to 30 GB and see if the 15 GB does not increase?

mlilius · March 23, 2023, 8:17pm

@balaji.ramaswamy sorry I miss-typed. The ebs backed storage is 50GB. I have not needed to run the clean yet but it is creeping close. It is now consuming 31GB. I ran another kvstore/report and attached the results here.

kvstore_summary.zip (2.0 KB)

wundi · March 24, 2023, 5:42am

We’re seeing a higher usage than expected on the master node as well. I’m not proud to admit it, but we actually took an outage on Dremio last week, because the ~200GB allocated was all used and we weren’t monitoring usage on that disk closely enough.

We have configured jobs.max.age_in_days=90 in order to have proper history, so a higher-than-normal usage was expected. That said, this Dremio installation is probably only serving a few thousand queries per day, so I was surprised about the usage.

For now, we have moved the master storage to a 4TB drive, until we get to the bottom of it (we run on-prem, so increasing the backing volume is not as easy as EBS). We did observe ~60GB being used overnight, which I imagine was a RocksDB compaction gone awry. Based on that, a good rule of thumb is probably to have at least 50% space free, in order to be safe.

balaji.ramaswamy:

We can see what is using up all the space

curl --location --request GET 'localhost:9047/apiv2/kvstore/report?store=none' \
--header 'Authorization: _dremioirr8hj6qnpc3tfr3omiqvev51c' > kvstore_summary.zip

Much appreciated @balaji.ramaswamy - I’ll run this and get some insight. I’m on vacation the next two weeks, but I’ll chip in when I’m back.

wundi · March 24, 2023, 5:56am

This, as far as I recall, requires one to take the master offline, while running that maintenance. That would be an issue for us, as I imagine it’d be for many. Hopefully that shouldn’t be necessary on a regular basis?

mlilius · March 27, 2023, 6:25am

@balaji.ramaswamy @wundi yes it does require the master to be offline causing down time. It is not ideal but works for now to prevent un-scheduled down time until the root cause can be found.

balaji.ramaswamy · March 28, 2023, 11:57pm

@wundi @mlilius just clean only prints the report and does not clean anything. What is happening is because you are restarting the SST files are no longer needed for recovery are getting dropped

mlilius · April 13, 2023, 4:13am

@balaji.ramaswamy Sorry for the confusion. The complete command I am running is dremio-admin clean -j=1 -o -p -c -i.

I’ve attached the results from the command if that helps.

dremio.log.zip (2.7 KB)

balaji.ramaswamy · April 17, 2023, 5:50am

@mlilius Checked your report, the above options have already cleaned as expected, the issue is all the .sst and .log files are needed for recovery so restart is clearing them automatically. 15GB seems to be less for your activity, I see close to 16K jobs per day, any chance you can increase it to 50GB?

Topic		Replies	Views
Dremio server disk management	10	1883	January 8, 2023
GKE Master disk usage	19	1661	November 27, 2020
Rocks DB Regularly Filling up	10	1337	August 15, 2023
Dremio Mounted on Azure external ssd 96% full disk Usage	12	2076	January 9, 2020
Space in the coordinator node	5	1510	September 25, 2020

AWS Dremio community edition drive filling up

Related topics