We have a setup of Dremio on AWS. We run a lot of API requests against Dremio using flight and create a ton of virtual and physical datasets on top of CSVs and Parquets.
The error we get when the db gets full is
exception: org.rocksdb.RocksDBException: While appending to file: /var/lib/dremio/db/catalog/002260.log: No space left on device
I have added screenshots below.
-
Here is the version of Dremio running on AWS
-
Here is what the system disk allocation looked like when the db was full
-
Here is the tail of the LOG file in the ~{DREMIO_HOME}/db/catalog
Iāve also set these parameters in support:
For me the only effective way to fix the RocksDB getting full has been to restart the Dremio deployment via CloudFormation and that clears the RocksDB without affecting the data.
Please advise if thereās an automated way to prevent the RocksDB from getting full and blocking all functionalities and log ins?
Thanks!
@amar_ikigai During restart, RocksDB will flush out .sst files no longer needed for recovery. What is the total size of the disk?
Can you run the below API and send the zip file it creates? Can be done when Dremio is up and running
curl --location --request GET 'localhost:9047/apiv2/kvstore/report?store=none' \
--header 'Authorization: _dremioirr8hj6qnpc3tfr3omiqvev51c' > kvstore_summary.zip
Hi @balaji.ramaswamy,
We have 2 different dremio instances.
1 for development thatās this one 50 GB disk
1 for production 150 GB disk
The development instance has tons of .log
files that keep eating up all the disk space and we are worried this will happen on production which would be a huge hit for our customers so we wanted an automated and safe way to clear these RocksDB files within the catalog folder without down time for our Dremio instances.
Here is the KVstore summary
kvstore_summary 2.zip (2.5 KB)
@amar_ikigai From your report, none of the internal stores are using any significant space. Wherever you took this report now, what is the disk usage now?
Hi @balaji.ramaswamy,
The disk space is filling up pretty quickly. Here are some summaries:
# df -h
Filesystem Size Used Avail Use% Mounted on
devtmpfs 16G 0 16G 0% /dev
tmpfs 16G 0 16G 0% /dev/shm
tmpfs 16G 508K 16G 1% /run
tmpfs 16G 0 16G 0% /sys/fs/cgroup
/dev/nvme1n1 50G 28G 20G 59% /mnt/c1
# du -sch *
80M cm
28G db
104K etc
16K lost+found
36K results
12K s3Backup
12K spilling
28G total
du -sch *
232K blob
28G catalog
232K metadata
24M search
28G total
du -sch *.log
74M 004125.log
73M 004128.log
...
73M 005264.log
73M 005268.log
73M 005270.log
71M 005272.log
26G total
Looking through the live LOG files as well there is some information from the DB STATS section:
** DB Stats **
Uptime(secs): 531096.8 total, 3338.4 interval
Cumulative writes: 8890K writes, 8890K keys, 8797K commit groups, 1.0 writes per commit group, ingest: 25.80 GB, 0.05 MB/s
Cumulative WAL: 8890K writes, 0 syncs, 8890516.00 writes per sync, written: 25.80 GB, 0.05 MB/s
Cumulative stall: 00:00:0.000 H:M:S, 0.0 percent
Interval writes: 13K writes, 13K keys, 13K commit groups, 1.0 writes per commit group, ingest: 72.51 MB, 0.02 MB/s
Interval WAL: 13K writes, 0 syncs, 13915.00 writes per sync, written: 0.07 MB, 0.02 MB/s
Interval stall: 00:00:0.000 H:M:S, 0.0 percent
Here we can see that the Cumulative WAL
has grown to a really large 26 GB which doesnāt make a lot of sense to me.
I did some digging into the RocksDB options ini files that govern the RocksDB settings within the Dremio coordinator node.
A few options jump out to me:
max_total_wal_size=0
WAL_size_limit_MB=0
db_write_buffer_size=0
max_log_file_size=0
Based on these it seems like the WAL for RocksDB can grow to an arbitrarily large size and that log files can grow to an arbitrarily large size before being flushed to SST files.
Reference: https://github.com/facebook/rocksdb/blob/23af6786a997d3592e8a68f1a8d9e0699a6eae36/include/rocksdb/options.h#L621
Main Question: Is there a safe way to modify these parameters and trigger the WAL flush of .log
files through Dremio settings or is this something I would have to manually tinker with?
Secondary Question: **Is there a hidden support key that we could configure to stop the WAL from being bigger than n GB? **
@amar_ikigai Changes to RocksDB settings are not tested.
hi!
what was the solution you went with eventually? I am facing the same issue! It would be awesome if you could share your approach briefly!
thanks,
kyle
@kyleahn Does restart clear the files? That tells us they are just files needed for recovery when Dremio goes down unexpectedly
Would also like to understand where the space is used, What is the total disk space where the ādbā folder is created? Does this mount also have the Dremio Cloud cache (C3) files configured
Are you able to run the below API and send us the output?
curl --location --request GET ālocalhost:9047/apiv2/kvstore/report?store=noneā
āheader āAuthorization: _dremioirr8hj6qnpc3tfr3omiqvev51cā > kvstore_summary.zip
I was able to set jobs.max.age_in_days and results.max_age_in_days to a lower value, and that resolved the issue. But, I do still see the disk space getting filled up quite aggressively. Currently the master node has 32GB, and we use Dremio only with Glue Iceberg tables with external reflection. Should I still expect to see 32GB to be filled up every day when the query volume isnāt high? (less than 500 a day)
kvstore_summary.zip (2.0 KB)
@kyleahn I am asking about the disk space? Are you saying free space is 32 GB? The Profiles and jobs are using 70 GB and you need some to store recovery files, so I would allocate a 150 GB to 200 GB disk. Now that you reduced jobs/profiles retention, it will come down. You can force an offline cleanup of jobs by shutting the coordinator and then do dremio-admin clean -j 7
, this wil lclean all jobs/profiles older than 7 days
jobs
basic rocks store stats
* Estimated Number of Keys: 8215060
* Estimated Live Data Size: 3459823720
* Total SST files size: 4091536398
* Pending Compaction Bytes: 0
* Estimated Blob Count: 0
* Estimated Blob Bytes: 0
profiles
basic rocks store stats
* Estimated Number of Keys: 7977327
* Estimated Live Data Size: 71978302888
* Total SST files size: 72334638944
* Pending Compaction Bytes: 0
* Estimated Blob Count: 0
* Estimated Blob Bytes: 0
1 Like