Dremio db folder access costs are out of control

Hey guys

What is the db folder used for? Why does it generate so much io?

We have a dremio env with the following rough folder sizes:
/opt/dremio/data/db/catalog: 156gb
/opt/dremio/data/db/search: 13gb

For the last 2 full days, our efs io charges are $227 and $222 USD. Over the last year, we have had spikes of $900 USD per day for efs io. I wasn’t keeping an eye on the charges at that time, so I don’t know what the dremio catalog sizes were at those times.

But for the last 2 full days, the total io traffic associated with the efs volume was 7.9tb and 7.7tb. The read portion of that io traffic is 7.7tb and 7.4tb respectively for those 2 days. This correlates well with the aws cost explorer findings that most of these io charges are for io reads and not io writes.

What could be causing so much io?

Our paths.local folder is mapped to aws efs. If I understand it correctly, efs is some form of nfs which dremio supports. Our paths.dist folder is mapped to s3. So all the reflection data storage and access should be billed as part of the s3 charges and not efs.

At one stage, I thought the efs charges was related to the dremio’s default setup of the spill folder being mapped to ${paths.local}/spill. In our desperate attempt to make these efs io charges go way, even though it’s not explicitly stated by dremio as being supported, I have the remapped the spill folder to an ec2 instance storage volume. But this didn’t help either. So spilling from larger queries is not the cause either.

One more thing to mention is the “reflection.cloud.cache.enabled” support key has been set to false to avoid efs io. Since it was a support key, I didn’t think a dremio restart was necessary. If a restart is required to effectuate this change, please let me know. Our efs io costs haven’t changed in any noticeable way.

This might be a separate topic but why did dremio by default have the cloud cache folder mapped to a folder under paths.local when dremio also recommends that paths.local be mapped to distributed storage like nfs. People running dremio in the cloud would most likely be using some form of metered nfs solution. This could have been another way to bankrupt us, but for the moment, it looks like that’s not what’s causing the high efs io right now.

Back to the main topic. In our current state, if I’m understanding everything correctly, there is nothing obviously io intensive mapped to the paths.local folder.

We do have a lot of reflections defined and they are all regularly refreshing on a schedule. Based on my understanding, data associated with those reflections is being saved to paths.dist which is backed by s3. So why is the “/opt/dremio/data/db/catalog” folder so huge. As much as I can tell, all the efs io is related to access to files under that catalog folder.

Are we doing something wrong or is dremio reflections just a scam? When reflections are turned on, we can avoid hits on the backing data source by having the data cached in s3 based storage. But the io associated with the maintenance of the reflection definitions in the catalog folder will generate so much io that a metered nfs service will cost many times more than hits to the backing data source itself.

So then is the only way for dremio to be cost effective in a cloud based deployment is for it to be run in a single node arrangement for the paths.local folder to be mapped to either ebs storage or instance storage.

Please advise us on how we can avoid bankruptcy. Thanks.

In the words of Jackie Wilson
♫your love, keep on, lifting our-running-costs♫
♫higher and higher♫

Here are the costs for the subsequent days:

  • $244
  • $270
  • $245

The efs io read metrics for those days were:

  • 8.2tb
  • 9.1tb
  • 8.3tb

But I do have some good news to report. On Sunday, the master container crashed. I don’t yet have a full day of stats to report after the restart, but looking at the hourly figures so far, it looks like the io read rates are significantly lower.

So this observation confirms our suspicions that this ever increasing io activity isn’t directly related to our usage of dremio. i.e. it is related to something dremio is doing internally to maintain dremio’s internal data structures associated with the virtual data sets and reflections we have defined in that dremio env.

I also forgot to mention in the original post, our dremio version is 24.2.2. Thanks.

@asdf01 Can you please run a KVstore report on the rocksdb and send us the output? Can be done when Dremio is up

curl --location --request GET 'localhost:9047/apiv2/kvstore/report?store=none' \
--header 'Authorization: _dremioirr8hj6qnpc3tfr3omiqvev51c' > kvstore_summary.zip

Also please send output of below

du -sh /opt/dremio/data
du -sh /opt/dremio

Thanks
Bali

Hi @balaji.ramaswamy. Thanks for your interest.

run a KVstore report

I have sent that to you at your work email address you supplied in one of the previous posts.

send output of below
du -sh /opt/dremio/data

144G /opt/dremio/data/

du -sh /opt/dremio

145G /opt/dremio/

Thanks

@asdf01

Thanks, will check KVstore report, I made s small mistke in the command, can you please do the below instead?

cd /opt/dremio
du -sh *
cd /opt/dremio/data
du -sh *
cd /opt/dremio/data/db
du -sh *

Thanks
Bali

@asdf01 About 100 GB is being used by profiles, jobs, Nessie store, reflection store. How many reflections do you have?

Some space is needed by .sst and .log files to bring Dremio to a consistent state when Dremio is not stopped properly

Please send output of above asked and hopefully that should tell us the missing ~ 60 GB

Hi Bali. Thanks for your continued interest.

do the below instead
cd /opt/dremio
du -sh *

36K bin
32K conf
135G data
4.0K dremio.conf.additions
4.0K dremio-env.additions
558M jars
744K lib
476K licenses
321M plugins
4.0K preDmStartup.sh
16K share

cd /opt/dremio/data
du -sh *

113M cm
134G db
1.8M spill
129M zk

cd /opt/dremio/data/db
du -sh *

184K blob
131G catalog
184K metadata
3.4G search

Please ignore the /opt/dremio/data/spill folder. It was remapped to a folder outside of /opt/dremio/data a while ago. Currently it’s just some older files in there from before the remapping change.

How many reflections do you have

11686

I don’t yet have a full day of stats to report after the restart

Here are the 2 full days of stats after the restart:
efs charges

  • $119
  • $118

efs read io metrics

  • 3.1tb
  • 4.0tb

There might be some date boundary issues between the cost explorer numbers and the cloudwatch efs metrics. But it looks like the efs read io usage and costs dropped suddenly due to the restart but is increasing again with time.

hopefully that should tell us the missing ~ 60 GB

Just in case there is a misunderstanding, we are not at all concerned about the efs storage space usage. What is killing us is the efs read io charges. Thanks.