Dremio distributed storage grow too large

Especially in accelerator cache path.

Our Dremio cluster so far only reading about spesific 2TB of data on S3, and only slightly growing for about 10GB a day. Yet the size of /my-bucket/dremio/accelerator/ now become a staggering 44TB! Is there any way to rotate or clean up unrelated accelerator objects that no longer usable?

Wow, that is surprising!

Which version of Dremio and can you tell us more about how it is deployed and how many nodes.

We’re using community version 3.0.6. Our current strategy is:

  • 1 dedicated master
  • 1 dedicated coordinator
  • 3 dedicated zookeeper
  • 8 - 20 worker nodes set on auto-scale
  • distributed storage is on dedicated S3 Bucket with fs.s3a.connection.maximum is set to 1500

Wow, @chafidz! That’s a lot of reflection data. There are a few questions that come to mind here:

  1. How many reflection do you have enabled? You can do a quick count from Admin → Reflections or by running the below query in Dremio:

SELECT COUNT(*) FROM sys.reflections WHERE status NOT LIKE 'DISABLED'

  1. What are the size of these reflections? Reflections are logical units with their own reflection ID and a set of materializations that represent the physical files associated with that reflection. As the reflections are refreshed, they get a new materialization set that persists until it’s expiration. You can see all the materializations and there sizes using:

SELECT * FROM sys.materializations

  1. These reflections reference upstream physical datasets in your various sources. What are the reflection refresh settings for the sources and the individual datasets? You can find this out by selecting a source or a PDS within it, and clicking the cog wheel icon. Go to the “Reflection Refresh” tab. Do you have “Never expire” checked for anything?

One correct @chafidz; if you are using Community Edition Admin -> Reflections is not available to you (Enterprise only feature). You should use the SQL against the system tables that I provided to see your reflections

Thanks @ben, very much appreciated. Btw, I see some of the reflection_id have status deleted but the objects still remain in S3. Are those safe to delete?

Hello @chafidz,

Where are you seeing DELETED? In sys.materializations under the state or under sys.reflections under the status column? I don’t believe the latter should ever have that value…

Dremio should be cleaning up the expired materializations files, though it will leave some directories behind. You should not have to do any manual cleanup.

It’s from sys.reflections

sorry, it should be sys.materialization . But anyway, we kinda ‘solved’ it by manually deleted any materializations with DELETED status but has expiration with absurd future date.