We have a dataset in parquet, and we’ve used Dremio to add some new columns here and change some others there, without the need of an ETL, so Dremio has been pretty useful for this task.
We’ve saved the results as a Dataset (I guess that’s a Virtual Dataset), and we want to create an Aggregation Reflection from that, with 8 dimensions and just one 1 measure. The parquet has been created with three partition levels, by year, month and day, and those three fields are used as dimensions as well.
The dataset is not that big big, with 10 000 million rows, and even if the machine where Dremio is running has a good deal of RAM, the reflection is never finished, it’s always throwing a “Query was cancelled because it exceeded the memory limits set by the administrator” error. We’ve set the dremio_conf like:
DREMIO_MAX_MEMORY_SIZE_MB=128000
DREMIO_MAX_HEAP_MEMORY_SIZE_MB=64000
DREMIO_MAX_DIRECT_MEMORY_SIZE_MB=32768
which seems to be too much. The question would be, is there any known limit to the amount of rows/dimensions/cardinality that can be used for an Aggregation Reflection? Also, a second question would be, how can we force the creation of an Aggregation Reflection on a Virtual Dataset? So far, we are doing it but removing all reflections, and creating it again,
We are using Dremio 3.3 on CentOS server, we’ll be glad with any help you can provide!
It’s a standalone execution on premise server, Coordinator + Executor node, with 48 CPUs and 128 Gbs of RAM. I’ve deleted all the aggregations, but I can recreate it again until it happens the same, and copy the query profile.
I’ve realised that adding dimensions one by one it’s working, until a seventh dimension is added. Then it keeps refreshing the reflection for several days, even if the data hasn’t changed. I’m adding a query profile of one of those refreshes df64426e-073d-4dbb-944e-39f9dac2a563.zip (179,9 KB)
You are not encountering any limit Dremio imposes on the number of rows/dimensions/cardinality, but, as you’ve seen, you can run out of memory resources creating the reflection. In theory, you can scale larger and your cluster and the reflection should build. For the profile you attached (which is still running) it looks like you are using 26 GB which is approaching the limit you have specified of 32 GB for direct memory. Generally, you will want to give more direct memory to a Dremio executor than heap memory (while it’s the opposite for the coordinator). You may want to consider having a separate coordinator and executor so you can tune their memory requirements separately.
One way would be to disable and re-enable, which may be what you doing currently. To do this, you’d go to the Reflection tab of the VDS Settings, disable the reflection, Save, and then go back and recreated them.
Generally, you control the reflection refresh behavior on a source or on the individual PDS within a source. If a VDS references a particular PDS, so will the reflections that are built on the that VDS. You can go to the Reflection Refresh tab of the PDS’s Settings window and click “refresh now”. This will cause reflections that depend on the PDS to rebuild.
Understood, thanks Ben. The reflection finally ended, and after a while (a couple of hours) it was refreshed again. Is that behaviour normal? The data didn’t change at all.
An administrator can specify the desired Refresh Policy for any physical dataset or data source – determining the refresh interval and expiration of reflections. All reflections based on a physical dataset or source will be refreshed accordingly. Refresh Policy options for a physical dataset will override the value for the source.
Dremio will refresh Data Reflections at the provided refresh interval and serve them until the provided expiration.
If you never want a datasets reflections to refresh, or you want to manage them “manually”, you can set to “Never Refresh” and then click “Refresh now” when you want to start building those reflections.