Aggregation Reflection Limits

forwardkeys_it · October 8, 2019, 11:25pm

We have a dataset in parquet, and we’ve used Dremio to add some new columns here and change some others there, without the need of an ETL, so Dremio has been pretty useful for this task.
We’ve saved the results as a Dataset (I guess that’s a Virtual Dataset), and we want to create an Aggregation Reflection from that, with 8 dimensions and just one 1 measure. The parquet has been created with three partition levels, by year, month and day, and those three fields are used as dimensions as well.
The dataset is not that big big, with 10 000 million rows, and even if the machine where Dremio is running has a good deal of RAM, the reflection is never finished, it’s always throwing a “Query was cancelled because it exceeded the memory limits set by the administrator” error. We’ve set the dremio_conf like:
DREMIO_MAX_MEMORY_SIZE_MB=128000
DREMIO_MAX_HEAP_MEMORY_SIZE_MB=64000
DREMIO_MAX_DIRECT_MEMORY_SIZE_MB=32768
which seems to be too much. The question would be, is there any known limit to the amount of rows/dimensions/cardinality that can be used for an Aggregation Reflection? Also, a second question would be, how can we force the creation of an Aggregation Reflection on a Virtual Dataset? So far, we are doing it but removing all reflections, and creating it again,

We are using Dremio 3.3 on CentOS server, we’ll be glad with any help you can provide!

Venugopal_Menda · October 14, 2019, 5:44am

Hi @forwardkeys_it

How many executors running in this Dremio cluster? and the RAM and CPUs on each executor?
Can you share the query profile for this?

@Venugopal_Menda

forwardkeys_it · October 14, 2019, 12:53pm

It’s a standalone execution on premise server, Coordinator + Executor node, with 48 CPUs and 128 Gbs of RAM. I’ve deleted all the aggregations, but I can recreate it again until it happens the same, and copy the query profile.
I’ve realised that adding dimensions one by one it’s working, until a seventh dimension is added. Then it keeps refreshing the reflection for several days, even if the data hasn’t changed. I’m adding a query profile of one of those refreshes
df64426e-073d-4dbb-944e-39f9dac2a563.zip (179,9 KB)

ben · October 14, 2019, 8:19pm

You are not encountering any limit Dremio imposes on the number of rows/dimensions/cardinality, but, as you’ve seen, you can run out of memory resources creating the reflection. In theory, you can scale larger and your cluster and the reflection should build. For the profile you attached (which is still running) it looks like you are using 26 GB which is approaching the limit you have specified of 32 GB for direct memory. Generally, you will want to give more direct memory to a Dremio executor than heap memory (while it’s the opposite for the coordinator). You may want to consider having a separate coordinator and executor so you can tune their memory requirements separately.

One way would be to disable and re-enable, which may be what you doing currently. To do this, you’d go to the Reflection tab of the VDS Settings, disable the reflection, Save, and then go back and recreated them.

Generally, you control the reflection refresh behavior on a source or on the individual PDS within a source. If a VDS references a particular PDS, so will the reflections that are built on the that VDS. You can go to the Reflection Refresh tab of the PDS’s Settings window and click “refresh now”. This will cause reflections that depend on the PDS to rebuild.

forwardkeys_it · October 15, 2019, 3:55pm

Understood, thanks Ben. The reflection finally ended, and after a while (a couple of hours) it was refreshed again. Is that behaviour normal? The data didn’t change at all.

ben · October 15, 2019, 5:39pm

@forwardkeys_it, the reflection will refresh, irregardless of whether the data has been updated, if you have the source PDS configured to refresh on some time interval. See our documentation on the subject for more information on this.:

An administrator can specify the desired Refresh Policy for any physical dataset or data source – determining the refresh interval and expiration of reflections. All reflections based on a physical dataset or source will be refreshed accordingly. Refresh Policy options for a physical dataset will override the value for the source.

Dremio will refresh Data Reflections at the provided refresh interval and serve them until the provided expiration.

If you never want a datasets reflections to refresh, or you want to manage them “manually”, you can set to “Never Refresh” and then click “Refresh now” when you want to start building those reflections.

forwardkeys_it · October 16, 2019, 8:45am

Ok, it makes sense. Thanks for your help @ben !

Topic		Replies	Views
Dremio 11 - Aggregation Reflections memory limit	1	1257	December 9, 2020
How to handle huge reflections?	6	1437	May 12, 2019
Can't setup a raw reflection on a virtual dataset out of memory	1	1149	March 25, 2019
Evaluating Dremio	3	2111	May 17, 2018
Understanding how the parquet reader works and a suggestion	10	3482	May 23, 2018

Aggregation Reflection Limits

Related topics