Queries that used to Run on v4.7 error out with memory exception on v11

We recently moved our dremio market place deployment from v4.7 to v11 and queries that previously ran within a few seconds and very limited memory now fail quoting that there was insufficient memory even though the actual run on the old version took a lot less (order of 10 or more times less) than the quote from v11.

Wanted to understand the cause and if there’s any possible fix …

Hi @udaykrishna5

Are you able to provide job profiles from both the versions?

Thanks
Bali

Sure

v4.7_runs_0adcfcb1-5048-4bb4-8530-4b147815db6b.zip (600.0 KB) insufficient_memory_fail_7b40191a-fef2-4fee-8ed4-97de08b6b1c4.zip (1.6 MB)

@udaykrishna5

Plans can change and can cause this, I see you have 5 executors, what is the direct memory on each of them?

I have the same issue, having moved from v4.6 to v4.9 on AWS. The master coordinator m5d.large single instance so 32GB RAM, 8GB default allocated for dremio UI, 24GB is quickly used up planning the query. Clarifying: The master coordinator node has 32GB of RAM on it, the query profile mentions it running out of memory around 24GB so based on the default settings for Dremio the UI etc. is using 8GB but the query planner runs out at 24GB

I don’t have earlier query profile, have sent @b-rock profile from v4.9.

Is there a way to select a less efficient query, merely to save memory on the master coordinator? I’m fine throwing more engine resources at the problem, or are we always stuck at paying more base cost for the master coordinator node to run? Can we add or remove master coordinator nodes without a new build?

@datocrats-org

Can you please elaborate on “so 32GB RAM, 8GB default allocated for dremio UI, 24GB is quickly used up planning the query”

Are you talking about heap and direct memory? If yes, can you please add the below to dremio-env on the master coordinator and then restart Dremio, run the same query and send us the GC logs,

DREMIO_JAVA_SERVER_EXTRA_OPTS=“-XX:+UseG1GC -XX:G1HeapRegionSize=32M -XX:MaxGCPauseMillis=500 -XX:InitiatingHeapOccupancyPercent=25 -XX:+PrintClassHistogramBeforeFullGC -XX:+PrintClassHistogramAfterFullGC”

Thanks
Bali

This issue was consistent irrespective of what node type I chose for my engines I chose right from m5d.xlarge (16G) to c5d.18xLarge(144G), but on v4.6 16G instance seems to do the job fine.