Job failing - beyond system capacity during query planning

Error: “Job was cancelled because the query went beyond system capacity during query planning. Please simplify the query.”

The query seems simple enough. It has 4 tables, including 3 vds with reflection.

d4495e47-8dd1-4de4-ae3d-ee627e2971b6.zip (449.3 KB)

@desi

The issue is partly due to the fact there are too many reflections that match this query. Dremio is trying to cost out each of these reflections and hence causing the plan to hit the limit. Reflections are great but we need to create efficient and less number of them. I also see you are on 4.5 and in 4.8+ we have default reflections that will reduce the costing by a big factor

Here is an white paper on Reflection Best Practices

Without any changes, without any coordinator reboot, the next day the same query executed without errors. GC issue? While it was erroring out, the server log gives a generic message, as this:

2021-01-28 18:26:31,065 [1fecfe2d-7451-5405-915e-afb83295f900/0:foreman-planning] ERROR c.d.s.commandpool.CommandWrapper - command 1fecfe2d-7451-5405-915e-afb83295f900/
0:foreman-planning failed

The GC log around same time has this:
2021-01-25T18:25:38.381+0000: 19194.518: [GC (Allocation Failure) [PSYoungGen: 516292K->2596K(510976K)] 844609K->331048K(2407936K), 0.0081440 secs] [Times: user=0.07 sys=0.00, real=0.01 secs]
2021-01-25T18:26:19.328+0000: 19235.466: [GC (Allocation Failure) [PSYoungGen: 510415K->3389K(505856K)] 838868K->331987K(2402816K), 0.0086859 secs] [Times: user=0.08 sys=0.00, real=0.01 secs]
2021-01-25T18:27:01.230+0000: 19277.367: [GC (Allocation Failure) [PSYoungGen: 505661K->12304K(509440K)] 834259K->341462K(2406400K), 0.0084330 secs] [Times: user=0.07 sys=0.00, real=0.01 secs]

Below are java params:

[opc@dlvm001 dremio]$ head server.gc
Java HotSpot™ 64-Bit Server VM (25.251-b08) for linux-amd64 JRE (1.8.0_251-b08), built on Mar 11 2020 23:20:16 by “java_re” with gcc 7.3.0
Memory: 4k page, physical 123485908k(121474964k free), swap 8388604k(8388604k free)
CommandLine flags: -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/var/log/dremio -XX:InitialHeapSize=1975774528 -XX:MaxDirectMemorySize=81604378624 -XX:MaxHeapSize=34359738368 -XX:+PrintGC -XX:+PrintGCDateStamps -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+StartAttachListener -XX:+UseParallelGC

@desi Do not see a full GC. Are you able to send us the profile of the successful profile?

0cdc8ba8-a2c9-4159-be3b-81016f25df28.zip (709.6 KB)

@desi

Am I missing something? Both profiles are successful ones

I guess that is the tricky part - the original error messages comes while the profile is successful.

@desi

Where do you see the error message?

@balaji.ramaswamy I am getting this error for queries where I use rank() over (

But the puzzling issue is this is only happening for one of the VDS and not for all VDS.
We are using Dremio 4.7

I will mail you the profile, please let me know what can we do to resolve this.

2021-06-25 05:00:23,254 [1f2a9d1a-3be7-7421-8069-2e7a3eaadb00/0:foreman-planning] INFO  c.d.exec.planner.MaxNodesListener - User Error Occurred [ErrorId: becfdedc-725f-4571-9094-aa92b62ed7e8]
com.dremio.common.exceptions.UserException: Job was cancelled because the query went beyond system capacity during query planning. Please simplify the query.
	at com.dremio.common.exceptions.UserException$Builder.build(UserException.java:802)
	at com.dremio.exec.planner.MaxNodesListener.relEquivalenceFound(MaxNodesListener.java:43)
	at org.apache.calcite.plan.volcano.RelSubset.add(RelSubset.java:274)
	at org.apache.calcite.plan.volcano.RelSet.add(RelSet.java:147)
	at org.apache.calcite.plan.volcano.VolcanoPlanner.addRelToSet(VolcanoPlanner.java:1653)
	at org.apache.calcite.plan.volcano.VolcanoPlanner.registerImpl(VolcanoPlanner.java:1599)
	at org.apache.calcite.plan.volcano.VolcanoPlanner.register(VolcanoPlanner.java:864)
	at org.apache.calcite.plan.volcano.VolcanoPlanner.ensureRegistered(VolcanoPlanner.java:886)
	at org.apache.calcite.plan.volcano.VolcanoPlanner.ensureRegistered(VolcanoPlanner.java:100)
	at org.apache.calcite.rel.AbstractRelNode.onRegister(AbstractRelNode.java:336)
	at org.apache.calcite.plan.volcano.VolcanoPlanner.registerImpl(VolcanoPlanner.java:1502)
	at org.apache.calcite.plan.volcano.VolcanoPlanner.register(VolcanoPlanner.java:864)
	at org.apache.calcite.plan.volcano.VolcanoPlanner.ensureRegistered(VolcanoPlanner.java:886)
	at org.apache.calcite.plan.volcano.VolcanoPlanner.ensureRegistered(VolcanoPlanner.java:1777)
	at org.apache.calcite.plan.volcano.VolcanoRuleCall.transformTo(VolcanoRuleCall.java:135)
	at org.apache.calcite.plan.RelOptRuleCall.transformTo(RelOptRuleCall.java:234)
	at org.apache.calcite.rel.rules.FilterSetOpTransposeRule.onMatch(FilterSetOpTransposeRule.java:93)
	at org.apache.calcite.plan.volcano.VolcanoRuleCall.onMatch(VolcanoRuleCall.java:212)
	at org.apache.calcite.plan.volcano.VolcanoPlanner.findBestExp(VolcanoPlanner.java:658)
	at com.dremio.exec.planner.DremioVolcanoPlanner.findBestExp(DremioVolcanoPlanner.java:83)
	at org.apache.calcite.tools.Programs$RuleSetProgram.run(Programs.java:368)
	at com.dremio.exec.planner.sql.handlers.PrelTransformer.lambda$transform$2(PrelTransformer.java:497)
	at com.dremio.exec.planner.sql.handlers.PrelTransformer.doTransform(PrelTransformer.java:511)
	at com.dremio.exec.planner.sql.handlers.PrelTransformer.transform(PrelTransformer.java:504)
	at com.dremio.exec.planner.sql.handlers.PrelTransformer.convertToDrel(PrelTransformer.java:242)
	at com.dremio.exec.planner.sql.handlers.PrelTransformer.convertToDrel(PrelTransformer.java:315)
	at com.dremio.exec.planner.sql.handlers.query.NormalHandler.getPlan(NormalHandler.java:47)
	at com.dremio.exec.planner.sql.handlers.commands.HandlerToExec.plan(HandlerToExec.java:59)
	at com.dremio.exec.work.foreman.AttemptManager.plan(AttemptManager.java:422)
	at com.dremio.exec.work.foreman.AttemptManager.lambda$run$1(AttemptManager.java:331)
	at com.dremio.service.commandpool.CommandWrapper.run(CommandWrapper.java:62)
	at com.dremio.context.RequestContext.run(RequestContext.java:95)
	at com.dremio.common.concurrent.ContextMigratingExecutorService.lambda$decorate$3(ContextMigratingExecutorService.java:199)
	at com.dremio.common.concurrent.ContextMigratingExecutorService$ComparableRunnable.run(ContextMigratingExecutorService.java:180)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

@unni We have a limit on the number of plan nodes as huge plans can cause excessive heap usage and hence cause the Dremio coordinator to go unstable. This query plan is hitting that limit, please try to split up the query into multiple VDS’s