Reflection refresh stuck planning, crashing dremio

I’m trying to refresh reflections but they keep getting stuck at the Planning phase.
A few minutes later dremio starts becoming unresponsive until I stop the container.

Logs suggest the planning is running out of heap space, trying to cancel any running queries but not finding any.

Same reflections seemed to be fine last week (checking to see when we upgraded to dremio v19, could be related).

/var/logs/dremio/json/server.json has the following:

{“timestamp”:“2021-11-29 17:03:49,551”,“host”:“dremio”,“thread”:“qtp1499489673-386”,“levelName”:“WARN”,“levelValue”:30000,“logger”:“com.dremio.dac.server.admin.profile.AccelerationWrapper”,“message”:“failed to get reflection dataset path”,“stack_trace”:"java.lang.NullPointerException: null
at com.dremio.dac.server.admin.profile.AccelerationWrapper.getReflectionDatasetPath(AccelerationWrapper.java:48)
at com.dremio.dac.server.admin.profile.ProfileWrapper.getDatasetGroupedLayoutList(ProfileWrapper.java:466)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

Could this be the cause? Not sure where to go from that error - querying the api everything looks ok. Going to try reverting to v18 and see if it goes away

@Saltxwater The NPE should not cause the heap issue. You can pass the below flag to DREMIO_JAVA_SERVER_EXTRA_OPTS="-XX:+PrintClassHistogramBeforeFullGC in dremio-env (only coordinator), restart the coordinator, reproduce the issue and send us the GC log files under the log folder on the coordinator

@balaji.ramaswamy
Digging into this further I’m seeing a lot of strange behaviour of Dremio.
I have a VDS (named “Join_Pnl_2015_2019”)which is joining 3 PDS.
If I run SELECT * FROM "Join_Pnl_2015_2019" then the UI preview runs fine…
If I run SELECT * FROM "Join_Pnl_2015_2019" WHERE d_date > '2018-12-31' then the job gets stuck on planning.

If I instead “edit” the VDS to view the underlying sql and add the where clause there then it works fine…

If you could provide any insight as to what is going on I would be extremely grateful! When it’s stuck in planning then dremio ui slows down before crashing.
Regarding GC:
Is there an additional flag I need to pass to actually create the GC logs? I’m seeing a lot like the following:

2021-11-30T02:08:51.654+0000: [GC concurrent-mark-start]
2021-11-30T02:08:55.473+0000: [GC pause (G1 Evacuation Pause) (young), 0.3538868 secs]
   [Parallel Time: 323.4 ms, GC Workers: 48]
      [GC Worker Start (ms): Min: 729154.6, Avg: 729155.5, Max: 729156.2, Diff: 1.6]
      [Ext Root Scanning (ms): Min: 0.0, Avg: 0.6, Max: 4.9, Diff: 4.9, Sum: 28.0]
      [Update RS (ms): Min: 17.6, Avg: 21.2, Max: 27.0, Diff: 9.4, Sum: 1019.6]
         [Processed Buffers: Min: 5, Avg: 13.7, Max: 18, Diff: 13, Sum: 656]
      [Scan RS (ms): Min: 100.2, Avg: 105.7, Max: 107.0, Diff: 6.8, Sum: 5071.4]
      [Code Root Scanning (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.1]
      [Object Copy (ms): Min: 193.3, Avg: 193.9, Max: 194.7, Diff: 1.4, Sum: 9305.8]
      [Termination (ms): Min: 0.0, Avg: 0.2, Max: 0.2, Diff: 0.2, Sum: 8.0]
         [Termination Attempts: Min: 1, Avg: 39.2, Max: 53, Diff: 52, Sum: 1882]
      [GC Worker Other (ms): Min: 0.0, Avg: 0.1, Max: 0.2, Diff: 0.2, Sum: 4.9]
      [GC Worker Total (ms): Min: 320.9, Avg: 321.6, Max: 322.5, Diff: 1.5, Sum: 15437.8]
      [GC Worker End (ms): Min: 729477.1, Avg: 729477.1, Max: 729477.3, Diff: 0.2]
   [Code Root Fixup: 0.0 ms]
   [Code Root Purge: 0.0 ms]
   [Clear CT: 7.1 ms]
   [Other: 23.4 ms]
      [Choose CSet: 0.0 ms]
      [Ref Proc: 4.5 ms]
      [Ref Enq: 0.0 ms]
      [Redirty Cards: 9.1 ms]
      [Humongous Register: 0.1 ms]
      [Humongous Reclaim: 0.0 ms]
      [Free CSet: 8.8 ms]
   [Eden: 5696.0M(5696.0M)->0.0B(5696.0M) Survivors: 832.0M->832.0M Heap: 115.9G(128.0G)->112.1G(128.0G)]
 [Times: user=15.51 sys=0.08, real=0.36 secs]
2021-11-30T02:08:59.565+0000: [GC pause (G1 Evacuation Pause) (young), 0.3570937 secs]
   [Parallel Time: 329.6 ms, GC Workers: 48]
      [GC Worker Start (ms): Min: 733246.5, Avg: 733247.5, Max: 733248.2, Diff: 1.7]
      [Ext Root Scanning (ms): Min: 0.0, Avg: 0.6, Max: 4.9, Diff: 4.9, Sum: 28.3]
      [Update RS (ms): Min: 14.8, Avg: 18.2, Max: 19.2, Diff: 4.4, Sum: 871.4]
         [Processed Buffers: Min: 10, Avg: 11.8, Max: 17, Diff: 7, Sum: 565]
      [Scan RS (ms): Min: 103.2, Avg: 104.3, Max: 105.2, Diff: 2.0, Sum: 5004.7]
      [Code Root Scanning (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.1]
      [Object Copy (ms): Min: 203.5, Avg: 204.3, Max: 205.0, Diff: 1.5, Sum: 9808.5]
      [Termination (ms): Min: 0.0, Avg: 0.4, Max: 0.5, Diff: 0.5, Sum: 18.3]
         [Termination Attempts: Min: 1, Avg: 100.6, Max: 127, Diff: 126, Sum: 4831]
      [GC Worker Other (ms): Min: 0.0, Avg: 0.1, Max: 0.3, Diff: 0.2, Sum: 5.4]
      [GC Worker Total (ms): Min: 327.1, Avg: 327.8, Max: 328.7, Diff: 1.6, Sum: 15736.6]
      [GC Worker End (ms): Min: 733575.2, Avg: 733575.3, Max: 733575.5, Diff: 0.2]
   [Code Root Fixup: 0.0 ms]
   [Code Root Purge: 0.0 ms]
   [Clear CT: 7.4 ms]
   [Other: 20.0 ms]
      [Choose CSet: 0.0 ms]
      [Ref Proc: 4.3 ms]
      [Ref Enq: 0.0 ms]
      [Redirty Cards: 7.2 ms]
      [Humongous Register: 0.1 ms]
      [Humongous Reclaim: 0.0 ms]
      [Free CSet: 7.6 ms]
   [Eden: 5696.0M(5696.0M)->0.0B(5696.0M) Survivors: 832.0M->832.0M Heap: 117.7G(128.0G)->114.0G(128.0G)]
 [Times: user=15.87 sys=0.09, real=0.35 secs]
2021-11-30T02:09:03.722+0000: [GC concurrent-mark-end, 12.0672890 secs]
2021-11-30T02:09:03.723+0000: [GC remark 2021-11-30T02:09:03.723+0000: [Finalize Marking, 0.0036129 secs] 2021-11-30T02:09:03.726+0000: [GC ref-proc, 0.0044050 secs] 2021-11-30T02:09:03.731+0000: [Unloading, 0.0238963 secs], 0.1156182 secs]
 [Times: user=4.71 sys=0.01, real=0.12 secs]
2021-11-30T02:09:03.840+0000: [GC cleanup 119G->119G(128G), 0.2599309 secs]
 [Times: user=11.76 sys=0.01, real=0.26 secs]
2021-11-30T02:09:04.100+0000: [GC concurrent-cleanup-start]

Managed to get to the bottom of this.

There was a separate VDS which used Join_Pnl_2015_2019 and applied a number of complicated filters (large where clause).

Queries on Join_Pnl would go through planning which was comparing the where clause supplied with the where clause on the filter VDS. This comparison causes dremio to do lots of processing, running out of heap space (128Gb) and ultimately failing.

Lesson: don’t have lots of complicated where clauses, even in what feels like an unrelated VDS