Dremio crashing, unclear why

Hi,

We’ve recently upgraded to community version 24.3.0-202312190021150029-52db2faf, and are seeing periodic crashes.

I’ve noticed 2 errors in the logs (all logs and config attached). This one appears regularly:

2024-02-07 13:48:48,070 [e2 - 1a3c772d-3681-578e-2e69-518c5c7a6c00:frag:2:1] INFO c.d.e.expr.fn.HiveFunctionRegistry - Failed to find a hive function for given FunctionCall: ‘FunctionCall [func=castFLOAT8, args=[ValueVectorReadExpression [fieldId=TypedFieldId [fieldIds=[7], remainder=null]]]]’ and the argument number is ‘0’ and the error is
java.lang.UnsupportedOperationException: Type UNION[REQUIRED] not supported as arguement to Hive UDFs

And this appears on crash in server.out

LLVM ERROR: Unable to allocate section memory!

Any suggestions? Thanks!

dremio_20240207.zip (343.4 KB)

which version of java are using?

openjdk version “1.8.0_382”

@mildlyth

Your openjdk version seems to be ok

# Java VM: OpenJDK 64-Bit Server VM (25.382-b05 mixed mode linux-amd64 compressed oops)
# Problematic frame:
# C  0x00007f6782124280
#
# Core dump written. Default location: /u/bodev/dremio/core or core.26587
#
# An error report file with more information is saved as:
# /u/bodev/dremio/hs_err_pid26587.log

If you see the last line of the error, there was a memory fault and there is a error file written, are you able to upload that?

hs_err_pid26587.zip (1.1 MB)

Error file attached

A further note on this - the available memory falls on the host with every reflection refresh - we have 2 reflections which refresh every 3 hours, and 4 every 6 hours. Sometimes we will trigger refreshes manually.

@mildlyth Can you please send me “ps -ef | grep dremio” output and also the server.gc log in your Dremio log folder

Plus the profile for Job ID# 1a3ec8c8-6013-67bf-6e72-9e9f1ef45200

ps -ef output:

bodev 8948 1 11 09:35 pts/0 00:36:58 /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.382.b05-1.el7_9.x86_64/jre/bin/java -Djava.util.logging.config.class=org.slf4j.bridge.SLF4JBridgeHandler -Djava.library.path=/u/bodev/dremio/dremio-community-24.3.0-202312190021150029-52db2faf/lib -XX:+PrintGCDetails -XX:+PrintGCDateStamps -Xloggc:/u/bodev/dremio/dremio-community-24.3.0-202312190021150029-52db2faf/log/server.gc -Ddremio.log.path=/u/bodev/dremio/dremio-community-24.3.0-202312190021150029-52db2faf/log -Ddremio.plugins.path=/u/bodev/dremio/dremio-community-24.3.0-202312190021150029-52db2faf/plugins -Xmx12384m -XX:MaxDirectMemorySize=20384m -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/u/bodev/dremio/dremio-community-24.3.0-202312190021150029-52db2faf/log -Dio.netty.maxDirectMemory=0 -Dio.netty.tryReflectionSetAccessible=true -DMAPR_IMPALA_RA_THROTTLE -DMAPR_MAX_RA_STREAMS=400 -XX:+UseG1GC -XX:+PrintClassHistogramBeforeFullGC -XX:+PrintClassHistogramAfterFullGC -cp /u/bodev/dremio/dremio-community-24.3.0-202312190021150029-52db2faf/conf:/u/bodev/dremio/dremio-community-24.3.0-202312190021150029-52db2faf/jars/:/u/bodev/dremio/dremio-community-24.3.0-202312190021150029-52db2faf/jars/ext/:/u/bodev/dremio/dremio-community-24.3.0-202312190021150029-52db2faf/jars/3rdparty/* com.dremio.dac.daemon.DremioDaemon

When I looked up the profile, I got the error “Query failed as Dremio was restarted. Details and profile information for this job may be missing.”. Raw Profile appears as blank. It was a reflection refresh.

I shared the gc in the original post. Here’s todays one prior to a crash
server.gc.zip (98.0 KB)

@mildlyth

Would need all the GC logs as this the new one is after it rolled over after the crash

At this point this looks like a Gandiva memory fault. Can you please send me the query from th UI,

During a time then the system is not used, can you please run that query and see if the system crashes

Thanks
Bali