ExecutionSetupException: One or more nodes lost connectivity during query

While running a query on Dremio 4.6.1 installed on Kubernetes, we are getting the following error message from Dremio UI:

ExecutionSetupException: One or more nodes lost connectivity during query. Identified nodes were [dremio-executor-2.dremio-cluster-pod.dremio.svc.cluster.local:0].

here are the logs from mentioned worker:
dremio-executor-2-logs.zip (6.7 KB)

Dremio-env config has the following settings:
DREMIO_MAX_DIRECT_MEMORY_SIZE_MB=13384
DREMIO_MAX_HEAP_MEMORY_SIZE_MB is not set
We are using workers of 16G /8c (Total of 10 workers)
1 Master Coordinator with the same config
Zookeeper with 1G/ 1c

Any idea what s causing this behavior ?

By running logs of the worker crashing here are the logs before the crash :slight_smile:

An irrecoverable stack overflow has occurred.
Please check if any of your loaded .so files has enabled executable stack (see man page execstack(8))
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007f41cdac4fa8, pid=1, tid=0x00007f41dc2ed700
#
# JRE version: OpenJDK Runtime Environment (8.0_262-b10) (build 1.8.0_262-b10)
# Java VM: OpenJDK 64-Bit Server VM (25.262-b10 mixed mode linux-amd64 compressed oops)
# Problematic frame:
# C  0x00007f41cdac4fa8
#
# Core dump written. Default location: /opt/dremio/core or core.1
#
# An error report file with more information is saved as:
# /tmp/hs_err_pid1.log
#
# If you would like to submit a bug report, please visit:
#   http://bugreport.java.com/bugreport/crash.jsp
# The crash happened outside the Java Virtual Machine in native code.
# See problematic frame for where to report the bug.
#

[error occurred during error reporting , id 0xb] 

@asakmedops

Have you configured GC logging to a separate file? Please add this to dremio-master.yaml and dremio-executor.yaml and restart pods, reproduce the issue and send us the gc logs

  • name: DREMIO_JAVA_EXTRA_OPTS_TEMP
    value: >-
    -Xloggc:/opt/dremio/data
    -XX:+UseGCLogFileRotation
    -XX:NumberOfGCLogFiles=5
    -XX:GCLogFileSize=4000k
    -XX:+PrintClassHistogramBeforeFullGC
    -XX:+HeapDumpOnOutOfMemoryError
    -XX:+UseG1GC
    -XX:G1HeapRegionSize=32M
    -XX:MaxGCPauseMillis=500
    -XX:InitiatingHeapOccupancyPercent=25

Thanks
Bali

1 Like

I just configured GC logging based on the config above.

Here are the executor logs for the dremio-executor failling

server.gc.logs.zip (9.4 KB)

For Dremio-master pod GC logs :
dremio-master.gc.logs.zip (6.8 KB)

Even with Dremio 11, this issue still persists

@asakmedops

Do not see the “printhistogramsbeforefullGC” on the executor GC logs, can you please add above parameters, that is for V1

If using V2 then do the below

  • Open values.yaml
  • Add the below under the appropriate section, executor or coordinator

extraStartParams: >-
-Xloggc:/opt/dremio/data/gc.log
-XX:+UseGCLogFileRotation
-XX:NumberOfGCLogFiles=5
-XX:GCLogFileSize=4000k
-XX:+PrintClassHistogramBeforeFullGC
-XX:+PrintClassHistogramAfterFullGC
-XX:+HeapDumpOnOutOfMemoryError
-XX:HeapDumpPath=/opt/dremio/data
-XX:+UseG1GC
-XX:G1HeapRegionSize=32M
-XX:MaxGCPauseMillis=500
-XX:InitiatingHeapOccupancyPercent=25
-XX:ErrorFile=/opt/dremio/data/hs_err_pid%p.log