ExecutionSetupException: One or more nodes lost connectivity during query

asakmedops · July 31, 2020, 7:40am

While running a query on Dremio 4.6.1 installed on Kubernetes, we are getting the following error message from Dremio UI:

ExecutionSetupException: One or more nodes lost connectivity during query. Identified nodes were [dremio-executor-2.dremio-cluster-pod.dremio.svc.cluster.local:0].

here are the logs from mentioned worker:
dremio-executor-2-logs.zip (6.7 KB)

Dremio-env config has the following settings:
DREMIO_MAX_DIRECT_MEMORY_SIZE_MB=13384
DREMIO_MAX_HEAP_MEMORY_SIZE_MB is not set
We are using workers of 16G /8c (Total of 10 workers)
1 Master Coordinator with the same config
Zookeeper with 1G/ 1c

Any idea what s causing this behavior ?

By running logs of the worker crashing here are the logs before the crash

An irrecoverable stack overflow has occurred.
Please check if any of your loaded .so files has enabled executable stack (see man page execstack(8))
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007f41cdac4fa8, pid=1, tid=0x00007f41dc2ed700
#
# JRE version: OpenJDK Runtime Environment (8.0_262-b10) (build 1.8.0_262-b10)
# Java VM: OpenJDK 64-Bit Server VM (25.262-b10 mixed mode linux-amd64 compressed oops)
# Problematic frame:
# C  0x00007f41cdac4fa8
#
# Core dump written. Default location: /opt/dremio/core or core.1
#
# An error report file with more information is saved as:
# /tmp/hs_err_pid1.log
#
# If you would like to submit a bug report, please visit:
#   http://bugreport.java.com/bugreport/crash.jsp
# The crash happened outside the Java Virtual Machine in native code.
# See problematic frame for where to report the bug.
#

[error occurred during error reporting , id 0xb]

balaji.ramaswamy · August 14, 2020, 7:52am

@asakmedops

Have you configured GC logging to a separate file? Please add this to dremio-master.yaml and dremio-executor.yaml and restart pods, reproduce the issue and send us the gc logs

name: DREMIO_JAVA_EXTRA_OPTS_TEMP
value: >-
-Xloggc:/opt/dremio/data
-XX:+UseGCLogFileRotation
-XX:NumberOfGCLogFiles=5
-XX:GCLogFileSize=4000k
-XX:+PrintClassHistogramBeforeFullGC
-XX:+HeapDumpOnOutOfMemoryError
-XX:+UseG1GC
-XX:G1HeapRegionSize=32M
-XX:MaxGCPauseMillis=500
-XX:InitiatingHeapOccupancyPercent=25

Thanks
Bali

asakmedops · August 14, 2020, 1:31pm

I just configured GC logging based on the config above.

Here are the executor logs for the dremio-executor failling

server.gc.logs.zip (9.4 KB)

For Dremio-master pod GC logs :
dremio-master.gc.logs.zip (6.8 KB)

asakmedops · December 7, 2020, 5:47pm

Even with Dremio 11, this issue still persists

balaji.ramaswamy · December 8, 2020, 4:30am

@asakmedops

Do not see the “printhistogramsbeforefullGC” on the executor GC logs, can you please add above parameters, that is for V1

If using V2 then do the below

Open values.yaml
Add the below under the appropriate section, executor or coordinator

extraStartParams: >-
-Xloggc:/opt/dremio/data/gc.log
-XX:+UseGCLogFileRotation
-XX:NumberOfGCLogFiles=5
-XX:GCLogFileSize=4000k
-XX:+PrintClassHistogramBeforeFullGC
-XX:+PrintClassHistogramAfterFullGC
-XX:+HeapDumpOnOutOfMemoryError
-XX:HeapDumpPath=/opt/dremio/data
-XX:+UseG1GC
-XX:G1HeapRegionSize=32M
-XX:MaxGCPauseMillis=500
-XX:InitiatingHeapOccupancyPercent=25
-XX:ErrorFile=/opt/dremio/data/hs_err_pid%p.log

Topic		Replies	Views
ExecutionSetupException: One or more nodes lost connectivity during query during reflection creation	1	154	May 13, 2024
Dremio On-Premise Server Down - ERROR ROOT Dremio is exiting. Node lost its master status	3	1618	January 3, 2022
New Instance Failure	28	2196	February 25, 2021
Reflection memory limits & lost connectivity Dremio University	5	1570	November 18, 2021
Previously stable dataset has started failing with "One or more nodes lost connectivity during query"	7	525	February 14, 2024

ExecutionSetupException: One or more nodes lost connectivity during query

Related topics