1 of the 4 worker node dead

crazyisjen · January 3, 2024, 3:04am

Exceeded timeout (30000) while waiting after sending work fragments to remote nodes. Sent 3 and only heard response back from 2 nodes

I have experience above error frequently, guess probably occur when running complex query concurrently, I have limit the concurrent query from 8 to 2 but still a node will dead.

Could you pls advise how to limit the memory allocation to each query as to avoid any work node fails and cause impact to other queries. Thanks a lot~

bogdan.coman · January 4, 2024, 3:01am

Hi @crazyisjen,

In the community version you have the Query Memory Control settings you can use (not per query, but for small and large queries): Queue Control | Dremio Documentation

Thanks, Bogdan

crazyisjen · January 10, 2024, 6:45am

thanks a lot
Would like to ask more abt Query Threshold

I saw default 30000000 is enabled , what is the units of this figures? And is it a ceiling for each query to consume the memory?
And I have check one of the large successful completed query which took 3401436399 in query cost, any reason it is not stopped if exceed the ceiling?

thanks a lot for help

balaji.ramaswamy · January 14, 2024, 12:44am

@crazyisjen The error you are getting means the executor did not respond, this could be due to 2 reasons

Full GC pause
RPC’s between Dremio nodes taking too much time

When this happens next time, send here the below

The job profile of the job that failed
server.log from the executor mentioned in the error that did not respond
GC logs from the executor mentioned in the error that did not respond

IF you are K8’s then use below parameters in values.yaml to first move GC logging to a PVC

Open values.yaml and add the following under the appropriate section; executor and/or coordinator:

extraStartParams: >-
-Xloggc:/opt/dremio/data/gc-%t-%p.log
-XX:+UseGCLogFileRotation
-XX:NumberOfGCLogFiles=5
-XX:GCLogFileSize=4000k
-XX:+PrintGCDetails
-XX:+PrintGCTimeStamps
-XX:+PrintGCDateStamps
-XX:+PrintClassHistogramBeforeFullGC
-XX:+PrintClassHistogramAfterFullGC
-XX:+HeapDumpOnOutOfMemoryError
-XX:HeapDumpPath=/opt/dremio/data
-XX:+UseG1GC
-XX:G1HeapRegionSize=32M
-XX:MaxGCPauseMillis=500
-XX:InitiatingHeapOccupancyPercent=25
-XX:+PrintAdaptiveSizePolicy
-XX:+PrintReferenceGC
-XX:ErrorFile=/opt/dremio/data/hs_err_pid%p.log

Topic		Replies	Views
Query was cancelled because it exceeded the memory limits set by the administrator	26	9760	January 26, 2023
DREMIO - Query was cancelled because it exceeded the memory limits set by the administrator	4	890	August 16, 2023
Reflection memory limits & lost connectivity Dremio University	5	1570	November 18, 2021
Query Failure (Query was cancelled because it exceeded the memory limits set by the administrator)	5	1749	April 8, 2022
Query exceeded memory limits	4	2664	January 25, 2022

1 of the 4 worker node dead

Related topics