1 of the 4 worker node dead

Exceeded timeout (30000) while waiting after sending work fragments to remote nodes. Sent 3 and only heard response back from 2 nodes

I have experience above error frequently, guess probably occur when running complex query concurrently, I have limit the concurrent query from 8 to 2 but still a node will dead.

Could you pls advise how to limit the memory allocation to each query as to avoid any work node fails and cause impact to other queries. Thanks a lot~

Hi @crazyisjen,

In the community version you have the Query Memory Control settings you can use (not per query, but for small and large queries): Queue Control | Dremio Documentation

Thanks, Bogdan

thanks a lot
Would like to ask more abt Query Threshold

I saw default 30000000 is enabled , what is the units of this figures? And is it a ceiling for each query to consume the memory?
And I have check one of the large successful completed query which took 3401436399 in query cost, any reason it is not stopped if exceed the ceiling?

thanks a lot for help :smiley:

@crazyisjen The error you are getting means the executor did not respond, this could be due to 2 reasons

  • Full GC pause
  • RPC’s between Dremio nodes taking too much time

When this happens next time, send here the below

  • The job profile of the job that failed
  • server.log from the executor mentioned in the error that did not respond
  • GC logs from the executor mentioned in the error that did not respond

IF you are K8’s then use below parameters in values.yaml to first move GC logging to a PVC

Open values.yaml and add the following under the appropriate section; executor and/or coordinator:

extraStartParams: >-