Exceeded timeout (30000) while waiting after sending work fragments to remote nodes. Sent 1 and only heard response back from 0 nodes

Filipe.Souza · November 22, 2024, 7:26pm

Hello everyone.
I’m facing this error in dremio cloud.

CONNECTION
b3e4344e-fdd5-4bda-a00e-3609c60afa9f.zip (69,6,KB)
ERROR: Exceeded timeout (30000) while waiting after sending work fragments to remote nodes. Sent 1 and only heard response back from 0 nodes

Node(s) that did not respond 10.13.5.198

Do you know what’s happening?

balaji.ramaswamy · November 25, 2024, 6:54pm

@Filipe.Souza The node IP listed, did not respond to another Dremio node. Could be a RPC taking too much long or a Full GC

If you the see the time taken for “Starting” under the raw profile-query tab-under “State Durations”, you will see it was 30 seconds

Starting:
30,007ms

Usually this should less than 100ms, this tells me the executors are very busy and even to assign the work fragment from it takes 30s

Can you please check if there was long GC pause or on the executor log (When this query ran) see if there are any WARN messages

Filipe.Souza · November 28, 2024, 1:16pm

Hi @balaji.ramaswamy
In AWS we cannot access the logs of the machines created by dremio cloud, we can only check the use of the resources as below.

Is this GC configuration parameterized within the machines?
If so, is there any way to make this change?
All machines are managed exclusively by dremio itself.

balaji.ramaswamy · December 5, 2024, 5:52am

@Filipe.Souza The coordinator is managed by Dremio but not the executors. The logs is something that we may be able to review, let me get back to you

Topic		Replies	Views
1 of the 4 worker node dead	3	417	January 14, 2024
Dremio 3.1 - Unable to execute HTTP request: Timeout waiting for connection from pool	1	1344	January 31, 2019
Workers in Provisioning or Disconnected	1	1815	March 11, 2019
Error while executing queries on multiple nodes	1	648	April 26, 2023
Getting error at Bi side regarding flight time out error in dremio community version	10	51	September 16, 2024

Exceeded timeout (30000) while waiting after sending work fragments to remote nodes. Sent 1 and only heard response back from 0 nodes

Related topics