Connection errors, queries fail intermittently

Hi,

I am getting the below error intermittently when many queries are fired concurrently. We are running Dremio 4.7 with a master & a secondary co-coordinator on Kubernetes.

Can you please tell me the possible ways to debug this issue.

CONNECTION ERROR: Exceeded timeout (5000) while waiting after sending work fragments to remote nodes. Sent 1 and only heard response back from 0 nodes.

Attaching the profile
54f14452-2842-4302-889f-8586b34e96c5.zip (8.3 KB)

2021-04-12 12:40:47,836 [FABRIC-rpc-event-queue] WARN c.d.s.fabric.FabricMessageHandler - Message of mode REQUEST for protocol 13 of rpc type 2 took longer than 500ms. Actual duration was 684ms.
2021-04-12 12:40:47,836 [FABRIC-2] WARN c.d.services.fabric.FabricClient - Message of mode REQUEST of rpc type 1 took longer than 500ms. Actual duration was 684ms.
2021-04-12 12:40:48,561 [FABRIC-rpc-event-queue] WARN c.d.s.fabric.FabricMessageHandler - Message of mode REQUEST for protocol 13 of rpc type 2 took longer than 500ms. Actual duration was 722ms.
2021-04-12 12:40:48,562 [FABRIC-2] WARN c.d.services.fabric.FabricClient - Message of mode REQUEST of rpc type 1 took longer than 500ms. Actual duration was 723ms.
2021-04-12 12:40:48,597 [out-of-band-observer] INFO query.logger - {“queryId”:“1f8bc03e-8d8b-2e5b-3341-0106ce7eda00”,“queryText”:" SELECT\n reflection_id\n FROM\n sys.reflections\n WHERE\n type=‘RAW’ AND status <> ‘DISABLED’ AND\n replace(replace(dataset, ‘"’, ‘’), ‘.’, ‘’) = replace(replace(‘marco_marcodashboardsystemmarco_marcodashboard_marcodatamarcoreacttransformedpoc_roidb_transfd_copy’, ‘"’, ‘’), ‘.’, ‘’)\n",“start”:1618231233078,“finish”:1618231248594,“outcome”:“COMPLETED”,“username”:“stagindataapi”}
10.156.0.65 - - [12/Apr/2021:12:40:48 +0000] “GET /api/v3/reflection/d39a3e9d-8dea-4b84-992b-90102702958a/ HTTP/1.1” 200 940 “-” “Java/1.8.0_66”
10.156.0.65 - - [12/Apr/2021:12:40:48 +0000] “GET /api/v3/reflection/07dc2d2a-e485-44c6-8502-5efb0ae544c6/ HTTP/1.1” 200 680 “-” “Java/1.8.0_66”
2021-04-12 12:40:49,468 [FABRIC-rpc-event-queue] WARN c.d.s.fabric.FabricMessageHandler - Message of mode REQUEST for protocol 13 of rpc type 2 took longer than 500ms. Actual duration was 873ms.
2021-04-12 12:40:49,469 [FABRIC-2] WARN c.d.services.fabric.FabricClient - Message of mode REQUEST of rpc type 1 took longer than 500ms. Actual duration was 874ms.
2021-04-12 12:40:49,906 [FABRIC-rpc-event-queue] WARN c.d.s.fabric.FabricMessageHandler - Message of mode REQUEST for protocol 13 of rpc type 2 took longer than 500ms. Actual duration was 813ms.
2021-04-12 12:40:49,906 [FABRIC-5] WARN c.d.services.fabric.FabricServer - Message of mode REQUEST of rpc type 1 took longer than 500ms. Actual duration was 813ms.
2021-04-12 12:40:50,171 [FABRIC-rpc-event-queue] WARN c.d.s.fabric.FabricMessageHandler - Message of mode REQUEST for protocol 13 of rpc type 2 took longer than 500ms. Actual duration was 702ms.
2021-04-12 12:40:50,171 [FABRIC-2] WARN c.d.services.fabric.FabricClient - Message of mode REQUEST of rpc type 1 took longer than 500ms. Actual duration was 702ms.
10.55.3.1 - - [12/Apr/2021:12:40:50 +0000] “GET / HTTP/1.1” 200 2663 “-” “kube-probe/1.18+”
2021-04-12 12:40:50,631 [FABRIC-rpc-event-queue] WARN c.d.s.fabric.FabricMessageHandler - Message of mode REQUEST for protocol 13 of rpc type 2 took longer than 500ms. Actual duration was 724ms.
2021-04-12 12:40:50,631 [FABRIC-5] WARN c.d.services.fabric.FabricServer - Message of mode REQUEST of rpc type 1 took longer than 500ms. Actual duration was 724ms.
2021-04-12 12:40:50,931 [FABRIC-rpc-event-queue] WARN c.d.s.fabric.FabricMessageHandler - Message of mode REQUEST for protocol 13 of rpc type 2 took longer than 500ms. Actual duration was 759ms.
2021-04-12 12:40:50,931 [FABRIC-2] WARN c.d.services.fabric.FabricClient - Message of mode REQUEST of rpc type 1 took longer than 500ms. Actual duration was 759ms.
2021-04-12 12:40:51,410 [FABRIC-rpc-event-queue] WARN c.d.s.fabric.FabricMessageHandler - Message of mode REQUEST for protocol 13 of rpc type 2 took longer than 500ms. Actual duration was 779ms.
2021-04-12 12:40:51,411 [FABRIC-5] WARN c.d.services.fabric.FabricServer - Message of mode REQUEST of rpc type 1 took longer than 500ms. Actual duration was 779ms.
2021-04-12 12:40:51,661 [FABRIC-rpc-event-queue] WARN c.d.s.fabric.FabricMessageHandler - Message of mode REQUEST for protocol 13 of rpc type 2 took longer than 500ms. Actual duration was 729ms.
2021-04-12 12:40:51,661 [FABRIC-2] WARN c.d.services.fabric.FabricClient - Message of mode REQUEST of rpc type 1 took longer than 500ms. Actual duration was 730ms.
2021-04-12 12:40:52,352 [FABRIC-rpc-event-queue] WARN c.d.s.fabric.FabricMessageHandler - Message of mode REQUEST for protocol 13 of rpc type 2 took longer than 500ms. Actual duration was 691ms.
2021-04-12 12:40:52,352 [FABRIC-2] WARN c.d.services.fabric.FabricClient - Message of mode REQUEST of rpc type 1 took longer than 500ms. Actual duration was 691ms.
2021-04-12 12:40:53,033 [FABRIC-rpc-event-queue] WARN c.d.s.fabric.FabricMessageHandler - Message of mode REQUEST for protocol 13 of rpc type 2 took longer than 500ms. Actual duration was 680ms.
2021-04-12 12:40:53,033 [FABRIC-2] WARN c.d.services.fabric.FabricClient - Message of mode REQUEST of rpc type 1 took longer than 500ms. Actual duration was 680ms.

@unni

One of your executors, either ran out of heap or did a full GC. Check the GC logs on the executors and see if there was a full GC. Check if the log folder in any of the executors have a heap dump generated (usually with a .bin or .hprof extension file)

Thanks
Bali