ChannelClosedException: [FABRIC]: Channel closed null <--> null (fabric client) on large query

jaroslav_marko · September 23, 2024, 9:13am

Hi, can someone help me to identify why is this error happening?
We only experience that with large queries and not sure why.
thanks

There are also other flavors of the same happening:
ChannelClosedException: [FABRIC]: Channel closed /198.18.101.211:45678 ↔ /198.18.101.215:42096 (fabric server)
Above mentioned IP are Executor 2 and Executor 4 for example.

877592cf-a4e3-4545-99f6-112961fc0559.zip (115.7 KB)

Jaro

tid · September 23, 2024, 2:04pm

Hi Jaroslav,

have you checked if the executors are crashing / restarting when the error occurs? If yes, I’d look for the exit code – 137 (out of memory) maybe?

I’ve seen cases in production where executors were killed because of out-of-memory situations on the OS level. Other cases with similar error messages were caused by high CPU load on the executor node(s) leading to timeouts when the executor was meant to reply to the coordinator (usual timeout in Dremio is 5000ms for that).

Best, Tim

balaji.ramaswamy · September 25, 2024, 4:19am

@jaroslav_marko

Like Dr Tim says, dremio-executor-4.dremio-cluster-pod.mlops-lakehouse.svc.cluster.local went unresponsive, any chance you send us the server.log, server.out (if writing to a persistent disk) and GC logs from that server. It should give us clues on why it went unresponsive

jaroslav_marko · October 10, 2024, 2:14pm

Hey @balaji.ramaswamy due to several redeployments I have lost the logs already, but I have another case.
it is simple update on big table that should be able to both parallelize and be split into manageable chunks, so no reason to over load the executors. during the operation the cpu and memory utilizations were not higher than 30%. and it failed after 7mins.

Attaching profile, and kindly ask you for some help.
d819c814-10b0-4814-a5ed-4f781194210d.zip (345.2 KB)

jaro

Topic		Replies	Views
ChannelClosedException - Channel Closed	2	716	December 16, 2023
Jdbc connection user rpc.ChannelClosedException	3	2986	December 1, 2020
Error while executing queries on multiple nodes	1	645	April 26, 2023
1 of the 4 worker node dead	3	388	January 14, 2024
Exception in RPC communication after cluster migration	1	1134	September 30, 2022

ChannelClosedException: [FABRIC]: Channel closed null <--> null (fabric client) on large query

Related topics