ChannelClosedException: [FABRIC]: Channel closed null <--> null (fabric client) on large query

Hi, can someone help me to identify why is this error happening?
We only experience that with large queries and not sure why.
thanks

There are also other flavors of the same happening:
ChannelClosedException: [FABRIC]: Channel closed /198.18.101.211:45678 ↔ /198.18.101.215:42096 (fabric server)
Above mentioned IP are Executor 2 and Executor 4 for example.

877592cf-a4e3-4545-99f6-112961fc0559.zip (115.7 KB)

Jaro

Hi Jaroslav,

have you checked if the executors are crashing / restarting when the error occurs? If yes, I’d look for the exit code – 137 (out of memory) maybe?

I’ve seen cases in production where executors were killed because of out-of-memory situations on the OS level. Other cases with similar error messages were caused by high CPU load on the executor node(s) leading to timeouts when the executor was meant to reply to the coordinator (usual timeout in Dremio is 5000ms for that).

Best, Tim

@jaroslav_marko

Like Dr Tim says, dremio-executor-4.dremio-cluster-pod.mlops-lakehouse.svc.cluster.local went unresponsive, any chance you send us the server.log, server.out (if writing to a persistent disk) and GC logs from that server. It should give us clues on why it went unresponsive

Hey @balaji.ramaswamy due to several redeployments I have lost the logs already, but I have another case.
it is simple update on big table that should be able to both parallelize and be split into manageable chunks, so no reason to over load the executors. during the operation the cpu and memory utilizations were not higher than 30%. and it failed after 7mins.

Attaching profile, and kindly ask you for some help.
d819c814-10b0-4814-a5ed-4f781194210d.zip (345.2 KB)

jaro

@jaroslav_marko, @tid and @balaji.ramaswamy
Any update on this, I am also running into this issue?

@jaroslav_marko When 198.18.101.218 tried to communicate to 198.18.101.217 it timed out, this could be because of a GC or RPC taking too long

Do you have GC logsand server.log on 198.18.101.217 when this query ran/ Can you see if there was either a long Young GC or a Full GC or any messages for this Job ID in server.log on that node? 198.18.101.217?