Hi, can someone help me to identify why is this error happening?
We only experience that with large queries and not sure why.
thanks
There are also other flavors of the same happening:
ChannelClosedException: [FABRIC]: Channel closed /198.18.101.211:45678 ↔ /198.18.101.215:42096 (fabric server)
Above mentioned IP are Executor 2 and Executor 4 for example.
have you checked if the executors are crashing / restarting when the error occurs? If yes, I’d look for the exit code – 137 (out of memory) maybe?
I’ve seen cases in production where executors were killed because of out-of-memory situations on the OS level. Other cases with similar error messages were caused by high CPU load on the executor node(s) leading to timeouts when the executor was meant to reply to the coordinator (usual timeout in Dremio is 5000ms for that).
Like Dr Tim says, dremio-executor-4.dremio-cluster-pod.mlops-lakehouse.svc.cluster.local went unresponsive, any chance you send us the server.log, server.out (if writing to a persistent disk) and GC logs from that server. It should give us clues on why it went unresponsive
Hey @balaji.ramaswamy due to several redeployments I have lost the logs already, but I have another case.
it is simple update on big table that should be able to both parallelize and be split into manageable chunks, so no reason to over load the executors. during the operation the cpu and memory utilizations were not higher than 30%. and it failed after 7mins.