Hello everybody!
We have a Dremio cluster with one master node and 5 executors that are provisioned through Yarn.
Because we had 2 of the executors not running, we stopped the them via UI (the Provisioning page in the Admin/Cluster section). Unfortunatelly, unlike other times, the status remained in “Stopping” and not actually stopping the executors (see the image below)
So far, restarting Dremio server, Yarn or Zookeeper did not help. We currently do not see the Dremio app being started in Yarn, so it did not actually start provisioning resources for a new one, but we do see the old one stopped. Also, we do not see any errors server.log or server.out related to executors, other than:
ERROR c.d.exec.work.foreman.AttemptManager - IllegalStateException: Error: No executors are available
. (which we understand, because we don’t see them running in the UI).
Do you have any idea what could cause this behavior, of Dremio not being able to provision the executors, being stuck in some sort of “Stopping” state without a clear reason/error?
Thank you for your time!
@mirelagrigoras What version of Dremio is this? Are you able to send the command ps -ef | grep dremio
on the Hadoop data nodes and see if the Dremio process is still running?
@balaji.ramaswamy Thank you for the reply!
The version we use is 4.1.8 Community Edition. Yes, on the edge (master node), the Dremio process is still running. On the other ones, we don’t and there is no Dremio application running in Yarn either currently, as it was, before the last restart.
Please let me know if I could provide any other useful information.
Thank you!
@mirelagrigoras Have seen some corner case scenarios where the previous shutdown of the application left orphaned containers still running, this can cause a new provisioning to get stuck. That is why, wanted to make sure that no Dremio processes are running on any of the data nodes, even though the application itself is not running
Also when you say you have tried to stop the Dremio cluster? I assume you have tried to restart coordinator too? and still after the coordinator comes up and logging on to the UI still shows the engine as “Stopping”