We’ve run into circumstances where Yarn kills one container and brings up Dremio on a new container, but Dremio does not die on the old container. This is related to a fixed issue in 3.1.7. The difference is that in our case Dremio doesn’t even try to shutdown, so the
/live end point that the
YarnWatchdog checks still returns cleanly, and the instance does not get shutdown until a Hadoop admin kills it for us.
From the 3.1.7 release notes, on the previous bug fix:
In YARN deployments, executor processes in containers sometimes do not exit cleanly and remain active.
Resolved by implementing a watchdog to watch Dremio processes and a HTTP health check to kill executor processes that do not shutdown cleanly.
This is a slight variation of the problem, where Yarn sometimes cleans the bundled jar files but not the actual Dremio process. Dremio coordinator thinks it is still a valid container and sends work to that container. Do you see more nodes when you compare the list under “Node Activity” and “select hostname from sys.nodes”? Does sys.nodes show more hosts. The extra ones under sys.nodes (but not shown under “admin-node activity” are the orphaned ones
@balaji.ramaswamy Yes, we see the node under
sys.nodes, but not under the Yarn Node list.
We’ve used this method in the past to identify which hosts need to be cleaned up, but there is no way for us to remotely shutdown the Dremio processes that I am aware. We end up contacting a Hadoop administrator who can login to the host and kill the Dremio process.
Totally agree with you. We are internally working here to address this issue in the coming releases
@balaji.ramaswamy I’ve seen a few small bug fix releases come out recently. Any update on the timeline for this fix?
@patricker, It is been actively worked on. Should be put in one of the next new releases. It is a big change so currently do not have an ETA.
this should be fixed in 4.1.8
I agree. I reviewed the code improvements in 4.1.8 for this and we are working on testing it.