YARN Executor Fails to Shutdown and Tries to Keep Running Jobs

We’ve run into circumstances where Yarn kills one container and brings up Dremio on a new container, but Dremio does not die on the old container. This is related to a fixed issue in 3.1.7. The difference is that in our case Dremio doesn’t even try to shutdown, so the /live end point that the YarnWatchdog checks still returns cleanly, and the instance does not get shutdown until a Hadoop admin kills it for us.

From the 3.1.7 release notes, on the previous bug fix:

In YARN deployments, executor processes in containers sometimes do not exit cleanly and remain active.
Resolved by implementing a watchdog to watch Dremio processes and a HTTP health check to kill executor processes that do not shutdown cleanly.

@patricker

This is a slight variation of the problem, where Yarn sometimes cleans the bundled jar files but not the actual Dremio process. Dremio coordinator thinks it is still a valid container and sends work to that container. Do you see more nodes when you compare the list under “Node Activity” and “select hostname from sys.nodes”? Does sys.nodes show more hosts. The extra ones under sys.nodes (but not shown under “admin-node activity” are the orphaned ones

Thanks
@balaji.ramaswamy

@balaji.ramaswamy Yes, we see the node under sys.nodes, but not under the Yarn Node list.
We’ve used this method in the past to identify which hosts need to be cleaned up, but there is no way for us to remotely shutdown the Dremio processes that I am aware. We end up contacting a Hadoop administrator who can login to the host and kill the Dremio process.

@patricker

Totally agree with you. We are internally working here to address this issue in the coming releases

Thanks
@balaji.ramaswamy