YARN Executor Fails to Shutdown and Tries to Keep Running Jobs

We’ve run into circumstances where Yarn kills one container and brings up Dremio on a new container, but Dremio does not die on the old container. This is related to a fixed issue in 3.1.7. The difference is that in our case Dremio doesn’t even try to shutdown, so the /live end point that the YarnWatchdog checks still returns cleanly, and the instance does not get shutdown until a Hadoop admin kills it for us.

From the 3.1.7 release notes, on the previous bug fix:

In YARN deployments, executor processes in containers sometimes do not exit cleanly and remain active.
Resolved by implementing a watchdog to watch Dremio processes and a HTTP health check to kill executor processes that do not shutdown cleanly.

@patricker

This is a slight variation of the problem, where Yarn sometimes cleans the bundled jar files but not the actual Dremio process. Dremio coordinator thinks it is still a valid container and sends work to that container. Do you see more nodes when you compare the list under “Node Activity” and “select hostname from sys.nodes”? Does sys.nodes show more hosts. The extra ones under sys.nodes (but not shown under “admin-node activity” are the orphaned ones

Thanks
@balaji.ramaswamy

@balaji.ramaswamy Yes, we see the node under sys.nodes, but not under the Yarn Node list.
We’ve used this method in the past to identify which hosts need to be cleaned up, but there is no way for us to remotely shutdown the Dremio processes that I am aware. We end up contacting a Hadoop administrator who can login to the host and kill the Dremio process.

@patricker

Totally agree with you. We are internally working here to address this issue in the coming releases

Thanks
@balaji.ramaswamy

@balaji.ramaswamy I’ve seen a few small bug fix releases come out recently. Any update on the timeline for this fix?

@patricker, It is been actively worked on. Should be put in one of the next new releases. It is a big change so currently do not have an ETA.

@patricker

this should be fixed in 4.1.8

I agree. I reviewed the code improvements in 4.1.8 for this and we are working on testing it.