We’ve run into circumstances where Yarn kills one container and brings up Dremio on a new container, but Dremio does not die on the old container. This is related to a fixed issue in 3.1.7. The difference is that in our case Dremio doesn’t even try to shutdown, so the
/live end point that the
YarnWatchdog checks still returns cleanly, and the instance does not get shutdown until a Hadoop admin kills it for us.
From the 3.1.7 release notes, on the previous bug fix:
In YARN deployments, executor processes in containers sometimes do not exit cleanly and remain active.
Resolved by implementing a watchdog to watch Dremio processes and a HTTP health check to kill executor processes that do not shutdown cleanly.