YARN Executor Fails to Shutdown and Tries to Keep Running Jobs

patricker · December 18, 2019, 8:16pm

We’ve run into circumstances where Yarn kills one container and brings up Dremio on a new container, but Dremio does not die on the old container. This is related to a fixed issue in 3.1.7. The difference is that in our case Dremio doesn’t even try to shutdown, so the /live end point that the YarnWatchdog checks still returns cleanly, and the instance does not get shutdown until a Hadoop admin kills it for us.

From the 3.1.7 release notes, on the previous bug fix:

In YARN deployments, executor processes in containers sometimes do not exit cleanly and remain active.
Resolved by implementing a watchdog to watch Dremio processes and a HTTP health check to kill executor processes that do not shutdown cleanly.

balaji.ramaswamy · December 23, 2019, 3:35am

@patricker

This is a slight variation of the problem, where Yarn sometimes cleans the bundled jar files but not the actual Dremio process. Dremio coordinator thinks it is still a valid container and sends work to that container. Do you see more nodes when you compare the list under “Node Activity” and “select hostname from sys.nodes”? Does sys.nodes show more hosts. The extra ones under sys.nodes (but not shown under “admin-node activity” are the orphaned ones

Thanks
@balaji.ramaswamy

patricker · January 6, 2020, 3:44pm

@balaji.ramaswamy Yes, we see the node under sys.nodes, but not under the Yarn Node list.
We’ve used this method in the past to identify which hosts need to be cleaned up, but there is no way for us to remotely shutdown the Dremio processes that I am aware. We end up contacting a Hadoop administrator who can login to the host and kill the Dremio process.

balaji.ramaswamy · January 6, 2020, 4:01pm

@patricker

Totally agree with you. We are internally working here to address this issue in the coming releases

Thanks
@balaji.ramaswamy

patricker · February 4, 2020, 4:23pm

@balaji.ramaswamy I’ve seen a few small bug fix releases come out recently. Any update on the timeline for this fix?

balaji.ramaswamy · February 4, 2020, 6:52pm

@patricker, It is been actively worked on. Should be put in one of the next new releases. It is a big change so currently do not have an ETA.

balaji.ramaswamy · April 7, 2020, 10:08pm

@patricker

this should be fixed in 4.1.8

patricker · April 8, 2020, 2:03pm

I agree. I reviewed the code improvements in 4.1.8 for this and we are working on testing it.

Topic		Replies	Views
Dremio executors not being provisioned Dremio University	3	1367	April 13, 2022
Dremio FileNotFoundException error	5	1534	September 13, 2019
Blacklisted nodes in yarn deployment	7	1778	April 1, 2022
Stop / Start workers from API / shell command	3	1308	March 8, 2019
Dremio Dockers stop as soon as i started them	5	2134	January 17, 2019

YARN Executor Fails to Shutdown and Tries to Keep Running Jobs

Related topics