Some jobs not cancelling via API?

Our dremio cluster runs on EKS, and one of the things we do to catch and preemptively kill long running jobs on dremio sent from our BI tool is every 30 seconds, we run a program that uses the API to login with the BI tool user, query the sys.jobs table to identify the jobs we want to kill, and then calling the cancel job endpoint for each job

this seems to be working some (maybe most) of the time, but i’m noticing for certain jobs it is not working even though i’m receiving the same 204 response I receive when it does work. Or it will work, but it will only work after x amount of attempts to the Cancel job endpoint. Sometimes, the jobs will not cancel via the API calls, as well as won’t cancel if you try and cancel them from the job details page in the UI. In these cases, we have to force a restart of our Master node to clear these hanging jobs (we have one Master / Coordinator setup)

Is there any advice for how to approach or troubleshoot this issue? I looked at the logs on the master / executor nodes where these jobs are running and nothing relevant stood out to me other than some logs detecting the jobs being “long running” after a certain point.

The job details and query profile might give you an indication of what types of jobs are not responding to the cancel api. Maybe you can try to narrow down at what point of planning or execution is this happening for you. It could also be with a specific operator. If you post a few query profiles of jobs that aren’t cancelling, I might be able to tell you how far the job has progressed.

@elijahproffitt On a busy system, the cancel request might not have reached the executors, if you open the job profile and expand phases, scroll to the right to see which phases/threads are running, pick an executor where the fragment status is still RUNNING and review the server.log on that executor to see if there was an issue with the cancel, as @Benny_Chow pointed out, the profile is a good place to start