Some jobs not cancelling via API?

elijahproffitt · February 18, 2025, 3:14pm

Our dremio cluster runs on EKS, and one of the things we do to catch and preemptively kill long running jobs on dremio sent from our BI tool is every 30 seconds, we run a program that uses the API to login with the BI tool user, query the sys.jobs table to identify the jobs we want to kill, and then calling the cancel job endpoint for each job

this seems to be working some (maybe most) of the time, but i’m noticing for certain jobs it is not working even though i’m receiving the same 204 response I receive when it does work. Or it will work, but it will only work after x amount of attempts to the Cancel job endpoint. Sometimes, the jobs will not cancel via the API calls, as well as won’t cancel if you try and cancel them from the job details page in the UI. In these cases, we have to force a restart of our Master node to clear these hanging jobs (we have one Master / Coordinator setup)

Is there any advice for how to approach or troubleshoot this issue? I looked at the logs on the master / executor nodes where these jobs are running and nothing relevant stood out to me other than some logs detecting the jobs being “long running” after a certain point.

Benny_Chow · February 19, 2025, 3:19am

The job details and query profile might give you an indication of what types of jobs are not responding to the cancel api. Maybe you can try to narrow down at what point of planning or execution is this happening for you. It could also be with a specific operator. If you post a few query profiles of jobs that aren’t cancelling, I might be able to tell you how far the job has progressed.

balaji.ramaswamy · February 19, 2025, 11:18pm

@elijahproffitt On a busy system, the cancel request might not have reached the executors, if you open the job profile and expand phases, scroll to the right to see which phases/threads are running, pick an executor where the fragment status is still RUNNING and review the server.log on that executor to see if there was an issue with the cancel, as @Benny_Chow pointed out, the profile is a good place to start

Topic		Replies	Views
Job status different between Jobs UI and Query & Planning	7	1654	March 16, 2020
Forcefully cancel running / queuing job	2	2238	January 22, 2020
Job cancellation requested failure	3	1289	October 13, 2021
There is no way to cancel long running jobs	4	3397	December 15, 2017
Dremio jobs taking too long to finish/cancel	11	3507	March 16, 2020

Some jobs not cancelling via API?

Related topics