Our dremio cluster runs on EKS, and one of the things we do to catch and preemptively kill long running jobs on dremio sent from our BI tool is every 30 seconds, we run a program that uses the API to login with the BI tool user, query the sys.jobs table to identify the jobs we want to kill, and then calling the cancel job endpoint for each job
this seems to be working some (maybe most) of the time, but i’m noticing for certain jobs it is not working even though i’m receiving the same 204 response I receive when it does work. Or it will work, but it will only work after x amount of attempts to the Cancel job endpoint. Sometimes, the jobs will not cancel via the API calls, as well as won’t cancel if you try and cancel them from the job details page in the UI. In these cases, we have to force a restart of our Master node to clear these hanging jobs (we have one Master / Coordinator setup)
Is there any advice for how to approach or troubleshoot this issue? I looked at the logs on the master / executor nodes where these jobs are running and nothing relevant stood out to me other than some logs detecting the jobs being “long running” after a certain point.