Job status different between Jobs UI and Query & Planning

Hi,

Last week we deployed additional executors to our Dremio cluster.

After requesting a cancellation, when I go to ‘Query and Planning’ the job status is listed as cancelled.

However, I am currently running into issues where I will request a job cancellation and the job remains as ‘running’ in the Jobs landing page and when I filter by status. This query will continue to be in progress forever and will count against the query concurrency limit.

The result of this is reflections/queries timing out, and everything remaining queued/timing out behind them, so Dremio no longer processes anything. This happens with Dremio coordinator/executors sitting at <5% utilization.

Even Query Type: Internal by Nodes is taking > 1 hour in some cases.

We tried resetting the coordinator, but this issue continues to happen.

Jake

@jdingler

When this happens next time, kindly send us the profile of the job that says CANCELLED. I am wondering if there are few fragments on some executors that were not

The profile, phases would tell us if that is the case

How to share a Dremio query profile

Thanks
@balaji.ramaswamy

Hi @balaji.ramaswamy,

Attached is a query profile for one of the queries where it would be ‘in progress’ forever even though I cancelled the job.

Jake05818a81-6958-4a34-a0e8-0ca73c053a9d.zip (1.5 MB)

Hi @balaji.ramaswamy

We are still running into this issue and continue to restart our cluster every day. Any tips on how to resolve this issue?

Jake

Hi @balaji.ramaswamy

Here is another job profile for this error.

Thanks!
Jake

5f76f575-c65b-455a-8299-96a72d928164.zip (967.5 KB)

The Jobs UI lists it as running, the Jobs Profile lists it as cancelled, but the Jobs API still lists it as running.

Even if I submit a cancellation POST request to the Jobs API (and receive a 204), it does not cancel the job.

All of the query types are: “queryType”: “ACCELERATOR_CREATE”,

I am still receiving this issue, where jobs will run for days and won’t automatically cancel.

In Jobs it is still running.
In Jobs > Profile > Query it says ‘Cancelled’.
When using a GET request at api/v3/job/ it is still marked as ‘running’.
When I POST a request at api/v3/job//cancel, I get a 204, but nothing ever cancels.

Another query profile, this time its a node.sys query that is hanging for > 48 hours

@doron Any ideas?

6b969e85-fa5d-4e8b-904d-5113cca7f04e.zip (24.5 KB)

@jdingler

Cancel job as other databases only is a request to cancel. It would be good if we can get the server.log from the executors and if possible some jstacks on the executor to see if this stuck on something, to take jstack, login to the executor as the same user as the process running Dremio and run the below script. Send us the server.log and the jstack outputs when the problem is actually happening

for i in seq -w 3 1 300
do
jstack -l > ThreadDump$i.txt
sleep 1
done