My Dremio cluster is configured to run up to 4 large queries and up to 100 small queries, as you can see below:
However, I’ve noticed that when there’s a large query running, the small queries that came after are waiting for it to finish before start executing. Look at this example:
Does anyone know why this happens?
From the screenshot I do not see any enqueued, I see all of them spinning which means executing. Am I missing something here?
Hi @balaji.ramaswamy! These spinning queries are very small, and usually they return de resultset in one second or less. However, while the large query doesn’t finish, they remain in this status. I’ve seen this behavior in different queries and I wonder if it’s a bug.
My current cluster have 1 coord and 2 exec.
Would you be able to send me the profile of one such spinning job?
Sorry for the late reply. I’m attaching the query profile. It was supposed to be a very small query, where usually it returns the results in 2 seconds or less. At the time that I extracted the profile, the query was running for over 1 hour.
6afc69aa-3213-4964-b355-f5ca267bb5e2.zip (25.7 KB)
Hi @balaji.ramaswamy! Any findings on this issue? I’m still facing this problem.
From the profile you last attached it looks like you lost one of the execution threads on
ip-10-12-11-180.ec2.internal. You can see this from the profile under the 01 Phase, where all the other threads have quickly competed their work, while a final one is stuck “SENDING”. This means the coordinator is waiting to hear back from it but it’s not received any update. The query can’t complete unless it gets that final piece of work.
If you see this again, you should find these “SENDING” threads in the profile and look at the logs from the executor they are located on. Some query might be generating an exception that kills a thread on that node.
That being said, you are using Dremio 3.1.x and we’ve made a number of improvements in the way Dremio handles these “bad” threads and “stuck” jobs in our 3.2.x release. I highly recommend upgrading if possible.
Great, @ben! Thank you for the follow up. I’ll upgrade my Dremio cluster and let you know if the problem occur s again.