we have a problem with some CPU usage staying high despite no jobs being processed in Dremio. The only way to remove this problem is to delete the executor so it can restart fresh. We coud not identify a specific job being the cause of it. It can start even if all jobs are successful. It has adverse impacts on new jobs performance, cancelling jobs is longer…
Eg. CPU usage in a K8s node with 1 executor with the problem (only 1 pod on this K8s worker, the pod/dremio-executor-0):
Thread dump when CPU usage is off (obtained running for i in seq -w 1 1 300; do jstack -l $i > ThreadDump$i.txt; sleep 1; done): TreadDump1.zip (6,9 Ko)
Server logs generated by the pod/dremio-executor-0 are definetly odd with huge amounts of Debug logs generated per minutes. Sample when we it has the CPU problem: executor0_logs.zip (3,1 Mo)
@allCag Very recently I found a similar behavior where queries took more time to execute and it turned out that the root debug logger was on, can you send us your logback.xml from the executor and we can validate
@allCag Can we go back to the default logback.xml, restart executors and after see if the behavior still stays the same? Although I do not see any debug enabled on the logback.xml you have sent
We observed less recurrence of the problem when reducing to 1 concurrent reflection (it still happens but not for all executors). For this test with the factory default logback.xml we stressed Dremio allowing 3 concurrent reflections.
@allCag Those messages can be ignored. I am not able to follow why you have the debug’s on? How is that related to concurrency settings? If you are saying at concurrency 3 CPU is always high, could be related. Can we try this
If the problem occurs on version 22, then it may be due to the TaskPool used. If you look at the code, there are two implementations of TaskPool: one in the dremio-ce-sabot-sheduler module, and the second in the Dremio core. By default, the implementation from dremio-ce-sabot-sheduler is included.
As a solution, if you do not touch the open source: delete dremio-ce-sabot-sheduler from the /jar folder and taskPool from the kernel implementation will be used.
On version 24.1.0 the problem seems to have gone away