we have a problem with some CPU usage staying high despite no jobs being processed in Dremio. The only way to remove this problem is to delete the executor so it can restart fresh. We coud not identify a specific job being the cause of it. It can start even if all jobs are successful. It has adverse impacts on new jobs performance, cancelling jobs is longer…
Eg. CPU usage in a K8s node with 1 executor with the problem (only 1 pod on this K8s worker, the pod/dremio-executor-0):
Thread dump when CPU usage is off (obtained running for i in seq -w 1 1 300; do jstack -l $i > ThreadDump$i.txt; sleep 1; done):
TreadDump1.zip (6,9 Ko)
Server logs generated by the pod/dremio-executor-0 are definetly odd with huge amounts of Debug logs generated per minutes. Sample when we it has the CPU problem:
executor0_logs.zip (3,1 Mo)
Thanks for you help !
@allCag Very recently I found a similar behavior where queries took more time to execute and it turned out that the root debug logger was on, can you send us your logback.xml from the executor and we can validate
Hi @balaji.ramaswamy ,
the configurations we use are as below:
logback.zip (582 Octets)
dremio-executor.zip (1,7 Ko)
The logger values we use on ths Dremio environement are the following:
@allCag Can we go back to the default logback.xml, restart executors and after see if the behavior still stays the same? Although I do not see any debug enabled on the logback.xml you have sent
@balaji.ramaswamy we used the defaut logback.xml contents available on dremio-cloud-tools/charts/dremio_v2/config at master · dremio/dremio-cloud-tools · GitHub and ran many reflections to stress Dremio.
We got the CPU problem on all executors this time with the following logs:
We observed less recurrence of the problem when reducing to 1 concurrent reflection (it still happens but not for all executors). For this test with the factory default logback.xml we stressed Dremio allowing 3 concurrent reflections.
We however tried many different queue control and memory settings but still got the issue… Could it be linked ?
@allCag Those messages can be ignored. I am not able to follow why you have the debug’s on? How is that related to concurrency settings? If you are saying at concurrency 3 CPU is always high, could be related. Can we try this
- Remove all debug from logback.xml
- Restart coordinator and executors
- Run with concurrency 1
- Run with concurrency 2
- run with concurrency 3