We have some recurrent job crashs for a reason we can’t identify. The job in Dremio says that the executor1 lost connection but I cannot find the cause in the logs and the Master node also crashes a few seconds later/before !
Dremio version 18.2 on Kubernetes environment.
Specs of nodes:
1 Master 55Go (45Go heap , 8Go Direct memory). I strongly increased the Heap for test without more success.
3 Executors 55Go (10Go heap, 43Go Direct memory)
Attached are Debug logs + gc logs for all nodes. logs.zip (137,8 Ko)
How can I send you the job profil privately ?
Note: I did not get any hs_err_pid%p.log file in Master or Executor nodes despite this setup
Can you help us identify the root cause of the nodes disconnections ?
increasing memory on executor nodes from 55 to 60Go
stripping down the SQL query to the bone (i.e. doing a select on the few fields of the underlying raw reflection available (40Go reflection)) as a test.
varying the volume of data (more or less days for the month of December of the underlying raw reflection) → only maximum 3 days of data works at once !
The 1st run the Master crashed, the 2nd run only 1 executor did.
Also, we can observe that the memory of the executors is massively used when in fact no heavy job is used on the nodes (it’s a test environment). Where is the memory being used ? By Dremio ?
Eg. Executor 2 is on worker 5 - nothing else running on this node.
Hello @balaji.ramaswamy ,
thanks for your oom killer idea which pointed us to the right direction.
However, as I explained in the previous post there is a problem of memory in the POD when we cumulate jobs.
The problem is as follows:
we reboot all nodes to start fresh memorywise
we run the simple reflection job
Job is a success:
So why is this reflection cached ? Is there a configuration to change this behaviour ?
Following this 1st test the same reflection is cancelled and ran again. The memory usage spikes in executors and the Master node crashed after 23min processing.
Now the cached memory is still not released on executor nodes…we are in a bad loop here !
Thanks in advance for your help.
@balaji.ramaswamy
We send all logs to stdout so everything should be in the files (logback.yaml included to check what we have setup).
GC files are included for the master coordinator and 1 executor.
dmesg.out of master coordinator node included.
No hprof logs found. logs.zip (389,8 Ko)
Test jobs:
Also, there are some strange error in the executor logs at the end of the test 1:
2022-01-20 12:11:12,499 [FABRIC-rpc-event-queue] DEBUG c.dremio.exec.store.dfs.PDFSProtocol - File /opt/dremio/data/pdfs/results/1e16bcd5-46e3-7b83-61ff-210a8aba0300 does not exist
com.dremio.common.exceptions.UserException: File /opt/dremio/data/pdfs/results/1e16bcd5-46e3-7b83-61ff-210a8aba0300 does not exist
Indeed, the reflection job succeded (jobId 1e16bcd5-46e3-7b83-61ff-210a8aba0300) and the reflection is working well for a query. a result folder can also be found. But for the Load Materialization metadata job there is no result folder on the executors.
I don’t see he unix messages and GC files for the master. The master log you have attached does not contain any startup sequence and more importantly not a single query printed. It also covers only 8 minutes of activity
First line
2022-01-24 14:13:06,836 [scheduler-22] DEBUG c.d.s.r.ReflectionServiceImpl - periodic refresh
Last line
2022-01-24 14:21:46,909 [scheduler-9] DEBUG c.d.s.s.LocalSchedulerService - Handling Instant null
Can you please do this
Run the job that crashes the master and send
Job profile
master server log that covers the crash
Master GC log that covers the time
Also please mention what time the crash happened and the time zone that time is given in