Executor and Master Nodes crash on query

Hello,

We have some recurrent job crashs for a reason we can’t identify. The job in Dremio says that the executor1 lost connection but I cannot find the cause in the logs and the Master node also crashes a few seconds later/before !

Dremio version 18.2 on Kubernetes environment.
Specs of nodes:

  • 1 Master 55Go (45Go heap , 8Go Direct memory). I strongly increased the Heap for test without more success.
  • 3 Executors 55Go (10Go heap, 43Go Direct memory)

Attached are Debug logs + gc logs for all nodes.
logs.zip (137,8 Ko)
How can I send you the job profil privately ?

Note: I did not get any hs_err_pid%p.log file in Master or Executor nodes despite this setup
image

Can you help us identify the root cause of the nodes disconnections ?

Rgs,
Alex

Hello,
I ran additional tests on the matter

  • increasing memory on executor nodes from 55 to 60Go
  • stripping down the SQL query to the bone (i.e. doing a select on the few fields of the underlying raw reflection available (40Go reflection)) as a test.
  • varying the volume of data (more or less days for the month of December of the underlying raw reflection) → only maximum 3 days of data works at once !

Still stuck !
Thanks f or your help !

@allCag I am wondering if oom-killer is kicking in, would you be able to send us the /var/log/messages and /var/log/messages.1

Hello @balaji.ramaswamy,

I checked each nodes and can’t find a message log. Do I need to configure something to get them ?
image

I got the following logs at the time of a crash, as you can see no pressure on memory…


The 1st run the Master crashed, the 2nd run only 1 executor did.

Also, we can observe that the memory of the executors is massively used when in fact no heavy job is used on the nodes (it’s a test environment). Where is the memory being used ? By Dremio ?
Eg. Executor 2 is on worker 5 - nothing else running on this node.


top on Executor 2:
image

Hello @balaji.ramaswamy ,
thanks for your oom killer idea which pointed us to the right direction.
However, as I explained in the previous post there is a problem of memory in the POD when we cumulate jobs.
The problem is as follows:

  1. we reboot all nodes to start fresh memorywise
  2. we run the simple reflection job
    Job is a success:

    The executor memory has now lots of cache memory and not released !

    image
  3. For test I deleted the accelerator folder of the reflection that just ran. It releases all the cache memory on the node:



    So why is this reflection cached ? Is there a configuration to change this behaviour ?
    Following this 1st test the same reflection is cancelled and ran again. The memory usage spikes in executors and the Master node crashed after 23min processing.
    image
    Now the cached memory is still not released on executor nodes…we are in a bad loop here !
    Thanks in advance for your help.

@allCag

Those are great details that bring us closer to the issue, can you please send the below from the coordinator?

  • server.out
  • On the coordinator from the Linux promt, output of dmesg -T, so something like dmesg -T > dmesg.out
  • are their any files on the coordinator log folder with hprof extension
  • GC files(server.gc, server.gc[1-5] from the coordinator
  • server.log (when master crashed) from the coordinator

@balaji.ramaswamy
We send all logs to stdout so everything should be in the files (logback.yaml included to check what we have setup).
GC files are included for the master coordinator and 1 executor.
dmesg.out of master coordinator node included.
No hprof logs found.
logs.zip (389,8 Ko)
Test jobs:

Also, there are some strange error in the executor logs at the end of the test 1:
2022-01-20 12:11:12,499 [FABRIC-rpc-event-queue] DEBUG c.dremio.exec.store.dfs.PDFSProtocol - File /opt/dremio/data/pdfs/results/1e16bcd5-46e3-7b83-61ff-210a8aba0300 does not exist
com.dremio.common.exceptions.UserException: File /opt/dremio/data/pdfs/results/1e16bcd5-46e3-7b83-61ff-210a8aba0300 does not exist
Indeed, the reflection job succeded (jobId 1e16bcd5-46e3-7b83-61ff-210a8aba0300) and the reflection is working well for a query. a result folder can also be found. But for the Load Materialization metadata job there is no result folder on the executors.

@allCag Before we dive deeper into your cluster

  • Executor has a physical RAM of 40000000k ~ 38 GB
  • Direct memory on the executor is -XX:MaxDirectMemorySize=34359738368 = 32 GB
  • Heap memory on the executor is -XX:MaxHeapSize=8589934592 = 8GB

There is no memory left for the OS

  • Coordinator has a physical RAM of 40000000k ~ 38 GB
  • Direct memory on the executor is -XX:MaxDirectMemorySize=10737418240 = 10 GB
  • Heap memory on the executor is -XX:MaxHeapSize=32212254720 = 30 GB

There is no memory left for the OS

[Tue Jan 18 00:29:22 2022] msgr-worker-2 invoked oom-killer: gfp_mask=0xd0, order=0, oom_score_adj=993

Lets make the heap on the coordinator 16 GB and Direct memory as 12 GB
Lets make the heap on the executor 8 GB and Direct memory as 26 GB

Now if coordinator crashes, lets see why while if the executor runs OOM then lets analyze profiles to find out why

Thanks
Bali

@balaji.ramaswamy here are the logs of the job that crashed the Master Node (this time).
logs_master_ko.zip (357,7 Ko)

Resource setup:

Total Memory on Dremio POD Heap Memory Direct Memory OS Memory
Master 38912 16384 18432 4096
Executor 38912 8192 26624 4096

Thanks,
Alex

@alex.shi

I don’t see he unix messages and GC files for the master. The master log you have attached does not contain any startup sequence and more importantly not a single query printed. It also covers only 8 minutes of activity

First line
2022-01-24 14:13:06,836 [scheduler-22] DEBUG c.d.s.r.ReflectionServiceImpl - periodic refresh

Last line
2022-01-24 14:21:46,909 [scheduler-9] DEBUG c.d.s.s.LocalSchedulerService - Handling Instant null

Can you please do this
Run the job that crashes the master and send

  • Job profile
  • master server log that covers the crash
  • Master GC log that covers the time

Also please mention what time the crash happened and the time zone that time is given in