Executor and Master Nodes crash on query

allCag · January 17, 2022, 10:23pm

Hello,

We have some recurrent job crashs for a reason we can’t identify. The job in Dremio says that the executor1 lost connection but I cannot find the cause in the logs and the Master node also crashes a few seconds later/before !

Dremio version 18.2 on Kubernetes environment.
Specs of nodes:

1 Master 55Go (45Go heap , 8Go Direct memory). I strongly increased the Heap for test without more success.
3 Executors 55Go (10Go heap, 43Go Direct memory)

Attached are Debug logs + gc logs for all nodes.
logs.zip (137,8 Ko)
How can I send you the job profil privately ?

Note: I did not get any hs_err_pid%p.log file in Master or Executor nodes despite this setup

Can you help us identify the root cause of the nodes disconnections ?

Rgs,
Alex

allCag · January 18, 2022, 2:04pm

Hello,
I ran additional tests on the matter

increasing memory on executor nodes from 55 to 60Go
stripping down the SQL query to the bone (i.e. doing a select on the few fields of the underlying raw reflection available (40Go reflection)) as a test.
varying the volume of data (more or less days for the month of December of the underlying raw reflection) → only maximum 3 days of data works at once !

Still stuck !
Thanks f or your help !

balaji.ramaswamy · January 18, 2022, 5:41pm

@allCag I am wondering if oom-killer is kicking in, would you be able to send us the /var/log/messages and /var/log/messages.1

allCag · January 18, 2022, 8:14pm

Hello @balaji.ramaswamy,

I checked each nodes and can’t find a message log. Do I need to configure something to get them ?

I got the following logs at the time of a crash, as you can see no pressure on memory…

The 1st run the Master crashed, the 2nd run only 1 executor did.

Also, we can observe that the memory of the executors is massively used when in fact no heavy job is used on the nodes (it’s a test environment). Where is the memory being used ? By Dremio ?
Eg. Executor 2 is on worker 5 - nothing else running on this node.

top on Executor 2:

allCag · January 19, 2022, 4:58pm

Hello @balaji.ramaswamy ,
thanks for your oom killer idea which pointed us to the right direction.
However, as I explained in the previous post there is a problem of memory in the POD when we cumulate jobs.
The problem is as follows:

we reboot all nodes to start fresh memorywise
we run the simple reflection job
Job is a success:

image1058×558 49.1 KB

The executor memory has now lots of cache memory and not released !

image1060×356 117 KB
For test I deleted the accelerator folder of the reflection that just ran. It releases all the cache memory on the node:

image1303×409 107 KB

image1670×641 140 KB

image1011×389 134 KB

So why is this reflection cached ? Is there a configuration to change this behaviour ?
Following this 1st test the same reflection is cancelled and ran again. The memory usage spikes in executors and the Master node crashed after 23min processing.

Now the cached memory is still not released on executor nodes…we are in a bad loop here !
Thanks in advance for your help.

balaji.ramaswamy · January 20, 2022, 6:17am

@allCag

Those are great details that bring us closer to the issue, can you please send the below from the coordinator?

server.out
On the coordinator from the Linux promt, output of dmesg -T, so something like dmesg -T > dmesg.out
are their any files on the coordinator log folder with hprof extension
GC files(server.gc, server.gc[1-5] from the coordinator
server.log (when master crashed) from the coordinator

allCag · January 20, 2022, 1:48pm

@balaji.ramaswamy
We send all logs to stdout so everything should be in the files (logback.yaml included to check what we have setup).
GC files are included for the master coordinator and 1 executor.
dmesg.out of master coordinator node included.
No hprof logs found.
logs.zip (389,8 Ko)
Test jobs:

Also, there are some strange error in the executor logs at the end of the test 1:
2022-01-20 12:11:12,499 [FABRIC-rpc-event-queue] DEBUG c.dremio.exec.store.dfs.PDFSProtocol - File /opt/dremio/data/pdfs/results/1e16bcd5-46e3-7b83-61ff-210a8aba0300 does not exist
com.dremio.common.exceptions.UserException: File /opt/dremio/data/pdfs/results/1e16bcd5-46e3-7b83-61ff-210a8aba0300 does not exist
Indeed, the reflection job succeded (jobId 1e16bcd5-46e3-7b83-61ff-210a8aba0300) and the reflection is working well for a query. a result folder can also be found. But for the Load Materialization metadata job there is no result folder on the executors.

balaji.ramaswamy · January 21, 2022, 11:01pm

@allCag Before we dive deeper into your cluster

Executor has a physical RAM of 40000000k ~ 38 GB
Direct memory on the executor is -XX:MaxDirectMemorySize=34359738368 = 32 GB
Heap memory on the executor is -XX:MaxHeapSize=8589934592 = 8GB

There is no memory left for the OS

Coordinator has a physical RAM of 40000000k ~ 38 GB
Direct memory on the executor is -XX:MaxDirectMemorySize=10737418240 = 10 GB
Heap memory on the executor is -XX:MaxHeapSize=32212254720 = 30 GB

There is no memory left for the OS

[Tue Jan 18 00:29:22 2022] msgr-worker-2 invoked oom-killer: gfp_mask=0xd0, order=0, oom_score_adj=993

Lets make the heap on the coordinator 16 GB and Direct memory as 12 GB
Lets make the heap on the executor 8 GB and Direct memory as 26 GB

Now if coordinator crashes, lets see why while if the executor runs OOM then lets analyze profiles to find out why

Thanks
Bali

allCag · January 24, 2022, 3:33pm

@balaji.ramaswamy here are the logs of the job that crashed the Master Node (this time).
logs_master_ko.zip (357,7 Ko)

Resource setup:

	Total Memory on Dremio POD	Heap Memory	Direct Memory	OS Memory
Master	38912	16384	18432	4096
Executor	38912	8192	26624	4096

Thanks,
Alex

balaji.ramaswamy · January 27, 2022, 8:52am

@alex.shi

I don’t see he unix messages and GC files for the master. The master log you have attached does not contain any startup sequence and more importantly not a single query printed. It also covers only 8 minutes of activity

First line
2022-01-24 14:13:06,836 [scheduler-22] DEBUG c.d.s.r.ReflectionServiceImpl - periodic refresh

Last line
2022-01-24 14:21:46,909 [scheduler-9] DEBUG c.d.s.s.LocalSchedulerService - Handling Instant null

Can you please do this
Run the job that crashes the master and send

Job profile
master server log that covers the crash
Master GC log that covers the time

Also please mention what time the crash happened and the time zone that time is given in

Topic		Replies	Views
ExecutionSetupException: One or more nodes lost connectivity during query during reflection creation	1	154	May 13, 2024
1 of the 4 worker node dead	3	417	January 14, 2024
ExecutionSetupException: One or more nodes lost connectivity during query	4	1896	December 8, 2020
Memory Issue on planning phase	11	1551	May 19, 2022
Query failing because of ZK connection SUSPENDED, RECONNECT results in master killing the job	16	3872	November 11, 2020

Executor and Master Nodes crash on query

Related topics