Any way to extend timeout on slow running machines to avoid Node lost master status error

asdf01 · August 18, 2025, 11:34pm

We have some dremio envs setup on slow running machines that have paging turned on. We are happy for those envs to run slowly but would like them to not fall over. Every now and then the master node would fall over with the node has lost the master status error.

We have tried configuring the zk.client.session.timeout to be 30 minutes to avoid this error. But is that the setting for connection in the other direction determining whether zookeeper should give up on the client. Is there a timeout setting for the executor nodes to wait a bit longer for the master to respond?

balaji.ramaswamy · August 21, 2025, 5:31pm

@asdf01 There are multiple pieces of info here. Node lost master status just means the master did not respond within the ZK timeout, it could be GC related or CPU 100% that is is not responding. Executor is not in play here. We do not recommend turning on swap for Dremio. Do you have the master log when this happened, I can take a look and see if we find some clues

asdf01 · August 26, 2025, 11:44pm

Hey @balaji.ramaswamy thanks for looking into this issue. It looks like I might have misunderstood the issue slightly when I initially created the post.

The master node crashes and restarts we have experienced recently have all been 26.0.0 nodes. We run a mixture of 24.2.2 nodes and 26.0.0 nodes.

Do you know if the “zk.client.session.timeout” config is still respected in the 26.0.0 version?

Node lost master status just means the master did not respond within the ZK timeout, it could be GC related or CPU 100% that is is not responding

We do not recommend turning on swap for Dremio

I understand your position, but we are trying to make do with the limited resources that we have. Turning on swap is the approach we are investigating right now. I understand that the dremio staff would not like to recommend an approach which reflects badly on dremio’s performance. But considering our running cost limitations, it would be good if you could help us explore the full extent of the possibilities of this approach.

Currently it looks like 24.2.2 is respecting that 30 minute timeout setting but 26.0.0 is not. It would be good if you could give us some insight into what the equivalent setting might be in 26.0.0.

Thanks

balaji.ramaswamy · September 1, 2025, 4:41pm

@asdf01 Extended the ZK client timeout should not have changed but this will be effective if that is the root cause, in your log file do you “SUSPENDED” or “LOST”? Any chance you can send the master log next time this happens?

Topic		Replies	Views
Query failing because of ZK connection SUSPENDED, RECONNECT results in master killing the job	16	3977	November 11, 2020
Zookeeper is not electing the slave node as master node Dremio University	4	2331	March 18, 2021
Dremio On-Premise Server Down - ERROR ROOT Dremio is exiting. Node lost its master status	3	1695	January 3, 2022
Error setting up remote fragment execution	21	4426	April 9, 2023
Replacing old master node and dremio UI will not start	3	1328	December 17, 2018

Any way to extend timeout on slow running machines to avoid Node lost master status error

Related topics