We have some dremio envs setup on slow running machines that have paging turned on. We are happy for those envs to run slowly but would like them to not fall over. Every now and then the master node would fall over with the node has lost the master status error.
We have tried configuring the zk.client.session.timeout to be 30 minutes to avoid this error. But is that the setting for connection in the other direction determining whether zookeeper should give up on the client. Is there a timeout setting for the executor nodes to wait a bit longer for the master to respond?
@asdf01 There are multiple pieces of info here. Node lost master status just means the master did not respond within the ZK timeout, it could be GC related or CPU 100% that is is not responding. Executor is not in play here. We do not recommend turning on swap for Dremio. Do you have the master log when this happened, I can take a look and see if we find some clues
Hey @balaji.ramaswamy thanks for looking into this issue. It looks like I might have misunderstood the issue slightly when I initially created the post.
The master node crashes and restarts we have experienced recently have all been 26.0.0 nodes. We run a mixture of 24.2.2 nodes and 26.0.0 nodes.
Do you know if the “zk.client.session.timeout” config is still respected in the 26.0.0 version?
Node lost master status just means the master did not respond within the ZK timeout, it could be GC related or CPU 100% that is is not responding
We do not recommend turning on swap for Dremio
I understand your position, but we are trying to make do with the limited resources that we have. Turning on swap is the approach we are investigating right now. I understand that the dremio staff would not like to recommend an approach which reflects badly on dremio’s performance. But considering our running cost limitations, it would be good if you could help us explore the full extent of the possibilities of this approach.
Currently it looks like 24.2.2 is respecting that 30 minute timeout setting but 26.0.0 is not. It would be good if you could give us some insight into what the equivalent setting might be in 26.0.0.
@asdf01 Extended the ZK client timeout should not have changed but this will be effective if that is the root cause, in your log file do you “SUSPENDED” or “LOST”? Any chance you can send the master log next time this happens?