Hi everyone,
I’m currently setting up a Dremio cluster with multiple coordinator and executor nodes. However, I’ve encountered an issue where the standby coordinator does not automatically take over as the master when the master node goes down.
Dremio Cluster Setup
My Dremio cluster consists of:
- 1 Master Coordinator Node
- 2 Standby Coordinator Nodes
- 2 Executor Nodes
- 3 External Zookeeper Nodes
- NFS for metadata storage
- MinIO bucket for distributed storage
Dremio Configuration
Below are the configurations for each node:
1. Master Coordinator Node (dremio.conf
)
paths: {
local: "/data/dremio-metadata"
dist: "dremioS3:///dremio-distributed"
}
services: {
web-admin {
host: "0.0.0.0"
port: 9191
}
coordinator {
enabled: true
master.enabled: true
master.embedded-zookeeper.enabled: false
client-endpoint {
port: 31010
}
}
executor {
enabled: false
}
flight {
use_session_service: true
}
}
zookeeper: "dremio-1.example.internal:2181,dremio-2.example.internal:2181,dremio-3.example.internal:2181/dremio"
2 & 3. Standby Coordinator Nodes (dremio.conf
)
paths: {
local: "/data/dremio-metadata"
dist: "dremioS3:///dremio-distributed"
}
services: {
web-admin {
host: "0.0.0.0"
port: 9191
}
coordinator {
enabled: true
master.enabled: false
master.embedded-zookeeper.enabled: false
client-endpoint {
port: 31010
}
}
executor {
enabled: false
}
flight {
use_session_service: true
}
}
zookeeper: "dremio-1.example.internal:2181,dremio-2.example.internal:2181,dremio-3.example.internal:2181/dremio"
4 & 5. Executor Nodes (dremio.conf
)
paths: {
local: "/data/dremio-metadata"
dist: "dremioS3:///dremio-distributed"
}
services: {
web-admin {
host: "0.0.0.0"
port: 9191
}
coordinator {
enabled: false
master.enabled: false
master.embedded-zookeeper.enabled: false
client-endpoint {
port: 31010
}
}
executor {
enabled: true
}
flight {
use_session_service: true
}
}
zookeeper: "dremio-1.example.internal:2181,dremio-2.example.internal:2181,dremio-3.example.internal:2181/dremio"
6. Zookeeper Configuration (zoo.cfg
)
tickTime=2000
initLimit=5
syncLimit=2
dataDir=/data/zookeeper
dataLogDir=/var/log/zookeeper
autopurge.snapRetainCount=3
autopurge.purgeInterval=1
maxClientCnxns=60
standaloneEnabled=false
admin.enableServer=true
server.1=dremio-1.example.internal:2888:3888;2181
server.2=dremio-2.example.internal:2888:3888;2181
server.3=dremio-3.example.internal:2888:3888;2181
Each Zookeeper node has a unique myid
file:
- Zookeeper Node 1:
1
- Zookeeper Node 2:
2
- Zookeeper Node 3:
3
Issue
The cluster starts normally with this configuration, and everything works as expected. However, when I simulate a master node failure, the standby coordinator nodes do not automatically take over as the master.
Has anyone encountered a similar issue? Am I missing any configuration to enable automatic failover?
Any help would be greatly appreciated!
Best regards,
Arman